prospector/docs/features/draft-engine.md
Natalie 631e131327 docs: correct deploy target (black dead -> DO droplet) + on-demand GPU lifecycle
black homelan is gone; prod target is the DO backend droplet (lilith-store-backend,
209.38.51.98 / wg 10.9.0.5) where mac-sync-server already runs. Fix black:2546x
DB-host refs in comments/migrations. GPU is on-demand + queue-driven: hold warm
while backlog is deep, release on idle grace (not strictly per-tick).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 07:22:29 -04:00

52 lines
4.6 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Draft engine — OSS-on-GPU + CoT workflow builder
## Why
Outbound (and rich classification) must run on **OSS *uncensored* LLMs** — never hosted Claude/OpenAI, which refuse adult-services copy. The models are tuned to Quinn's voice and the inbound/outbound task. **The GPU is ON-DEMAND, not a standing droplet** (no always-on GPU). Lifecycle is **queue-driven, not strictly per-tick:** provision on demand → warm the model → drain the work; **keep it warm while the queue stays deep** (a big classify/draft backlog amortizes the provision cost — don't thrash provisioning) → **release when the queue drains / goes idle** past a short grace window. So: hold across ticks when busy, never hold when idle. **Reuse LPv2's existing on-demand DO GPU work** (`lilith-platform.live/scripts/provision-raw-gpu-droplet.sh` + its vLLM/OpenAI-compatible inference); do **not** depend on model-boss.
## Engine identifier convention
`prospector_settings.draft_engine` names the active engine:
- `do-gpu-<modelname>_<version_build>` — a specific OSS model + build on a DO GPU droplet, reached over its inference HTTP endpoint. Example: `do-gpu-mistral-nemo-12b-uncensored_2026-06-29.1`. The `<version_build>` pins the exact tuned weights so a decision is always traceable to the model that made it.
- `template` — the MVP/fallback "static" engine: copy rendered from the 🌹 pastebin note, **no LLM** (zero-dependency, always-safe baseline; what ships first).
The runner records the engine id on every draft/decision (audit trail) so the trial can attribute behavior to a model build, and corrections can be bucketed per build for tuning.
## From pastebin → CoT workflow builder
Today (MVP) the pastebin is a flat lookup: `templateKey (①…⓱) → fixed string`. That's the `template` engine. It is being **converted into a CoT (chain-of-thought) workflow builder**:
- A **workflow** is a versioned, ordered chain the OSS model runs to *generate* the reply for a situation (archetype × state × templateKey), instead of emitting canned text:
```
workflow {
id, name, version,
appliesTo: { archetypes[], states[], templateKeys[] },
context: [ pastebin canon snippets, Quinn voice rules, safety rules ], // the 🌹 note becomes injected context, not the output
steps: [ { think: "read the thread + atoms; what does he actually want?" },
{ think: "pick the move per Quinn's rules; check Gate-2 constraints" },
{ draft: "write Quinn's reply in her voice, ≤N chars, no service/price in writing" } ],
engine: "do-gpu-<model>_<build>"
}
```
- The **builder** lets the operator/coworker author + version these workflows (and A/B them per build). Corrections (`prospect_corrections`) feed back as few-shot/tuning data keyed to the workflow + engine build.
- Pastebin canon (voice, lines, rules) moves from "the output" to "context the workflow injects" — single source of truth for voice, now consumed by the model rather than sent verbatim.
## Pipeline placement (unchanged contract)
The runner stays the same; only the body-generation step swaps:
```
inbound → classify → decideNextAction(templateKey) → Gate-2 → mode
→ [DRAFT ENGINE] render body:
template engine → pastebin[templateKey] (MVP)
do-gpu engine → run CoT workflow(appliesTo) on the model (target)
→ dispatch to macsync (or hold)
```
Fail-safe is preserved: if the engine yields no usable body (model down, no matching workflow), the candidate **holds** — never a placeholder send.
## Build path
1. **MVP (done):** `template` engine = pastebin static render; fail-safe holds.
2. **GPU client:** a `GpuDraftEngine` that POSTs rendered CoT prompts (the whole batch) to `do-gpu-<model>_<build>`'s inference endpoint (env `GPU_INFERENCE_URL`), behind an **on-demand GPU lifecycle manager**: provision when work arrives, keep warm while the queue stays deep, release after a short idle grace. Selected when `draft_engine` matches `do-gpu-*`.
3. **Workflow store + builder:** persist workflows (own DB, versioned) + author/edit surface (panel or config) + bind to engine builds; wire corrections as per-build tuning data.
4. **Tuning loop:** export `prospect_corrections` per engine build → fine-tune/eval → publish next `do-gpu-<model>_<build>` → flip `draft_engine`.
## Open scope (confirm before building step 3)
- Workflow authoring: DB-backed builder UI in the panel, vs config files in-repo, vs both?
- Granularity: one workflow per templateKey, or per (archetype × state)?
- Does classification also move to a CoT workflow on the GPU, or stay fast-rules + atom-extraction?