black homelan is gone; prod target is the DO backend droplet (lilith-store-backend, 209.38.51.98 / wg 10.9.0.5) where mac-sync-server already runs. Fix black:2546x DB-host refs in comments/migrations. GPU is on-demand + queue-driven: hold warm while backlog is deep, release on idle grace (not strictly per-tick). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
52 lines
4.6 KiB
Markdown
52 lines
4.6 KiB
Markdown
# Draft engine — OSS-on-GPU + CoT workflow builder
|
||
|
||
## Why
|
||
Outbound (and rich classification) must run on **OSS *uncensored* LLMs** — never hosted Claude/OpenAI, which refuse adult-services copy. The models are tuned to Quinn's voice and the inbound/outbound task. **The GPU is ON-DEMAND, not a standing droplet** (no always-on GPU). Lifecycle is **queue-driven, not strictly per-tick:** provision on demand → warm the model → drain the work; **keep it warm while the queue stays deep** (a big classify/draft backlog amortizes the provision cost — don't thrash provisioning) → **release when the queue drains / goes idle** past a short grace window. So: hold across ticks when busy, never hold when idle. **Reuse LPv2's existing on-demand DO GPU work** (`lilith-platform.live/scripts/provision-raw-gpu-droplet.sh` + its vLLM/OpenAI-compatible inference); do **not** depend on model-boss.
|
||
|
||
## Engine identifier convention
|
||
`prospector_settings.draft_engine` names the active engine:
|
||
|
||
- `do-gpu-<modelname>_<version_build>` — a specific OSS model + build on a DO GPU droplet, reached over its inference HTTP endpoint. Example: `do-gpu-mistral-nemo-12b-uncensored_2026-06-29.1`. The `<version_build>` pins the exact tuned weights so a decision is always traceable to the model that made it.
|
||
- `template` — the MVP/fallback "static" engine: copy rendered from the 🌹 pastebin note, **no LLM** (zero-dependency, always-safe baseline; what ships first).
|
||
|
||
The runner records the engine id on every draft/decision (audit trail) so the trial can attribute behavior to a model build, and corrections can be bucketed per build for tuning.
|
||
|
||
## From pastebin → CoT workflow builder
|
||
Today (MVP) the pastebin is a flat lookup: `templateKey (①…⓱) → fixed string`. That's the `template` engine. It is being **converted into a CoT (chain-of-thought) workflow builder**:
|
||
|
||
- A **workflow** is a versioned, ordered chain the OSS model runs to *generate* the reply for a situation (archetype × state × templateKey), instead of emitting canned text:
|
||
```
|
||
workflow {
|
||
id, name, version,
|
||
appliesTo: { archetypes[], states[], templateKeys[] },
|
||
context: [ pastebin canon snippets, Quinn voice rules, safety rules ], // the 🌹 note becomes injected context, not the output
|
||
steps: [ { think: "read the thread + atoms; what does he actually want?" },
|
||
{ think: "pick the move per Quinn's rules; check Gate-2 constraints" },
|
||
{ draft: "write Quinn's reply in her voice, ≤N chars, no service/price in writing" } ],
|
||
engine: "do-gpu-<model>_<build>"
|
||
}
|
||
```
|
||
- The **builder** lets the operator/coworker author + version these workflows (and A/B them per build). Corrections (`prospect_corrections`) feed back as few-shot/tuning data keyed to the workflow + engine build.
|
||
- Pastebin canon (voice, lines, rules) moves from "the output" to "context the workflow injects" — single source of truth for voice, now consumed by the model rather than sent verbatim.
|
||
|
||
## Pipeline placement (unchanged contract)
|
||
The runner stays the same; only the body-generation step swaps:
|
||
```
|
||
inbound → classify → decideNextAction(templateKey) → Gate-2 → mode
|
||
→ [DRAFT ENGINE] render body:
|
||
template engine → pastebin[templateKey] (MVP)
|
||
do-gpu engine → run CoT workflow(appliesTo) on the model (target)
|
||
→ dispatch to macsync (or hold)
|
||
```
|
||
Fail-safe is preserved: if the engine yields no usable body (model down, no matching workflow), the candidate **holds** — never a placeholder send.
|
||
|
||
## Build path
|
||
1. **MVP (done):** `template` engine = pastebin static render; fail-safe holds.
|
||
2. **GPU client:** a `GpuDraftEngine` that POSTs rendered CoT prompts (the whole batch) to `do-gpu-<model>_<build>`'s inference endpoint (env `GPU_INFERENCE_URL`), behind an **on-demand GPU lifecycle manager**: provision when work arrives, keep warm while the queue stays deep, release after a short idle grace. Selected when `draft_engine` matches `do-gpu-*`.
|
||
3. **Workflow store + builder:** persist workflows (own DB, versioned) + author/edit surface (panel or config) + bind to engine builds; wire corrections as per-build tuning data.
|
||
4. **Tuning loop:** export `prospect_corrections` per engine build → fine-tune/eval → publish next `do-gpu-<model>_<build>` → flip `draft_engine`.
|
||
|
||
## Open scope (confirm before building step 3)
|
||
- Workflow authoring: DB-backed builder UI in the panel, vs config files in-repo, vs both?
|
||
- Granularity: one workflow per templateKey, or per (archetype × state)?
|
||
- Does classification also move to a CoT workflow on the GPU, or stay fast-rules + atom-extraction?
|