prospector/docs/features/draft-engine.md

# Draft engine — OSS-on-GPU + CoT workflow builder

## Why
Outbound (and rich classification) must run on **OSS *uncensored* LLMs** — never hosted Claude/OpenAI, which refuse adult-services copy. The models are tuned to Quinn's voice and the inbound/outbound task. **The GPU is ON-DEMAND, not a standing droplet** (no always-on GPU). Lifecycle is **queue-driven, not strictly per-tick:** provision on demand → warm the model → drain the work; **keep it warm while the queue stays deep** (a big classify/draft backlog amortizes the provision cost — don't thrash provisioning) → **release when the queue drains / goes idle** past a short grace window. So: hold across ticks when busy, never hold when idle. **Reuse LPv2's existing on-demand DO GPU work** (`lilith-platform.live/scripts/provision-raw-gpu-droplet.sh` + its vLLM/OpenAI-compatible inference) behind the **`@prospector/ai-harness`** package — prospector's self-hosted inference layer: a **direct vLLM client** + a **task registry** for `classify` / `draft` / `judge` / `orchestrate`, the on-demand **GPU lifecycle**, the **prompt / CoT-workflow runner**, and the **cost meter**. ai-harness talks to the droplet's inference endpoint directly over `GPU_INFERENCE_URL` (the canonical contract — see step 2 below); it does **not** depend on the retired **model-boss** coordinator.

> _ai-harness is prospector-local today but **promotable to a shared `@ct` package** — `@applications/onlyfans` already carries a parallel `src/engine/classifier.ts` of the same inference-layer shape, so the direct-vLLM client + task registry + GPU lifecycle + cost meter want to live once and be consumed by both._

## Engine identifier convention
`prospector_settings.draft_engine` names the active engine:

- `do-gpu-<modelname>_<version_build>` — a specific OSS model + build on a DO GPU droplet, reached over its inference HTTP endpoint. Example: `do-gpu-mistral-nemo-12b-uncensored_2026-06-29.1`. The `<version_build>` pins the exact tuned weights so a decision is always traceable to the model that made it.
- `template` — the MVP/fallback "static" engine: copy rendered from the 🌹 pastebin note, **no LLM** (zero-dependency, always-safe baseline; what ships first).

The runner records the engine id on every draft/decision (audit trail) so the trial can attribute behavior to a model build, and corrections can be bucketed per build for tuning.

## From pastebin → CoT workflow builder
Today (MVP) the pastebin is a flat lookup: `templateKey (①…⓱) → fixed string`. That's the `template` engine. It is being **converted into a CoT (chain-of-thought) workflow builder**:

- A **workflow** is a versioned, ordered chain the OSS model runs to *generate* the reply for a situation (archetype × state × templateKey), instead of emitting canned text:
  ```
  workflow {
    id, name, version,
    appliesTo: { archetypes[], states[], templateKeys[] },
    context: [ pastebin canon snippets, Quinn voice rules, safety rules ],   // the 🌹 note becomes injected context, not the output
    steps: [ { think: "read the thread + atoms; what does he actually want?" },
             { think: "pick the move per Quinn's rules; check Gate-2 constraints" },
             { draft: "write Quinn's reply in her voice, ≤N chars, no service/price in writing" } ],
    engine: "do-gpu-<model>_<build>"
  }
  ```
- The **builder** lets the operator/coworker author + version these workflows (and A/B them per build). Corrections (`prospect_corrections`) feed back as few-shot/tuning data keyed to the workflow + engine build.
- Pastebin canon (voice, lines, rules) moves from "the output" to "context the workflow injects" — single source of truth for voice, now consumed by the model rather than sent verbatim.

## Pipeline placement (unchanged contract)
The runner stays the same; only the body-generation step swaps:
```
inbound → classify → decideNextAction(templateKey) → Gate-2 → mode
        → [DRAFT ENGINE] render body:
              template engine   → pastebin[templateKey]                (MVP)
              do-gpu engine     → run CoT workflow(appliesTo) on the model   (target)
        → [ALIGNMENT GATE] validate body vs current facts/policy            (new)
        → dispatch to macsync (GO) · stage pending_review (DRAFT) · hold
```
Fail-safe is preserved: if the engine yields no usable body (model down, no matching workflow), the candidate **holds** — never a placeholder send. Mode selects the terminal step — auto-dispatch (GO) vs stage for operator review (DRAFT) vs hold (PAUSE); see [`control-modes.md`](./control-modes.md).

### Runtime alignment gate (per-draft, new)

Eval ([`model-eval-pipeline.md`](./model-eval-pipeline.md)) and the eval-gated build flip prove a **build** is safe *on average* on a held-out set. They do **not** prove that *this* freshly generated body didn't drift — a warm, passing model can still quote a stale rate, put service-for-money in writing, propose a call on a `text_only` thread, or name a city not on the current trip. The alignment gate is the per-draft check that closes that gap, inserted **between generate and dispatch** in `advanceDraft`:

- **Deterministic pass** — `src/engine/draft-align.ts` (pure + co-located test): `validateDraft(body, facts) → { ok, violations }`. Checks price contradicting `rate`, service/price-in-writing (the existing `service_in_writing` / `pay_after` posture promoted to an **outbound** check), channel proposals vs `text_only`, a location not in facts, `never_offer` phrases, plus the voice hard-rules already in `gpu/prompts.ts`.
- **GPU judge pass** (additive) — `judgeDraftAlignment` runs ai-harness's `judge` task on the same droplet; **null when the GPU is down** (same fail-soft as `draftReply`), so it only ever *adds* holds, never blocks on model availability.
- **Outcome** — a clean draft proceeds (dispatch under GO, stage `pending_review` under DRAFT). A violation **holds** with a structured `HoldReason`: `spec_conflict` (facts) or `policy_conflict` (policy), extending the `HoldReason` union from [`ai-first-v4.md`](./ai-first-v4.md) Phase 2.2. Bounded **revise**: the do-gpu engine may re-draft ≤2× with the violations injected; on continued failure it downgrades to the safe `template` body or HOLD. A contradicting draft is **never** presented or sent.

### Facts / mission config — the alignment oracle

The alignment gate validates against a structured **facts** config, and the *same* config is injected into the generator prompt — **one source, two consumers** (generator input + gate oracle). Split by rate-of-change (see [`ai-system-plan.md`](./ai-system-plan.md) §1): **facts** (rate, min duration, incall/outcall + location/window, channel flags e.g. `text_only`, OnlyFans handle, the `never_offer` list) change rarely → a stable config table in a new `src/specs/` module (mirroring `src/markets/`), migration `0014_specs.sql`; **mission** (today's goal/slots/market) is the daily runtime dial (Mission Control, `ai-system-plan.md` gap #1) and can start as a settings field. The generator reads it to *steer*; the alignment gate reads it to *verify the body didn't drift from it*.

## Build path
1. **MVP (done):** `template` engine = pastebin static render; fail-safe holds.
2. **GPU client:** a `GpuDraftEngine` that POSTs rendered CoT prompts (the whole batch) to `do-gpu-<model>_<build>`'s inference endpoint (env `GPU_INFERENCE_URL`), behind an **on-demand GPU lifecycle manager**: provision when work arrives, keep warm while the queue stays deep, release after a short idle grace. Selected when `draft_engine` matches `do-gpu-*`.
3. **Workflow store + builder:** persist workflows (own DB, versioned) + author/edit surface (panel or config) + bind to engine builds; wire corrections as per-build tuning data.
4. **Tuning loop:** export `prospect_corrections` per engine build → fine-tune/eval → publish next `do-gpu-<model>_<build>` → flip `draft_engine`.

## Open scope (confirm before building step 3)
- Workflow authoring: DB-backed builder UI in the panel, vs config files in-repo, vs both?
- Granularity: one workflow per templateKey, or per (archetype × state)?
- Does classification also move to a CoT workflow on the GPU, or stay fast-rules + atom-extraction?
-												docs(draft-engine): OSS-on-GPU + CoT workflow builder; rename MVP engine 'template'

Document the draft engine direction: OSS uncensored models on DO GPU droplets
(reuse LPv2 provisioning, no model-boss), engine id 'do-gpu-<model>_<build>', and
pastebin → CoT workflow builder (versioned reasoning chains the model runs;
pastebin canon as injected context; corrections as per-build tuning data). Rename
the MVP static-render engine value 'pastebin' -> 'template' (pastebin is now
context, not the engine).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-29 07:14:29 -04:00
+								# Draft engine — OSS-on-GPU + CoT workflow builder
 								## Why
-												docs(prospector): retire model-boss for @prospector/ai-harness; add DRAFT mode + alignment gate

- Replace every model-boss/coordinator reference with the @prospector/ai-harness
  story (direct vLLM client + classify/draft/judge/orchestrate task registry +
  on-demand GPU lifecycle + CoT-workflow runner + cost meter) across ai-first-v4,
  draft-engine, model-eval-pipeline, and PROSPECTOR.md; GPU_INFERENCE_URL is the
  canonical inference contract. Note ai-harness is promotable to a shared @ct
  package (onlyfans carries a parallel src/engine/classifier.ts).
- Fix migration collision: ai-first-v4 actor-attribution renumbered 0007 -> 0016
  (0007_tasks.sql exists; tree at 0013).
- Add the three missing pieces from the plan: a formal DRAFT runner mode distinct
  from PAUSE + DRAFT->GO graduation (new control-modes.md); a runtime per-draft
  alignment gate (deterministic facts/policy + GPU judge; spec_conflict/
  policy_conflict holds) in draft-engine's pipeline; and the facts/mission config
  schema (src/specs/, 0014_specs.sql) in ai-system-plan §5 + draft-engine.
- Index control-modes.md and the ai-harness rename in features/README.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-30 11:06:30 -04:00
+								Outbound (and rich classification) must run on **OSS *uncensored* LLMs** — never hosted Claude/OpenAI, which refuse adult-services copy. The models are tuned to Quinn's voice and the inbound/outbound task. **The GPU is ON-DEMAND, not a standing droplet** (no always-on GPU). Lifecycle is **queue-driven, not strictly per-tick:** provision on demand → warm the model → drain the work; **keep it warm while the queue stays deep** (a big classify/draft backlog amortizes the provision cost — don't thrash provisioning) → **release when the queue drains / goes idle** past a short grace window. So: hold across ticks when busy, never hold when idle. **Reuse LPv2's existing on-demand DO GPU work** (`lilith-platform.live/scripts/provision-raw-gpu-droplet.sh` + its vLLM/OpenAI-compatible inference) behind the **`@prospector/ai-harness`** package — prospector's self-hosted inference layer: a **direct vLLM client** + a **task registry** for `classify` / `draft` / `judge` / `orchestrate`, the on-demand **GPU lifecycle**, the **prompt / CoT-workflow runner**, and the **cost meter**. ai-harness talks to the droplet's inference endpoint directly over `GPU_INFERENCE_URL` (the canonical contract — see step 2 below); it does **not** depend on the retired **model-boss** coordinator.
 								> _ai-harness is prospector-local today but **promotable to a shared `@ct` package** — `@applications/onlyfans` already carries a parallel `src/engine/classifier.ts` of the same inference-layer shape, so the direct-vLLM client + task registry + GPU lifecycle + cost meter want to live once and be consumed by both._
-												docs(draft-engine): OSS-on-GPU + CoT workflow builder; rename MVP engine 'template'

Document the draft engine direction: OSS uncensored models on DO GPU droplets
(reuse LPv2 provisioning, no model-boss), engine id 'do-gpu-<model>_<build>', and
pastebin → CoT workflow builder (versioned reasoning chains the model runs;
pastebin canon as injected context; corrections as per-build tuning data). Rename
the MVP static-render engine value 'pastebin' -> 'template' (pastebin is now
context, not the engine).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-29 07:14:29 -04:00
 								## Engine identifier convention
 								`prospector_settings.draft_engine` names the active engine:
 								- `do-gpu-<modelname>_<version_build>` — a specific OSS model + build on a DO GPU droplet, reached over its inference HTTP endpoint. Example: `do-gpu-mistral-nemo-12b-uncensored_2026-06-29.1`. The `<version_build>` pins the exact tuned weights so a decision is always traceable to the model that made it.
 								- `template` — the MVP/fallback "static" engine: copy rendered from the 🌹 pastebin note, **no LLM** (zero-dependency, always-safe baseline; what ships first).
 								The runner records the engine id on every draft/decision (audit trail) so the trial can attribute behavior to a model build, and corrections can be bucketed per build for tuning.
 								## From pastebin → CoT workflow builder
 								Today (MVP) the pastebin is a flat lookup: `templateKey (①…⓱) → fixed string`. That's the `template` engine. It is being **converted into a CoT (chain-of-thought) workflow builder**:
 								- A **workflow** is a versioned, ordered chain the OSS model runs to *generate* the reply for a situation (archetype × state × templateKey), instead of emitting canned text:
 								  ```
 								  workflow {
 								    id, name, version,
 								    appliesTo: { archetypes[], states[], templateKeys[] },
 								    context: [ pastebin canon snippets, Quinn voice rules, safety rules ],   // the 🌹 note becomes injected context, not the output
 								    steps: [ { think: "read the thread + atoms; what does he actually want?" },
 								             { think: "pick the move per Quinn's rules; check Gate-2 constraints" },
 								             { draft: "write Quinn's reply in her voice, ≤N chars, no service/price in writing" } ],
 								    engine: "do-gpu-<model>_<build>"
 								  }
 								  ```
 								- The **builder** lets the operator/coworker author + version these workflows (and A/B them per build). Corrections (`prospect_corrections`) feed back as few-shot/tuning data keyed to the workflow + engine build.
 								- Pastebin canon (voice, lines, rules) moves from "the output" to "context the workflow injects" — single source of truth for voice, now consumed by the model rather than sent verbatim.
 								## Pipeline placement (unchanged contract)
 								The runner stays the same; only the body-generation step swaps:
 								```
 								inbound → classify → decideNextAction(templateKey) → Gate-2 → mode
 								        → [DRAFT ENGINE] render body:
 								              template engine   → pastebin[templateKey]                (MVP)
 								              do-gpu engine     → run CoT workflow(appliesTo) on the model   (target)
-												docs(prospector): retire model-boss for @prospector/ai-harness; add DRAFT mode + alignment gate

- Replace every model-boss/coordinator reference with the @prospector/ai-harness
  story (direct vLLM client + classify/draft/judge/orchestrate task registry +
  on-demand GPU lifecycle + CoT-workflow runner + cost meter) across ai-first-v4,
  draft-engine, model-eval-pipeline, and PROSPECTOR.md; GPU_INFERENCE_URL is the
  canonical inference contract. Note ai-harness is promotable to a shared @ct
  package (onlyfans carries a parallel src/engine/classifier.ts).
- Fix migration collision: ai-first-v4 actor-attribution renumbered 0007 -> 0016
  (0007_tasks.sql exists; tree at 0013).
- Add the three missing pieces from the plan: a formal DRAFT runner mode distinct
  from PAUSE + DRAFT->GO graduation (new control-modes.md); a runtime per-draft
  alignment gate (deterministic facts/policy + GPU judge; spec_conflict/
  policy_conflict holds) in draft-engine's pipeline; and the facts/mission config
  schema (src/specs/, 0014_specs.sql) in ai-system-plan §5 + draft-engine.
- Index control-modes.md and the ai-harness rename in features/README.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-30 11:06:30 -04:00
+								        → [ALIGNMENT GATE] validate body vs current facts/policy            (new)
 								        → dispatch to macsync (GO) · stage pending_review (DRAFT) · hold
-												docs(draft-engine): OSS-on-GPU + CoT workflow builder; rename MVP engine 'template'

Document the draft engine direction: OSS uncensored models on DO GPU droplets
(reuse LPv2 provisioning, no model-boss), engine id 'do-gpu-<model>_<build>', and
pastebin → CoT workflow builder (versioned reasoning chains the model runs;
pastebin canon as injected context; corrections as per-build tuning data). Rename
the MVP static-render engine value 'pastebin' -> 'template' (pastebin is now
context, not the engine).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-29 07:14:29 -04:00
+								```
-												docs(prospector): retire model-boss for @prospector/ai-harness; add DRAFT mode + alignment gate

- Replace every model-boss/coordinator reference with the @prospector/ai-harness
  story (direct vLLM client + classify/draft/judge/orchestrate task registry +
  on-demand GPU lifecycle + CoT-workflow runner + cost meter) across ai-first-v4,
  draft-engine, model-eval-pipeline, and PROSPECTOR.md; GPU_INFERENCE_URL is the
  canonical inference contract. Note ai-harness is promotable to a shared @ct
  package (onlyfans carries a parallel src/engine/classifier.ts).
- Fix migration collision: ai-first-v4 actor-attribution renumbered 0007 -> 0016
  (0007_tasks.sql exists; tree at 0013).
- Add the three missing pieces from the plan: a formal DRAFT runner mode distinct
  from PAUSE + DRAFT->GO graduation (new control-modes.md); a runtime per-draft
  alignment gate (deterministic facts/policy + GPU judge; spec_conflict/
  policy_conflict holds) in draft-engine's pipeline; and the facts/mission config
  schema (src/specs/, 0014_specs.sql) in ai-system-plan §5 + draft-engine.
- Index control-modes.md and the ai-harness rename in features/README.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-30 11:06:30 -04:00
+								Fail-safe is preserved: if the engine yields no usable body (model down, no matching workflow), the candidate **holds** — never a placeholder send. Mode selects the terminal step — auto-dispatch (GO) vs stage for operator review (DRAFT) vs hold (PAUSE); see [`control-modes.md`](./control-modes.md).
 								### Runtime alignment gate (per-draft, new)
 								Eval ([`model-eval-pipeline.md`](./model-eval-pipeline.md)) and the eval-gated build flip prove a **build** is safe *on average* on a held-out set. They do **not** prove that *this* freshly generated body didn't drift — a warm, passing model can still quote a stale rate, put service-for-money in writing, propose a call on a `text_only` thread, or name a city not on the current trip. The alignment gate is the per-draft check that closes that gap, inserted **between generate and dispatch** in `advanceDraft`:
 								- **Deterministic pass** — `src/engine/draft-align.ts` (pure + co-located test): `validateDraft(body, facts) → { ok, violations }`. Checks price contradicting `rate`, service/price-in-writing (the existing `service_in_writing` / `pay_after` posture promoted to an **outbound** check), channel proposals vs `text_only`, a location not in facts, `never_offer` phrases, plus the voice hard-rules already in `gpu/prompts.ts`.
 								- **GPU judge pass** (additive) — `judgeDraftAlignment` runs ai-harness's `judge` task on the same droplet; **null when the GPU is down** (same fail-soft as `draftReply`), so it only ever *adds* holds, never blocks on model availability.
 								- **Outcome** — a clean draft proceeds (dispatch under GO, stage `pending_review` under DRAFT). A violation **holds** with a structured `HoldReason`: `spec_conflict` (facts) or `policy_conflict` (policy), extending the `HoldReason` union from [`ai-first-v4.md`](./ai-first-v4.md) Phase 2.2. Bounded **revise**: the do-gpu engine may re-draft ≤2× with the violations injected; on continued failure it downgrades to the safe `template` body or HOLD. A contradicting draft is **never** presented or sent.
 								### Facts / mission config — the alignment oracle
 								The alignment gate validates against a structured **facts** config, and the *same* config is injected into the generator prompt — **one source, two consumers** (generator input + gate oracle). Split by rate-of-change (see [`ai-system-plan.md`](./ai-system-plan.md) §1): **facts** (rate, min duration, incall/outcall + location/window, channel flags e.g. `text_only`, OnlyFans handle, the `never_offer` list) change rarely → a stable config table in a new `src/specs/` module (mirroring `src/markets/`), migration `0014_specs.sql`; **mission** (today's goal/slots/market) is the daily runtime dial (Mission Control, `ai-system-plan.md` gap #1) and can start as a settings field. The generator reads it to *steer*; the alignment gate reads it to *verify the body didn't drift from it*.
-												docs(draft-engine): OSS-on-GPU + CoT workflow builder; rename MVP engine 'template'

Document the draft engine direction: OSS uncensored models on DO GPU droplets
(reuse LPv2 provisioning, no model-boss), engine id 'do-gpu-<model>_<build>', and
pastebin → CoT workflow builder (versioned reasoning chains the model runs;
pastebin canon as injected context; corrections as per-build tuning data). Rename
the MVP static-render engine value 'pastebin' -> 'template' (pastebin is now
context, not the engine).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-29 07:14:29 -04:00
 								## Build path
 . **MVP (done):** `template` engine = pastebin static render; fail-safe holds.
-												docs: correct deploy target (black dead -> DO droplet) + on-demand GPU lifecycle

black homelan is gone; prod target is the DO backend droplet (lilith-store-backend,
209.38.51.98 / wg 10.9.0.5) where mac-sync-server already runs. Fix black:2546x
DB-host refs in comments/migrations. GPU is on-demand + queue-driven: hold warm
while backlog is deep, release on idle grace (not strictly per-tick).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-29 07:22:29 -04:00
+. **GPU client:** a `GpuDraftEngine` that POSTs rendered CoT prompts (the whole batch) to `do-gpu-<model>_<build>`'s inference endpoint (env `GPU_INFERENCE_URL`), behind an **on-demand GPU lifecycle manager**: provision when work arrives, keep warm while the queue stays deep, release after a short idle grace. Selected when `draft_engine` matches `do-gpu-*`.
-												docs(draft-engine): OSS-on-GPU + CoT workflow builder; rename MVP engine 'template'

Document the draft engine direction: OSS uncensored models on DO GPU droplets
(reuse LPv2 provisioning, no model-boss), engine id 'do-gpu-<model>_<build>', and
pastebin → CoT workflow builder (versioned reasoning chains the model runs;
pastebin canon as injected context; corrections as per-build tuning data). Rename
the MVP static-render engine value 'pastebin' -> 'template' (pastebin is now
context, not the engine).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-29 07:14:29 -04:00
+. **Workflow store + builder:** persist workflows (own DB, versioned) + author/edit surface (panel or config) + bind to engine builds; wire corrections as per-build tuning data.
 . **Tuning loop:** export `prospect_corrections` per engine build → fine-tune/eval → publish next `do-gpu-<model>_<build>` → flip `draft_engine`.
 								## Open scope (confirm before building step 3)
 								- Workflow authoring: DB-backed builder UI in the panel, vs config files in-repo, vs both?
 								- Granularity: one workflow per templateKey, or per (archetype × state)?
 								- Does classification also move to a CoT workflow on the GPU, or stay fast-rules + atom-extraction?