# Training loop — CoT-labeled corpus → LoRA → eval-gated build flip > How the OSS models ([`draft-engine.md`](./draft-engine.md)) actually improve from > Quinn's **10K+ message history**. The thesis: **don't train on raw > `(inbound → reply)` pairs.** First enrich every historical turn into a structured, > reasoning-bearing record — that record is the durable asset; the weights are > downstream and disposable. One extraction feeds classifier training, generator > training, the prompt-level workflow library, the hard-example miner, and the eval > set. > > Model policy unchanged: serving is OSS-on-GPU (no Claude in the runtime loop). The > **offline labeling pass** may use a strong teacher (incl. hosted Claude) — it > analyzes history, it does not generate adult copy at runtime. ## 1. The two producers (provenance is a routing key, not metadata) Quinn's recent good outbound came from two distinct systems. They map onto the engine's two draft modes and two of the three AI roles ([`ai-first-v4.md`](./ai-first-v4.md)): | Producer | What it does | Engine mode | Trains | |---|---|---|---| | **matcher** | classify convo → campaign, classify inbound, **match** to a reply in the campaign's set | `template` (selection) | the **classifier** + the **retrieval / move-selection** policy | | **agent** | reads the full convo, **generates** a freeform reply | `do-gpu` (generative `prospect.draft`) | the **generative message-generator** (voice) | **Do not pool the 10K into one SFT set.** Matcher outputs are *selections from a fixed library* — training the generator on them teaches it to parrot canned copy (no generalization). So `source ∈ {matcher, agent}` **dispatches** each example to the model it can actually improve. ## 2. The per-turn extraction record (the asset) For every turn in every historical thread, produce one record keyed to the engine's **existing** vocabulary — invent nothing: ```jsonc { "read": { "archetype": "...", "atoms": { /* ProspectAtomsV4, src/engine/atoms.ts */ } }, "state": { "stage": "...", "transition": "..." }, // engine/state.ts state machine "move": "qualify|screen|deflect-price|book|tease|re-engage|disengage", // ← NEW label, the gold "constraints": { "noPriceInWriting": true, "humanOwned": false, "bookingTriad": "partial" }, // Gate-2 facts honored "outcome": "booked|engaged|ghosted|blocked|scam", // mined from the thread tail = reward/weight "source": { "actor": "agent|matcher|human|runner", "agentId": "...", "context": "..." }, "reply": "" } ``` - **`read.atoms`** = the canonical **22-atom `ProspectAtomsV4`** (`src/engine/atoms.ts`). Reuse the existing defensive parser; do not define a parallel schema. - **`move`** is the one field not yet in code (~6–10 recurring plays). It is the highest-value extraction — see §4. - **`outcome`** is mined from how the thread actually ended, not asserted. - **`source`** is the v4 governance **actor attribution** applied retroactively to history (the same column the going-forward pipeline adds to `prospect_drafts`). > **To confirm against code when locking the schema:** the full 22-atom enum set in > `src/engine/atoms.ts`, the `move` taxonomy (derive from `PROSPECTOR_TRAINING.md` — > an **external Executor doc, not yet in this repo** (also dangling-cited at > `src/engine/fast-classifier.ts:4`; create it here when the taxonomy is locked) — + > the campaign reply sets), and where the 10K physically lives (Apple Messages via > macsync / a legacy LP export — `prospect_drafts` is going-forward only; the > macsync client is outbox-only). Step zero is an **export → structure** job. ## 3. CoT labeling = rationalization (backward from a known-good answer) We already *know* the good reply — it is Quinn's actual message on a thread that **converted**. So we generate the reasoning **backward**, conditioned on the answer (STaR / "distilling step-by-step" / rejection-sampled rationales): 1. Teacher reads `(thread context, atoms, state)` **and the known reply**. 2. Teacher emits the trace: *read → chosen move → constraints honored → therefore this reply*. 3. **Reject any rationale that does not reconstruct her actual move** — automatic quality filter; bad rationales never enter the corpus. Backward rationales are far higher quality than forward reasoning because they are anchored to a verified outcome. The SFT target becomes `(context, read, state) → [reasoning] → reply`, which distills the **decision procedure**, not the surface wording. Keep the trace at inference (explainable — feeds Plane-2 `prospector_explain`) or distill it away for latency. ## 4. The shortcut: distill agent wins into the matcher library The matcher is cheap (classify + retrieve, no generation); the agent is expensive (GPU generation). The self-improving cascade turns expensive wins into cheap future coverage: ``` inbound → classify (cheap) ├── good match in campaign library? ─── matcher emits it (cheap) ✅ └── novel / no good match? ──────────── escalate to agent (generate, costly) └── agent reply converts? └── PROMOTE it into the campaign library as a new matchable reply, keyed by (archetype × state × move) ``` Every generator win on a novel situation becomes a new entry the matcher can select next time → **coverage grows, fewer inbounds need the generator, cost falls while quality rises.** This *is* "shortcut improvements": you grow behavior at **prompt/library speed (hours)** instead of **retraining speed (days)**, and only LoRA when the prompt/library layer saturates. The promoted replies are literally new `appliesTo: { archetype, state, templateKey }` entries in the [`draft-engine.md`](./draft-engine.md) CoT-workflow library. ## 5. Data routing — provenance routes, outcome admits `source` decides *which model* an example trains; `outcome` decides *whether it is admitted at all*. Both gates always apply (training on agent output is self-distillation — keep only the wins, or you reinforce mistakes): | Row | Good outcome → | Bad outcome → | |---|---|---| | **matcher** | classifier SFT + retrieval positive | retrieval negative; **and a "library gap" signal** (novel → had to escalate) marking exactly where the matcher needs new entries | | **agent** | generator voice SFT + **library-promotion candidate** (§4) | dropped (bad voice example) | Three derived sets fall out of the same extraction — the data-efficiency multipliers: - **Hard-example set (active learning).** Run the current `fast-classifier` over all 10K; diff its atoms/archetype vs the CoT ground truth. **Train only on the disagreements** — where you are currently wrong. Biggest efficiency lever. - **DPO contrast pairs.** Same `archetype × state`, one reply converted, one ghosted → a direct preference pair. The `original_body → corrected_body` rows in `prospect_corrections` are ready-made pairs. - **Held-out eval set.** Atom-stratified split (10–20%) → honest metrics. With 10K this is real; at 25 examples (the current rule-classifier's set) the eval gate was theater. ## 6. Training & serving mechanics - **Never train on the serving droplet.** Training is an hours-long, 100%-GPU batch job; it would starve live inference. It runs on a **separate transient droplet** (spin up → produce adapter → write to storage → tear down). This does **not** violate "1 inference droplet until I complain" — that decision governs *serving* topology, not an offline job. - **LoRA/QLoRA, not full fine-tune.** 10K is comfortably above the LoRA threshold (hundreds–low-thousands), well below full-FT scale. Tiny adapters, cheap, and — key — the serving GPU does **multi-LoRA**: one resident base + per-task adapters (classifier, draft) swapped per request at ~zero cost. This is why one droplet serves all roles concurrently (continuous batching + multi-LoRA + the `ChatPriority` queue in `src/gpu/types.ts`). See [`gpu-cost-control.md`](./gpu-cost-control.md). - **Eval-gated build flip.** New build → held-out eval → metrics **and** Gate-2 safety must pass → only then flip `draft_engine` / the task-registry version. The `do-gpu-_` convention + per-decision engine-id recording let corrections bucket per build, closing the loop. ## 7. Sequencing 1. **Export → structure → label** the 10K into per-turn records (§2, §3). The first real work is ETL + the labeling pass, not training. 2. **Classifier first** — labels come free from matcher classify decisions + outcome mining; cleanest eval; LoRA the hard-example set (§5). 3. **Message-generator second** — voice SFT on **agent** wins (filtered by outcome), then DPO on contrast pairs. Uncensored OSS base. Hard-gated by safety eval. 4. **Orchestrator: never trained** — it needs tool-calling + instruction-following, and zero orchestration transcripts exist. Strong OSS instruct model + MCP tool schemas + few-shot. (Tier A, [`ai-first-v4.md`](./ai-first-v4.md).) 5. **Compounding loop:** label → `{read, move, outcome, source}` → (a) library/prompt wins now → (b) hard-example LoRA when saturated → (c) eval gate → (d) the new build's corrections re-feed the labels. ## 8. Invariant Training never widens the send floor. A new build only changes *which body* is proposed (matched, generated, or `template` fallback); Gate-2, the kill-switch, and the macsync-outbox floor are untouched, and the eval gate proves safety before any build reaches the auto-send path. A cold/failed/unproven model degrades to the `template` fallback — never to an unsafe or placeholder send.