prospector/docs/features/training-loop.md

# Training loop — CoT-labeled corpus → LoRA → eval-gated build flip

> How the OSS models ([`draft-engine.md`](./draft-engine.md)) actually improve from
> Quinn's **10K+ message history**. The thesis: **don't train on raw
> `(inbound → reply)` pairs.** First enrich every historical turn into a structured,
> reasoning-bearing record — that record is the durable asset; the weights are
> downstream and disposable. One extraction feeds classifier training, generator
> training, the prompt-level workflow library, the hard-example miner, and the eval
> set.
>
> Model policy unchanged: serving is OSS-on-GPU (no Claude in the runtime loop). The
> **offline labeling pass** may use a strong teacher (incl. hosted Claude) — it
> analyzes history, it does not generate adult copy at runtime.

## 1. The two producers (provenance is a routing key, not metadata)

Quinn's recent good outbound came from two distinct systems. They map onto the
engine's two draft modes and two of the three AI roles
([`ai-first-v4.md`](./ai-first-v4.md)):

| Producer | What it does | Engine mode | Trains |
|---|---|---|---|
| **matcher** | classify convo → campaign, classify inbound, **match** to a reply in the campaign's set | `template` (selection) | the **classifier** + the **retrieval / move-selection** policy |
| **agent** | reads the full convo, **generates** a freeform reply | `do-gpu` (generative `prospect.draft`) | the **generative message-generator** (voice) |

**Do not pool the 10K into one SFT set.** Matcher outputs are *selections from a
fixed library* — training the generator on them teaches it to parrot canned copy
(no generalization). So `source ∈ {matcher, agent}` **dispatches** each example to
the model it can actually improve.

## 2. The per-turn extraction record (the asset)

For every turn in every historical thread, produce one record keyed to the engine's
**existing** vocabulary — invent nothing:

```jsonc
{
  "read":        { "archetype": "...", "atoms": { /* ProspectAtomsV4, src/engine/atoms.ts */ } },
  "state":       { "stage": "...", "transition": "..." },   // engine/state.ts state machine
  "move":        "qualify|screen|deflect-price|book|tease|re-engage|disengage",  // ← NEW label, the gold
  "constraints": { "noPriceInWriting": true, "humanOwned": false, "bookingTriad": "partial" }, // Gate-2 facts honored
  "outcome":     "booked|engaged|ghosted|blocked|scam",     // mined from the thread tail = reward/weight
  "source":      { "actor": "agent|matcher|human|runner", "agentId": "...", "context": "..." },
  "reply":       "<the actual message sent>"
}
```

- **`read.atoms`** = the canonical **22-atom `ProspectAtomsV4`** (`src/engine/atoms.ts`).
  Reuse the existing defensive parser; do not define a parallel schema.
- **`move`** is the one field not yet in code (~6–10 recurring plays). It is the
  highest-value extraction — see §4.
- **`outcome`** is mined from how the thread actually ended, not asserted.
- **`source`** is the v4 governance **actor attribution** applied retroactively to
  history (the same column the going-forward pipeline adds to `prospect_drafts`).

> **To confirm against code when locking the schema:** the full 22-atom enum set in
> `src/engine/atoms.ts`, the `move` taxonomy (derive from `PROSPECTOR_TRAINING.md` —
> an **external Executor doc, not yet in this repo** (also dangling-cited at
> `src/engine/fast-classifier.ts:4`; create it here when the taxonomy is locked) — +
> the campaign reply sets), and where the 10K physically lives (Apple Messages via
> macsync / a legacy LP export — `prospect_drafts` is going-forward only; the
> macsync client is outbox-only). Step zero is an **export → structure** job.

## 3. CoT labeling = rationalization (backward from a known-good answer)

We already *know* the good reply — it is Quinn's actual message on a thread that
**converted**. So we generate the reasoning **backward**, conditioned on the answer
(STaR / "distilling step-by-step" / rejection-sampled rationales):

1. Teacher reads `(thread context, atoms, state)` **and the known reply**.
2. Teacher emits the trace: *read → chosen move → constraints honored → therefore
   this reply*.
3. **Reject any rationale that does not reconstruct her actual move** — automatic
   quality filter; bad rationales never enter the corpus.

Backward rationales are far higher quality than forward reasoning because they are
anchored to a verified outcome. The SFT target becomes
`(context, read, state) → [reasoning] → reply`, which distills the **decision
procedure**, not the surface wording. Keep the trace at inference (explainable —
feeds Plane-2 `prospector_explain`) or distill it away for latency.

## 4. The shortcut: distill agent wins into the matcher library

The matcher is cheap (classify + retrieve, no generation); the agent is expensive
(GPU generation). The self-improving cascade turns expensive wins into cheap future
coverage:

```
inbound → classify (cheap)
   ├── good match in campaign library? ─── matcher emits it (cheap)            ✅
   └── novel / no good match? ──────────── escalate to agent (generate, costly)
                                              └── agent reply converts?
                                                    └── PROMOTE it into the campaign
                                                        library as a new matchable
                                                        reply, keyed by (archetype ×
                                                        state × move)
```

Every generator win on a novel situation becomes a new entry the matcher can select
next time → **coverage grows, fewer inbounds need the generator, cost falls while
quality rises.** This *is* "shortcut improvements": you grow behavior at
**prompt/library speed (hours)** instead of **retraining speed (days)**, and only
LoRA when the prompt/library layer saturates. The promoted replies are literally new
`appliesTo: { archetype, state, templateKey }` entries in the
[`draft-engine.md`](./draft-engine.md) CoT-workflow library.

## 5. Data routing — provenance routes, outcome admits

`source` decides *which model* an example trains; `outcome` decides *whether it is
admitted at all*. Both gates always apply (training on agent output is
self-distillation — keep only the wins, or you reinforce mistakes):

| Row | Good outcome → | Bad outcome → |
|---|---|---|
| **matcher** | classifier SFT + retrieval positive | retrieval negative; **and a "library gap" signal** (novel → had to escalate) marking exactly where the matcher needs new entries |
| **agent** | generator voice SFT + **library-promotion candidate** (§4) | dropped (bad voice example) |

Three derived sets fall out of the same extraction — the data-efficiency multipliers:

- **Hard-example set (active learning).** Run the current `fast-classifier` over all
  10K; diff its atoms/archetype vs the CoT ground truth. **Train only on the
  disagreements** — where you are currently wrong. Biggest efficiency lever.
- **DPO contrast pairs.** Same `archetype × state`, one reply converted, one ghosted
  → a direct preference pair. The `original_body → corrected_body` rows in
  `prospect_corrections` are ready-made pairs.
- **Held-out eval set.** Atom-stratified split (10–20%) → honest metrics. With 10K
  this is real; at 25 examples (the current rule-classifier's set) the eval gate was
  theater.

## 6. Training & serving mechanics

- **Never train on the serving droplet.** Training is an hours-long, 100%-GPU batch
  job; it would starve live inference. It runs on a **separate transient droplet**
  (spin up → produce adapter → write to storage → tear down). This does **not**
  violate "1 inference droplet until I complain" — that decision governs *serving*
  topology, not an offline job.
- **LoRA/QLoRA, not full fine-tune.** 10K is comfortably above the LoRA threshold
  (hundreds–low-thousands), well below full-FT scale. Tiny adapters, cheap, and —
  key — the serving GPU does **multi-LoRA**: one resident base + per-task adapters
  (classifier, draft) swapped per request at ~zero cost. This is why one droplet
  serves all roles concurrently (continuous batching + multi-LoRA + the
  `ChatPriority` queue in `src/gpu/types.ts`). See
  [`gpu-cost-control.md`](./gpu-cost-control.md).
- **Eval-gated build flip.** New build → held-out eval → metrics **and** Gate-2
  safety must pass → only then flip `draft_engine` / the task-registry version. The
  `do-gpu-<model>_<build>` convention + per-decision engine-id recording let
  corrections bucket per build, closing the loop.

## 7. Sequencing

1. **Export → structure → label** the 10K into per-turn records (§2, §3). The first
   real work is ETL + the labeling pass, not training.
2. **Classifier first** — labels come free from matcher classify decisions + outcome
   mining; cleanest eval; LoRA the hard-example set (§5).
3. **Message-generator second** — voice SFT on **agent** wins (filtered by outcome),
   then DPO on contrast pairs. Uncensored OSS base. Hard-gated by safety eval.
4. **Orchestrator: never trained** — it needs tool-calling + instruction-following,
   and zero orchestration transcripts exist. Strong OSS instruct model + MCP tool
   schemas + few-shot. (Tier A, [`ai-first-v4.md`](./ai-first-v4.md).)
5. **Compounding loop:** label → `{read, move, outcome, source}` → (a) library/prompt
   wins now → (b) hard-example LoRA when saturated → (c) eval gate → (d) the new
   build's corrections re-feed the labels.

## 8. Invariant

Training never widens the send floor. A new build only changes *which body* is
proposed (matched, generated, or `template` fallback); Gate-2, the kill-switch, and
the macsync-outbox floor are untouched, and the eval gate proves safety before any
build reaches the auto-send path. A cold/failed/unproven model degrades to the
`template` fallback — never to an unsafe or placeholder send.