prospector/docs/features/training-loop.md
Natalie 57c3b76ef5
Some checks are pending
CI / verify (push) Waiting to run
docs(prospector): add training-loop (CoT-labeled corpus -> LoRA -> eval gate)
Thesis: enrich 10K+ history into per-turn {read, move, outcome, source}
records before training. Map the two producers (matcher=classify+retrieve,
agent=generate) to data routing; distill agent wins into the matcher
library as the cost/quality shortcut. LoRA per role on a transient
training droplet; multi-LoRA serving on the single inference droplet;
eval-gated build flip. Classifier first, generator second, orchestrator
never trained.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 18:00:01 -04:00

9.7 KiB
Raw Blame History

Training loop — CoT-labeled corpus → LoRA → eval-gated build flip

How the OSS models (draft-engine.md) actually improve from Quinn's 10K+ message history. The thesis: don't train on raw (inbound → reply) pairs. First enrich every historical turn into a structured, reasoning-bearing record — that record is the durable asset; the weights are downstream and disposable. One extraction feeds classifier training, generator training, the prompt-level workflow library, the hard-example miner, and the eval set.

Model policy unchanged: serving is OSS-on-GPU (no Claude in the runtime loop). The offline labeling pass may use a strong teacher (incl. hosted Claude) — it analyzes history, it does not generate adult copy at runtime.

1. The two producers (provenance is a routing key, not metadata)

Quinn's recent good outbound came from two distinct systems. They map onto the engine's two draft modes and two of the three AI roles (ai-first-v4.md):

Producer What it does Engine mode Trains
matcher classify convo → campaign, classify inbound, match to a reply in the campaign's set template (selection) the classifier + the retrieval / move-selection policy
agent reads the full convo, generates a freeform reply do-gpu (generative prospect.draft) the generative message-generator (voice)

Do not pool the 10K into one SFT set. Matcher outputs are selections from a fixed library — training the generator on them teaches it to parrot canned copy (no generalization). So source ∈ {matcher, agent} dispatches each example to the model it can actually improve.

2. The per-turn extraction record (the asset)

For every turn in every historical thread, produce one record keyed to the engine's existing vocabulary — invent nothing:

{
  "read":        { "archetype": "...", "atoms": { /* ProspectAtomsV4, src/engine/atoms.ts */ } },
  "state":       { "stage": "...", "transition": "..." },   // engine/state.ts state machine
  "move":        "qualify|screen|deflect-price|book|tease|re-engage|disengage",  // ← NEW label, the gold
  "constraints": { "noPriceInWriting": true, "humanOwned": false, "bookingTriad": "partial" }, // Gate-2 facts honored
  "outcome":     "booked|engaged|ghosted|blocked|scam",     // mined from the thread tail = reward/weight
  "source":      { "actor": "agent|matcher|human|runner", "agentId": "...", "context": "..." },
  "reply":       "<the actual message sent>"
}
  • read.atoms = the canonical 22-atom ProspectAtomsV4 (src/engine/atoms.ts). Reuse the existing defensive parser; do not define a parallel schema.
  • move is the one field not yet in code (~610 recurring plays). It is the highest-value extraction — see §4.
  • outcome is mined from how the thread actually ended, not asserted.
  • source is the v4 governance actor attribution applied retroactively to history (the same column the going-forward pipeline adds to prospect_drafts).

To confirm against code when locking the schema: the full 22-atom enum set in src/engine/atoms.ts, the move taxonomy (derive from PROSPECTOR_TRAINING.md + the campaign reply sets), and where the 10K physically lives (Apple Messages via macsync / a legacy LP export — prospect_drafts is going-forward only; the macsync client is outbox-only). Step zero is an export → structure job.

3. CoT labeling = rationalization (backward from a known-good answer)

We already know the good reply — it is Quinn's actual message on a thread that converted. So we generate the reasoning backward, conditioned on the answer (STaR / "distilling step-by-step" / rejection-sampled rationales):

  1. Teacher reads (thread context, atoms, state) and the known reply.
  2. Teacher emits the trace: read → chosen move → constraints honored → therefore this reply.
  3. Reject any rationale that does not reconstruct her actual move — automatic quality filter; bad rationales never enter the corpus.

Backward rationales are far higher quality than forward reasoning because they are anchored to a verified outcome. The SFT target becomes (context, read, state) → [reasoning] → reply, which distills the decision procedure, not the surface wording. Keep the trace at inference (explainable — feeds Plane-2 prospector_explain) or distill it away for latency.

4. The shortcut: distill agent wins into the matcher library

The matcher is cheap (classify + retrieve, no generation); the agent is expensive (GPU generation). The self-improving cascade turns expensive wins into cheap future coverage:

inbound → classify (cheap)
   ├── good match in campaign library? ─── matcher emits it (cheap)            ✅
   └── novel / no good match? ──────────── escalate to agent (generate, costly)
                                              └── agent reply converts?
                                                    └── PROMOTE it into the campaign
                                                        library as a new matchable
                                                        reply, keyed by (archetype ×
                                                        state × move)

Every generator win on a novel situation becomes a new entry the matcher can select next time → coverage grows, fewer inbounds need the generator, cost falls while quality rises. This is "shortcut improvements": you grow behavior at prompt/library speed (hours) instead of retraining speed (days), and only LoRA when the prompt/library layer saturates. The promoted replies are literally new appliesTo: { archetype, state, templateKey } entries in the draft-engine.md CoT-workflow library.

5. Data routing — provenance routes, outcome admits

source decides which model an example trains; outcome decides whether it is admitted at all. Both gates always apply (training on agent output is self-distillation — keep only the wins, or you reinforce mistakes):

Row Good outcome → Bad outcome →
matcher classifier SFT + retrieval positive retrieval negative; and a "library gap" signal (novel → had to escalate) marking exactly where the matcher needs new entries
agent generator voice SFT + library-promotion candidate (§4) dropped (bad voice example)

Three derived sets fall out of the same extraction — the data-efficiency multipliers:

  • Hard-example set (active learning). Run the current fast-classifier over all 10K; diff its atoms/archetype vs the CoT ground truth. Train only on the disagreements — where you are currently wrong. Biggest efficiency lever.
  • DPO contrast pairs. Same archetype × state, one reply converted, one ghosted → a direct preference pair. The original_body → corrected_body rows in prospect_corrections are ready-made pairs.
  • Held-out eval set. Atom-stratified split (1020%) → honest metrics. With 10K this is real; at 25 examples (the current rule-classifier's set) the eval gate was theater.

6. Training & serving mechanics

  • Never train on the serving droplet. Training is an hours-long, 100%-GPU batch job; it would starve live inference. It runs on a separate transient droplet (spin up → produce adapter → write to storage → tear down). This does not violate "1 inference droplet until I complain" — that decision governs serving topology, not an offline job.
  • LoRA/QLoRA, not full fine-tune. 10K is comfortably above the LoRA threshold (hundredslow-thousands), well below full-FT scale. Tiny adapters, cheap, and — key — the serving GPU does multi-LoRA: one resident base + per-task adapters (classifier, draft) swapped per request at ~zero cost. This is why one droplet serves all roles concurrently (continuous batching + multi-LoRA + the ChatPriority queue in src/gpu/types.ts). See gpu-cost-control.md.
  • Eval-gated build flip. New build → held-out eval → metrics and Gate-2 safety must pass → only then flip draft_engine / the task-registry version. The do-gpu-<model>_<build> convention + per-decision engine-id recording let corrections bucket per build, closing the loop.

7. Sequencing

  1. Export → structure → label the 10K into per-turn records (§2, §3). The first real work is ETL + the labeling pass, not training.
  2. Classifier first — labels come free from matcher classify decisions + outcome mining; cleanest eval; LoRA the hard-example set (§5).
  3. Message-generator second — voice SFT on agent wins (filtered by outcome), then DPO on contrast pairs. Uncensored OSS base. Hard-gated by safety eval.
  4. Orchestrator: never trained — it needs tool-calling + instruction-following, and zero orchestration transcripts exist. Strong OSS instruct model + MCP tool schemas + few-shot. (Tier A, ai-first-v4.md.)
  5. Compounding loop: label → {read, move, outcome, source} → (a) library/prompt wins now → (b) hard-example LoRA when saturated → (c) eval gate → (d) the new build's corrections re-feed the labels.

8. Invariant

Training never widens the send floor. A new build only changes which body is proposed (matched, generated, or template fallback); Gate-2, the kill-switch, and the macsync-outbox floor are untouched, and the eval gate proves safety before any build reaches the auto-send path. A cold/failed/unproven model degrades to the template fallback — never to an unsafe or placeholder send.