Thesis: enrich 10K+ history into per-turn {read, move, outcome, source}
records before training. Map the two producers (matcher=classify+retrieve,
agent=generate) to data routing; distill agent wins into the matcher
library as the cost/quality shortcut. LoRA per role on a transient
training droplet; multi-LoRA serving on the single inference droplet;
eval-gated build flip. Classifier first, generator second, orchestrator
never trained.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
9.7 KiB
Training loop — CoT-labeled corpus → LoRA → eval-gated build flip
How the OSS models (
draft-engine.md) actually improve from Quinn's 10K+ message history. The thesis: don't train on raw(inbound → reply)pairs. First enrich every historical turn into a structured, reasoning-bearing record — that record is the durable asset; the weights are downstream and disposable. One extraction feeds classifier training, generator training, the prompt-level workflow library, the hard-example miner, and the eval set.Model policy unchanged: serving is OSS-on-GPU (no Claude in the runtime loop). The offline labeling pass may use a strong teacher (incl. hosted Claude) — it analyzes history, it does not generate adult copy at runtime.
1. The two producers (provenance is a routing key, not metadata)
Quinn's recent good outbound came from two distinct systems. They map onto the
engine's two draft modes and two of the three AI roles
(ai-first-v4.md):
| Producer | What it does | Engine mode | Trains |
|---|---|---|---|
| matcher | classify convo → campaign, classify inbound, match to a reply in the campaign's set | template (selection) |
the classifier + the retrieval / move-selection policy |
| agent | reads the full convo, generates a freeform reply | do-gpu (generative prospect.draft) |
the generative message-generator (voice) |
Do not pool the 10K into one SFT set. Matcher outputs are selections from a
fixed library — training the generator on them teaches it to parrot canned copy
(no generalization). So source ∈ {matcher, agent} dispatches each example to
the model it can actually improve.
2. The per-turn extraction record (the asset)
For every turn in every historical thread, produce one record keyed to the engine's existing vocabulary — invent nothing:
{
"read": { "archetype": "...", "atoms": { /* ProspectAtomsV4, src/engine/atoms.ts */ } },
"state": { "stage": "...", "transition": "..." }, // engine/state.ts state machine
"move": "qualify|screen|deflect-price|book|tease|re-engage|disengage", // ← NEW label, the gold
"constraints": { "noPriceInWriting": true, "humanOwned": false, "bookingTriad": "partial" }, // Gate-2 facts honored
"outcome": "booked|engaged|ghosted|blocked|scam", // mined from the thread tail = reward/weight
"source": { "actor": "agent|matcher|human|runner", "agentId": "...", "context": "..." },
"reply": "<the actual message sent>"
}
read.atoms= the canonical 22-atomProspectAtomsV4(src/engine/atoms.ts). Reuse the existing defensive parser; do not define a parallel schema.moveis the one field not yet in code (~6–10 recurring plays). It is the highest-value extraction — see §4.outcomeis mined from how the thread actually ended, not asserted.sourceis the v4 governance actor attribution applied retroactively to history (the same column the going-forward pipeline adds toprospect_drafts).
To confirm against code when locking the schema: the full 22-atom enum set in
src/engine/atoms.ts, themovetaxonomy (derive fromPROSPECTOR_TRAINING.md+ the campaign reply sets), and where the 10K physically lives (Apple Messages via macsync / a legacy LP export —prospect_draftsis going-forward only; the macsync client is outbox-only). Step zero is an export → structure job.
3. CoT labeling = rationalization (backward from a known-good answer)
We already know the good reply — it is Quinn's actual message on a thread that converted. So we generate the reasoning backward, conditioned on the answer (STaR / "distilling step-by-step" / rejection-sampled rationales):
- Teacher reads
(thread context, atoms, state)and the known reply. - Teacher emits the trace: read → chosen move → constraints honored → therefore this reply.
- Reject any rationale that does not reconstruct her actual move — automatic quality filter; bad rationales never enter the corpus.
Backward rationales are far higher quality than forward reasoning because they are
anchored to a verified outcome. The SFT target becomes
(context, read, state) → [reasoning] → reply, which distills the decision
procedure, not the surface wording. Keep the trace at inference (explainable —
feeds Plane-2 prospector_explain) or distill it away for latency.
4. The shortcut: distill agent wins into the matcher library
The matcher is cheap (classify + retrieve, no generation); the agent is expensive (GPU generation). The self-improving cascade turns expensive wins into cheap future coverage:
inbound → classify (cheap)
├── good match in campaign library? ─── matcher emits it (cheap) ✅
└── novel / no good match? ──────────── escalate to agent (generate, costly)
└── agent reply converts?
└── PROMOTE it into the campaign
library as a new matchable
reply, keyed by (archetype ×
state × move)
Every generator win on a novel situation becomes a new entry the matcher can select
next time → coverage grows, fewer inbounds need the generator, cost falls while
quality rises. This is "shortcut improvements": you grow behavior at
prompt/library speed (hours) instead of retraining speed (days), and only
LoRA when the prompt/library layer saturates. The promoted replies are literally new
appliesTo: { archetype, state, templateKey } entries in the
draft-engine.md CoT-workflow library.
5. Data routing — provenance routes, outcome admits
source decides which model an example trains; outcome decides whether it is
admitted at all. Both gates always apply (training on agent output is
self-distillation — keep only the wins, or you reinforce mistakes):
| Row | Good outcome → | Bad outcome → |
|---|---|---|
| matcher | classifier SFT + retrieval positive | retrieval negative; and a "library gap" signal (novel → had to escalate) marking exactly where the matcher needs new entries |
| agent | generator voice SFT + library-promotion candidate (§4) | dropped (bad voice example) |
Three derived sets fall out of the same extraction — the data-efficiency multipliers:
- Hard-example set (active learning). Run the current
fast-classifierover all 10K; diff its atoms/archetype vs the CoT ground truth. Train only on the disagreements — where you are currently wrong. Biggest efficiency lever. - DPO contrast pairs. Same
archetype × state, one reply converted, one ghosted → a direct preference pair. Theoriginal_body → corrected_bodyrows inprospect_correctionsare ready-made pairs. - Held-out eval set. Atom-stratified split (10–20%) → honest metrics. With 10K this is real; at 25 examples (the current rule-classifier's set) the eval gate was theater.
6. Training & serving mechanics
- Never train on the serving droplet. Training is an hours-long, 100%-GPU batch job; it would starve live inference. It runs on a separate transient droplet (spin up → produce adapter → write to storage → tear down). This does not violate "1 inference droplet until I complain" — that decision governs serving topology, not an offline job.
- LoRA/QLoRA, not full fine-tune. 10K is comfortably above the LoRA threshold
(hundreds–low-thousands), well below full-FT scale. Tiny adapters, cheap, and —
key — the serving GPU does multi-LoRA: one resident base + per-task adapters
(classifier, draft) swapped per request at ~zero cost. This is why one droplet
serves all roles concurrently (continuous batching + multi-LoRA + the
ChatPriorityqueue insrc/gpu/types.ts). Seegpu-cost-control.md. - Eval-gated build flip. New build → held-out eval → metrics and Gate-2
safety must pass → only then flip
draft_engine/ the task-registry version. Thedo-gpu-<model>_<build>convention + per-decision engine-id recording let corrections bucket per build, closing the loop.
7. Sequencing
- Export → structure → label the 10K into per-turn records (§2, §3). The first real work is ETL + the labeling pass, not training.
- Classifier first — labels come free from matcher classify decisions + outcome mining; cleanest eval; LoRA the hard-example set (§5).
- Message-generator second — voice SFT on agent wins (filtered by outcome), then DPO on contrast pairs. Uncensored OSS base. Hard-gated by safety eval.
- Orchestrator: never trained — it needs tool-calling + instruction-following,
and zero orchestration transcripts exist. Strong OSS instruct model + MCP tool
schemas + few-shot. (Tier A,
ai-first-v4.md.) - Compounding loop: label →
{read, move, outcome, source}→ (a) library/prompt wins now → (b) hard-example LoRA when saturated → (c) eval gate → (d) the new build's corrections re-feed the labels.
8. Invariant
Training never widens the send floor. A new build only changes which body is
proposed (matched, generated, or template fallback); Gate-2, the kill-switch, and
the macsync-outbox floor are untouched, and the eval gate proves safety before any
build reaches the auto-send path. A cold/failed/unproven model degrades to the
template fallback — never to an unsafe or placeholder send.