Some checks failed
CI / verify (push) Failing after 1m1s
Multi-agent review against the real repo confirmed 3 accuracy errors (the design docs were correctly cleared as forward-looking, not state claims): - ai-system-plan: drop '95% terse' — score.py emits only on-voice/location/ malformed; cite those. - tooling/eval/README: pseudonym is RQ_NN only (extract.py), not THREAD_NN. - training-loop: mark PROSPECTOR_TRAINING.md as an external Executor doc not yet in this repo (also dangling-cited in fast-classifier.ts:4). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
170 lines
9.8 KiB
Markdown
170 lines
9.8 KiB
Markdown
# Training loop — CoT-labeled corpus → LoRA → eval-gated build flip
|
||
|
||
> How the OSS models ([`draft-engine.md`](./draft-engine.md)) actually improve from
|
||
> Quinn's **10K+ message history**. The thesis: **don't train on raw
|
||
> `(inbound → reply)` pairs.** First enrich every historical turn into a structured,
|
||
> reasoning-bearing record — that record is the durable asset; the weights are
|
||
> downstream and disposable. One extraction feeds classifier training, generator
|
||
> training, the prompt-level workflow library, the hard-example miner, and the eval
|
||
> set.
|
||
>
|
||
> Model policy unchanged: serving is OSS-on-GPU (no Claude in the runtime loop). The
|
||
> **offline labeling pass** may use a strong teacher (incl. hosted Claude) — it
|
||
> analyzes history, it does not generate adult copy at runtime.
|
||
|
||
## 1. The two producers (provenance is a routing key, not metadata)
|
||
|
||
Quinn's recent good outbound came from two distinct systems. They map onto the
|
||
engine's two draft modes and two of the three AI roles
|
||
([`ai-first-v4.md`](./ai-first-v4.md)):
|
||
|
||
| Producer | What it does | Engine mode | Trains |
|
||
|---|---|---|---|
|
||
| **matcher** | classify convo → campaign, classify inbound, **match** to a reply in the campaign's set | `template` (selection) | the **classifier** + the **retrieval / move-selection** policy |
|
||
| **agent** | reads the full convo, **generates** a freeform reply | `do-gpu` (generative `prospect.draft`) | the **generative message-generator** (voice) |
|
||
|
||
**Do not pool the 10K into one SFT set.** Matcher outputs are *selections from a
|
||
fixed library* — training the generator on them teaches it to parrot canned copy
|
||
(no generalization). So `source ∈ {matcher, agent}` **dispatches** each example to
|
||
the model it can actually improve.
|
||
|
||
## 2. The per-turn extraction record (the asset)
|
||
|
||
For every turn in every historical thread, produce one record keyed to the engine's
|
||
**existing** vocabulary — invent nothing:
|
||
|
||
```jsonc
|
||
{
|
||
"read": { "archetype": "...", "atoms": { /* ProspectAtomsV4, src/engine/atoms.ts */ } },
|
||
"state": { "stage": "...", "transition": "..." }, // engine/state.ts state machine
|
||
"move": "qualify|screen|deflect-price|book|tease|re-engage|disengage", // ← NEW label, the gold
|
||
"constraints": { "noPriceInWriting": true, "humanOwned": false, "bookingTriad": "partial" }, // Gate-2 facts honored
|
||
"outcome": "booked|engaged|ghosted|blocked|scam", // mined from the thread tail = reward/weight
|
||
"source": { "actor": "agent|matcher|human|runner", "agentId": "...", "context": "..." },
|
||
"reply": "<the actual message sent>"
|
||
}
|
||
```
|
||
|
||
- **`read.atoms`** = the canonical **22-atom `ProspectAtomsV4`** (`src/engine/atoms.ts`).
|
||
Reuse the existing defensive parser; do not define a parallel schema.
|
||
- **`move`** is the one field not yet in code (~6–10 recurring plays). It is the
|
||
highest-value extraction — see §4.
|
||
- **`outcome`** is mined from how the thread actually ended, not asserted.
|
||
- **`source`** is the v4 governance **actor attribution** applied retroactively to
|
||
history (the same column the going-forward pipeline adds to `prospect_drafts`).
|
||
|
||
> **To confirm against code when locking the schema:** the full 22-atom enum set in
|
||
> `src/engine/atoms.ts`, the `move` taxonomy (derive from `PROSPECTOR_TRAINING.md` —
|
||
> an **external Executor doc, not yet in this repo** (also dangling-cited at
|
||
> `src/engine/fast-classifier.ts:4`; create it here when the taxonomy is locked) — +
|
||
> the campaign reply sets), and where the 10K physically lives (Apple Messages via
|
||
> macsync / a legacy LP export — `prospect_drafts` is going-forward only; the
|
||
> macsync client is outbox-only). Step zero is an **export → structure** job.
|
||
|
||
## 3. CoT labeling = rationalization (backward from a known-good answer)
|
||
|
||
We already *know* the good reply — it is Quinn's actual message on a thread that
|
||
**converted**. So we generate the reasoning **backward**, conditioned on the answer
|
||
(STaR / "distilling step-by-step" / rejection-sampled rationales):
|
||
|
||
1. Teacher reads `(thread context, atoms, state)` **and the known reply**.
|
||
2. Teacher emits the trace: *read → chosen move → constraints honored → therefore
|
||
this reply*.
|
||
3. **Reject any rationale that does not reconstruct her actual move** — automatic
|
||
quality filter; bad rationales never enter the corpus.
|
||
|
||
Backward rationales are far higher quality than forward reasoning because they are
|
||
anchored to a verified outcome. The SFT target becomes
|
||
`(context, read, state) → [reasoning] → reply`, which distills the **decision
|
||
procedure**, not the surface wording. Keep the trace at inference (explainable —
|
||
feeds Plane-2 `prospector_explain`) or distill it away for latency.
|
||
|
||
## 4. The shortcut: distill agent wins into the matcher library
|
||
|
||
The matcher is cheap (classify + retrieve, no generation); the agent is expensive
|
||
(GPU generation). The self-improving cascade turns expensive wins into cheap future
|
||
coverage:
|
||
|
||
```
|
||
inbound → classify (cheap)
|
||
├── good match in campaign library? ─── matcher emits it (cheap) ✅
|
||
└── novel / no good match? ──────────── escalate to agent (generate, costly)
|
||
└── agent reply converts?
|
||
└── PROMOTE it into the campaign
|
||
library as a new matchable
|
||
reply, keyed by (archetype ×
|
||
state × move)
|
||
```
|
||
|
||
Every generator win on a novel situation becomes a new entry the matcher can select
|
||
next time → **coverage grows, fewer inbounds need the generator, cost falls while
|
||
quality rises.** This *is* "shortcut improvements": you grow behavior at
|
||
**prompt/library speed (hours)** instead of **retraining speed (days)**, and only
|
||
LoRA when the prompt/library layer saturates. The promoted replies are literally new
|
||
`appliesTo: { archetype, state, templateKey }` entries in the
|
||
[`draft-engine.md`](./draft-engine.md) CoT-workflow library.
|
||
|
||
## 5. Data routing — provenance routes, outcome admits
|
||
|
||
`source` decides *which model* an example trains; `outcome` decides *whether it is
|
||
admitted at all*. Both gates always apply (training on agent output is
|
||
self-distillation — keep only the wins, or you reinforce mistakes):
|
||
|
||
| Row | Good outcome → | Bad outcome → |
|
||
|---|---|---|
|
||
| **matcher** | classifier SFT + retrieval positive | retrieval negative; **and a "library gap" signal** (novel → had to escalate) marking exactly where the matcher needs new entries |
|
||
| **agent** | generator voice SFT + **library-promotion candidate** (§4) | dropped (bad voice example) |
|
||
|
||
Three derived sets fall out of the same extraction — the data-efficiency multipliers:
|
||
|
||
- **Hard-example set (active learning).** Run the current `fast-classifier` over all
|
||
10K; diff its atoms/archetype vs the CoT ground truth. **Train only on the
|
||
disagreements** — where you are currently wrong. Biggest efficiency lever.
|
||
- **DPO contrast pairs.** Same `archetype × state`, one reply converted, one ghosted
|
||
→ a direct preference pair. The `original_body → corrected_body` rows in
|
||
`prospect_corrections` are ready-made pairs.
|
||
- **Held-out eval set.** Atom-stratified split (10–20%) → honest metrics. With 10K
|
||
this is real; at 25 examples (the current rule-classifier's set) the eval gate was
|
||
theater.
|
||
|
||
## 6. Training & serving mechanics
|
||
|
||
- **Never train on the serving droplet.** Training is an hours-long, 100%-GPU batch
|
||
job; it would starve live inference. It runs on a **separate transient droplet**
|
||
(spin up → produce adapter → write to storage → tear down). This does **not**
|
||
violate "1 inference droplet until I complain" — that decision governs *serving*
|
||
topology, not an offline job.
|
||
- **LoRA/QLoRA, not full fine-tune.** 10K is comfortably above the LoRA threshold
|
||
(hundreds–low-thousands), well below full-FT scale. Tiny adapters, cheap, and —
|
||
key — the serving GPU does **multi-LoRA**: one resident base + per-task adapters
|
||
(classifier, draft) swapped per request at ~zero cost. This is why one droplet
|
||
serves all roles concurrently (continuous batching + multi-LoRA + the
|
||
`ChatPriority` queue in `src/gpu/types.ts`). See
|
||
[`gpu-cost-control.md`](./gpu-cost-control.md).
|
||
- **Eval-gated build flip.** New build → held-out eval → metrics **and** Gate-2
|
||
safety must pass → only then flip `draft_engine` / the task-registry version. The
|
||
`do-gpu-<model>_<build>` convention + per-decision engine-id recording let
|
||
corrections bucket per build, closing the loop.
|
||
|
||
## 7. Sequencing
|
||
|
||
1. **Export → structure → label** the 10K into per-turn records (§2, §3). The first
|
||
real work is ETL + the labeling pass, not training.
|
||
2. **Classifier first** — labels come free from matcher classify decisions + outcome
|
||
mining; cleanest eval; LoRA the hard-example set (§5).
|
||
3. **Message-generator second** — voice SFT on **agent** wins (filtered by outcome),
|
||
then DPO on contrast pairs. Uncensored OSS base. Hard-gated by safety eval.
|
||
4. **Orchestrator: never trained** — it needs tool-calling + instruction-following,
|
||
and zero orchestration transcripts exist. Strong OSS instruct model + MCP tool
|
||
schemas + few-shot. (Tier A, [`ai-first-v4.md`](./ai-first-v4.md).)
|
||
5. **Compounding loop:** label → `{read, move, outcome, source}` → (a) library/prompt
|
||
wins now → (b) hard-example LoRA when saturated → (c) eval gate → (d) the new
|
||
build's corrections re-feed the labels.
|
||
|
||
## 8. Invariant
|
||
|
||
Training never widens the send floor. A new build only changes *which body* is
|
||
proposed (matched, generated, or `template` fallback); Gate-2, the kill-switch, and
|
||
the macsync-outbox floor are untouched, and the eval gate proves safety before any
|
||
build reaches the auto-send path. A cold/failed/unproven model degrades to the
|
||
`template` fallback — never to an unsafe or placeholder send.
|