tooling/eval — the model eval & training-data pipeline
The runnable pipeline behind docs/features/training-loop.md
and model-eval-pipeline.md: turn
Quinn's message history into labeled training data, score the OSS model's drafts,
and provision the GPU safely. Claude is the offline advisor/judge; the OSS
uncensored model (Qwen3.6-27B-AEON) is the worker — it drafts the adult copy
Claude won't.
PII discipline (hard rule)
Real conversations never enter the repo. All extracted threads, model outputs, and
the handle map live under .data/ (gitignored). Phone numbers are pseudonymized
(RQ_NN, per extract.py); the map stays in *.local.json (gitignored). Conversation
text is sent only to the operator's own GPU droplet, encrypted in transit (SSH
tunnel today; wg-mesh once onboarded). Only scripts + prompts are committed.
The pipeline (grouped by stage)
| File |
Purpose |
lib.py |
Burst-aware, 1:1-only chat.db extraction. Collapses message bursts (one sender, up to 132 in a row — ~38% of runs), excludes group chats (style 43), yields CLIENT→QUINN decision points. The single correct extraction every script uses. |
Labeling (build the training set)
| File |
Purpose |
mine_cluster.py |
Pull a labeled cluster by regex (e.g. bbc) → (client_msg, Quinn's actual reply). For dense token-signals where regex works. |
sweep.py |
Semantic move-classification at scale. Classifies decision points into the move taxonomy (incl. the not-a-prospect gate: existing_client / personal / vendor / spam). Scales via WORKERS/MAX_PER_HANDLE. Finds the sparse classes (escalate/photographer) regex can't. |
rationalize.py |
Backward CoT distillation (STaR). Given a conversation + Quinn's actual reply, infer the move she ran + a reasoning trace anchored to it → the (context → trace → move) LoRA training rows. |
Evaluation (the bake-off)
| File |
Purpose |
extract.py |
Build a pseudonymized eval set from the agent-matcher reply-queue + chat.db context. |
run.py |
The OSS model drafts Quinn's next text per the validated methodology (json_schema strict → 0% malformed, canon few-shot → on-voice, classify-move-first → matcher-level discipline). |
score.py |
Malformed %, on-voice %, move-agreement vs the matcher. |
GPU lifecycle
| File |
Purpose |
gpu.py |
On-demand H100 with self-reaping auto-teardown: NO secret on the droplet; an external reaper enforces a hard lifetime cap (from the DO API) + SSH-only idle check. Region fallback (nyc2→tor1→atl1→ams3). up / reap / install-reaper / down / status. |
Run it
# 1. Provision the GPU (auto-tears-down at idle/cap, even if the laptop sleeps)
python3 gpu.py up && python3 gpu.py install-reaper
ssh -f -N -L 8800:localhost:8000 root@<ip> # encrypted tunnel to vLLM
export OSS_URL=http://localhost:8800/v1/chat/completions DATA_DIR="$PWD/.data"
# 2. Label the corpus at scale
WORKERS=64 MAX_PER_HANDLE=20 python3 sweep.py # → .data/sweep_labels.json
WORKERS=64 python3 rationalize.py sweep_labels.json # → .data/traincot_sweep_labels.json
# 3. Or run the bake-off eval
python3 extract.py && python3 run.py && python3 score.py
# 4. Done — tear down (model weights persist on the nyc2 volume)
python3 gpu.py down
Verdict so far (see model-eval-pipeline.md / ai-system-plan.md)
The OSS generator drafts Quinn's voice well (89% on-voice, 0 location errors after
iteration) — adopt it for the draft engine; Claude stays the offline judge. The
classifier needs the identity gate + clean-data LoRA before it's reliable (it aligned
with her real replies only ~46% on the contaminated full corpus — the not-a-prospect
gate above is the fix).