prospector/tooling/eval/README.md
Natalie 1fa1787dd4
Some checks failed
CI / verify (push) Failing after 1m1s
docs(prospector): fix unverified claims found by the doc-review workflow
Multi-agent review against the real repo confirmed 3 accuracy errors (the
design docs were correctly cleared as forward-looking, not state claims):
- ai-system-plan: drop '95% terse' — score.py emits only on-voice/location/
  malformed; cite those.
- tooling/eval/README: pseudonym is RQ_NN only (extract.py), not THREAD_NN.
- training-loop: mark PROSPECTOR_TRAINING.md as an external Executor doc not
  yet in this repo (also dangling-cited in fast-classifier.ts:4).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 11:14:54 -04:00

3.8 KiB

tooling/eval — the model eval & training-data pipeline

The runnable pipeline behind docs/features/training-loop.md and model-eval-pipeline.md: turn Quinn's message history into labeled training data, score the OSS model's drafts, and provision the GPU safely. Claude is the offline advisor/judge; the OSS uncensored model (Qwen3.6-27B-AEON) is the worker — it drafts the adult copy Claude won't.

PII discipline (hard rule)

Real conversations never enter the repo. All extracted threads, model outputs, and the handle map live under .data/ (gitignored). Phone numbers are pseudonymized (RQ_NN, per extract.py); the map stays in *.local.json (gitignored). Conversation text is sent only to the operator's own GPU droplet, encrypted in transit (SSH tunnel today; wg-mesh once onboarded). Only scripts + prompts are committed.

The pipeline (grouped by stage)

Shared extraction

File Purpose
lib.py Burst-aware, 1:1-only chat.db extraction. Collapses message bursts (one sender, up to 132 in a row — ~38% of runs), excludes group chats (style 43), yields CLIENT→QUINN decision points. The single correct extraction every script uses.

Labeling (build the training set)

File Purpose
mine_cluster.py Pull a labeled cluster by regex (e.g. bbc) → (client_msg, Quinn's actual reply). For dense token-signals where regex works.
sweep.py Semantic move-classification at scale. Classifies decision points into the move taxonomy (incl. the not-a-prospect gate: existing_client / personal / vendor / spam). Scales via WORKERS/MAX_PER_HANDLE. Finds the sparse classes (escalate/photographer) regex can't.
rationalize.py Backward CoT distillation (STaR). Given a conversation + Quinn's actual reply, infer the move she ran + a reasoning trace anchored to it → the (context → trace → move) LoRA training rows.

Evaluation (the bake-off)

File Purpose
extract.py Build a pseudonymized eval set from the agent-matcher reply-queue + chat.db context.
run.py The OSS model drafts Quinn's next text per the validated methodology (json_schema strict → 0% malformed, canon few-shot → on-voice, classify-move-first → matcher-level discipline).
score.py Malformed %, on-voice %, move-agreement vs the matcher.

GPU lifecycle

File Purpose
gpu.py On-demand H100 with self-reaping auto-teardown: NO secret on the droplet; an external reaper enforces a hard lifetime cap (from the DO API) + SSH-only idle check. Region fallback (nyc2→tor1→atl1→ams3). up / reap / install-reaper / down / status.

Run it

# 1. Provision the GPU (auto-tears-down at idle/cap, even if the laptop sleeps)
python3 gpu.py up && python3 gpu.py install-reaper
ssh -f -N -L 8800:localhost:8000 root@<ip>          # encrypted tunnel to vLLM
export OSS_URL=http://localhost:8800/v1/chat/completions DATA_DIR="$PWD/.data"

# 2. Label the corpus at scale
WORKERS=64 MAX_PER_HANDLE=20 python3 sweep.py        # → .data/sweep_labels.json
WORKERS=64 python3 rationalize.py sweep_labels.json  # → .data/traincot_sweep_labels.json

# 3. Or run the bake-off eval
python3 extract.py && python3 run.py && python3 score.py

# 4. Done — tear down (model weights persist on the nyc2 volume)
python3 gpu.py down

Verdict so far (see model-eval-pipeline.md / ai-system-plan.md)

The OSS generator drafts Quinn's voice well (89% on-voice, 0 location errors after iteration) — adopt it for the draft engine; Claude stays the offline judge. The classifier needs the identity gate + clean-data LoRA before it's reliable (it aligned with her real replies only ~46% on the contaminated full corpus — the not-a-prospect gate above is the fix).