prospector/tooling/eval
Natalie da65901d96
Some checks are pending
CI / verify (push) Waiting to run
perf(prospector): WORKERS concurrency for rationalize (vertical scale)
Matches sweep.py — 64-way client concurrency against vLLM max_num_seqs=128 so
the full 8K-row backward-rationalization runs in ~12min on one H100, no
horizontal fleet needed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 04:16:55 -04:00
..
.gitignore feat(prospector): add tooling/eval draft-engine bake-off harness 2026-06-30 01:47:56 -04:00
extract.py feat(prospector): add tooling/eval draft-engine bake-off harness 2026-06-30 01:47:56 -04:00
lib.py fix(prospector): burst-aware, 1:1-only extraction (shared lib.py) 2026-06-30 04:03:46 -04:00
mine_cluster.py feat(prospector): add mine_cluster.py — labeled message clusters from chat.db 2026-06-30 02:36:20 -04:00
rationalize.py perf(prospector): WORKERS concurrency for rationalize (vertical scale) 2026-06-30 04:16:55 -04:00
README.md feat(prospector): add tooling/eval draft-engine bake-off harness 2026-06-30 01:47:56 -04:00
run.py feat(prospector): add tooling/eval draft-engine bake-off harness 2026-06-30 01:47:56 -04:00
score.py feat(prospector): add tooling/eval draft-engine bake-off harness 2026-06-30 01:47:56 -04:00
sweep.py fix(prospector): burst-aware, 1:1-only extraction (shared lib.py) 2026-06-30 04:03:46 -04:00

tooling/eval — draft-engine bake-off

Validate that the OSS uncensored model can draft Quinn's voice well enough to be the draft engine, with Claude as the offline judge/advisor, never the runtime generator (it declines the adult copy — which is the whole reason the engine must be OSS). See docs/features/model-eval-pipeline.md.

What it does

  1. extract.py — builds a pseudonymized eval set from the agent-matcher reply-queue (handles + classified cat/tmpl + the matcher's drafted reply = the baseline) joined with full conversation context from the local Messages chat.db. Phone numbers → RQ_NN; the map stays local.
  2. run.py — the OSS model drafts Quinn's next text for each convo, over an SSH tunnel (or the wg-mesh IP) to the GPU droplet's vLLM. Concurrent (12) so vLLM batches.
  3. score.py — malformed %, on-voice %, and move-agreement vs the matcher.

PII discipline (hard rule)

Real conversations (intimate content) never enter the repo. All extracted threads, model outputs, and the handle map live under .data/ (gitignored). Only the scripts and prompt are committed. Conversation text is sent only to the operator's own GPU droplet, encrypted in transit (SSH tunnel today; wg-mesh once the droplet is onboarded as a mesh host).

Run

export OSS_URL=http://localhost:8800/v1/chat/completions   # ssh -L tunnel to vLLM
python3 extract.py && python3 run.py && python3 score.py

Result (recent agent-matcher set, 25 current NYC convos)

Model: AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16. Four methodology iterations eliminated the early weaknesses:

Fix Weakness → result
response_format: json_schema + strict malformed JSON 425% → 0%
canon few-shot (pastebin templates) off-voice 11% → 0% (100% on-voice)
current facts + location-from-context location contradictions → 0
classify-the-move-first, then reply defensive moves (withhold address / redirect harvesters + crude to OF) → 96% move-agreement; all defensive cases fixed

The classify→reply two-step gives the free-generation model the agent-matcher's discipline on the hard cases (e.g. a crude BBC cool? → OF-redirect, not engaging it) while keeping generative flexibility — the matcher+generator hybrid from training-loop.md.

Verdict: adopt the OSS model for the draft engine; Claude stays the offline advisor/judge.