History

Natalie da65901d96 Some checks are pending CI / verify (push) Waiting to run Details perf(prospector): WORKERS concurrency for rationalize (vertical scale) Matches sweep.py — 64-way client concurrency against vLLM max_num_seqs=128 so the full 8K-row backward-rationalization runs in ~12min on one H100, no horizontal fleet needed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>		2026-06-30 04:16:55 -04:00
..
.gitignore	feat(prospector): add tooling/eval draft-engine bake-off harness	2026-06-30 01:47:56 -04:00
extract.py	feat(prospector): add tooling/eval draft-engine bake-off harness	2026-06-30 01:47:56 -04:00
lib.py	fix(prospector): burst-aware, 1:1-only extraction (shared lib.py)	2026-06-30 04:03:46 -04:00
mine_cluster.py	feat(prospector): add mine_cluster.py — labeled message clusters from chat.db	2026-06-30 02:36:20 -04:00
rationalize.py	perf(prospector): WORKERS concurrency for rationalize (vertical scale)	2026-06-30 04:16:55 -04:00
README.md	feat(prospector): add tooling/eval draft-engine bake-off harness	2026-06-30 01:47:56 -04:00
run.py	feat(prospector): add tooling/eval draft-engine bake-off harness	2026-06-30 01:47:56 -04:00
score.py	feat(prospector): add tooling/eval draft-engine bake-off harness	2026-06-30 01:47:56 -04:00
sweep.py	fix(prospector): burst-aware, 1:1-only extraction (shared lib.py)	2026-06-30 04:03:46 -04:00

README.md

tooling/eval — draft-engine bake-off

Validate that the OSS uncensored model can draft Quinn's voice well enough to be the draft engine, with Claude as the offline judge/advisor, never the runtime generator (it declines the adult copy — which is the whole reason the engine must be OSS). See docs/features/model-eval-pipeline.md.

What it does

extract.py — builds a pseudonymized eval set from the agent-matcher reply-queue (handles + classified cat/tmpl + the matcher's drafted reply = the baseline) joined with full conversation context from the local Messages chat.db. Phone numbers → RQ_NN; the map stays local.
run.py — the OSS model drafts Quinn's next text for each convo, over an SSH tunnel (or the wg-mesh IP) to the GPU droplet's vLLM. Concurrent (12) so vLLM batches.
score.py — malformed %, on-voice %, and move-agreement vs the matcher.

PII discipline (hard rule)

Real conversations (intimate content) never enter the repo. All extracted threads, model outputs, and the handle map live under .data/ (gitignored). Only the scripts and prompt are committed. Conversation text is sent only to the operator's own GPU droplet, encrypted in transit (SSH tunnel today; wg-mesh once the droplet is onboarded as a mesh host).

Run

export OSS_URL=http://localhost:8800/v1/chat/completions   # ssh -L tunnel to vLLM
python3 extract.py && python3 run.py && python3 score.py

Result (recent agent-matcher set, 25 current NYC convos)

Model: AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16. Four methodology iterations eliminated the early weaknesses:

Fix	Weakness → result
`response_format: json_schema` + `strict`	malformed JSON 4–25% → 0%
canon few-shot (pastebin templates)	off-voice 11% → 0% (100% on-voice)
current facts + location-from-context	location contradictions → 0
classify-the-move-first, then reply	defensive moves (withhold address / redirect harvesters + crude to OF) → 96% move-agreement; all defensive cases fixed

The classify→reply two-step gives the free-generation model the agent-matcher's discipline on the hard cases (e.g. a crude BBC cool? → OF-redirect, not engaging it) while keeping generative flexibility — the matcher+generator hybrid from training-loop.md.

Verdict: adopt the OSS model for the draft engine; Claude stays the offline advisor/judge.

README.md Unescape Escape

tooling/eval — draft-engine bake-off

What it does

PII discipline (hard rule)

Run

Result (recent agent-matcher set, 25 current NYC convos)

README.md