prospector/tooling/eval
Natalie c2bcd23548
Some checks are pending
CI / verify (push) Waiting to run
feat(prospector): add mine_cluster.py — labeled message clusters from chat.db
Pulls every thread matching a pattern (e.g. BBC / self-offer / lowball) paired
with Quinn's actual next reply = the gold move label. The training substrate for
hardening move-classification on the intent-split cases where prompt rules fail
(paying-prospect -> qualify vs non-client -> disengage). PII under gitignored
.data only.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 02:36:20 -04:00
..
.gitignore feat(prospector): add tooling/eval draft-engine bake-off harness 2026-06-30 01:47:56 -04:00
extract.py feat(prospector): add tooling/eval draft-engine bake-off harness 2026-06-30 01:47:56 -04:00
mine_cluster.py feat(prospector): add mine_cluster.py — labeled message clusters from chat.db 2026-06-30 02:36:20 -04:00
README.md feat(prospector): add tooling/eval draft-engine bake-off harness 2026-06-30 01:47:56 -04:00
run.py feat(prospector): add tooling/eval draft-engine bake-off harness 2026-06-30 01:47:56 -04:00
score.py feat(prospector): add tooling/eval draft-engine bake-off harness 2026-06-30 01:47:56 -04:00

tooling/eval — draft-engine bake-off

Validate that the OSS uncensored model can draft Quinn's voice well enough to be the draft engine, with Claude as the offline judge/advisor, never the runtime generator (it declines the adult copy — which is the whole reason the engine must be OSS). See docs/features/model-eval-pipeline.md.

What it does

  1. extract.py — builds a pseudonymized eval set from the agent-matcher reply-queue (handles + classified cat/tmpl + the matcher's drafted reply = the baseline) joined with full conversation context from the local Messages chat.db. Phone numbers → RQ_NN; the map stays local.
  2. run.py — the OSS model drafts Quinn's next text for each convo, over an SSH tunnel (or the wg-mesh IP) to the GPU droplet's vLLM. Concurrent (12) so vLLM batches.
  3. score.py — malformed %, on-voice %, and move-agreement vs the matcher.

PII discipline (hard rule)

Real conversations (intimate content) never enter the repo. All extracted threads, model outputs, and the handle map live under .data/ (gitignored). Only the scripts and prompt are committed. Conversation text is sent only to the operator's own GPU droplet, encrypted in transit (SSH tunnel today; wg-mesh once the droplet is onboarded as a mesh host).

Run

export OSS_URL=http://localhost:8800/v1/chat/completions   # ssh -L tunnel to vLLM
python3 extract.py && python3 run.py && python3 score.py

Result (recent agent-matcher set, 25 current NYC convos)

Model: AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16. Four methodology iterations eliminated the early weaknesses:

Fix Weakness → result
response_format: json_schema + strict malformed JSON 425% → 0%
canon few-shot (pastebin templates) off-voice 11% → 0% (100% on-voice)
current facts + location-from-context location contradictions → 0
classify-the-move-first, then reply defensive moves (withhold address / redirect harvesters + crude to OF) → 96% move-agreement; all defensive cases fixed

The classify→reply two-step gives the free-generation model the agent-matcher's discipline on the hard cases (e.g. a crude BBC cool? → OF-redirect, not engaging it) while keeping generative flexibility — the matcher+generator hybrid from training-loop.md.

Verdict: adopt the OSS model for the draft engine; Claude stays the offline advisor/judge.