|
Some checks are pending
CI / verify (push) Waiting to run
Pulls every thread matching a pattern (e.g. BBC / self-offer / lowball) paired with Quinn's actual next reply = the gold move label. The training substrate for hardening move-classification on the intent-split cases where prompt rules fail (paying-prospect -> qualify vs non-client -> disengage). PII under gitignored .data only. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| .gitignore | ||
| extract.py | ||
| mine_cluster.py | ||
| README.md | ||
| run.py | ||
| score.py | ||
tooling/eval — draft-engine bake-off
Validate that the OSS uncensored model can draft Quinn's voice well enough to be
the draft engine, with Claude as the
offline judge/advisor, never the runtime generator (it declines the adult copy —
which is the whole reason the engine must be OSS). See
docs/features/model-eval-pipeline.md.
What it does
extract.py— builds a pseudonymized eval set from the agent-matcher reply-queue (handles + classifiedcat/tmpl+ the matcher's drafted reply = the baseline) joined with full conversation context from the local Messageschat.db. Phone numbers →RQ_NN; the map stays local.run.py— the OSS model drafts Quinn's next text for each convo, over an SSH tunnel (or the wg-mesh IP) to the GPU droplet's vLLM. Concurrent (12) so vLLM batches.score.py— malformed %, on-voice %, and move-agreement vs the matcher.
PII discipline (hard rule)
Real conversations (intimate content) never enter the repo. All extracted threads,
model outputs, and the handle map live under .data/ (gitignored). Only the
scripts and prompt are committed. Conversation text is sent only to the
operator's own GPU droplet, encrypted in transit (SSH tunnel today; wg-mesh once
the droplet is onboarded as a mesh host).
Run
export OSS_URL=http://localhost:8800/v1/chat/completions # ssh -L tunnel to vLLM
python3 extract.py && python3 run.py && python3 score.py
Result (recent agent-matcher set, 25 current NYC convos)
Model: AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16. Four methodology
iterations eliminated the early weaknesses:
| Fix | Weakness → result |
|---|---|
response_format: json_schema + strict |
malformed JSON 4–25% → 0% |
| canon few-shot (pastebin templates) | off-voice 11% → 0% (100% on-voice) |
| current facts + location-from-context | location contradictions → 0 |
| classify-the-move-first, then reply | defensive moves (withhold address / redirect harvesters + crude to OF) → 96% move-agreement; all defensive cases fixed |
The classify→reply two-step gives the free-generation model the agent-matcher's
discipline on the hard cases (e.g. a crude BBC cool? → OF-redirect, not engaging
it) while keeping generative flexibility — the matcher+generator hybrid from
training-loop.md.
Verdict: adopt the OSS model for the draft engine; Claude stays the offline advisor/judge.