docs(prospector): add model eval & selection pipeline (Claude-advisor/OSS-worker)
Some checks are pending
CI / verify (push) Waiting to run

Bake-off harness in src/eval/ with Claude as offline labeler/judge/advisor
(never in the serving loop). Per-role scoring (classifier F1, generator
refusal+voice+policy+85% gate, orchestrator tool-call), replay harness to
fix Executor cycle-1's no-batch-replay blocker, researched candidate
roster (de-refused instruct base + Quinn-voice LoRA over heavy RP
fine-tunes). Reuses outcomes.jsonl/gold-turnpairs/RUNNER-POLICY.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Natalie 2026-06-29 23:42:10 -04:00
parent aa3e6eacda
commit 0120acef26

View file

@ -0,0 +1,106 @@
# Model eval & selection pipeline — Claude-advisor / OSS-worker
> Pick the OSS uncensored models we'll actually run, by bake-off, with **Claude as
> the offline advisor/judge/labeler — never in the serving loop.** This operationalizes
> the eval gate from [`training-loop.md`](./training-loop.md) and fixes the blocker
> that stopped Executor's manual eval cycle 1.
>
> **Home:** `@applications/prospector` (`src/eval/`). It already has the model-boss/GPU
> plumbing to serve candidates and the thread data to replay. **Corpus:** Executor's
> `prospecting/voice-engine/datasets/` (`outcomes.jsonl` — 323 TAGSET-labeled rows,
> `gold-turnpairs.jsonl` — 42 pairs) + `RUNNER-POLICY.md` + `CANONICAL-VOICE.md` +
> `TAGSET.md`, and the 40K iMessage history.
## The split (reconciles "Claude in the loop" with "OSS at runtime")
- **OSS uncensored = the worker** — runtime serving, and the *candidate under test*.
- **Claude = advisor in the *development/eval* loop only** — three offline jobs, none
emitting adult copy to a prospect:
1. **Labeler** — extend 323 → full corpus against `TAGSET.md` (CoT rationalization).
2. **Judge** — score candidate drafts on voice/policy (the role the human did by hand
in cycle 1).
3. **Advisor** — synthesize the scorecard → recommend the model + build per role.
## What already exists (do not rebuild)
| Asset | Where | Use |
|---|---|---|
| 323 TAGSET-labeled personas | `outcomes.jsonl` | classifier eval gold |
| 42 `(client→quinn)` pairs | `gold-turnpairs.jsonl` | generator eval gold seed |
| eval method + **85% "would-send-unedited"** Gate-1 + axes (VOICE_REGISTER, THREAD_FACTS) | `eval-heldout.md` | the rubric |
| policy source of truth (scam screen, QUALIFY→PITCH+RATE→location→HOLD, voice) | `RUNNER-POLICY.md` | judge policy rubric |
| candidate serving + on-demand GPU | prospector `src/gpu/` + model-boss | run candidates |
| 40K thread history | iMessage sync | replay corpus |
**The cycle-1 blocker (verbatim):** *"a clean automated eval is blocked because
`draft_message` only drafts the live next turn — it can't replay a historical client
message — so we can't batch-score against gold pairs."* → **Stage 0 below fixes this.**
## Pipeline
| Stage | What | Build? |
|---|---|---|
| **0 · Replay harness** | reconstruct each held-out thread to turn N → ask a candidate to draft turn N → batch-score vs gold | **BUILD** (`src/eval/`) — the blocker fix |
| **1 · Corpus + split** | freeze held-out; Claude labels the 40K remainder to TAGSET | reuse datasets + Claude labeler |
| **2 · Candidate roster** | per role (below); pin `do-gpu-<model>_<build>` | config |
| **3 · Eval** (role × candidate) | run via replay harness, score (below) | BUILD scorers |
| **4 · Scorecard + decision** | quality × refusal × latency × VRAM × cost → Claude advisor recommends → Quinn approves flip | BUILD aggregator |
| **5 · Standing loop** | every new build/correction → harness → judge → 85% gate → flip | the continuous gate |
### Stage 3 — scoring differs per role (so the winner differs per role)
- **Classifier** — replay inbound → predict TAGSET tags → **accuracy/F1 vs
`outcomes.jsonl`**. Objective; no judge.
- **Message-generator** — replay thread → candidate draft → score on:
- **Refusal rate** — the **uncensored gate**; a fixed set of in-policy adult prompts,
any refusal/moralizing → **auto-disqualify** before voice scoring. The single most
decisive filter; why hosted models can't serve this role.
- **Voice fidelity** — Claude judge vs `CANONICAL-VOICE` on VOICE_REGISTER + THREAD_FACTS.
- **Policy compliance** — Claude + programmatic vs `RUNNER-POLICY` (withholds address,
text-only, correct funnel stage, scam screen).
- **% would-send-unedited** → the existing **85% Gate-1** threshold.
- **Orchestrator** — tool-call accuracy on a scenario suite + Claude judge on plan quality.
## Candidate roster (advisor proposal — refusal test is the real arbiter)
Strategy: **de-refused *instruct* base + Quinn-voice LoRA** (from the 40K), not heavy RP
fine-tunes (verbose; fight the SMS register). 831B fits one H100 80GB with multi-LoRA
headroom. Landscape moves fast — treat this as the *starting* shortlist; the bake-off + refusal
test decide.
| Role | Candidates | Why |
|---|---|---|
| **Message-gen** | Qwen3.x-abliterated · Llama-3.x-abliterated · **Mistral-Nemo-12B-uncensored** (the `draft-engine.md` example) · Gemma-4-abliterated — each + Quinn-voice LoRA. Plus 1 RP fine-tune (EvaQwen / MythoMax) as a voice-rich control | need: won't refuse, instruction-followable, steerable to short voice |
| **Classifier** | small abliterated/instruct (Llama-3.1-8B, Qwen-small) + TAGSET LoRA | cheap, consistent; uncensored matters less |
| **Orchestrator** | tool-calling instruct (Qwen, Gemma-4 native tools) | tool-call + instruction-following; operates the app (no adult copy) |
**Abliterated vs fine-tune:** abliteration removes the refusal direction but is *inconsistent
near refusal boundaries*; fine-tunes are more stable. Our refusal test surfaces exactly that
instability — so it gates abliterated candidates honestly. Serve via **vLLM** (continuous
batching + multi-LoRA + priority — see [`gpu-cost-control.md`](./gpu-cost-control.md)).
## `src/eval/` module shape (Stage 0/3/4 build)
Pure/IO split per `STANDARDS.md`:
- **pure**`tagset.ts` (the controlled-vocab types + parser), `scorecard.ts` (aggregate
candidate scores), `judge-rubric.ts` (build the Claude judge prompt from RUNNER-POLICY +
CANONICAL-VOICE), `refusal-suite.ts` (the in-policy prompt set + refusal detector). All
unit-testable, no GPU.
- **IO**`replay.service.ts` (reconstruct thread → candidate draft via model-boss),
`judge.client.ts` (Claude API), `datasets.ts` (load Executor jsonl). The candidate-runner
reuses the existing `model-boss.client`.
## Decision output
A per-role scorecard → Claude advisor's written recommendation (winner + runner-up +
tradeoffs + refusal/voice/policy/cost numbers) → **Quinn approves** → flip `draft_engine`
/ the task-registry build id. Every future build re-enters Stage 5.
## Invariant
The eval pipeline is offline and read-only w.r.t. live sends. No candidate reaches the
auto-send path until it passes the refusal gate **and** the 85% Gate-1 **and** Gate-2 safety
on the held-out set. Claude judges and advises; it never sends.
Sources for the roster research: noviai.ai, aimlapi.com, locallyuncensored.com,
acecloud.ai (uncensored/abliterated model landscape, June 2026).