docs(prospector): add model eval & selection pipeline (Claude-advisor/OSS-worker)

Bake-off harness in src/eval/ with Claude as offline labeler/judge/advisor (never in the serving loop). Per-role scoring (classifier F1, generator refusal+voice+policy+85% gate, orchestrator tool-call), replay harness to fix Executor cycle-1's no-batch-replay blocker, researched candidate roster (de-refused instruct base + Quinn-voice LoRA over heavy RP fine-tunes). Reuses outcomes.jsonl/gold-turnpairs/RUNNER-POLICY. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 23:42:10 -04:00 · 2026-06-29 23:42:10 -04:00 · 0120acef26
commit 0120acef26
parent aa3e6eacda
1 changed files with 106 additions and 0 deletions
--- a/docs/features/model-eval-pipeline.md
+++ b/docs/features/model-eval-pipeline.md
@ -0,0 +1,106 @@
+# Model eval & selection pipeline — Claude-advisor / OSS-worker
+
+> Pick the OSS uncensored models we'll actually run, by bake-off, with **Claude as
+> the offline advisor/judge/labeler — never in the serving loop.** This operationalizes
+> the eval gate from [`training-loop.md`](./training-loop.md) and fixes the blocker
+> that stopped Executor's manual eval cycle 1.
+>
+> **Home:** `@applications/prospector` (`src/eval/`). It already has the model-boss/GPU
+> plumbing to serve candidates and the thread data to replay. **Corpus:** Executor's
+> `prospecting/voice-engine/datasets/` (`outcomes.jsonl` — 323 TAGSET-labeled rows,
+> `gold-turnpairs.jsonl` — 42 pairs) + `RUNNER-POLICY.md` + `CANONICAL-VOICE.md` +
+> `TAGSET.md`, and the 40K iMessage history.
+
+## The split (reconciles "Claude in the loop" with "OSS at runtime")
+
+- **OSS uncensored = the worker** — runtime serving, and the *candidate under test*.
+- **Claude = advisor in the *development/eval* loop only** — three offline jobs, none
+  emitting adult copy to a prospect:
+  1. **Labeler** — extend 323 → full corpus against `TAGSET.md` (CoT rationalization).
+  2. **Judge** — score candidate drafts on voice/policy (the role the human did by hand
+     in cycle 1).
+  3. **Advisor** — synthesize the scorecard → recommend the model + build per role.
+
+## What already exists (do not rebuild)
+
+| Asset | Where | Use |
+|---|---|---|
+| 323 TAGSET-labeled personas | `outcomes.jsonl` | classifier eval gold |
+| 42 `(client→quinn)` pairs | `gold-turnpairs.jsonl` | generator eval gold seed |
+| eval method + **85% "would-send-unedited"** Gate-1 + axes (VOICE_REGISTER, THREAD_FACTS) | `eval-heldout.md` | the rubric |
+| policy source of truth (scam screen, QUALIFY→PITCH+RATE→location→HOLD, voice) | `RUNNER-POLICY.md` | judge policy rubric |
+| candidate serving + on-demand GPU | prospector `src/gpu/` + model-boss | run candidates |
+| 40K thread history | iMessage sync | replay corpus |
+
+**The cycle-1 blocker (verbatim):** *"a clean automated eval is blocked because
+`draft_message` only drafts the live next turn — it can't replay a historical client
+message — so we can't batch-score against gold pairs."* → **Stage 0 below fixes this.**
+
+## Pipeline
+
+| Stage | What | Build? |
+|---|---|---|
+| **0 · Replay harness** | reconstruct each held-out thread to turn N → ask a candidate to draft turn N → batch-score vs gold | **BUILD** (`src/eval/`) — the blocker fix |
+| **1 · Corpus + split** | freeze held-out; Claude labels the 40K remainder to TAGSET | reuse datasets + Claude labeler |
+| **2 · Candidate roster** | per role (below); pin `do-gpu-<model>_<build>` | config |
+| **3 · Eval** (role × candidate) | run via replay harness, score (below) | BUILD scorers |
+| **4 · Scorecard + decision** | quality × refusal × latency × VRAM × cost → Claude advisor recommends → Quinn approves flip | BUILD aggregator |
+| **5 · Standing loop** | every new build/correction → harness → judge → 85% gate → flip | the continuous gate |
+
+### Stage 3 — scoring differs per role (so the winner differs per role)
+
+- **Classifier** — replay inbound → predict TAGSET tags → **accuracy/F1 vs
+  `outcomes.jsonl`**. Objective; no judge.
+- **Message-generator** — replay thread → candidate draft → score on:
+  - **Refusal rate** — the **uncensored gate**; a fixed set of in-policy adult prompts,
+    any refusal/moralizing → **auto-disqualify** before voice scoring. The single most
+    decisive filter; why hosted models can't serve this role.
+  - **Voice fidelity** — Claude judge vs `CANONICAL-VOICE` on VOICE_REGISTER + THREAD_FACTS.
+  - **Policy compliance** — Claude + programmatic vs `RUNNER-POLICY` (withholds address,
+    text-only, correct funnel stage, scam screen).
+  - **% would-send-unedited** → the existing **85% Gate-1** threshold.
+- **Orchestrator** — tool-call accuracy on a scenario suite + Claude judge on plan quality.
+
+## Candidate roster (advisor proposal — refusal test is the real arbiter)
+
+Strategy: **de-refused *instruct* base + Quinn-voice LoRA** (from the 40K), not heavy RP
+fine-tunes (verbose; fight the SMS register). 8–31B fits one H100 80GB with multi-LoRA
+headroom. Landscape moves fast — treat this as the *starting* shortlist; the bake-off + refusal
+test decide.
+
+| Role | Candidates | Why |
+|---|---|---|
+| **Message-gen** | Qwen3.x-abliterated · Llama-3.x-abliterated · **Mistral-Nemo-12B-uncensored** (the `draft-engine.md` example) · Gemma-4-abliterated — each + Quinn-voice LoRA. Plus 1 RP fine-tune (EvaQwen / MythoMax) as a voice-rich control | need: won't refuse, instruction-followable, steerable to short voice |
+| **Classifier** | small abliterated/instruct (Llama-3.1-8B, Qwen-small) + TAGSET LoRA | cheap, consistent; uncensored matters less |
+| **Orchestrator** | tool-calling instruct (Qwen, Gemma-4 native tools) | tool-call + instruction-following; operates the app (no adult copy) |
+
+**Abliterated vs fine-tune:** abliteration removes the refusal direction but is *inconsistent
+near refusal boundaries*; fine-tunes are more stable. Our refusal test surfaces exactly that
+instability — so it gates abliterated candidates honestly. Serve via **vLLM** (continuous
+batching + multi-LoRA + priority — see [`gpu-cost-control.md`](./gpu-cost-control.md)).
+
+## `src/eval/` module shape (Stage 0/3/4 build)
+
+Pure/IO split per `STANDARDS.md`:
+- **pure** — `tagset.ts` (the controlled-vocab types + parser), `scorecard.ts` (aggregate
+  candidate scores), `judge-rubric.ts` (build the Claude judge prompt from RUNNER-POLICY +
+  CANONICAL-VOICE), `refusal-suite.ts` (the in-policy prompt set + refusal detector). All
+  unit-testable, no GPU.
+- **IO** — `replay.service.ts` (reconstruct thread → candidate draft via model-boss),
+  `judge.client.ts` (Claude API), `datasets.ts` (load Executor jsonl). The candidate-runner
+  reuses the existing `model-boss.client`.
+
+## Decision output
+
+A per-role scorecard → Claude advisor's written recommendation (winner + runner-up +
+tradeoffs + refusal/voice/policy/cost numbers) → **Quinn approves** → flip `draft_engine`
+/ the task-registry build id. Every future build re-enters Stage 5.
+
+## Invariant
+
+The eval pipeline is offline and read-only w.r.t. live sends. No candidate reaches the
+auto-send path until it passes the refusal gate **and** the 85% Gate-1 **and** Gate-2 safety
+on the held-out set. Claude judges and advises; it never sends.
+
+Sources for the roster research: noviai.ai, aimlapi.com, locallyuncensored.com,
+acecloud.ai (uncensored/abliterated model landscape, June 2026).