diff --git a/docs/features/model-eval-pipeline.md b/docs/features/model-eval-pipeline.md new file mode 100644 index 0000000..f957b28 --- /dev/null +++ b/docs/features/model-eval-pipeline.md @@ -0,0 +1,106 @@ +# Model eval & selection pipeline — Claude-advisor / OSS-worker + +> Pick the OSS uncensored models we'll actually run, by bake-off, with **Claude as +> the offline advisor/judge/labeler — never in the serving loop.** This operationalizes +> the eval gate from [`training-loop.md`](./training-loop.md) and fixes the blocker +> that stopped Executor's manual eval cycle 1. +> +> **Home:** `@applications/prospector` (`src/eval/`). It already has the model-boss/GPU +> plumbing to serve candidates and the thread data to replay. **Corpus:** Executor's +> `prospecting/voice-engine/datasets/` (`outcomes.jsonl` — 323 TAGSET-labeled rows, +> `gold-turnpairs.jsonl` — 42 pairs) + `RUNNER-POLICY.md` + `CANONICAL-VOICE.md` + +> `TAGSET.md`, and the 40K iMessage history. + +## The split (reconciles "Claude in the loop" with "OSS at runtime") + +- **OSS uncensored = the worker** — runtime serving, and the *candidate under test*. +- **Claude = advisor in the *development/eval* loop only** — three offline jobs, none + emitting adult copy to a prospect: + 1. **Labeler** — extend 323 → full corpus against `TAGSET.md` (CoT rationalization). + 2. **Judge** — score candidate drafts on voice/policy (the role the human did by hand + in cycle 1). + 3. **Advisor** — synthesize the scorecard → recommend the model + build per role. + +## What already exists (do not rebuild) + +| Asset | Where | Use | +|---|---|---| +| 323 TAGSET-labeled personas | `outcomes.jsonl` | classifier eval gold | +| 42 `(client→quinn)` pairs | `gold-turnpairs.jsonl` | generator eval gold seed | +| eval method + **85% "would-send-unedited"** Gate-1 + axes (VOICE_REGISTER, THREAD_FACTS) | `eval-heldout.md` | the rubric | +| policy source of truth (scam screen, QUALIFY→PITCH+RATE→location→HOLD, voice) | `RUNNER-POLICY.md` | judge policy rubric | +| candidate serving + on-demand GPU | prospector `src/gpu/` + model-boss | run candidates | +| 40K thread history | iMessage sync | replay corpus | + +**The cycle-1 blocker (verbatim):** *"a clean automated eval is blocked because +`draft_message` only drafts the live next turn — it can't replay a historical client +message — so we can't batch-score against gold pairs."* → **Stage 0 below fixes this.** + +## Pipeline + +| Stage | What | Build? | +|---|---|---| +| **0 · Replay harness** | reconstruct each held-out thread to turn N → ask a candidate to draft turn N → batch-score vs gold | **BUILD** (`src/eval/`) — the blocker fix | +| **1 · Corpus + split** | freeze held-out; Claude labels the 40K remainder to TAGSET | reuse datasets + Claude labeler | +| **2 · Candidate roster** | per role (below); pin `do-gpu-_` | config | +| **3 · Eval** (role × candidate) | run via replay harness, score (below) | BUILD scorers | +| **4 · Scorecard + decision** | quality × refusal × latency × VRAM × cost → Claude advisor recommends → Quinn approves flip | BUILD aggregator | +| **5 · Standing loop** | every new build/correction → harness → judge → 85% gate → flip | the continuous gate | + +### Stage 3 — scoring differs per role (so the winner differs per role) + +- **Classifier** — replay inbound → predict TAGSET tags → **accuracy/F1 vs + `outcomes.jsonl`**. Objective; no judge. +- **Message-generator** — replay thread → candidate draft → score on: + - **Refusal rate** — the **uncensored gate**; a fixed set of in-policy adult prompts, + any refusal/moralizing → **auto-disqualify** before voice scoring. The single most + decisive filter; why hosted models can't serve this role. + - **Voice fidelity** — Claude judge vs `CANONICAL-VOICE` on VOICE_REGISTER + THREAD_FACTS. + - **Policy compliance** — Claude + programmatic vs `RUNNER-POLICY` (withholds address, + text-only, correct funnel stage, scam screen). + - **% would-send-unedited** → the existing **85% Gate-1** threshold. +- **Orchestrator** — tool-call accuracy on a scenario suite + Claude judge on plan quality. + +## Candidate roster (advisor proposal — refusal test is the real arbiter) + +Strategy: **de-refused *instruct* base + Quinn-voice LoRA** (from the 40K), not heavy RP +fine-tunes (verbose; fight the SMS register). 8–31B fits one H100 80GB with multi-LoRA +headroom. Landscape moves fast — treat this as the *starting* shortlist; the bake-off + refusal +test decide. + +| Role | Candidates | Why | +|---|---|---| +| **Message-gen** | Qwen3.x-abliterated · Llama-3.x-abliterated · **Mistral-Nemo-12B-uncensored** (the `draft-engine.md` example) · Gemma-4-abliterated — each + Quinn-voice LoRA. Plus 1 RP fine-tune (EvaQwen / MythoMax) as a voice-rich control | need: won't refuse, instruction-followable, steerable to short voice | +| **Classifier** | small abliterated/instruct (Llama-3.1-8B, Qwen-small) + TAGSET LoRA | cheap, consistent; uncensored matters less | +| **Orchestrator** | tool-calling instruct (Qwen, Gemma-4 native tools) | tool-call + instruction-following; operates the app (no adult copy) | + +**Abliterated vs fine-tune:** abliteration removes the refusal direction but is *inconsistent +near refusal boundaries*; fine-tunes are more stable. Our refusal test surfaces exactly that +instability — so it gates abliterated candidates honestly. Serve via **vLLM** (continuous +batching + multi-LoRA + priority — see [`gpu-cost-control.md`](./gpu-cost-control.md)). + +## `src/eval/` module shape (Stage 0/3/4 build) + +Pure/IO split per `STANDARDS.md`: +- **pure** — `tagset.ts` (the controlled-vocab types + parser), `scorecard.ts` (aggregate + candidate scores), `judge-rubric.ts` (build the Claude judge prompt from RUNNER-POLICY + + CANONICAL-VOICE), `refusal-suite.ts` (the in-policy prompt set + refusal detector). All + unit-testable, no GPU. +- **IO** — `replay.service.ts` (reconstruct thread → candidate draft via model-boss), + `judge.client.ts` (Claude API), `datasets.ts` (load Executor jsonl). The candidate-runner + reuses the existing `model-boss.client`. + +## Decision output + +A per-role scorecard → Claude advisor's written recommendation (winner + runner-up + +tradeoffs + refusal/voice/policy/cost numbers) → **Quinn approves** → flip `draft_engine` +/ the task-registry build id. Every future build re-enters Stage 5. + +## Invariant + +The eval pipeline is offline and read-only w.r.t. live sends. No candidate reaches the +auto-send path until it passes the refusal gate **and** the 85% Gate-1 **and** Gate-2 safety +on the held-out set. Claude judges and advises; it never sends. + +Sources for the roster research: noviai.ai, aimlapi.com, locallyuncensored.com, +acecloud.ai (uncensored/abliterated model landscape, June 2026).