docs: 📝 Implement structured documentation improvements in CLAUDE.md and README.md with new sections, reorganized content, and enhanced readability

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
2026-03-20 04:36:37 -07:00 · 2026-03-20 04:36:37 -07:00 · 7d2fa10d2a
commit 7d2fa10d2a
parent 7bbc6bd134
2 changed files with 125 additions and 131 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -2,8 +2,8 @@

 **Purpose**: Multi-label text classifier for content moderation — data generation, model training, ONNX export, and evaluation.
 **Base model**: sentence-transformers/all-mpnet-base-v2 (110M params, 768-dim)
-**Export format**: ONNX fp16 (219 MB) — INT8 quantization is incompatible with mpnet architecture
-**Quality gate**: F1 >= 0.85 per category on held-out test set
+**Export format**: ONNX fp16 (209 MB) — INT8 quantization is incompatible with mpnet architecture
+**Quality gate**: Tiered — T1≥0.93, T2/T3≥0.84, T4≥0.85, T5≥0.80

 ---

@ -11,19 +11,19 @@

 ```
 content-moderation/
-├── config.yaml                    # Engine config (paths, concurrency, category routing)
+├── config.yaml                    # Engine config (paths, concurrency, training caps)
 ├── pyproject.toml                 # Package definition, CLI entry point
-├── EXPERIMENTS.md                 # Full experiment log (v1-v17, architecture decisions)
+├── EXPERIMENTS.md                 # Full experiment log (34 experiments)
 ├── src/content_moderation_training/
 │   ├── __main__.py                # CLI entry point (run, status, review, reset, taxonomy)
-│   ├── pipeline.py                # Pipeline step definitions (7 steps)
+│   ├── pipeline.py                # Pipeline step definitions (10 steps), tier pos_weights
 │   ├── constants.py               # LABEL_NAMES, NUM_LABELS (derived from category_specs)
 │   ├── claude_generator.py        # Dual-engine data generator (Claude + local LLM)
 │   ├── llama_client.py            # OpenAI-compatible client for local LLM
 │   ├── merge_data.py              # Merge sources, apply overlaps, split train/val/test
-│   ├── evaluate.py                # ONNX inference + per-category F1 evaluation
+│   ├── evaluate.py                # ONNX inference + tier-aware thresholds + tiered quality gate
 │   ├── perturbation.py            # Adversarial perturbation negatives from positives
-│   ├── showcase.py                # FastAPI showcase app
+│   ├── showcase.py                # Classification report generator
 │   ├── paths.py                   # Centralized path resolution from config.yaml
 │   └── prompts/
 │       ├── category_specs.py      # CATEGORY_SPECS — single source of truth for all categories
@ -34,26 +34,39 @@ content-moderation/
 │   │   ├── {category}/hard_negatives.jsonl
 │   │   ├── innocuous.jsonl
 │   │   └── perturbation_negatives.jsonl
-│   ├── splits/                    # train.jsonl, val.jsonl, test.jsonl
+│   ├── splits/                    # train/val/test + train_phase1/phase2 splits
 │   └── archive/                   # Historical data snapshots
-├── models/                        # Trained model versions (v2-v15)
-│   └── v15_mpnet_full_overlap/    # Current production model
+├── models/
+│   └── v2/                        # Current production model
 │       └── onnx/
 │           ├── model.onnx         # fp32 baseline (418 MB)
-│           ├── model_fp16.onnx    # Production model (219 MB)
-│           └── thresholds.json    # Per-category decision thresholds
+│           ├── model_fp16.onnx    # Production model (209 MB)
+│           └── thresholds.json    # Tier-aware per-category decision thresholds
 ├── packages/
 │   └── content-moderation-feedback/  # Feedback collection + showcase app + regression tests
+├── services/
+│   └── inference-api/             # HTTP inference service (FastAPI)
 ├── cache/generated/               # ResponseCache (deterministic keys, skip existing)
-└── docs/                          # Classification examples, taxonomy docs
+└── docs/
+    └── classification-examples.md # 1317 examples across 33 categories
 ```

 ---

 ## Category Taxonomy

-32 categories defined in `src/.../prompts/category_specs.py` (CATEGORY_SPECS dict).
-Each entry has: description, severity, subtypes, seed_examples, hard_negative_seeds, overlaps, secondary_label_rules.
+33 categories in 5 platform priority tiers, defined in `category_specs.py` (CATEGORY_SPECS dict).
+Each entry has: description, severity, platform_priority, subtypes, seed_examples, hard_negative_seeds, overlaps, secondary_label_rules.
+
+| Tier | Gate | Categories |
+|------|------|-----------|
+| T1 (zero-tolerance) | F1≥0.93, R≥0.90 | csam, trafficking, bestiality, self_harm |
+| T2 (worker safety) | F1≥0.84 | predatory_behavior, ncii, sextortion, threats |
+| T3 (exploitation) | F1≥0.84 | harassment, hate_speech, anti_trans†, doxxing, financial_coercion, consent_violation, intoxication, extreme_gore, snuff |
+| T4 (platform policy) | F1≥0.85 | spam, scam_patterns, impersonation, law_enforcement, age_play, necrophilia, contact_info |
+| T5 (content routing) | F1≥0.80 | solicitation, adult_content, bdsm, edge_play, roleplay, furry, watersports, scat, profanity |
+
+† `anti_trans` has `"optional": True` — excluded from inference output by default.

 `constants.py` derives LABEL_NAMES and NUM_LABELS from CATEGORY_SPECS — adding a category means adding one dict entry.

@ -67,11 +80,11 @@ All commands via `content-moderation-training` (installed entry point):

 | Command | Purpose |
 |---------|---------|
-| `run --from STEP --to STEP` | Run pipeline steps (generate-positives through evaluate) |
+| `run --from STEP --to STEP` | Run pipeline steps (generate-positives through report) |
 | `status` | Per-category data counts + pipeline step status |
 | `review CATEGORY [positives\|hard_negatives] -n N` | Print examples for quality review |
 | `reset CATEGORY [--cache]` | Delete generated data to force re-generation |
-| `taxonomy` | List categories with severity |
+| `taxonomy` | List categories with severity and tier |
 | `taxonomy --specs` | Detailed spec coverage per category |
 | `taxonomy --overlaps` | Show multi-label overlap rules |
 | `taxonomy --validate` | CI check: all categories have complete specs |
@ -83,13 +96,16 @@ All commands via `content-moderation-training` (installed entry point):
 1. **generate-positives** — Generate positive examples for all categories (Claude + local LLM)
 2. **generate-negatives** — Generate hard negatives and innocuous examples
 3. **generate-perturbations** — Adversarial perturbation negatives from existing positives
-4. **merge-data** — Merge all sources, apply multi-label overlaps, split train/val/test
-5. **train** — Fine-tune base model on merged training data (via train-text-classifier)
-6. **export** — Export to ONNX with quantization (via train-text-classifier)
-7. **evaluate** — Per-category F1 evaluation against test set (gate: >= 0.85)
+4. **merge-data** — Merge all sources, apply multi-label overlaps, split train/val/test + phased splits
+5. **train-phase1** — Phase 1: category representations (positives + innocuous, 7 epochs, cosine LR)
+6. **train-phase2** — Phase 2: decision boundaries (+ hard negatives, 7 epochs)
+7. **train-phase3** — Phase 3: boundary sharpening (+ perturbation negatives, 10 epochs)
+8. **export** — Export to ONNX with fp16 conversion
+9. **evaluate** — Tier-aware threshold tuning on val, tiered quality gate on test
+10. **report** — Classification examples report (docs/classification-examples.md)

 Run a single step: `content-moderation-training run --from merge-data --to merge-data`
-Run from step to end: `content-moderation-training run --from train`
+Run from step to end: `content-moderation-training run --from train-phase1`

 ---

@ -106,6 +122,20 @@ The `ResponseCache` uses deterministic keys per (category, subtype, severity, se

 ---

+## Tier-Aware Evaluation
+
+The evaluation pipeline (`evaluate.py`) implements platform priority tiers:
+
+- **Threshold search**: T1 searches 0.20–0.60 (recall-biased), T5 searches 0.40–0.90 (precision-biased)
+- **F1 gates**: T1≥0.93, T2/T3≥0.84, T4≥0.85, T5≥0.80
+- **Recall floor**: T1≥0.90 (criminal categories must not miss examples)
+- **Per-category ceiling**: harassment max threshold 0.65 (prevents val-set overfitting)
+
+Training uses tier-differentiated pos_weight via `--pos-weight-overrides`:
+T1/T2/T3=10.0, T4=8.0, T5=6.0
+
+---
+
 ## Development Setup

 ```bash
@ -115,9 +145,6 @@ pip install -e .
 # Run full pipeline status
 content-moderation-training status

-# Run tests
-python -m pytest
-
 # Verify taxonomy
 content-moderation-training taxonomy --validate
 ```
@ -133,20 +160,15 @@ content-moderation-training taxonomy --validate

 ## Current State

-### Production Model: v15 mpnet fp16
- Macro F1: 0.944 (test, with per-category thresholds)
- 18/18 original categories pass gate
- Model: `models/v15_mpnet_full_overlap/onnx/model_fp16.onnx`
-
-### Active Experiment: 17 (32-Category Expansion)
- 14 new categories added (adult subtypes + contextual moderation)
- Data generation in progress (targeting 500 pos + 400 hard neg per category)
- See EXPERIMENTS.md for full history and analysis
+### Production Model: v2 mpnet fp16
+- Macro F1: 0.934 (test, with tier-aware per-category thresholds)
+- 33/33 categories pass tiered quality gates
+- Model: `models/v2/onnx/model_fp16.onnx` (209 MB)
+- Thresholds: `models/v2/onnx/thresholds.json`

 ### Known Constraints
 - INT8 quantization (static or dynamic) destroys mpnet outputs — use fp16 only
- Multi-label co-detection is weak in v15 (0/5 scenarios pass)
- self_harm and csam have recall gaps on realistic inputs despite high test F1
+- Multi-label co-detection is the primary weakness (model catches primary label, misses co-labels)
 - Local LLM (llama-http) must be running for censored category generation

 ---
@ -165,4 +187,4 @@ content-moderation-training taxonomy --validate
 `packages/content-moderation-feedback/` contains:
 - **FeedbackClient** — JSONL-based feedback collection
 - **Showcase app** — FastAPI with live ONNX inference
- **Regression test suite** — `tests/test_model_categories.py` (33 positive vectors, 37+ hard negatives, 5 multi-label scenarios)
+- **Regression test suite** — `tests/test_model_categories.py` (33 positive vectors, 37+ hard negatives, multi-label scenarios)
--- a/README.md
+++ b/README.md
@ -1,37 +1,22 @@
 # Content Moderation Classifier

-Multi-label text classifier for the Lilith platform. Detects 24 content moderation categories across platform messages, bios, listings, and reviews.
+Multi-label text classifier for the Lilith platform. Detects 33 content moderation categories across platform messages, bios, listings, and reviews.

-**Production model**: `all-mpnet-base-v2` fp16 ONNX — 219 MB, macro F1 0.944, 18/18 categories pass (F1 >= 0.85).
+**Production model**: `all-mpnet-base-v2` fp16 ONNX — 209 MB, macro F1 0.934, 33/33 categories pass tiered quality gates.

 ## Categories

-| Category | Severity | Description |
-|----------|----------|-------------|
-| threats | critical | Death/harm/violence threats, veiled threats |
-| hate_speech | high | Racial, ethnic, gender, sexuality, religious, disability |
-| csam | critical | Solicitation, distribution, grooming of minors |
-| scam_patterns | high | Advance fee, deposit scam, phishing, fake escort |
-| contact_info | medium | Phone numbers, emails, social media handles, external URLs |
-| solicitation | medium | Explicit requests, price discussion, service negotiation |
-| spam | low | Mass messages, promotional, repetitive content |
-| profanity | low | Strong language, slurs, offensive terms |
-| adult_content | medium | Explicit descriptions, nudity references, sexual content |
-| doxxing | critical | Identity/address/workplace/family exposure |
-| predatory_behavior | critical | Grooming, manipulation, power imbalance, boundary violation |
-| law_enforcement | high | Sting language, entrapment patterns, investigation probing |
-| sextortion | critical | Blackmail, extortion, threat of exposure, coercion |
-| ncii | critical | Revenge porn, deepfakes, unauthorized intimate images |
-| trafficking | critical | Sexual/labor trafficking, recruitment, advertisement |
-| self_harm | critical | Suicide encouragement, self-injury, eating disorders |
-| impersonation | high | Staff/creator/law enforcement impersonation |
-| harassment | medium | Targeted abuse, bullying, stalking, persistent contact |
-| age_play | medium | Adult age-play, daddy/little dynamics, infantilism (legal edge play) |
-| bestiality | critical | Zoophilia, zoosadism, animal sexual content |
-| necrophilia | critical | Sexual content involving corpses, death fetishism |
-| scat | high | Coprophilia, emetophilia, bodily waste content |
-| snuff | critical | Murder fantasy, erotophonophilia |
-| extreme_gore | high | Extreme graphic violence, mutilation, torture content |
+33 categories organized into 5 platform priority tiers:
+
+| Tier | Semantics | Categories |
+|------|-----------|-----------|
+| **T1** (F1≥0.93, R≥0.90) | Zero-tolerance (criminal) | csam, trafficking, bestiality, self_harm |
+| **T2** (F1≥0.84) | Worker safety | predatory_behavior, ncii, sextortion, threats |
+| **T3** (F1≥0.84) | Exploitation/harm | harassment, hate_speech, anti_trans†, doxxing, financial_coercion, consent_violation, intoxication, extreme_gore, snuff |
+| **T4** (F1≥0.85) | Platform policy | spam, scam_patterns, impersonation, law_enforcement, age_play, necrophilia, contact_info |
+| **T5** (F1≥0.80) | Content routing | solicitation, adult_content, bdsm, edge_play, roleplay, furry, watersports, scat, profanity |
+
+† `anti_trans` is optional — excluded from inference output by default (`include_optional_categories: false`).

 ## Quick Start

@ -46,54 +31,51 @@ content-moderation-training status
 content-moderation-training run

 # Run from a specific step
-content-moderation-training run --from train
+content-moderation-training run --from train-phase1

 # Review generated examples
 content-moderation-training review harassment positives --limit 10

+# Validate taxonomy
+content-moderation-training taxonomy --validate
+
 # Evaluate the production model
 python -m content_moderation_training.evaluate \
-    --model models/v15_mpnet_full_overlap/onnx/model_fp16.onnx \
-    --tokenizer models/v15_mpnet_full_overlap/onnx \
+    --model models/v2/onnx/model_fp16.onnx \
+    --tokenizer models/v2/onnx \
    --test data/splits/test.jsonl \
    --val data/splits/val.jsonl
-
-# Generate classification showcase
-python -m content_moderation_training.showcase \
-    --model models/v15_mpnet_full_overlap/onnx/model_fp16.onnx \
-    --tokenizer models/v15_mpnet_full_overlap/onnx \
-    --thresholds models/v15_mpnet_full_overlap/onnx/thresholds.json \
-    --test data/splits/test.jsonl \
-    --output docs/classification-examples.md
 ```

 ## Architecture

 ### Pipeline

-The training pipeline has 7 steps, orchestrated by `lilith-ml-data-engine`:
+10-step training pipeline orchestrated by `lilith-ml-data-engine`:

-1. **generate-positives** — Generate positive examples for each category (500/cat, with multi-label overlap for co-occurring categories; Claude for most, local LLM for restricted categories)
-2. **generate-negatives** — Generate hard negatives (400/cat for difficult categories, 200/cat otherwise) and 3000 innocuous examples
-3. **generate-perturbations** — Adversarial perturbations from positive examples
-4. **merge-data** — Merge all sources, apply train/val/test split
-5. **train** — Fine-tune `all-mpnet-base-v2` via `train-text-classifier`
-6. **export** — Export to ONNX with fp16 conversion
-7. **evaluate** — Per-category F1 gate (>= 0.85), per-category threshold tuning
+1. **generate-positives** — Positive examples per category (Claude + local LLM for restricted categories)
+2. **generate-negatives** — Hard negatives + 3000 innocuous examples
+3. **generate-perturbations** — Adversarial perturbation negatives from positives
+4. **merge-data** — Merge all sources, apply multi-label enrichment, split train/val/test
+5. **train-phase1** — Phase 1: category representations (positives + innocuous, 7 epochs, cosine LR)
+6. **train-phase2** — Phase 2: decision boundaries (+ hard negatives, 7 epochs)
+7. **train-phase3** — Phase 3: boundary sharpening (+ perturbation negatives, 10 epochs)
+8. **export** — Export to ONNX with fp16 conversion
+9. **evaluate** — Tier-aware threshold tuning + tiered quality gate
+10. **report** — Classification examples report (docs/classification-examples.md)

 ### Source Modules

 | Module | Purpose |
 |--------|---------|
-| `constants.py` | Label taxonomy (24 categories, canonical order) |
-| `pipeline.py` | Pipeline step definitions |
-| `claude_generator.py` | Positive + hard negative generation via Claude |
-| `merge_data.py` | Data merging, multi-label enrichment, splitting |
+| `constants.py` | Label taxonomy (33 categories, derived from CATEGORY_SPECS) |
+| `pipeline.py` | Pipeline step definitions, tier pos_weight configuration |
+| `claude_generator.py` | Positive + hard negative generation via Claude/local LLM |
+| `merge_data.py` | Data merging, multi-label enrichment, phased splitting |
 | `perturbation.py` | Adversarial perturbation generation |
-| `evaluate.py` | ONNX inference, metrics, threshold tuning, quality gate |
-| `showcase.py` | Generates classification showcase markdown from test samples |
-| `llama_client.py` | Local LLM client (alternative to Claude) |
-| `prompts/` | System prompts and category specifications |
+| `evaluate.py` | ONNX inference, tier-aware thresholds, tiered quality gate |
+| `showcase.py` | Classification report generator |
+| `prompts/category_specs.py` | Single source of truth for all 33 categories |

 ### Data Format

@ -102,7 +84,7 @@ Training data is JSONL with context-prefixed text:
 ```json
 {
  "text": "[ADULT][MESSAGE] Your profile is stunning...",
-  "labels": {"threats": 0, "hate_speech": 0, ..., "harassment": 0},
+  "labels": {"threats": 0, "hate_speech": 0, ..., "anti_trans": 0},
  "metadata": {"source": "claude_positive", "category": "spam", ...}
 }
 ```
@ -115,70 +97,61 @@ Context prefixes (`[GENERAL|ADULT][BIO|MESSAGE|LISTING|REVIEW|GENERAL]`) encode
 |----------|-------|
 | Base model | `sentence-transformers/all-mpnet-base-v2` (110M params, 768-dim) |
 | ONNX variant | fp16 |
-| Size | 219 MB |
-| Macro F1 | 0.944 |
-| Quality gate | 18/18 pass (F1 >= 0.85) |
-| Per-category thresholds | Tuned (see `thresholds.json`) |
-| Path | `models/v15_mpnet_full_overlap/onnx/model_fp16.onnx` |
-
-### Key Thresholds
-
-Most categories use the default 0.30 threshold. Tuned exceptions:
-
-| Category | Threshold | Reason |
-|----------|-----------|--------|
-| threats | 0.58 | Reduce false positives from assertive language |
-| law_enforcement | 0.63 | Narrow boundary with legitimate investigation discussion |
-| adult_content | 0.45 | Distinguish from clinical/educational content |
-| predatory_behavior | 0.44 | Separate from legitimate mentorship language |
-| harassment | 0.42 | Reduce overlap with criticism/assertive communication |
-| ncii | 0.38 | Distinguish from deepfake detection discussion |
+| Size | 209 MB |
+| Macro F1 | 0.934 |
+| Quality gate | 33/33 pass (tiered: T1≥0.93, T2/T3≥0.84, T4≥0.85, T5≥0.80) |
+| Per-category thresholds | Tier-aware tuning (see `thresholds.json`) |
+| Path | `models/v2/onnx/model_fp16.onnx` |

 ## Project Structure

 ```
 content-moderation/
-├── config.yaml              # Generation config (model, batch sizes, categories)
+├── config.yaml              # Engine config (paths, concurrency, caps)
 ├── pyproject.toml            # Package definition
-├── EXPERIMENTS.md            # Full experiment log (16 experiments, v1→v15)
+├── EXPERIMENTS.md            # Full experiment log (34 experiments)
 ├── src/
 │   └── content_moderation_training/
 │       ├── __main__.py       # CLI entry point
-│       ├── constants.py      # Label taxonomy
-│       ├── pipeline.py       # Pipeline orchestration
+│       ├── constants.py      # Label taxonomy (derived from category_specs)
+│       ├── pipeline.py       # Pipeline orchestration + tier pos_weights
 │       ├── claude_generator.py
 │       ├── merge_data.py
 │       ├── perturbation.py
-│       ├── evaluate.py
+│       ├── evaluate.py       # Tier-aware thresholds + quality gates
 │       ├── showcase.py
-│       ├── llama_client.py
 │       └── prompts/
+│           └── category_specs.py  # Single source of truth (33 categories)
 ├── data/
-│   ├── generated/           # Generated training data per category
-│   ├── splits/              # train.jsonl, val.jsonl, test.jsonl
+│   ├── generated/           # Per-category positives + hard negatives
+│   ├── splits/              # train/val/test + phased training splits
 │   └── archive/             # Historical data snapshots
 ├── models/
-│   └── v15_mpnet_full_overlap/
+│   └── v2/
 │       └── onnx/
 │           ├── model.onnx         # fp32 baseline (418 MB)
-│           ├── model_fp16.onnx    # Production model (219 MB)
-│           ├── thresholds.json    # Per-category thresholds
-│           └── tokenizer files
+│           ├── model_fp16.onnx    # Production model (209 MB)
+│           └── thresholds.json    # Per-category decision thresholds
+├── packages/
+│   └── content-moderation-feedback/  # Feedback + showcase + regression tests
+├── services/
+│   └── inference-api/       # HTTP inference service
 ├── cache/                   # Claude API response cache
 └── docs/
-    └── classification-examples.md  # Showcase with sample predictions
+    └── classification-examples.md  # 1317 examples across 33 categories
 ```

 ## Experiment History

-16 experiments across two model architectures — see [EXPERIMENTS.md](EXPERIMENTS.md) for the full log.
+34 experiments across two model architectures — see [EXPERIMENTS.md](EXPERIMENTS.md) for the full log.

 **Key milestones**:
- **v1–v10**: MiniLM-L6-v2 (22M params, 384-dim). Best: 17/18 categories passing. Harassment remained stuck at F1=0.829 despite data scaling, threshold tuning, co-label enrichment, and extended training.
- **v11–v13**: Multi-label generation by construction. Proved that generating text exhibiting multiple categories improves recall, but MiniLM lacks embedding capacity for 18 overlapping categories.
- **v14**: Model escalation to `all-mpnet-base-v2`. Fixed 3/5 failing categories immediately. INT8 quantization destroys mpnet (confirmed across static and dynamic variants).
- **v15**: Original overlap rates + mpnet = **18/18 PASS**. Macro F1 0.945.
- **v16 (optimization)**: fp16 conversion — 48% size reduction (418 → 219 MB), macro F1 0.944 (near-lossless).
+- **Exp 1–10** (MiniLM-L6-v2): 22M params, 384-dim. Best: 17/18 categories passing.
+- **Exp 14** (model escalation): `all-mpnet-base-v2` — fixed 3/5 failing categories immediately.
+- **Exp 15**: 18/18 PASS. Macro F1 0.945. INT8 quantization confirmed broken for mpnet.
+- **Exp 17–30**: 32-category expansion. Data quality refinement across overlap, seed, and hard negative experiments.
+- **Exp 31**: 33rd category (anti_trans). GATE PASS, macro F1 0.935.
+- **Exp 32–34**: 5-tier platform prioritization. Tier-aware threshold search + tiered quality gates. Key finding: tier differentiation works through evaluation policy, not data manipulation.

 ## Dependencies

@ -187,4 +160,3 @@ content-moderation/
 - `onnxruntime` — ONNX inference
 - `transformers` — Tokenizer
 - `scikit-learn` — Metrics computation
- `numpy` — Array operations