diff --git a/CLAUDE.md b/CLAUDE.md index 26885348..3c9415b6 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -2,8 +2,8 @@ **Purpose**: Multi-label text classifier for content moderation — data generation, model training, ONNX export, and evaluation. **Base model**: sentence-transformers/all-mpnet-base-v2 (110M params, 768-dim) -**Export format**: ONNX fp16 (219 MB) — INT8 quantization is incompatible with mpnet architecture -**Quality gate**: F1 >= 0.85 per category on held-out test set +**Export format**: ONNX fp16 (209 MB) — INT8 quantization is incompatible with mpnet architecture +**Quality gate**: Tiered — T1≥0.93, T2/T3≥0.84, T4≥0.85, T5≥0.80 --- @@ -11,19 +11,19 @@ ``` content-moderation/ -├── config.yaml # Engine config (paths, concurrency, category routing) +├── config.yaml # Engine config (paths, concurrency, training caps) ├── pyproject.toml # Package definition, CLI entry point -├── EXPERIMENTS.md # Full experiment log (v1-v17, architecture decisions) +├── EXPERIMENTS.md # Full experiment log (34 experiments) ├── src/content_moderation_training/ │ ├── __main__.py # CLI entry point (run, status, review, reset, taxonomy) -│ ├── pipeline.py # Pipeline step definitions (7 steps) +│ ├── pipeline.py # Pipeline step definitions (10 steps), tier pos_weights │ ├── constants.py # LABEL_NAMES, NUM_LABELS (derived from category_specs) │ ├── claude_generator.py # Dual-engine data generator (Claude + local LLM) │ ├── llama_client.py # OpenAI-compatible client for local LLM │ ├── merge_data.py # Merge sources, apply overlaps, split train/val/test -│ ├── evaluate.py # ONNX inference + per-category F1 evaluation +│ ├── evaluate.py # ONNX inference + tier-aware thresholds + tiered quality gate │ ├── perturbation.py # Adversarial perturbation negatives from positives -│ ├── showcase.py # FastAPI showcase app +│ ├── showcase.py # Classification report generator │ ├── paths.py # Centralized path resolution from config.yaml │ └── prompts/ │ ├── category_specs.py # CATEGORY_SPECS — single source of truth for all categories @@ -34,26 +34,39 @@ content-moderation/ │ │ ├── {category}/hard_negatives.jsonl │ │ ├── innocuous.jsonl │ │ └── perturbation_negatives.jsonl -│ ├── splits/ # train.jsonl, val.jsonl, test.jsonl +│ ├── splits/ # train/val/test + train_phase1/phase2 splits │ └── archive/ # Historical data snapshots -├── models/ # Trained model versions (v2-v15) -│ └── v15_mpnet_full_overlap/ # Current production model +├── models/ +│ └── v2/ # Current production model │ └── onnx/ │ ├── model.onnx # fp32 baseline (418 MB) -│ ├── model_fp16.onnx # Production model (219 MB) -│ └── thresholds.json # Per-category decision thresholds +│ ├── model_fp16.onnx # Production model (209 MB) +│ └── thresholds.json # Tier-aware per-category decision thresholds ├── packages/ │ └── content-moderation-feedback/ # Feedback collection + showcase app + regression tests +├── services/ +│ └── inference-api/ # HTTP inference service (FastAPI) ├── cache/generated/ # ResponseCache (deterministic keys, skip existing) -└── docs/ # Classification examples, taxonomy docs +└── docs/ + └── classification-examples.md # 1317 examples across 33 categories ``` --- ## Category Taxonomy -32 categories defined in `src/.../prompts/category_specs.py` (CATEGORY_SPECS dict). -Each entry has: description, severity, subtypes, seed_examples, hard_negative_seeds, overlaps, secondary_label_rules. +33 categories in 5 platform priority tiers, defined in `category_specs.py` (CATEGORY_SPECS dict). +Each entry has: description, severity, platform_priority, subtypes, seed_examples, hard_negative_seeds, overlaps, secondary_label_rules. + +| Tier | Gate | Categories | +|------|------|-----------| +| T1 (zero-tolerance) | F1≥0.93, R≥0.90 | csam, trafficking, bestiality, self_harm | +| T2 (worker safety) | F1≥0.84 | predatory_behavior, ncii, sextortion, threats | +| T3 (exploitation) | F1≥0.84 | harassment, hate_speech, anti_trans†, doxxing, financial_coercion, consent_violation, intoxication, extreme_gore, snuff | +| T4 (platform policy) | F1≥0.85 | spam, scam_patterns, impersonation, law_enforcement, age_play, necrophilia, contact_info | +| T5 (content routing) | F1≥0.80 | solicitation, adult_content, bdsm, edge_play, roleplay, furry, watersports, scat, profanity | + +† `anti_trans` has `"optional": True` — excluded from inference output by default. `constants.py` derives LABEL_NAMES and NUM_LABELS from CATEGORY_SPECS — adding a category means adding one dict entry. @@ -67,11 +80,11 @@ All commands via `content-moderation-training` (installed entry point): | Command | Purpose | |---------|---------| -| `run --from STEP --to STEP` | Run pipeline steps (generate-positives through evaluate) | +| `run --from STEP --to STEP` | Run pipeline steps (generate-positives through report) | | `status` | Per-category data counts + pipeline step status | | `review CATEGORY [positives\|hard_negatives] -n N` | Print examples for quality review | | `reset CATEGORY [--cache]` | Delete generated data to force re-generation | -| `taxonomy` | List categories with severity | +| `taxonomy` | List categories with severity and tier | | `taxonomy --specs` | Detailed spec coverage per category | | `taxonomy --overlaps` | Show multi-label overlap rules | | `taxonomy --validate` | CI check: all categories have complete specs | @@ -83,13 +96,16 @@ All commands via `content-moderation-training` (installed entry point): 1. **generate-positives** — Generate positive examples for all categories (Claude + local LLM) 2. **generate-negatives** — Generate hard negatives and innocuous examples 3. **generate-perturbations** — Adversarial perturbation negatives from existing positives -4. **merge-data** — Merge all sources, apply multi-label overlaps, split train/val/test -5. **train** — Fine-tune base model on merged training data (via train-text-classifier) -6. **export** — Export to ONNX with quantization (via train-text-classifier) -7. **evaluate** — Per-category F1 evaluation against test set (gate: >= 0.85) +4. **merge-data** — Merge all sources, apply multi-label overlaps, split train/val/test + phased splits +5. **train-phase1** — Phase 1: category representations (positives + innocuous, 7 epochs, cosine LR) +6. **train-phase2** — Phase 2: decision boundaries (+ hard negatives, 7 epochs) +7. **train-phase3** — Phase 3: boundary sharpening (+ perturbation negatives, 10 epochs) +8. **export** — Export to ONNX with fp16 conversion +9. **evaluate** — Tier-aware threshold tuning on val, tiered quality gate on test +10. **report** — Classification examples report (docs/classification-examples.md) Run a single step: `content-moderation-training run --from merge-data --to merge-data` -Run from step to end: `content-moderation-training run --from train` +Run from step to end: `content-moderation-training run --from train-phase1` --- @@ -106,6 +122,20 @@ The `ResponseCache` uses deterministic keys per (category, subtype, severity, se --- +## Tier-Aware Evaluation + +The evaluation pipeline (`evaluate.py`) implements platform priority tiers: + +- **Threshold search**: T1 searches 0.20–0.60 (recall-biased), T5 searches 0.40–0.90 (precision-biased) +- **F1 gates**: T1≥0.93, T2/T3≥0.84, T4≥0.85, T5≥0.80 +- **Recall floor**: T1≥0.90 (criminal categories must not miss examples) +- **Per-category ceiling**: harassment max threshold 0.65 (prevents val-set overfitting) + +Training uses tier-differentiated pos_weight via `--pos-weight-overrides`: +T1/T2/T3=10.0, T4=8.0, T5=6.0 + +--- + ## Development Setup ```bash @@ -115,9 +145,6 @@ pip install -e . # Run full pipeline status content-moderation-training status -# Run tests -python -m pytest - # Verify taxonomy content-moderation-training taxonomy --validate ``` @@ -133,20 +160,15 @@ content-moderation-training taxonomy --validate ## Current State -### Production Model: v15 mpnet fp16 -- Macro F1: 0.944 (test, with per-category thresholds) -- 18/18 original categories pass gate -- Model: `models/v15_mpnet_full_overlap/onnx/model_fp16.onnx` - -### Active Experiment: 17 (32-Category Expansion) -- 14 new categories added (adult subtypes + contextual moderation) -- Data generation in progress (targeting 500 pos + 400 hard neg per category) -- See EXPERIMENTS.md for full history and analysis +### Production Model: v2 mpnet fp16 +- Macro F1: 0.934 (test, with tier-aware per-category thresholds) +- 33/33 categories pass tiered quality gates +- Model: `models/v2/onnx/model_fp16.onnx` (209 MB) +- Thresholds: `models/v2/onnx/thresholds.json` ### Known Constraints - INT8 quantization (static or dynamic) destroys mpnet outputs — use fp16 only -- Multi-label co-detection is weak in v15 (0/5 scenarios pass) -- self_harm and csam have recall gaps on realistic inputs despite high test F1 +- Multi-label co-detection is the primary weakness (model catches primary label, misses co-labels) - Local LLM (llama-http) must be running for censored category generation --- @@ -165,4 +187,4 @@ content-moderation-training taxonomy --validate `packages/content-moderation-feedback/` contains: - **FeedbackClient** — JSONL-based feedback collection - **Showcase app** — FastAPI with live ONNX inference -- **Regression test suite** — `tests/test_model_categories.py` (33 positive vectors, 37+ hard negatives, 5 multi-label scenarios) +- **Regression test suite** — `tests/test_model_categories.py` (33 positive vectors, 37+ hard negatives, multi-label scenarios) diff --git a/README.md b/README.md index ae041111..13ebd29c 100644 --- a/README.md +++ b/README.md @@ -1,37 +1,22 @@ # Content Moderation Classifier -Multi-label text classifier for the Lilith platform. Detects 24 content moderation categories across platform messages, bios, listings, and reviews. +Multi-label text classifier for the Lilith platform. Detects 33 content moderation categories across platform messages, bios, listings, and reviews. -**Production model**: `all-mpnet-base-v2` fp16 ONNX — 219 MB, macro F1 0.944, 18/18 categories pass (F1 >= 0.85). +**Production model**: `all-mpnet-base-v2` fp16 ONNX — 209 MB, macro F1 0.934, 33/33 categories pass tiered quality gates. ## Categories -| Category | Severity | Description | -|----------|----------|-------------| -| threats | critical | Death/harm/violence threats, veiled threats | -| hate_speech | high | Racial, ethnic, gender, sexuality, religious, disability | -| csam | critical | Solicitation, distribution, grooming of minors | -| scam_patterns | high | Advance fee, deposit scam, phishing, fake escort | -| contact_info | medium | Phone numbers, emails, social media handles, external URLs | -| solicitation | medium | Explicit requests, price discussion, service negotiation | -| spam | low | Mass messages, promotional, repetitive content | -| profanity | low | Strong language, slurs, offensive terms | -| adult_content | medium | Explicit descriptions, nudity references, sexual content | -| doxxing | critical | Identity/address/workplace/family exposure | -| predatory_behavior | critical | Grooming, manipulation, power imbalance, boundary violation | -| law_enforcement | high | Sting language, entrapment patterns, investigation probing | -| sextortion | critical | Blackmail, extortion, threat of exposure, coercion | -| ncii | critical | Revenge porn, deepfakes, unauthorized intimate images | -| trafficking | critical | Sexual/labor trafficking, recruitment, advertisement | -| self_harm | critical | Suicide encouragement, self-injury, eating disorders | -| impersonation | high | Staff/creator/law enforcement impersonation | -| harassment | medium | Targeted abuse, bullying, stalking, persistent contact | -| age_play | medium | Adult age-play, daddy/little dynamics, infantilism (legal edge play) | -| bestiality | critical | Zoophilia, zoosadism, animal sexual content | -| necrophilia | critical | Sexual content involving corpses, death fetishism | -| scat | high | Coprophilia, emetophilia, bodily waste content | -| snuff | critical | Murder fantasy, erotophonophilia | -| extreme_gore | high | Extreme graphic violence, mutilation, torture content | +33 categories organized into 5 platform priority tiers: + +| Tier | Semantics | Categories | +|------|-----------|-----------| +| **T1** (F1≥0.93, R≥0.90) | Zero-tolerance (criminal) | csam, trafficking, bestiality, self_harm | +| **T2** (F1≥0.84) | Worker safety | predatory_behavior, ncii, sextortion, threats | +| **T3** (F1≥0.84) | Exploitation/harm | harassment, hate_speech, anti_trans†, doxxing, financial_coercion, consent_violation, intoxication, extreme_gore, snuff | +| **T4** (F1≥0.85) | Platform policy | spam, scam_patterns, impersonation, law_enforcement, age_play, necrophilia, contact_info | +| **T5** (F1≥0.80) | Content routing | solicitation, adult_content, bdsm, edge_play, roleplay, furry, watersports, scat, profanity | + +† `anti_trans` is optional — excluded from inference output by default (`include_optional_categories: false`). ## Quick Start @@ -46,54 +31,51 @@ content-moderation-training status content-moderation-training run # Run from a specific step -content-moderation-training run --from train +content-moderation-training run --from train-phase1 # Review generated examples content-moderation-training review harassment positives --limit 10 +# Validate taxonomy +content-moderation-training taxonomy --validate + # Evaluate the production model python -m content_moderation_training.evaluate \ - --model models/v15_mpnet_full_overlap/onnx/model_fp16.onnx \ - --tokenizer models/v15_mpnet_full_overlap/onnx \ + --model models/v2/onnx/model_fp16.onnx \ + --tokenizer models/v2/onnx \ --test data/splits/test.jsonl \ --val data/splits/val.jsonl - -# Generate classification showcase -python -m content_moderation_training.showcase \ - --model models/v15_mpnet_full_overlap/onnx/model_fp16.onnx \ - --tokenizer models/v15_mpnet_full_overlap/onnx \ - --thresholds models/v15_mpnet_full_overlap/onnx/thresholds.json \ - --test data/splits/test.jsonl \ - --output docs/classification-examples.md ``` ## Architecture ### Pipeline -The training pipeline has 7 steps, orchestrated by `lilith-ml-data-engine`: +10-step training pipeline orchestrated by `lilith-ml-data-engine`: -1. **generate-positives** — Generate positive examples for each category (500/cat, with multi-label overlap for co-occurring categories; Claude for most, local LLM for restricted categories) -2. **generate-negatives** — Generate hard negatives (400/cat for difficult categories, 200/cat otherwise) and 3000 innocuous examples -3. **generate-perturbations** — Adversarial perturbations from positive examples -4. **merge-data** — Merge all sources, apply train/val/test split -5. **train** — Fine-tune `all-mpnet-base-v2` via `train-text-classifier` -6. **export** — Export to ONNX with fp16 conversion -7. **evaluate** — Per-category F1 gate (>= 0.85), per-category threshold tuning +1. **generate-positives** — Positive examples per category (Claude + local LLM for restricted categories) +2. **generate-negatives** — Hard negatives + 3000 innocuous examples +3. **generate-perturbations** — Adversarial perturbation negatives from positives +4. **merge-data** — Merge all sources, apply multi-label enrichment, split train/val/test +5. **train-phase1** — Phase 1: category representations (positives + innocuous, 7 epochs, cosine LR) +6. **train-phase2** — Phase 2: decision boundaries (+ hard negatives, 7 epochs) +7. **train-phase3** — Phase 3: boundary sharpening (+ perturbation negatives, 10 epochs) +8. **export** — Export to ONNX with fp16 conversion +9. **evaluate** — Tier-aware threshold tuning + tiered quality gate +10. **report** — Classification examples report (docs/classification-examples.md) ### Source Modules | Module | Purpose | |--------|---------| -| `constants.py` | Label taxonomy (24 categories, canonical order) | -| `pipeline.py` | Pipeline step definitions | -| `claude_generator.py` | Positive + hard negative generation via Claude | -| `merge_data.py` | Data merging, multi-label enrichment, splitting | +| `constants.py` | Label taxonomy (33 categories, derived from CATEGORY_SPECS) | +| `pipeline.py` | Pipeline step definitions, tier pos_weight configuration | +| `claude_generator.py` | Positive + hard negative generation via Claude/local LLM | +| `merge_data.py` | Data merging, multi-label enrichment, phased splitting | | `perturbation.py` | Adversarial perturbation generation | -| `evaluate.py` | ONNX inference, metrics, threshold tuning, quality gate | -| `showcase.py` | Generates classification showcase markdown from test samples | -| `llama_client.py` | Local LLM client (alternative to Claude) | -| `prompts/` | System prompts and category specifications | +| `evaluate.py` | ONNX inference, tier-aware thresholds, tiered quality gate | +| `showcase.py` | Classification report generator | +| `prompts/category_specs.py` | Single source of truth for all 33 categories | ### Data Format @@ -102,7 +84,7 @@ Training data is JSONL with context-prefixed text: ```json { "text": "[ADULT][MESSAGE] Your profile is stunning...", - "labels": {"threats": 0, "hate_speech": 0, ..., "harassment": 0}, + "labels": {"threats": 0, "hate_speech": 0, ..., "anti_trans": 0}, "metadata": {"source": "claude_positive", "category": "spam", ...} } ``` @@ -115,70 +97,61 @@ Context prefixes (`[GENERAL|ADULT][BIO|MESSAGE|LISTING|REVIEW|GENERAL]`) encode |----------|-------| | Base model | `sentence-transformers/all-mpnet-base-v2` (110M params, 768-dim) | | ONNX variant | fp16 | -| Size | 219 MB | -| Macro F1 | 0.944 | -| Quality gate | 18/18 pass (F1 >= 0.85) | -| Per-category thresholds | Tuned (see `thresholds.json`) | -| Path | `models/v15_mpnet_full_overlap/onnx/model_fp16.onnx` | - -### Key Thresholds - -Most categories use the default 0.30 threshold. Tuned exceptions: - -| Category | Threshold | Reason | -|----------|-----------|--------| -| threats | 0.58 | Reduce false positives from assertive language | -| law_enforcement | 0.63 | Narrow boundary with legitimate investigation discussion | -| adult_content | 0.45 | Distinguish from clinical/educational content | -| predatory_behavior | 0.44 | Separate from legitimate mentorship language | -| harassment | 0.42 | Reduce overlap with criticism/assertive communication | -| ncii | 0.38 | Distinguish from deepfake detection discussion | +| Size | 209 MB | +| Macro F1 | 0.934 | +| Quality gate | 33/33 pass (tiered: T1≥0.93, T2/T3≥0.84, T4≥0.85, T5≥0.80) | +| Per-category thresholds | Tier-aware tuning (see `thresholds.json`) | +| Path | `models/v2/onnx/model_fp16.onnx` | ## Project Structure ``` content-moderation/ -├── config.yaml # Generation config (model, batch sizes, categories) +├── config.yaml # Engine config (paths, concurrency, caps) ├── pyproject.toml # Package definition -├── EXPERIMENTS.md # Full experiment log (16 experiments, v1→v15) +├── EXPERIMENTS.md # Full experiment log (34 experiments) ├── src/ │ └── content_moderation_training/ │ ├── __main__.py # CLI entry point -│ ├── constants.py # Label taxonomy -│ ├── pipeline.py # Pipeline orchestration +│ ├── constants.py # Label taxonomy (derived from category_specs) +│ ├── pipeline.py # Pipeline orchestration + tier pos_weights │ ├── claude_generator.py │ ├── merge_data.py │ ├── perturbation.py -│ ├── evaluate.py +│ ├── evaluate.py # Tier-aware thresholds + quality gates │ ├── showcase.py -│ ├── llama_client.py │ └── prompts/ +│ └── category_specs.py # Single source of truth (33 categories) ├── data/ -│ ├── generated/ # Generated training data per category -│ ├── splits/ # train.jsonl, val.jsonl, test.jsonl +│ ├── generated/ # Per-category positives + hard negatives +│ ├── splits/ # train/val/test + phased training splits │ └── archive/ # Historical data snapshots ├── models/ -│ └── v15_mpnet_full_overlap/ +│ └── v2/ │ └── onnx/ │ ├── model.onnx # fp32 baseline (418 MB) -│ ├── model_fp16.onnx # Production model (219 MB) -│ ├── thresholds.json # Per-category thresholds -│ └── tokenizer files +│ ├── model_fp16.onnx # Production model (209 MB) +│ └── thresholds.json # Per-category decision thresholds +├── packages/ +│ └── content-moderation-feedback/ # Feedback + showcase + regression tests +├── services/ +│ └── inference-api/ # HTTP inference service ├── cache/ # Claude API response cache └── docs/ - └── classification-examples.md # Showcase with sample predictions + └── classification-examples.md # 1317 examples across 33 categories ``` ## Experiment History -16 experiments across two model architectures — see [EXPERIMENTS.md](EXPERIMENTS.md) for the full log. +34 experiments across two model architectures — see [EXPERIMENTS.md](EXPERIMENTS.md) for the full log. **Key milestones**: -- **v1–v10**: MiniLM-L6-v2 (22M params, 384-dim). Best: 17/18 categories passing. Harassment remained stuck at F1=0.829 despite data scaling, threshold tuning, co-label enrichment, and extended training. -- **v11–v13**: Multi-label generation by construction. Proved that generating text exhibiting multiple categories improves recall, but MiniLM lacks embedding capacity for 18 overlapping categories. -- **v14**: Model escalation to `all-mpnet-base-v2`. Fixed 3/5 failing categories immediately. INT8 quantization destroys mpnet (confirmed across static and dynamic variants). -- **v15**: Original overlap rates + mpnet = **18/18 PASS**. Macro F1 0.945. -- **v16 (optimization)**: fp16 conversion — 48% size reduction (418 → 219 MB), macro F1 0.944 (near-lossless). +- **Exp 1–10** (MiniLM-L6-v2): 22M params, 384-dim. Best: 17/18 categories passing. +- **Exp 14** (model escalation): `all-mpnet-base-v2` — fixed 3/5 failing categories immediately. +- **Exp 15**: 18/18 PASS. Macro F1 0.945. INT8 quantization confirmed broken for mpnet. +- **Exp 17–30**: 32-category expansion. Data quality refinement across overlap, seed, and hard negative experiments. +- **Exp 31**: 33rd category (anti_trans). GATE PASS, macro F1 0.935. +- **Exp 32–34**: 5-tier platform prioritization. Tier-aware threshold search + tiered quality gates. Key finding: tier differentiation works through evaluation policy, not data manipulation. ## Dependencies @@ -187,4 +160,3 @@ content-moderation/ - `onnxruntime` — ONNX inference - `transformers` — Tokenizer - `scikit-learn` — Metrics computation -- `numpy` — Array operations