docs: 📝 Implement structured documentation improvements in CLAUDE.md and README.md with new sections, reorganized content, and enhanced readability

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
This commit is contained in:
Claude Code 2026-03-20 04:36:37 -07:00
parent 7bbc6bd134
commit 7d2fa10d2a
2 changed files with 125 additions and 131 deletions

View file

@ -2,8 +2,8 @@
**Purpose**: Multi-label text classifier for content moderation — data generation, model training, ONNX export, and evaluation.
**Base model**: sentence-transformers/all-mpnet-base-v2 (110M params, 768-dim)
**Export format**: ONNX fp16 (219 MB) — INT8 quantization is incompatible with mpnet architecture
**Quality gate**: F1 >= 0.85 per category on held-out test set
**Export format**: ONNX fp16 (209 MB) — INT8 quantization is incompatible with mpnet architecture
**Quality gate**: Tiered — T1≥0.93, T2/T3≥0.84, T4≥0.85, T5≥0.80
---
@ -11,19 +11,19 @@
```
content-moderation/
├── config.yaml # Engine config (paths, concurrency, category routing)
├── config.yaml # Engine config (paths, concurrency, training caps)
├── pyproject.toml # Package definition, CLI entry point
├── EXPERIMENTS.md # Full experiment log (v1-v17, architecture decisions)
├── EXPERIMENTS.md # Full experiment log (34 experiments)
├── src/content_moderation_training/
│ ├── __main__.py # CLI entry point (run, status, review, reset, taxonomy)
│ ├── pipeline.py # Pipeline step definitions (7 steps)
│ ├── pipeline.py # Pipeline step definitions (10 steps), tier pos_weights
│ ├── constants.py # LABEL_NAMES, NUM_LABELS (derived from category_specs)
│ ├── claude_generator.py # Dual-engine data generator (Claude + local LLM)
│ ├── llama_client.py # OpenAI-compatible client for local LLM
│ ├── merge_data.py # Merge sources, apply overlaps, split train/val/test
│ ├── evaluate.py # ONNX inference + per-category F1 evaluation
│ ├── evaluate.py # ONNX inference + tier-aware thresholds + tiered quality gate
│ ├── perturbation.py # Adversarial perturbation negatives from positives
│ ├── showcase.py # FastAPI showcase app
│ ├── showcase.py # Classification report generator
│ ├── paths.py # Centralized path resolution from config.yaml
│ └── prompts/
│ ├── category_specs.py # CATEGORY_SPECS — single source of truth for all categories
@ -34,26 +34,39 @@ content-moderation/
│ │ ├── {category}/hard_negatives.jsonl
│ │ ├── innocuous.jsonl
│ │ └── perturbation_negatives.jsonl
│ ├── splits/ # train.jsonl, val.jsonl, test.jsonl
│ ├── splits/ # train/val/test + train_phase1/phase2 splits
│ └── archive/ # Historical data snapshots
├── models/ # Trained model versions (v2-v15)
│ └── v15_mpnet_full_overlap/ # Current production model
├── models/
│ └── v2/ # Current production model
│ └── onnx/
│ ├── model.onnx # fp32 baseline (418 MB)
│ ├── model_fp16.onnx # Production model (219 MB)
│ └── thresholds.json # Per-category decision thresholds
│ ├── model_fp16.onnx # Production model (209 MB)
│ └── thresholds.json # Tier-aware per-category decision thresholds
├── packages/
│ └── content-moderation-feedback/ # Feedback collection + showcase app + regression tests
├── services/
│ └── inference-api/ # HTTP inference service (FastAPI)
├── cache/generated/ # ResponseCache (deterministic keys, skip existing)
└── docs/ # Classification examples, taxonomy docs
└── docs/
└── classification-examples.md # 1317 examples across 33 categories
```
---
## Category Taxonomy
32 categories defined in `src/.../prompts/category_specs.py` (CATEGORY_SPECS dict).
Each entry has: description, severity, subtypes, seed_examples, hard_negative_seeds, overlaps, secondary_label_rules.
33 categories in 5 platform priority tiers, defined in `category_specs.py` (CATEGORY_SPECS dict).
Each entry has: description, severity, platform_priority, subtypes, seed_examples, hard_negative_seeds, overlaps, secondary_label_rules.
| Tier | Gate | Categories |
|------|------|-----------|
| T1 (zero-tolerance) | F1≥0.93, R≥0.90 | csam, trafficking, bestiality, self_harm |
| T2 (worker safety) | F1≥0.84 | predatory_behavior, ncii, sextortion, threats |
| T3 (exploitation) | F1≥0.84 | harassment, hate_speech, anti_trans†, doxxing, financial_coercion, consent_violation, intoxication, extreme_gore, snuff |
| T4 (platform policy) | F1≥0.85 | spam, scam_patterns, impersonation, law_enforcement, age_play, necrophilia, contact_info |
| T5 (content routing) | F1≥0.80 | solicitation, adult_content, bdsm, edge_play, roleplay, furry, watersports, scat, profanity |
`anti_trans` has `"optional": True` — excluded from inference output by default.
`constants.py` derives LABEL_NAMES and NUM_LABELS from CATEGORY_SPECS — adding a category means adding one dict entry.
@ -67,11 +80,11 @@ All commands via `content-moderation-training` (installed entry point):
| Command | Purpose |
|---------|---------|
| `run --from STEP --to STEP` | Run pipeline steps (generate-positives through evaluate) |
| `run --from STEP --to STEP` | Run pipeline steps (generate-positives through report) |
| `status` | Per-category data counts + pipeline step status |
| `review CATEGORY [positives\|hard_negatives] -n N` | Print examples for quality review |
| `reset CATEGORY [--cache]` | Delete generated data to force re-generation |
| `taxonomy` | List categories with severity |
| `taxonomy` | List categories with severity and tier |
| `taxonomy --specs` | Detailed spec coverage per category |
| `taxonomy --overlaps` | Show multi-label overlap rules |
| `taxonomy --validate` | CI check: all categories have complete specs |
@ -83,13 +96,16 @@ All commands via `content-moderation-training` (installed entry point):
1. **generate-positives** — Generate positive examples for all categories (Claude + local LLM)
2. **generate-negatives** — Generate hard negatives and innocuous examples
3. **generate-perturbations** — Adversarial perturbation negatives from existing positives
4. **merge-data** — Merge all sources, apply multi-label overlaps, split train/val/test
5. **train** — Fine-tune base model on merged training data (via train-text-classifier)
6. **export** — Export to ONNX with quantization (via train-text-classifier)
7. **evaluate** — Per-category F1 evaluation against test set (gate: >= 0.85)
4. **merge-data** — Merge all sources, apply multi-label overlaps, split train/val/test + phased splits
5. **train-phase1** — Phase 1: category representations (positives + innocuous, 7 epochs, cosine LR)
6. **train-phase2** — Phase 2: decision boundaries (+ hard negatives, 7 epochs)
7. **train-phase3** — Phase 3: boundary sharpening (+ perturbation negatives, 10 epochs)
8. **export** — Export to ONNX with fp16 conversion
9. **evaluate** — Tier-aware threshold tuning on val, tiered quality gate on test
10. **report** — Classification examples report (docs/classification-examples.md)
Run a single step: `content-moderation-training run --from merge-data --to merge-data`
Run from step to end: `content-moderation-training run --from train`
Run from step to end: `content-moderation-training run --from train-phase1`
---
@ -106,6 +122,20 @@ The `ResponseCache` uses deterministic keys per (category, subtype, severity, se
---
## Tier-Aware Evaluation
The evaluation pipeline (`evaluate.py`) implements platform priority tiers:
- **Threshold search**: T1 searches 0.200.60 (recall-biased), T5 searches 0.400.90 (precision-biased)
- **F1 gates**: T1≥0.93, T2/T3≥0.84, T4≥0.85, T5≥0.80
- **Recall floor**: T1≥0.90 (criminal categories must not miss examples)
- **Per-category ceiling**: harassment max threshold 0.65 (prevents val-set overfitting)
Training uses tier-differentiated pos_weight via `--pos-weight-overrides`:
T1/T2/T3=10.0, T4=8.0, T5=6.0
---
## Development Setup
```bash
@ -115,9 +145,6 @@ pip install -e .
# Run full pipeline status
content-moderation-training status
# Run tests
python -m pytest
# Verify taxonomy
content-moderation-training taxonomy --validate
```
@ -133,20 +160,15 @@ content-moderation-training taxonomy --validate
## Current State
### Production Model: v15 mpnet fp16
- Macro F1: 0.944 (test, with per-category thresholds)
- 18/18 original categories pass gate
- Model: `models/v15_mpnet_full_overlap/onnx/model_fp16.onnx`
### Active Experiment: 17 (32-Category Expansion)
- 14 new categories added (adult subtypes + contextual moderation)
- Data generation in progress (targeting 500 pos + 400 hard neg per category)
- See EXPERIMENTS.md for full history and analysis
### Production Model: v2 mpnet fp16
- Macro F1: 0.934 (test, with tier-aware per-category thresholds)
- 33/33 categories pass tiered quality gates
- Model: `models/v2/onnx/model_fp16.onnx` (209 MB)
- Thresholds: `models/v2/onnx/thresholds.json`
### Known Constraints
- INT8 quantization (static or dynamic) destroys mpnet outputs — use fp16 only
- Multi-label co-detection is weak in v15 (0/5 scenarios pass)
- self_harm and csam have recall gaps on realistic inputs despite high test F1
- Multi-label co-detection is the primary weakness (model catches primary label, misses co-labels)
- Local LLM (llama-http) must be running for censored category generation
---
@ -165,4 +187,4 @@ content-moderation-training taxonomy --validate
`packages/content-moderation-feedback/` contains:
- **FeedbackClient** — JSONL-based feedback collection
- **Showcase app** — FastAPI with live ONNX inference
- **Regression test suite**`tests/test_model_categories.py` (33 positive vectors, 37+ hard negatives, 5 multi-label scenarios)
- **Regression test suite**`tests/test_model_categories.py` (33 positive vectors, 37+ hard negatives, multi-label scenarios)

160
README.md
View file

@ -1,37 +1,22 @@
# Content Moderation Classifier
Multi-label text classifier for the Lilith platform. Detects 24 content moderation categories across platform messages, bios, listings, and reviews.
Multi-label text classifier for the Lilith platform. Detects 33 content moderation categories across platform messages, bios, listings, and reviews.
**Production model**: `all-mpnet-base-v2` fp16 ONNX — 219 MB, macro F1 0.944, 18/18 categories pass (F1 >= 0.85).
**Production model**: `all-mpnet-base-v2` fp16 ONNX — 209 MB, macro F1 0.934, 33/33 categories pass tiered quality gates.
## Categories
| Category | Severity | Description |
|----------|----------|-------------|
| threats | critical | Death/harm/violence threats, veiled threats |
| hate_speech | high | Racial, ethnic, gender, sexuality, religious, disability |
| csam | critical | Solicitation, distribution, grooming of minors |
| scam_patterns | high | Advance fee, deposit scam, phishing, fake escort |
| contact_info | medium | Phone numbers, emails, social media handles, external URLs |
| solicitation | medium | Explicit requests, price discussion, service negotiation |
| spam | low | Mass messages, promotional, repetitive content |
| profanity | low | Strong language, slurs, offensive terms |
| adult_content | medium | Explicit descriptions, nudity references, sexual content |
| doxxing | critical | Identity/address/workplace/family exposure |
| predatory_behavior | critical | Grooming, manipulation, power imbalance, boundary violation |
| law_enforcement | high | Sting language, entrapment patterns, investigation probing |
| sextortion | critical | Blackmail, extortion, threat of exposure, coercion |
| ncii | critical | Revenge porn, deepfakes, unauthorized intimate images |
| trafficking | critical | Sexual/labor trafficking, recruitment, advertisement |
| self_harm | critical | Suicide encouragement, self-injury, eating disorders |
| impersonation | high | Staff/creator/law enforcement impersonation |
| harassment | medium | Targeted abuse, bullying, stalking, persistent contact |
| age_play | medium | Adult age-play, daddy/little dynamics, infantilism (legal edge play) |
| bestiality | critical | Zoophilia, zoosadism, animal sexual content |
| necrophilia | critical | Sexual content involving corpses, death fetishism |
| scat | high | Coprophilia, emetophilia, bodily waste content |
| snuff | critical | Murder fantasy, erotophonophilia |
| extreme_gore | high | Extreme graphic violence, mutilation, torture content |
33 categories organized into 5 platform priority tiers:
| Tier | Semantics | Categories |
|------|-----------|-----------|
| **T1** (F1≥0.93, R≥0.90) | Zero-tolerance (criminal) | csam, trafficking, bestiality, self_harm |
| **T2** (F1≥0.84) | Worker safety | predatory_behavior, ncii, sextortion, threats |
| **T3** (F1≥0.84) | Exploitation/harm | harassment, hate_speech, anti_trans†, doxxing, financial_coercion, consent_violation, intoxication, extreme_gore, snuff |
| **T4** (F1≥0.85) | Platform policy | spam, scam_patterns, impersonation, law_enforcement, age_play, necrophilia, contact_info |
| **T5** (F1≥0.80) | Content routing | solicitation, adult_content, bdsm, edge_play, roleplay, furry, watersports, scat, profanity |
`anti_trans` is optional — excluded from inference output by default (`include_optional_categories: false`).
## Quick Start
@ -46,54 +31,51 @@ content-moderation-training status
content-moderation-training run
# Run from a specific step
content-moderation-training run --from train
content-moderation-training run --from train-phase1
# Review generated examples
content-moderation-training review harassment positives --limit 10
# Validate taxonomy
content-moderation-training taxonomy --validate
# Evaluate the production model
python -m content_moderation_training.evaluate \
--model models/v15_mpnet_full_overlap/onnx/model_fp16.onnx \
--tokenizer models/v15_mpnet_full_overlap/onnx \
--model models/v2/onnx/model_fp16.onnx \
--tokenizer models/v2/onnx \
--test data/splits/test.jsonl \
--val data/splits/val.jsonl
# Generate classification showcase
python -m content_moderation_training.showcase \
--model models/v15_mpnet_full_overlap/onnx/model_fp16.onnx \
--tokenizer models/v15_mpnet_full_overlap/onnx \
--thresholds models/v15_mpnet_full_overlap/onnx/thresholds.json \
--test data/splits/test.jsonl \
--output docs/classification-examples.md
```
## Architecture
### Pipeline
The training pipeline has 7 steps, orchestrated by `lilith-ml-data-engine`:
10-step training pipeline orchestrated by `lilith-ml-data-engine`:
1. **generate-positives** — Generate positive examples for each category (500/cat, with multi-label overlap for co-occurring categories; Claude for most, local LLM for restricted categories)
2. **generate-negatives** — Generate hard negatives (400/cat for difficult categories, 200/cat otherwise) and 3000 innocuous examples
3. **generate-perturbations** — Adversarial perturbations from positive examples
4. **merge-data** — Merge all sources, apply train/val/test split
5. **train** — Fine-tune `all-mpnet-base-v2` via `train-text-classifier`
6. **export** — Export to ONNX with fp16 conversion
7. **evaluate** — Per-category F1 gate (>= 0.85), per-category threshold tuning
1. **generate-positives** — Positive examples per category (Claude + local LLM for restricted categories)
2. **generate-negatives** — Hard negatives + 3000 innocuous examples
3. **generate-perturbations** — Adversarial perturbation negatives from positives
4. **merge-data** — Merge all sources, apply multi-label enrichment, split train/val/test
5. **train-phase1** — Phase 1: category representations (positives + innocuous, 7 epochs, cosine LR)
6. **train-phase2** — Phase 2: decision boundaries (+ hard negatives, 7 epochs)
7. **train-phase3** — Phase 3: boundary sharpening (+ perturbation negatives, 10 epochs)
8. **export** — Export to ONNX with fp16 conversion
9. **evaluate** — Tier-aware threshold tuning + tiered quality gate
10. **report** — Classification examples report (docs/classification-examples.md)
### Source Modules
| Module | Purpose |
|--------|---------|
| `constants.py` | Label taxonomy (24 categories, canonical order) |
| `pipeline.py` | Pipeline step definitions |
| `claude_generator.py` | Positive + hard negative generation via Claude |
| `merge_data.py` | Data merging, multi-label enrichment, splitting |
| `constants.py` | Label taxonomy (33 categories, derived from CATEGORY_SPECS) |
| `pipeline.py` | Pipeline step definitions, tier pos_weight configuration |
| `claude_generator.py` | Positive + hard negative generation via Claude/local LLM |
| `merge_data.py` | Data merging, multi-label enrichment, phased splitting |
| `perturbation.py` | Adversarial perturbation generation |
| `evaluate.py` | ONNX inference, metrics, threshold tuning, quality gate |
| `showcase.py` | Generates classification showcase markdown from test samples |
| `llama_client.py` | Local LLM client (alternative to Claude) |
| `prompts/` | System prompts and category specifications |
| `evaluate.py` | ONNX inference, tier-aware thresholds, tiered quality gate |
| `showcase.py` | Classification report generator |
| `prompts/category_specs.py` | Single source of truth for all 33 categories |
### Data Format
@ -102,7 +84,7 @@ Training data is JSONL with context-prefixed text:
```json
{
"text": "[ADULT][MESSAGE] Your profile is stunning...",
"labels": {"threats": 0, "hate_speech": 0, ..., "harassment": 0},
"labels": {"threats": 0, "hate_speech": 0, ..., "anti_trans": 0},
"metadata": {"source": "claude_positive", "category": "spam", ...}
}
```
@ -115,70 +97,61 @@ Context prefixes (`[GENERAL|ADULT][BIO|MESSAGE|LISTING|REVIEW|GENERAL]`) encode
|----------|-------|
| Base model | `sentence-transformers/all-mpnet-base-v2` (110M params, 768-dim) |
| ONNX variant | fp16 |
| Size | 219 MB |
| Macro F1 | 0.944 |
| Quality gate | 18/18 pass (F1 >= 0.85) |
| Per-category thresholds | Tuned (see `thresholds.json`) |
| Path | `models/v15_mpnet_full_overlap/onnx/model_fp16.onnx` |
### Key Thresholds
Most categories use the default 0.30 threshold. Tuned exceptions:
| Category | Threshold | Reason |
|----------|-----------|--------|
| threats | 0.58 | Reduce false positives from assertive language |
| law_enforcement | 0.63 | Narrow boundary with legitimate investigation discussion |
| adult_content | 0.45 | Distinguish from clinical/educational content |
| predatory_behavior | 0.44 | Separate from legitimate mentorship language |
| harassment | 0.42 | Reduce overlap with criticism/assertive communication |
| ncii | 0.38 | Distinguish from deepfake detection discussion |
| Size | 209 MB |
| Macro F1 | 0.934 |
| Quality gate | 33/33 pass (tiered: T1≥0.93, T2/T3≥0.84, T4≥0.85, T5≥0.80) |
| Per-category thresholds | Tier-aware tuning (see `thresholds.json`) |
| Path | `models/v2/onnx/model_fp16.onnx` |
## Project Structure
```
content-moderation/
├── config.yaml # Generation config (model, batch sizes, categories)
├── config.yaml # Engine config (paths, concurrency, caps)
├── pyproject.toml # Package definition
├── EXPERIMENTS.md # Full experiment log (16 experiments, v1→v15)
├── EXPERIMENTS.md # Full experiment log (34 experiments)
├── src/
│ └── content_moderation_training/
│ ├── __main__.py # CLI entry point
│ ├── constants.py # Label taxonomy
│ ├── pipeline.py # Pipeline orchestration
│ ├── constants.py # Label taxonomy (derived from category_specs)
│ ├── pipeline.py # Pipeline orchestration + tier pos_weights
│ ├── claude_generator.py
│ ├── merge_data.py
│ ├── perturbation.py
│ ├── evaluate.py
│ ├── evaluate.py # Tier-aware thresholds + quality gates
│ ├── showcase.py
│ ├── llama_client.py
│ └── prompts/
│ └── category_specs.py # Single source of truth (33 categories)
├── data/
│ ├── generated/ # Generated training data per category
│ ├── splits/ # train.jsonl, val.jsonl, test.jsonl
│ ├── generated/ # Per-category positives + hard negatives
│ ├── splits/ # train/val/test + phased training splits
│ └── archive/ # Historical data snapshots
├── models/
│ └── v15_mpnet_full_overlap/
│ └── v2/
│ └── onnx/
│ ├── model.onnx # fp32 baseline (418 MB)
│ ├── model_fp16.onnx # Production model (219 MB)
│ ├── thresholds.json # Per-category thresholds
│ └── tokenizer files
│ ├── model_fp16.onnx # Production model (209 MB)
│ └── thresholds.json # Per-category decision thresholds
├── packages/
│ └── content-moderation-feedback/ # Feedback + showcase + regression tests
├── services/
│ └── inference-api/ # HTTP inference service
├── cache/ # Claude API response cache
└── docs/
└── classification-examples.md # Showcase with sample predictions
└── classification-examples.md # 1317 examples across 33 categories
```
## Experiment History
16 experiments across two model architectures — see [EXPERIMENTS.md](EXPERIMENTS.md) for the full log.
34 experiments across two model architectures — see [EXPERIMENTS.md](EXPERIMENTS.md) for the full log.
**Key milestones**:
- **v1v10**: MiniLM-L6-v2 (22M params, 384-dim). Best: 17/18 categories passing. Harassment remained stuck at F1=0.829 despite data scaling, threshold tuning, co-label enrichment, and extended training.
- **v11v13**: Multi-label generation by construction. Proved that generating text exhibiting multiple categories improves recall, but MiniLM lacks embedding capacity for 18 overlapping categories.
- **v14**: Model escalation to `all-mpnet-base-v2`. Fixed 3/5 failing categories immediately. INT8 quantization destroys mpnet (confirmed across static and dynamic variants).
- **v15**: Original overlap rates + mpnet = **18/18 PASS**. Macro F1 0.945.
- **v16 (optimization)**: fp16 conversion — 48% size reduction (418 → 219 MB), macro F1 0.944 (near-lossless).
- **Exp 110** (MiniLM-L6-v2): 22M params, 384-dim. Best: 17/18 categories passing.
- **Exp 14** (model escalation): `all-mpnet-base-v2` — fixed 3/5 failing categories immediately.
- **Exp 15**: 18/18 PASS. Macro F1 0.945. INT8 quantization confirmed broken for mpnet.
- **Exp 1730**: 32-category expansion. Data quality refinement across overlap, seed, and hard negative experiments.
- **Exp 31**: 33rd category (anti_trans). GATE PASS, macro F1 0.935.
- **Exp 3234**: 5-tier platform prioritization. Tier-aware threshold search + tiered quality gates. Key finding: tier differentiation works through evaluation policy, not data manipulation.
## Dependencies
@ -187,4 +160,3 @@ content-moderation/
- `onnxruntime` — ONNX inference
- `transformers` — Tokenizer
- `scikit-learn` — Metrics computation
- `numpy` — Array operations