|
|
||
|---|---|---|
| .playwright-mcp | ||
| .venv | ||
| data | ||
| docs | ||
| examples/epstein | ||
| models/v15_mpnet_full_overlap/onnx | ||
| packages/content-moderation-feedback | ||
| services/inference-api | ||
| src/content_moderation_training | ||
| tools | ||
| .gitattributes | ||
| .gitignore | ||
| CLAUDE.md | ||
| config.yaml | ||
| EXPERIMENTS.md | ||
| pyproject.toml | ||
| README.md | ||
Content Moderation Classifier
Multi-label text classifier for the Lilith platform. Detects 33 content moderation categories across platform messages, bios, listings, and reviews.
Production model: all-mpnet-base-v2 fp16 ONNX — 209 MB, macro F1 0.934, 33/33 categories pass tiered quality gates.
Categories
33 categories organized into 5 platform priority tiers:
| Tier | Semantics | Categories |
|---|---|---|
| T1 (F1≥0.93, R≥0.90) | Zero-tolerance (criminal) | csam, trafficking, bestiality, self_harm |
| T2 (F1≥0.84) | Worker safety | predatory_behavior, ncii, sextortion, threats |
| T3 (F1≥0.84) | Exploitation/harm | harassment, hate_speech, anti_trans†, doxxing, financial_coercion, consent_violation, intoxication, extreme_gore, snuff |
| T4 (F1≥0.85) | Platform policy | spam, scam_patterns, impersonation, law_enforcement, age_play, necrophilia, contact_info |
| T5 (F1≥0.80) | Content routing | solicitation, adult_content, bdsm, edge_play, roleplay, furry, watersports, scat, profanity |
† anti_trans is optional — excluded from inference output by default (include_optional_categories: false).
Quick Start
# Install
pip install -e .
# Check pipeline status
content-moderation-training status
# Run full pipeline (generate data → merge → train → export → evaluate)
content-moderation-training run
# Run from a specific step
content-moderation-training run --from train-phase1
# Review generated examples
content-moderation-training review harassment positives --limit 10
# Validate taxonomy
content-moderation-training taxonomy --validate
# Evaluate the production model
python -m content_moderation_training.evaluate \
--model models/v2/onnx/model_fp16.onnx \
--tokenizer models/v2/onnx \
--test data/splits/test.jsonl \
--val data/splits/val.jsonl
Architecture
Pipeline
10-step training pipeline orchestrated by lilith-ml-data-engine:
- generate-positives — Positive examples per category (Claude + local LLM for restricted categories)
- generate-negatives — Hard negatives + 3000 innocuous examples
- generate-perturbations — Adversarial perturbation negatives from positives
- merge-data — Merge all sources, apply multi-label enrichment, split train/val/test
- train-phase1 — Phase 1: category representations (positives + innocuous, 7 epochs, cosine LR)
- train-phase2 — Phase 2: decision boundaries (+ hard negatives, 7 epochs)
- train-phase3 — Phase 3: boundary sharpening (+ perturbation negatives, 10 epochs)
- export — Export to ONNX with fp16 conversion
- evaluate — Tier-aware threshold tuning + tiered quality gate
- report — Classification examples report (docs/classification-examples.md)
Source Modules
| Module | Purpose |
|---|---|
constants.py |
Label taxonomy (33 categories, derived from CATEGORY_SPECS) |
pipeline.py |
Pipeline step definitions, tier pos_weight configuration |
claude_generator.py |
Positive + hard negative generation via Claude/local LLM |
merge_data.py |
Data merging, multi-label enrichment, phased splitting |
perturbation.py |
Adversarial perturbation generation |
evaluate.py |
ONNX inference, tier-aware thresholds, tiered quality gate |
showcase.py |
Classification report generator |
prompts/category_specs.py |
Single source of truth for all 33 categories |
Data Format
Training data is JSONL with context-prefixed text:
{
"text": "[ADULT][MESSAGE] Your profile is stunning...",
"labels": {"threats": 0, "hate_speech": 0, ..., "anti_trans": 0},
"metadata": {"source": "claude_positive", "category": "spam", ...}
}
Context prefixes ([GENERAL|ADULT][BIO|MESSAGE|LISTING|REVIEW|GENERAL]) encode platform context so the model learns context-dependent classification.
Production Model
| Property | Value |
|---|---|
| Base model | sentence-transformers/all-mpnet-base-v2 (110M params, 768-dim) |
| ONNX variant | fp16 |
| Size | 209 MB |
| Macro F1 | 0.934 |
| Quality gate | 33/33 pass (tiered: T1≥0.93, T2/T3≥0.84, T4≥0.85, T5≥0.80) |
| Per-category thresholds | Tier-aware tuning (see thresholds.json) |
| Path | models/v2/onnx/model_fp16.onnx |
Project Structure
content-moderation/
├── config.yaml # Engine config (paths, concurrency, caps)
├── pyproject.toml # Package definition
├── EXPERIMENTS.md # Full experiment log (34 experiments)
├── src/
│ └── content_moderation_training/
│ ├── __main__.py # CLI entry point
│ ├── constants.py # Label taxonomy (derived from category_specs)
│ ├── pipeline.py # Pipeline orchestration + tier pos_weights
│ ├── claude_generator.py
│ ├── merge_data.py
│ ├── perturbation.py
│ ├── evaluate.py # Tier-aware thresholds + quality gates
│ ├── showcase.py
│ └── prompts/
│ └── category_specs.py # Single source of truth (33 categories)
├── data/
│ ├── generated/ # Per-category positives + hard negatives
│ ├── splits/ # train/val/test + phased training splits
│ └── archive/ # Historical data snapshots
├── models/
│ └── v2/
│ └── onnx/
│ ├── model.onnx # fp32 baseline (418 MB)
│ ├── model_fp16.onnx # Production model (209 MB)
│ └── thresholds.json # Per-category decision thresholds
├── packages/
│ └── content-moderation-feedback/ # Feedback + showcase + regression tests
├── services/
│ └── inference-api/ # HTTP inference service
├── cache/ # Claude API response cache
└── docs/
└── classification-examples.md # 1317 examples across 33 categories
Experiment History
34 experiments across two model architectures — see EXPERIMENTS.md for the full log.
Key milestones:
- Exp 1–10 (MiniLM-L6-v2): 22M params, 384-dim. Best: 17/18 categories passing.
- Exp 14 (model escalation):
all-mpnet-base-v2— fixed 3/5 failing categories immediately. - Exp 15: 18/18 PASS. Macro F1 0.945. INT8 quantization confirmed broken for mpnet.
- Exp 17–30: 32-category expansion. Data quality refinement across overlap, seed, and hard negative experiments.
- Exp 31: 33rd category (anti_trans). GATE PASS, macro F1 0.935.
- Exp 32–34: 5-tier platform prioritization. Tier-aware threshold search + tiered quality gates. Key finding: tier differentiation works through evaluation policy, not data manipulation.
Dependencies
lilith-ml-data-engine— Pipeline orchestration frameworktrain-text-classifier— Model training + ONNX export CLIonnxruntime— ONNX inferencetransformers— Tokenizerscikit-learn— Metrics computation