No description
|
|
||
|---|---|---|
| data | ||
| docs | ||
| src/content_moderation_training | ||
| .gitattributes | ||
| .gitignore | ||
| config.yaml | ||
| EXPERIMENTS.md | ||
| pyproject.toml | ||
| README.md | ||
Content Moderation Classifier
Multi-label text classifier for the Lilith platform. Detects 18 content moderation categories across platform messages, bios, listings, and reviews.
Production model: all-mpnet-base-v2 fp16 ONNX — 219 MB, macro F1 0.944, 18/18 categories pass (F1 >= 0.85).
Categories
| Category | Severity | Description |
|---|---|---|
| threats | critical | Death/harm/violence threats, veiled threats |
| hate_speech | high | Racial, ethnic, gender, sexuality, religious, disability |
| csam | critical | Solicitation, distribution, grooming of minors |
| scam_patterns | high | Advance fee, deposit scam, phishing, fake escort |
| contact_info | medium | Phone numbers, emails, social media handles, external URLs |
| solicitation | medium | Explicit requests, price discussion, service negotiation |
| spam | low | Mass messages, promotional, repetitive content |
| profanity | low | Strong language, slurs, offensive terms |
| adult_content | medium | Explicit descriptions, nudity references, sexual content |
| doxxing | critical | Identity/address/workplace/family exposure |
| predatory_behavior | critical | Grooming, manipulation, power imbalance, boundary violation |
| law_enforcement | high | Sting language, entrapment patterns, investigation probing |
| sextortion | critical | Blackmail, extortion, threat of exposure, coercion |
| ncii | critical | Revenge porn, deepfakes, unauthorized intimate images |
| trafficking | critical | Sexual/labor trafficking, recruitment, advertisement |
| self_harm | critical | Suicide encouragement, self-injury, eating disorders |
| impersonation | high | Staff/creator/law enforcement impersonation |
| harassment | medium | Targeted abuse, bullying, stalking, persistent contact |
Quick Start
# Install
pip install -e .
# Check pipeline status
content-moderation-training status
# Run full pipeline (generate data → merge → train → export → evaluate)
content-moderation-training run
# Run from a specific step
content-moderation-training run --from train
# Review generated examples
content-moderation-training review harassment positives --limit 10
# Evaluate the production model
python -m content_moderation_training.evaluate \
--model models/v15_mpnet_full_overlap/onnx/model_fp16.onnx \
--tokenizer models/v15_mpnet_full_overlap/onnx \
--test data/splits/test.jsonl \
--val data/splits/val.jsonl
# Generate classification showcase
python -m content_moderation_training.showcase \
--model models/v15_mpnet_full_overlap/onnx/model_fp16.onnx \
--tokenizer models/v15_mpnet_full_overlap/onnx \
--thresholds models/v15_mpnet_full_overlap/onnx/thresholds.json \
--test data/splits/test.jsonl \
--output docs/classification-examples.md
Architecture
Pipeline
The training pipeline has 7 steps, orchestrated by lilith-ml-data-engine:
- generate-positives — Claude generates positive examples for each category (500/cat, with multi-label overlap for co-occurring categories)
- generate-negatives — Claude generates hard negatives (400/cat for difficult categories, 200/cat otherwise) and 3000 innocuous examples
- generate-perturbations — Adversarial perturbations from positive examples
- merge-data — Merge all sources, apply train/val/test split
- train — Fine-tune
all-mpnet-base-v2viatrain-text-classifier - export — Export to ONNX with fp16 conversion
- evaluate — Per-category F1 gate (>= 0.85), per-category threshold tuning
Source Modules
| Module | Purpose |
|---|---|
constants.py |
Label taxonomy (18 categories, canonical order) |
pipeline.py |
Pipeline step definitions |
claude_generator.py |
Positive + hard negative generation via Claude |
merge_data.py |
Data merging, multi-label enrichment, splitting |
perturbation.py |
Adversarial perturbation generation |
evaluate.py |
ONNX inference, metrics, threshold tuning, quality gate |
showcase.py |
Generates classification showcase markdown from test samples |
llama_client.py |
Local LLM client (alternative to Claude) |
prompts/ |
System prompts and category specifications |
Data Format
Training data is JSONL with context-prefixed text:
{
"text": "[ADULT][MESSAGE] Your profile is stunning...",
"labels": {"threats": 0, "hate_speech": 0, ..., "harassment": 0},
"metadata": {"source": "claude_positive", "category": "spam", ...}
}
Context prefixes ([GENERAL|ADULT][BIO|MESSAGE|LISTING|REVIEW|GENERAL]) encode platform context so the model learns context-dependent classification.
Production Model
| Property | Value |
|---|---|
| Base model | sentence-transformers/all-mpnet-base-v2 (110M params, 768-dim) |
| ONNX variant | fp16 |
| Size | 219 MB |
| Macro F1 | 0.944 |
| Quality gate | 18/18 pass (F1 >= 0.85) |
| Per-category thresholds | Tuned (see thresholds.json) |
| Path | models/v15_mpnet_full_overlap/onnx/model_fp16.onnx |
Key Thresholds
Most categories use the default 0.30 threshold. Tuned exceptions:
| Category | Threshold | Reason |
|---|---|---|
| threats | 0.58 | Reduce false positives from assertive language |
| law_enforcement | 0.63 | Narrow boundary with legitimate investigation discussion |
| adult_content | 0.45 | Distinguish from clinical/educational content |
| predatory_behavior | 0.44 | Separate from legitimate mentorship language |
| harassment | 0.42 | Reduce overlap with criticism/assertive communication |
| ncii | 0.38 | Distinguish from deepfake detection discussion |
Project Structure
content-moderation/
├── config.yaml # Generation config (model, batch sizes, categories)
├── pyproject.toml # Package definition
├── EXPERIMENTS.md # Full experiment log (16 experiments, v1→v15)
├── src/
│ └── content_moderation_training/
│ ├── __main__.py # CLI entry point
│ ├── constants.py # Label taxonomy
│ ├── pipeline.py # Pipeline orchestration
│ ├── claude_generator.py
│ ├── merge_data.py
│ ├── perturbation.py
│ ├── evaluate.py
│ ├── showcase.py
│ ├── llama_client.py
│ └── prompts/
├── data/
│ ├── claude/ # Generated training data per category
│ ├── splits/ # train.jsonl, val.jsonl, test.jsonl
│ └── archive/ # Historical data snapshots
├── models/
│ └── v15_mpnet_full_overlap/
│ └── onnx/
│ ├── model.onnx # fp32 baseline (418 MB)
│ ├── model_fp16.onnx # Production model (219 MB)
│ ├── thresholds.json # Per-category thresholds
│ └── tokenizer files
├── cache/ # Claude API response cache
└── docs/
└── classification-examples.md # Showcase with sample predictions
Experiment History
16 experiments across two model architectures — see EXPERIMENTS.md for the full log.
Key milestones:
- v1–v10: MiniLM-L6-v2 (22M params, 384-dim). Best: 17/18 categories passing. Harassment remained stuck at F1=0.829 despite data scaling, threshold tuning, co-label enrichment, and extended training.
- v11–v13: Multi-label generation by construction. Proved that generating text exhibiting multiple categories improves recall, but MiniLM lacks embedding capacity for 18 overlapping categories.
- v14: Model escalation to
all-mpnet-base-v2. Fixed 3/5 failing categories immediately. INT8 quantization destroys mpnet (confirmed across static and dynamic variants). - v15: Original overlap rates + mpnet = 18/18 PASS. Macro F1 0.945.
- v16 (optimization): fp16 conversion — 48% size reduction (418 → 219 MB), macro F1 0.944 (near-lossless).
Dependencies
lilith-ml-data-engine— Pipeline orchestration frameworktrain-text-classifier— Model training + ONNX export CLIonnxruntime— ONNX inferencetransformers— Tokenizerscikit-learn— Metrics computationnumpy— Array operations