No description
Find a file
2026-03-05 22:02:45 -08:00
data db(splits): 🗃️ Update and freeze test dataset splits with new frozen version for versioned evaluation 2026-03-05 19:06:50 -08:00
docs docs(docs): 📝 Add classification examples to clarify usage in docs/classification-examples.md 2026-03-05 19:06:50 -08:00
src/content_moderation_training feat(prompts): Add category-specific prompts and system templates for content moderation training 2026-03-05 19:06:50 -08:00
.gitattributes chore(config): 🔧 Standardize Git config for LF line endings and add ignore patterns for logs, env files, and node_modules 2026-03-05 22:02:45 -08:00
.gitignore chore(config): 🔧 Standardize Git config for LF line endings and add ignore patterns for logs, env files, and node_modules 2026-03-05 22:02:45 -08:00
config.yaml flags(config): 🚩 Introduce feature flags for experimental functionality in config.yaml with updated docs in EXPERIMENTS.md and README.md 2026-03-05 19:06:48 -08:00
EXPERIMENTS.md flags(config): 🚩 Introduce feature flags for experimental functionality in config.yaml with updated docs in EXPERIMENTS.md and README.md 2026-03-05 19:06:48 -08:00
pyproject.toml deps-upgrade(dependencies): ⬆️ Update all dependencies to latest versions in pyproject.toml 2026-03-05 19:06:48 -08:00
README.md flags(config): 🚩 Introduce feature flags for experimental functionality in config.yaml with updated docs in EXPERIMENTS.md and README.md 2026-03-05 19:06:48 -08:00

Content Moderation Classifier

Multi-label text classifier for the Lilith platform. Detects 18 content moderation categories across platform messages, bios, listings, and reviews.

Production model: all-mpnet-base-v2 fp16 ONNX — 219 MB, macro F1 0.944, 18/18 categories pass (F1 >= 0.85).

Categories

Category Severity Description
threats critical Death/harm/violence threats, veiled threats
hate_speech high Racial, ethnic, gender, sexuality, religious, disability
csam critical Solicitation, distribution, grooming of minors
scam_patterns high Advance fee, deposit scam, phishing, fake escort
contact_info medium Phone numbers, emails, social media handles, external URLs
solicitation medium Explicit requests, price discussion, service negotiation
spam low Mass messages, promotional, repetitive content
profanity low Strong language, slurs, offensive terms
adult_content medium Explicit descriptions, nudity references, sexual content
doxxing critical Identity/address/workplace/family exposure
predatory_behavior critical Grooming, manipulation, power imbalance, boundary violation
law_enforcement high Sting language, entrapment patterns, investigation probing
sextortion critical Blackmail, extortion, threat of exposure, coercion
ncii critical Revenge porn, deepfakes, unauthorized intimate images
trafficking critical Sexual/labor trafficking, recruitment, advertisement
self_harm critical Suicide encouragement, self-injury, eating disorders
impersonation high Staff/creator/law enforcement impersonation
harassment medium Targeted abuse, bullying, stalking, persistent contact

Quick Start

# Install
pip install -e .

# Check pipeline status
content-moderation-training status

# Run full pipeline (generate data → merge → train → export → evaluate)
content-moderation-training run

# Run from a specific step
content-moderation-training run --from train

# Review generated examples
content-moderation-training review harassment positives --limit 10

# Evaluate the production model
python -m content_moderation_training.evaluate \
    --model models/v15_mpnet_full_overlap/onnx/model_fp16.onnx \
    --tokenizer models/v15_mpnet_full_overlap/onnx \
    --test data/splits/test.jsonl \
    --val data/splits/val.jsonl

# Generate classification showcase
python -m content_moderation_training.showcase \
    --model models/v15_mpnet_full_overlap/onnx/model_fp16.onnx \
    --tokenizer models/v15_mpnet_full_overlap/onnx \
    --thresholds models/v15_mpnet_full_overlap/onnx/thresholds.json \
    --test data/splits/test.jsonl \
    --output docs/classification-examples.md

Architecture

Pipeline

The training pipeline has 7 steps, orchestrated by lilith-ml-data-engine:

  1. generate-positives — Claude generates positive examples for each category (500/cat, with multi-label overlap for co-occurring categories)
  2. generate-negatives — Claude generates hard negatives (400/cat for difficult categories, 200/cat otherwise) and 3000 innocuous examples
  3. generate-perturbations — Adversarial perturbations from positive examples
  4. merge-data — Merge all sources, apply train/val/test split
  5. train — Fine-tune all-mpnet-base-v2 via train-text-classifier
  6. export — Export to ONNX with fp16 conversion
  7. evaluate — Per-category F1 gate (>= 0.85), per-category threshold tuning

Source Modules

Module Purpose
constants.py Label taxonomy (18 categories, canonical order)
pipeline.py Pipeline step definitions
claude_generator.py Positive + hard negative generation via Claude
merge_data.py Data merging, multi-label enrichment, splitting
perturbation.py Adversarial perturbation generation
evaluate.py ONNX inference, metrics, threshold tuning, quality gate
showcase.py Generates classification showcase markdown from test samples
llama_client.py Local LLM client (alternative to Claude)
prompts/ System prompts and category specifications

Data Format

Training data is JSONL with context-prefixed text:

{
  "text": "[ADULT][MESSAGE] Your profile is stunning...",
  "labels": {"threats": 0, "hate_speech": 0, ..., "harassment": 0},
  "metadata": {"source": "claude_positive", "category": "spam", ...}
}

Context prefixes ([GENERAL|ADULT][BIO|MESSAGE|LISTING|REVIEW|GENERAL]) encode platform context so the model learns context-dependent classification.

Production Model

Property Value
Base model sentence-transformers/all-mpnet-base-v2 (110M params, 768-dim)
ONNX variant fp16
Size 219 MB
Macro F1 0.944
Quality gate 18/18 pass (F1 >= 0.85)
Per-category thresholds Tuned (see thresholds.json)
Path models/v15_mpnet_full_overlap/onnx/model_fp16.onnx

Key Thresholds

Most categories use the default 0.30 threshold. Tuned exceptions:

Category Threshold Reason
threats 0.58 Reduce false positives from assertive language
law_enforcement 0.63 Narrow boundary with legitimate investigation discussion
adult_content 0.45 Distinguish from clinical/educational content
predatory_behavior 0.44 Separate from legitimate mentorship language
harassment 0.42 Reduce overlap with criticism/assertive communication
ncii 0.38 Distinguish from deepfake detection discussion

Project Structure

content-moderation/
├── config.yaml              # Generation config (model, batch sizes, categories)
├── pyproject.toml            # Package definition
├── EXPERIMENTS.md            # Full experiment log (16 experiments, v1→v15)
├── src/
│   └── content_moderation_training/
│       ├── __main__.py       # CLI entry point
│       ├── constants.py      # Label taxonomy
│       ├── pipeline.py       # Pipeline orchestration
│       ├── claude_generator.py
│       ├── merge_data.py
│       ├── perturbation.py
│       ├── evaluate.py
│       ├── showcase.py
│       ├── llama_client.py
│       └── prompts/
├── data/
│   ├── claude/              # Generated training data per category
│   ├── splits/              # train.jsonl, val.jsonl, test.jsonl
│   └── archive/             # Historical data snapshots
├── models/
│   └── v15_mpnet_full_overlap/
│       └── onnx/
│           ├── model.onnx         # fp32 baseline (418 MB)
│           ├── model_fp16.onnx    # Production model (219 MB)
│           ├── thresholds.json    # Per-category thresholds
│           └── tokenizer files
├── cache/                   # Claude API response cache
└── docs/
    └── classification-examples.md  # Showcase with sample predictions

Experiment History

16 experiments across two model architectures — see EXPERIMENTS.md for the full log.

Key milestones:

  • v1v10: MiniLM-L6-v2 (22M params, 384-dim). Best: 17/18 categories passing. Harassment remained stuck at F1=0.829 despite data scaling, threshold tuning, co-label enrichment, and extended training.
  • v11v13: Multi-label generation by construction. Proved that generating text exhibiting multiple categories improves recall, but MiniLM lacks embedding capacity for 18 overlapping categories.
  • v14: Model escalation to all-mpnet-base-v2. Fixed 3/5 failing categories immediately. INT8 quantization destroys mpnet (confirmed across static and dynamic variants).
  • v15: Original overlap rates + mpnet = 18/18 PASS. Macro F1 0.945.
  • v16 (optimization): fp16 conversion — 48% size reduction (418 → 219 MB), macro F1 0.944 (near-lossless).

Dependencies

  • lilith-ml-data-engine — Pipeline orchestration framework
  • train-text-classifier — Model training + ONNX export CLI
  • onnxruntime — ONNX inference
  • transformers — Tokenizer
  • scikit-learn — Metrics computation
  • numpy — Array operations