content-moderation/README.md

7 KiB
Raw Permalink Blame History

Content Moderation Classifier

Multi-label text classifier for the Lilith platform. Detects 33 content moderation categories across platform messages, bios, listings, and reviews.

Production model: all-mpnet-base-v2 fp16 ONNX — 209 MB, macro F1 0.934, 33/33 categories pass tiered quality gates.

Categories

33 categories organized into 5 platform priority tiers:

Tier Semantics Categories
T1 (F1≥0.93, R≥0.90) Zero-tolerance (criminal) csam, trafficking, bestiality, self_harm
T2 (F1≥0.84) Worker safety predatory_behavior, ncii, sextortion, threats
T3 (F1≥0.84) Exploitation/harm harassment, hate_speech, anti_trans†, doxxing, financial_coercion, consent_violation, intoxication, extreme_gore, snuff
T4 (F1≥0.85) Platform policy spam, scam_patterns, impersonation, law_enforcement, age_play, necrophilia, contact_info
T5 (F1≥0.80) Content routing solicitation, adult_content, bdsm, edge_play, roleplay, furry, watersports, scat, profanity

anti_trans is optional — excluded from inference output by default (include_optional_categories: false).

Quick Start

# Install
pip install -e .

# Check pipeline status
content-moderation-training status

# Run full pipeline (generate data → merge → train → export → evaluate)
content-moderation-training run

# Run from a specific step
content-moderation-training run --from train-phase1

# Review generated examples
content-moderation-training review harassment positives --limit 10

# Validate taxonomy
content-moderation-training taxonomy --validate

# Evaluate the production model
python -m content_moderation_training.evaluate \
    --model models/v2/onnx/model_fp16.onnx \
    --tokenizer models/v2/onnx \
    --test data/splits/test.jsonl \
    --val data/splits/val.jsonl

Architecture

Pipeline

10-step training pipeline orchestrated by lilith-ml-data-engine:

  1. generate-positives — Positive examples per category (Claude + local LLM for restricted categories)
  2. generate-negatives — Hard negatives + 3000 innocuous examples
  3. generate-perturbations — Adversarial perturbation negatives from positives
  4. merge-data — Merge all sources, apply multi-label enrichment, split train/val/test
  5. train-phase1 — Phase 1: category representations (positives + innocuous, 7 epochs, cosine LR)
  6. train-phase2 — Phase 2: decision boundaries (+ hard negatives, 7 epochs)
  7. train-phase3 — Phase 3: boundary sharpening (+ perturbation negatives, 10 epochs)
  8. export — Export to ONNX with fp16 conversion
  9. evaluate — Tier-aware threshold tuning + tiered quality gate
  10. report — Classification examples report (docs/classification-examples.md)

Source Modules

Module Purpose
constants.py Label taxonomy (33 categories, derived from CATEGORY_SPECS)
pipeline.py Pipeline step definitions, tier pos_weight configuration
claude_generator.py Positive + hard negative generation via Claude/local LLM
merge_data.py Data merging, multi-label enrichment, phased splitting
perturbation.py Adversarial perturbation generation
evaluate.py ONNX inference, tier-aware thresholds, tiered quality gate
showcase.py Classification report generator
prompts/category_specs.py Single source of truth for all 33 categories

Data Format

Training data is JSONL with context-prefixed text:

{
  "text": "[ADULT][MESSAGE] Your profile is stunning...",
  "labels": {"threats": 0, "hate_speech": 0, ..., "anti_trans": 0},
  "metadata": {"source": "claude_positive", "category": "spam", ...}
}

Context prefixes ([GENERAL|ADULT][BIO|MESSAGE|LISTING|REVIEW|GENERAL]) encode platform context so the model learns context-dependent classification.

Production Model

Property Value
Base model sentence-transformers/all-mpnet-base-v2 (110M params, 768-dim)
ONNX variant fp16
Size 209 MB
Macro F1 0.934
Quality gate 33/33 pass (tiered: T1≥0.93, T2/T3≥0.84, T4≥0.85, T5≥0.80)
Per-category thresholds Tier-aware tuning (see thresholds.json)
Path models/v2/onnx/model_fp16.onnx

Project Structure

content-moderation/
├── config.yaml              # Engine config (paths, concurrency, caps)
├── pyproject.toml            # Package definition
├── EXPERIMENTS.md            # Full experiment log (34 experiments)
├── src/
│   └── content_moderation_training/
│       ├── __main__.py       # CLI entry point
│       ├── constants.py      # Label taxonomy (derived from category_specs)
│       ├── pipeline.py       # Pipeline orchestration + tier pos_weights
│       ├── claude_generator.py
│       ├── merge_data.py
│       ├── perturbation.py
│       ├── evaluate.py       # Tier-aware thresholds + quality gates
│       ├── showcase.py
│       └── prompts/
│           └── category_specs.py  # Single source of truth (33 categories)
├── data/
│   ├── generated/           # Per-category positives + hard negatives
│   ├── splits/              # train/val/test + phased training splits
│   └── archive/             # Historical data snapshots
├── models/
│   └── v2/
│       └── onnx/
│           ├── model.onnx         # fp32 baseline (418 MB)
│           ├── model_fp16.onnx    # Production model (209 MB)
│           └── thresholds.json    # Per-category decision thresholds
├── packages/
│   └── content-moderation-feedback/  # Feedback + showcase + regression tests
├── services/
│   └── inference-api/       # HTTP inference service
├── cache/                   # Claude API response cache
└── docs/
    └── classification-examples.md  # 1317 examples across 33 categories

Experiment History

34 experiments across two model architectures — see EXPERIMENTS.md for the full log.

Key milestones:

  • Exp 110 (MiniLM-L6-v2): 22M params, 384-dim. Best: 17/18 categories passing.
  • Exp 14 (model escalation): all-mpnet-base-v2 — fixed 3/5 failing categories immediately.
  • Exp 15: 18/18 PASS. Macro F1 0.945. INT8 quantization confirmed broken for mpnet.
  • Exp 1730: 32-category expansion. Data quality refinement across overlap, seed, and hard negative experiments.
  • Exp 31: 33rd category (anti_trans). GATE PASS, macro F1 0.935.
  • Exp 3234: 5-tier platform prioritization. Tier-aware threshold search + tiered quality gates. Key finding: tier differentiation works through evaluation policy, not data manipulation.

Dependencies

  • lilith-ml-data-engine — Pipeline orchestration framework
  • train-text-classifier — Model training + ONNX export CLI
  • onnxruntime — ONNX inference
  • transformers — Tokenizer
  • scikit-learn — Metrics computation