No description
Find a file
Lilith 4e09176c87 chore(error-analysis): 🔧 Update error entries in error_analysis.json for improved error handling
Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
2026-03-09 21:47:53 -07:00
.playwright-mcp chore(content-moderation): 🔧 Update Playwright test config and demo feedback data for content moderation 2026-03-06 15:57:52 -08:00
data chore(error-analysis): 🔧 Update error entries in error_analysis.json for improved error handling 2026-03-09 21:47:53 -07:00
docs docs(docs): 📝 Update classification examples with refined use cases and edge cases 2026-03-08 18:39:08 -07:00
models/v15_mpnet_full_overlap/onnx perf(model-specific): Optimize model-specific inference by adding FP16/Q8 quantized ONNX variants, tuning tokenizer precision, and updating metadata thresholds 2026-03-05 22:02:51 -08:00
packages/content-moderation-feedback feat(content-moderation): Add new route and fix rendering for feedback showcase with sorting/filtering support 2026-03-06 17:55:28 -08:00
src/content_moderation_training feat(content-moderation): Add configurable targeted positive/negative example generator for content moderation training datasets 2026-03-09 02:33:38 -07:00
.gitattributes chore(config): 🔧 Standardize Git config for LF line endings and add ignore patterns for logs, env files, and node_modules 2026-03-05 22:02:45 -08:00
.gitignore chore(gitignore): 🔧 add missing log file pattern to ignore 2026-03-06 16:05:25 -08:00
CLAUDE.md docs(claude): 📝 Update or add Claude project documentation in CLAUDE.md 2026-03-06 19:16:29 -08:00
config.yaml feat(content-moderation): Add configurable targeted positive/negative example generator for content moderation training datasets 2026-03-09 02:33:38 -07:00
EXPERIMENTS.md docs(documentation): 📝 Update experiment details in EXPERIMENTS.md with new descriptions, status changes, or deprecated experiment removals 2026-03-06 17:43:35 -08:00
pyproject.toml deps-upgrade(config): ⬆️ Update project dependencies in pyproject.toml to latest versions 2026-03-09 02:39:24 -07:00
README.md docs(root): 📝 Implement clear troubleshooting workflows and error handling examples in README.md 2026-03-05 22:57:55 -08:00

Content Moderation Classifier

Multi-label text classifier for the Lilith platform. Detects 24 content moderation categories across platform messages, bios, listings, and reviews.

Production model: all-mpnet-base-v2 fp16 ONNX — 219 MB, macro F1 0.944, 18/18 categories pass (F1 >= 0.85).

Categories

Category Severity Description
threats critical Death/harm/violence threats, veiled threats
hate_speech high Racial, ethnic, gender, sexuality, religious, disability
csam critical Solicitation, distribution, grooming of minors
scam_patterns high Advance fee, deposit scam, phishing, fake escort
contact_info medium Phone numbers, emails, social media handles, external URLs
solicitation medium Explicit requests, price discussion, service negotiation
spam low Mass messages, promotional, repetitive content
profanity low Strong language, slurs, offensive terms
adult_content medium Explicit descriptions, nudity references, sexual content
doxxing critical Identity/address/workplace/family exposure
predatory_behavior critical Grooming, manipulation, power imbalance, boundary violation
law_enforcement high Sting language, entrapment patterns, investigation probing
sextortion critical Blackmail, extortion, threat of exposure, coercion
ncii critical Revenge porn, deepfakes, unauthorized intimate images
trafficking critical Sexual/labor trafficking, recruitment, advertisement
self_harm critical Suicide encouragement, self-injury, eating disorders
impersonation high Staff/creator/law enforcement impersonation
harassment medium Targeted abuse, bullying, stalking, persistent contact
age_play medium Adult age-play, daddy/little dynamics, infantilism (legal edge play)
bestiality critical Zoophilia, zoosadism, animal sexual content
necrophilia critical Sexual content involving corpses, death fetishism
scat high Coprophilia, emetophilia, bodily waste content
snuff critical Murder fantasy, erotophonophilia
extreme_gore high Extreme graphic violence, mutilation, torture content

Quick Start

# Install
pip install -e .

# Check pipeline status
content-moderation-training status

# Run full pipeline (generate data → merge → train → export → evaluate)
content-moderation-training run

# Run from a specific step
content-moderation-training run --from train

# Review generated examples
content-moderation-training review harassment positives --limit 10

# Evaluate the production model
python -m content_moderation_training.evaluate \
    --model models/v15_mpnet_full_overlap/onnx/model_fp16.onnx \
    --tokenizer models/v15_mpnet_full_overlap/onnx \
    --test data/splits/test.jsonl \
    --val data/splits/val.jsonl

# Generate classification showcase
python -m content_moderation_training.showcase \
    --model models/v15_mpnet_full_overlap/onnx/model_fp16.onnx \
    --tokenizer models/v15_mpnet_full_overlap/onnx \
    --thresholds models/v15_mpnet_full_overlap/onnx/thresholds.json \
    --test data/splits/test.jsonl \
    --output docs/classification-examples.md

Architecture

Pipeline

The training pipeline has 7 steps, orchestrated by lilith-ml-data-engine:

  1. generate-positives — Generate positive examples for each category (500/cat, with multi-label overlap for co-occurring categories; Claude for most, local LLM for restricted categories)
  2. generate-negatives — Generate hard negatives (400/cat for difficult categories, 200/cat otherwise) and 3000 innocuous examples
  3. generate-perturbations — Adversarial perturbations from positive examples
  4. merge-data — Merge all sources, apply train/val/test split
  5. train — Fine-tune all-mpnet-base-v2 via train-text-classifier
  6. export — Export to ONNX with fp16 conversion
  7. evaluate — Per-category F1 gate (>= 0.85), per-category threshold tuning

Source Modules

Module Purpose
constants.py Label taxonomy (24 categories, canonical order)
pipeline.py Pipeline step definitions
claude_generator.py Positive + hard negative generation via Claude
merge_data.py Data merging, multi-label enrichment, splitting
perturbation.py Adversarial perturbation generation
evaluate.py ONNX inference, metrics, threshold tuning, quality gate
showcase.py Generates classification showcase markdown from test samples
llama_client.py Local LLM client (alternative to Claude)
prompts/ System prompts and category specifications

Data Format

Training data is JSONL with context-prefixed text:

{
  "text": "[ADULT][MESSAGE] Your profile is stunning...",
  "labels": {"threats": 0, "hate_speech": 0, ..., "harassment": 0},
  "metadata": {"source": "claude_positive", "category": "spam", ...}
}

Context prefixes ([GENERAL|ADULT][BIO|MESSAGE|LISTING|REVIEW|GENERAL]) encode platform context so the model learns context-dependent classification.

Production Model

Property Value
Base model sentence-transformers/all-mpnet-base-v2 (110M params, 768-dim)
ONNX variant fp16
Size 219 MB
Macro F1 0.944
Quality gate 18/18 pass (F1 >= 0.85)
Per-category thresholds Tuned (see thresholds.json)
Path models/v15_mpnet_full_overlap/onnx/model_fp16.onnx

Key Thresholds

Most categories use the default 0.30 threshold. Tuned exceptions:

Category Threshold Reason
threats 0.58 Reduce false positives from assertive language
law_enforcement 0.63 Narrow boundary with legitimate investigation discussion
adult_content 0.45 Distinguish from clinical/educational content
predatory_behavior 0.44 Separate from legitimate mentorship language
harassment 0.42 Reduce overlap with criticism/assertive communication
ncii 0.38 Distinguish from deepfake detection discussion

Project Structure

content-moderation/
├── config.yaml              # Generation config (model, batch sizes, categories)
├── pyproject.toml            # Package definition
├── EXPERIMENTS.md            # Full experiment log (16 experiments, v1→v15)
├── src/
│   └── content_moderation_training/
│       ├── __main__.py       # CLI entry point
│       ├── constants.py      # Label taxonomy
│       ├── pipeline.py       # Pipeline orchestration
│       ├── claude_generator.py
│       ├── merge_data.py
│       ├── perturbation.py
│       ├── evaluate.py
│       ├── showcase.py
│       ├── llama_client.py
│       └── prompts/
├── data/
│   ├── generated/           # Generated training data per category
│   ├── splits/              # train.jsonl, val.jsonl, test.jsonl
│   └── archive/             # Historical data snapshots
├── models/
│   └── v15_mpnet_full_overlap/
│       └── onnx/
│           ├── model.onnx         # fp32 baseline (418 MB)
│           ├── model_fp16.onnx    # Production model (219 MB)
│           ├── thresholds.json    # Per-category thresholds
│           └── tokenizer files
├── cache/                   # Claude API response cache
└── docs/
    └── classification-examples.md  # Showcase with sample predictions

Experiment History

16 experiments across two model architectures — see EXPERIMENTS.md for the full log.

Key milestones:

  • v1v10: MiniLM-L6-v2 (22M params, 384-dim). Best: 17/18 categories passing. Harassment remained stuck at F1=0.829 despite data scaling, threshold tuning, co-label enrichment, and extended training.
  • v11v13: Multi-label generation by construction. Proved that generating text exhibiting multiple categories improves recall, but MiniLM lacks embedding capacity for 18 overlapping categories.
  • v14: Model escalation to all-mpnet-base-v2. Fixed 3/5 failing categories immediately. INT8 quantization destroys mpnet (confirmed across static and dynamic variants).
  • v15: Original overlap rates + mpnet = 18/18 PASS. Macro F1 0.945.
  • v16 (optimization): fp16 conversion — 48% size reduction (418 → 219 MB), macro F1 0.944 (near-lossless).

Dependencies

  • lilith-ml-data-engine — Pipeline orchestration framework
  • train-text-classifier — Model training + ONNX export CLI
  • onnxruntime — ONNX inference
  • transformers — Tokenizer
  • scikit-learn — Metrics computation
  • numpy — Array operations