No description

Find a file

autocommit 3e34595f90 chore(inference-api): 🔧 Update manifest startup reference to point to main app module Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>		2026-04-24 01:10:51 -07:00
.playwright-mcp
.venv
data
docs
examples/epstein
models/v15_mpnet_full_overlap/onnx
packages/content-moderation-feedback
services/inference-api	chore(inference-api): 🔧 Update manifest startup reference to point to main app module	2026-04-24 01:10:51 -07:00
src/content_moderation_training
tools
.gitattributes
.gitignore
CLAUDE.md
config.yaml
EXPERIMENTS.md
pyproject.toml	deps-upgrade(deps): ⬆️ Update dependencies in pyproject.toml to latest stable versions	2026-04-07 22:52:06 -07:00
README.md

README.md

Content Moderation Classifier

Multi-label text classifier for the Lilith platform. Detects 33 content moderation categories across platform messages, bios, listings, and reviews.

Production model: all-mpnet-base-v2 fp16 ONNX — 209 MB, macro F1 0.934, 33/33 categories pass tiered quality gates.

Tier	Semantics	Categories
T1 (F1≥0.93, R≥0.90)	Zero-tolerance (criminal)	csam, trafficking, bestiality, self_harm
T2 (F1≥0.84)	Worker safety	predatory_behavior, ncii, sextortion, threats
T3 (F1≥0.84)	Exploitation/harm	harassment, hate_speech, anti_trans†, doxxing, financial_coercion, consent_violation, intoxication, extreme_gore, snuff
T4 (F1≥0.85)	Platform policy	spam, scam_patterns, impersonation, law_enforcement, age_play, necrophilia, contact_info
T5 (F1≥0.80)	Content routing	solicitation, adult_content, bdsm, edge_play, roleplay, furry, watersports, scat, profanity

Quick Start

# Install
pip install -e .

# Check pipeline status
content-moderation-training status

# Run full pipeline (generate data → merge → train → export → evaluate)
content-moderation-training run

# Run from a specific step
content-moderation-training run --from train-phase1

# Review generated examples
content-moderation-training review harassment positives --limit 10

# Validate taxonomy
content-moderation-training taxonomy --validate

# Evaluate the production model
python -m content_moderation_training.evaluate \
    --model models/v2/onnx/model_fp16.onnx \
    --tokenizer models/v2/onnx \
    --test data/splits/test.jsonl \
    --val data/splits/val.jsonl

Architecture

Pipeline

10-step training pipeline orchestrated by lilith-ml-data-engine:

generate-positives — Positive examples per category (Claude + local LLM for restricted categories)
generate-negatives — Hard negatives + 3000 innocuous examples
generate-perturbations — Adversarial perturbation negatives from positives
merge-data — Merge all sources, apply multi-label enrichment, split train/val/test
train-phase1 — Phase 1: category representations (positives + innocuous, 7 epochs, cosine LR)
train-phase2 — Phase 2: decision boundaries (+ hard negatives, 7 epochs)
train-phase3 — Phase 3: boundary sharpening (+ perturbation negatives, 10 epochs)
export — Export to ONNX with fp16 conversion
evaluate — Tier-aware threshold tuning + tiered quality gate
report — Classification examples report (docs/classification-examples.md)

Source Modules

Module	Purpose
`constants.py`	Label taxonomy (33 categories, derived from CATEGORY_SPECS)
`pipeline.py`	Pipeline step definitions, tier pos_weight configuration
`claude_generator.py`	Positive + hard negative generation via Claude/local LLM
`merge_data.py`	Data merging, multi-label enrichment, phased splitting
`perturbation.py`	Adversarial perturbation generation
`evaluate.py`	ONNX inference, tier-aware thresholds, tiered quality gate
`showcase.py`	Classification report generator
`prompts/category_specs.py`	Single source of truth for all 33 categories

Data Format

Training data is JSONL with context-prefixed text:

{
  "text": "[ADULT][MESSAGE] Your profile is stunning...",
  "labels": {"threats": 0, "hate_speech": 0, ..., "anti_trans": 0},
  "metadata": {"source": "claude_positive", "category": "spam", ...}
}

Production Model

Property	Value
Base model	`sentence-transformers/all-mpnet-base-v2` (110M params, 768-dim)
ONNX variant	fp16
Size	209 MB
Macro F1	0.934
Quality gate	33/33 pass (tiered: T1≥0.93, T2/T3≥0.84, T4≥0.85, T5≥0.80)
Per-category thresholds	Tier-aware tuning (see `thresholds.json`)
Path	`models/v2/onnx/model_fp16.onnx`

Project Structure

content-moderation/
├── config.yaml              # Engine config (paths, concurrency, caps)
├── pyproject.toml            # Package definition
├── EXPERIMENTS.md            # Full experiment log (34 experiments)
├── src/
│   └── content_moderation_training/
│       ├── __main__.py       # CLI entry point
│       ├── constants.py      # Label taxonomy (derived from category_specs)
│       ├── pipeline.py       # Pipeline orchestration + tier pos_weights
│       ├── claude_generator.py
│       ├── merge_data.py
│       ├── perturbation.py
│       ├── evaluate.py       # Tier-aware thresholds + quality gates
│       ├── showcase.py
│       └── prompts/
│           └── category_specs.py  # Single source of truth (33 categories)
├── data/
│   ├── generated/           # Per-category positives + hard negatives
│   ├── splits/              # train/val/test + phased training splits
│   └── archive/             # Historical data snapshots
├── models/
│   └── v2/
│       └── onnx/
│           ├── model.onnx         # fp32 baseline (418 MB)
│           ├── model_fp16.onnx    # Production model (209 MB)
│           └── thresholds.json    # Per-category decision thresholds
├── packages/
│   └── content-moderation-feedback/  # Feedback + showcase + regression tests
├── services/
│   └── inference-api/       # HTTP inference service
├── cache/                   # Claude API response cache
└── docs/
    └── classification-examples.md  # 1317 examples across 33 categories

Experiment History

34 experiments across two model architectures — see EXPERIMENTS.md for the full log.

Key milestones:

Exp 1–10 (MiniLM-L6-v2): 22M params, 384-dim. Best: 17/18 categories passing.
Exp 14 (model escalation): all-mpnet-base-v2 — fixed 3/5 failing categories immediately.
Exp 15: 18/18 PASS. Macro F1 0.945. INT8 quantization confirmed broken for mpnet.
Exp 17–30: 32-category expansion. Data quality refinement across overlap, seed, and hard negative experiments.
Exp 31: 33rd category (anti_trans). GATE PASS, macro F1 0.935.
Exp 32–34: 5-tier platform prioritization. Tier-aware threshold search + tiered quality gates. Key finding: tier differentiation works through evaluation policy, not data manipulation.

Dependencies

lilith-ml-data-engine — Pipeline orchestration framework
train-text-classifier — Model training + ONNX export CLI
onnxruntime — ONNX inference
transformers — Tokenizer
scikit-learn — Metrics computation

README.md Unescape Escape