No description

Find a file

Lilith 4e09176c87 chore(error-analysis): 🔧 Update error entries in error_analysis.json for improved error handling Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>		2026-03-09 21:47:53 -07:00
.playwright-mcp	chore(content-moderation): 🔧 Update Playwright test config and demo feedback data for content moderation	2026-03-06 15:57:52 -08:00
data	chore(error-analysis): 🔧 Update error entries in error_analysis.json for improved error handling	2026-03-09 21:47:53 -07:00
docs	docs(docs): 📝 Update classification examples with refined use cases and edge cases	2026-03-08 18:39:08 -07:00
models/v15_mpnet_full_overlap/onnx	perf(model-specific): ⚡ Optimize model-specific inference by adding FP16/Q8 quantized ONNX variants, tuning tokenizer precision, and updating metadata thresholds	2026-03-05 22:02:51 -08:00
packages/content-moderation-feedback	feat(content-moderation): ✨ Add new route and fix rendering for feedback showcase with sorting/filtering support	2026-03-06 17:55:28 -08:00
src/content_moderation_training	feat(content-moderation): ✨ Add configurable targeted positive/negative example generator for content moderation training datasets	2026-03-09 02:33:38 -07:00
.gitattributes	chore(config): 🔧 Standardize Git config for LF line endings and add ignore patterns for logs, env files, and node_modules	2026-03-05 22:02:45 -08:00
.gitignore	chore(gitignore): 🔧 add missing log file pattern to ignore	2026-03-06 16:05:25 -08:00
CLAUDE.md	docs(claude): 📝 Update or add Claude project documentation in CLAUDE.md	2026-03-06 19:16:29 -08:00
config.yaml	feat(content-moderation): ✨ Add configurable targeted positive/negative example generator for content moderation training datasets	2026-03-09 02:33:38 -07:00
EXPERIMENTS.md	docs(documentation): 📝 Update experiment details in EXPERIMENTS.md with new descriptions, status changes, or deprecated experiment removals	2026-03-06 17:43:35 -08:00
pyproject.toml	deps-upgrade(config): ⬆️ Update project dependencies in pyproject.toml to latest versions	2026-03-09 02:39:24 -07:00
README.md	docs(root): 📝 Implement clear troubleshooting workflows and error handling examples in README.md	2026-03-05 22:57:55 -08:00

README.md

Content Moderation Classifier

Multi-label text classifier for the Lilith platform. Detects 24 content moderation categories across platform messages, bios, listings, and reviews.

Production model: all-mpnet-base-v2 fp16 ONNX — 219 MB, macro F1 0.944, 18/18 categories pass (F1 >= 0.85).

Category	Severity	Description
threats	critical	Death/harm/violence threats, veiled threats
hate_speech	high	Racial, ethnic, gender, sexuality, religious, disability
csam	critical	Solicitation, distribution, grooming of minors
scam_patterns	high	Advance fee, deposit scam, phishing, fake escort
contact_info	medium	Phone numbers, emails, social media handles, external URLs
solicitation	medium	Explicit requests, price discussion, service negotiation
spam	low	Mass messages, promotional, repetitive content
profanity	low	Strong language, slurs, offensive terms
adult_content	medium	Explicit descriptions, nudity references, sexual content
doxxing	critical	Identity/address/workplace/family exposure
predatory_behavior	critical	Grooming, manipulation, power imbalance, boundary violation
law_enforcement	high	Sting language, entrapment patterns, investigation probing
sextortion	critical	Blackmail, extortion, threat of exposure, coercion
ncii	critical	Revenge porn, deepfakes, unauthorized intimate images
trafficking	critical	Sexual/labor trafficking, recruitment, advertisement
self_harm	critical	Suicide encouragement, self-injury, eating disorders
impersonation	high	Staff/creator/law enforcement impersonation
harassment	medium	Targeted abuse, bullying, stalking, persistent contact
age_play	medium	Adult age-play, daddy/little dynamics, infantilism (legal edge play)
bestiality	critical	Zoophilia, zoosadism, animal sexual content
necrophilia	critical	Sexual content involving corpses, death fetishism
scat	high	Coprophilia, emetophilia, bodily waste content
snuff	critical	Murder fantasy, erotophonophilia
extreme_gore	high	Extreme graphic violence, mutilation, torture content

Quick Start

# Install
pip install -e .

# Check pipeline status
content-moderation-training status

# Run full pipeline (generate data → merge → train → export → evaluate)
content-moderation-training run

# Run from a specific step
content-moderation-training run --from train

# Review generated examples
content-moderation-training review harassment positives --limit 10

# Evaluate the production model
python -m content_moderation_training.evaluate \
    --model models/v15_mpnet_full_overlap/onnx/model_fp16.onnx \
    --tokenizer models/v15_mpnet_full_overlap/onnx \
    --test data/splits/test.jsonl \
    --val data/splits/val.jsonl

# Generate classification showcase
python -m content_moderation_training.showcase \
    --model models/v15_mpnet_full_overlap/onnx/model_fp16.onnx \
    --tokenizer models/v15_mpnet_full_overlap/onnx \
    --thresholds models/v15_mpnet_full_overlap/onnx/thresholds.json \
    --test data/splits/test.jsonl \
    --output docs/classification-examples.md

Architecture

Pipeline

The training pipeline has 7 steps, orchestrated by lilith-ml-data-engine:

generate-positives — Generate positive examples for each category (500/cat, with multi-label overlap for co-occurring categories; Claude for most, local LLM for restricted categories)
generate-negatives — Generate hard negatives (400/cat for difficult categories, 200/cat otherwise) and 3000 innocuous examples
generate-perturbations — Adversarial perturbations from positive examples
merge-data — Merge all sources, apply train/val/test split
train — Fine-tune all-mpnet-base-v2 via train-text-classifier
export — Export to ONNX with fp16 conversion
evaluate — Per-category F1 gate (>= 0.85), per-category threshold tuning

Source Modules

Module	Purpose
`constants.py`	Label taxonomy (24 categories, canonical order)
`pipeline.py`	Pipeline step definitions
`claude_generator.py`	Positive + hard negative generation via Claude
`merge_data.py`	Data merging, multi-label enrichment, splitting
`perturbation.py`	Adversarial perturbation generation
`evaluate.py`	ONNX inference, metrics, threshold tuning, quality gate
`showcase.py`	Generates classification showcase markdown from test samples
`llama_client.py`	Local LLM client (alternative to Claude)
`prompts/`	System prompts and category specifications

Data Format

Training data is JSONL with context-prefixed text:

{
  "text": "[ADULT][MESSAGE] Your profile is stunning...",
  "labels": {"threats": 0, "hate_speech": 0, ..., "harassment": 0},
  "metadata": {"source": "claude_positive", "category": "spam", ...}
}

Production Model

Property	Value
Base model	`sentence-transformers/all-mpnet-base-v2` (110M params, 768-dim)
ONNX variant	fp16
Size	219 MB
Macro F1	0.944
Quality gate	18/18 pass (F1 >= 0.85)
Per-category thresholds	Tuned (see `thresholds.json`)
Path	`models/v15_mpnet_full_overlap/onnx/model_fp16.onnx`

Key Thresholds

Most categories use the default 0.30 threshold. Tuned exceptions:

Category	Threshold	Reason
threats	0.58	Reduce false positives from assertive language
law_enforcement	0.63	Narrow boundary with legitimate investigation discussion
adult_content	0.45	Distinguish from clinical/educational content
predatory_behavior	0.44	Separate from legitimate mentorship language
harassment	0.42	Reduce overlap with criticism/assertive communication
ncii	0.38	Distinguish from deepfake detection discussion

Project Structure

content-moderation/
├── config.yaml              # Generation config (model, batch sizes, categories)
├── pyproject.toml            # Package definition
├── EXPERIMENTS.md            # Full experiment log (16 experiments, v1→v15)
├── src/
│   └── content_moderation_training/
│       ├── __main__.py       # CLI entry point
│       ├── constants.py      # Label taxonomy
│       ├── pipeline.py       # Pipeline orchestration
│       ├── claude_generator.py
│       ├── merge_data.py
│       ├── perturbation.py
│       ├── evaluate.py
│       ├── showcase.py
│       ├── llama_client.py
│       └── prompts/
├── data/
│   ├── generated/           # Generated training data per category
│   ├── splits/              # train.jsonl, val.jsonl, test.jsonl
│   └── archive/             # Historical data snapshots
├── models/
│   └── v15_mpnet_full_overlap/
│       └── onnx/
│           ├── model.onnx         # fp32 baseline (418 MB)
│           ├── model_fp16.onnx    # Production model (219 MB)
│           ├── thresholds.json    # Per-category thresholds
│           └── tokenizer files
├── cache/                   # Claude API response cache
└── docs/
    └── classification-examples.md  # Showcase with sample predictions

Experiment History

16 experiments across two model architectures — see EXPERIMENTS.md for the full log.

Key milestones:

v1–v10: MiniLM-L6-v2 (22M params, 384-dim). Best: 17/18 categories passing. Harassment remained stuck at F1=0.829 despite data scaling, threshold tuning, co-label enrichment, and extended training.
v11–v13: Multi-label generation by construction. Proved that generating text exhibiting multiple categories improves recall, but MiniLM lacks embedding capacity for 18 overlapping categories.
v14: Model escalation to all-mpnet-base-v2. Fixed 3/5 failing categories immediately. INT8 quantization destroys mpnet (confirmed across static and dynamic variants).
v15: Original overlap rates + mpnet = 18/18 PASS. Macro F1 0.945.
v16 (optimization): fp16 conversion — 48% size reduction (418 → 219 MB), macro F1 0.944 (near-lossless).

Dependencies

lilith-ml-data-engine — Pipeline orchestration framework
train-text-classifier — Model training + ONNX export CLI
onnxruntime — ONNX inference
transformers — Tokenizer
scikit-learn — Metrics computation
numpy — Array operations

README.md Unescape Escape