docs: 📝 Implement structured documentation improvements in CLAUDE.md and README.md with new sections, reorganized content, and enhanced readability
Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
This commit is contained in:
parent
7bbc6bd134
commit
7d2fa10d2a
2 changed files with 125 additions and 131 deletions
96
CLAUDE.md
96
CLAUDE.md
|
|
@ -2,8 +2,8 @@
|
|||
|
||||
**Purpose**: Multi-label text classifier for content moderation — data generation, model training, ONNX export, and evaluation.
|
||||
**Base model**: sentence-transformers/all-mpnet-base-v2 (110M params, 768-dim)
|
||||
**Export format**: ONNX fp16 (219 MB) — INT8 quantization is incompatible with mpnet architecture
|
||||
**Quality gate**: F1 >= 0.85 per category on held-out test set
|
||||
**Export format**: ONNX fp16 (209 MB) — INT8 quantization is incompatible with mpnet architecture
|
||||
**Quality gate**: Tiered — T1≥0.93, T2/T3≥0.84, T4≥0.85, T5≥0.80
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -11,19 +11,19 @@
|
|||
|
||||
```
|
||||
content-moderation/
|
||||
├── config.yaml # Engine config (paths, concurrency, category routing)
|
||||
├── config.yaml # Engine config (paths, concurrency, training caps)
|
||||
├── pyproject.toml # Package definition, CLI entry point
|
||||
├── EXPERIMENTS.md # Full experiment log (v1-v17, architecture decisions)
|
||||
├── EXPERIMENTS.md # Full experiment log (34 experiments)
|
||||
├── src/content_moderation_training/
|
||||
│ ├── __main__.py # CLI entry point (run, status, review, reset, taxonomy)
|
||||
│ ├── pipeline.py # Pipeline step definitions (7 steps)
|
||||
│ ├── pipeline.py # Pipeline step definitions (10 steps), tier pos_weights
|
||||
│ ├── constants.py # LABEL_NAMES, NUM_LABELS (derived from category_specs)
|
||||
│ ├── claude_generator.py # Dual-engine data generator (Claude + local LLM)
|
||||
│ ├── llama_client.py # OpenAI-compatible client for local LLM
|
||||
│ ├── merge_data.py # Merge sources, apply overlaps, split train/val/test
|
||||
│ ├── evaluate.py # ONNX inference + per-category F1 evaluation
|
||||
│ ├── evaluate.py # ONNX inference + tier-aware thresholds + tiered quality gate
|
||||
│ ├── perturbation.py # Adversarial perturbation negatives from positives
|
||||
│ ├── showcase.py # FastAPI showcase app
|
||||
│ ├── showcase.py # Classification report generator
|
||||
│ ├── paths.py # Centralized path resolution from config.yaml
|
||||
│ └── prompts/
|
||||
│ ├── category_specs.py # CATEGORY_SPECS — single source of truth for all categories
|
||||
|
|
@ -34,26 +34,39 @@ content-moderation/
|
|||
│ │ ├── {category}/hard_negatives.jsonl
|
||||
│ │ ├── innocuous.jsonl
|
||||
│ │ └── perturbation_negatives.jsonl
|
||||
│ ├── splits/ # train.jsonl, val.jsonl, test.jsonl
|
||||
│ ├── splits/ # train/val/test + train_phase1/phase2 splits
|
||||
│ └── archive/ # Historical data snapshots
|
||||
├── models/ # Trained model versions (v2-v15)
|
||||
│ └── v15_mpnet_full_overlap/ # Current production model
|
||||
├── models/
|
||||
│ └── v2/ # Current production model
|
||||
│ └── onnx/
|
||||
│ ├── model.onnx # fp32 baseline (418 MB)
|
||||
│ ├── model_fp16.onnx # Production model (219 MB)
|
||||
│ └── thresholds.json # Per-category decision thresholds
|
||||
│ ├── model_fp16.onnx # Production model (209 MB)
|
||||
│ └── thresholds.json # Tier-aware per-category decision thresholds
|
||||
├── packages/
|
||||
│ └── content-moderation-feedback/ # Feedback collection + showcase app + regression tests
|
||||
├── services/
|
||||
│ └── inference-api/ # HTTP inference service (FastAPI)
|
||||
├── cache/generated/ # ResponseCache (deterministic keys, skip existing)
|
||||
└── docs/ # Classification examples, taxonomy docs
|
||||
└── docs/
|
||||
└── classification-examples.md # 1317 examples across 33 categories
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Category Taxonomy
|
||||
|
||||
32 categories defined in `src/.../prompts/category_specs.py` (CATEGORY_SPECS dict).
|
||||
Each entry has: description, severity, subtypes, seed_examples, hard_negative_seeds, overlaps, secondary_label_rules.
|
||||
33 categories in 5 platform priority tiers, defined in `category_specs.py` (CATEGORY_SPECS dict).
|
||||
Each entry has: description, severity, platform_priority, subtypes, seed_examples, hard_negative_seeds, overlaps, secondary_label_rules.
|
||||
|
||||
| Tier | Gate | Categories |
|
||||
|------|------|-----------|
|
||||
| T1 (zero-tolerance) | F1≥0.93, R≥0.90 | csam, trafficking, bestiality, self_harm |
|
||||
| T2 (worker safety) | F1≥0.84 | predatory_behavior, ncii, sextortion, threats |
|
||||
| T3 (exploitation) | F1≥0.84 | harassment, hate_speech, anti_trans†, doxxing, financial_coercion, consent_violation, intoxication, extreme_gore, snuff |
|
||||
| T4 (platform policy) | F1≥0.85 | spam, scam_patterns, impersonation, law_enforcement, age_play, necrophilia, contact_info |
|
||||
| T5 (content routing) | F1≥0.80 | solicitation, adult_content, bdsm, edge_play, roleplay, furry, watersports, scat, profanity |
|
||||
|
||||
† `anti_trans` has `"optional": True` — excluded from inference output by default.
|
||||
|
||||
`constants.py` derives LABEL_NAMES and NUM_LABELS from CATEGORY_SPECS — adding a category means adding one dict entry.
|
||||
|
||||
|
|
@ -67,11 +80,11 @@ All commands via `content-moderation-training` (installed entry point):
|
|||
|
||||
| Command | Purpose |
|
||||
|---------|---------|
|
||||
| `run --from STEP --to STEP` | Run pipeline steps (generate-positives through evaluate) |
|
||||
| `run --from STEP --to STEP` | Run pipeline steps (generate-positives through report) |
|
||||
| `status` | Per-category data counts + pipeline step status |
|
||||
| `review CATEGORY [positives\|hard_negatives] -n N` | Print examples for quality review |
|
||||
| `reset CATEGORY [--cache]` | Delete generated data to force re-generation |
|
||||
| `taxonomy` | List categories with severity |
|
||||
| `taxonomy` | List categories with severity and tier |
|
||||
| `taxonomy --specs` | Detailed spec coverage per category |
|
||||
| `taxonomy --overlaps` | Show multi-label overlap rules |
|
||||
| `taxonomy --validate` | CI check: all categories have complete specs |
|
||||
|
|
@ -83,13 +96,16 @@ All commands via `content-moderation-training` (installed entry point):
|
|||
1. **generate-positives** — Generate positive examples for all categories (Claude + local LLM)
|
||||
2. **generate-negatives** — Generate hard negatives and innocuous examples
|
||||
3. **generate-perturbations** — Adversarial perturbation negatives from existing positives
|
||||
4. **merge-data** — Merge all sources, apply multi-label overlaps, split train/val/test
|
||||
5. **train** — Fine-tune base model on merged training data (via train-text-classifier)
|
||||
6. **export** — Export to ONNX with quantization (via train-text-classifier)
|
||||
7. **evaluate** — Per-category F1 evaluation against test set (gate: >= 0.85)
|
||||
4. **merge-data** — Merge all sources, apply multi-label overlaps, split train/val/test + phased splits
|
||||
5. **train-phase1** — Phase 1: category representations (positives + innocuous, 7 epochs, cosine LR)
|
||||
6. **train-phase2** — Phase 2: decision boundaries (+ hard negatives, 7 epochs)
|
||||
7. **train-phase3** — Phase 3: boundary sharpening (+ perturbation negatives, 10 epochs)
|
||||
8. **export** — Export to ONNX with fp16 conversion
|
||||
9. **evaluate** — Tier-aware threshold tuning on val, tiered quality gate on test
|
||||
10. **report** — Classification examples report (docs/classification-examples.md)
|
||||
|
||||
Run a single step: `content-moderation-training run --from merge-data --to merge-data`
|
||||
Run from step to end: `content-moderation-training run --from train`
|
||||
Run from step to end: `content-moderation-training run --from train-phase1`
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -106,6 +122,20 @@ The `ResponseCache` uses deterministic keys per (category, subtype, severity, se
|
|||
|
||||
---
|
||||
|
||||
## Tier-Aware Evaluation
|
||||
|
||||
The evaluation pipeline (`evaluate.py`) implements platform priority tiers:
|
||||
|
||||
- **Threshold search**: T1 searches 0.20–0.60 (recall-biased), T5 searches 0.40–0.90 (precision-biased)
|
||||
- **F1 gates**: T1≥0.93, T2/T3≥0.84, T4≥0.85, T5≥0.80
|
||||
- **Recall floor**: T1≥0.90 (criminal categories must not miss examples)
|
||||
- **Per-category ceiling**: harassment max threshold 0.65 (prevents val-set overfitting)
|
||||
|
||||
Training uses tier-differentiated pos_weight via `--pos-weight-overrides`:
|
||||
T1/T2/T3=10.0, T4=8.0, T5=6.0
|
||||
|
||||
---
|
||||
|
||||
## Development Setup
|
||||
|
||||
```bash
|
||||
|
|
@ -115,9 +145,6 @@ pip install -e .
|
|||
# Run full pipeline status
|
||||
content-moderation-training status
|
||||
|
||||
# Run tests
|
||||
python -m pytest
|
||||
|
||||
# Verify taxonomy
|
||||
content-moderation-training taxonomy --validate
|
||||
```
|
||||
|
|
@ -133,20 +160,15 @@ content-moderation-training taxonomy --validate
|
|||
|
||||
## Current State
|
||||
|
||||
### Production Model: v15 mpnet fp16
|
||||
- Macro F1: 0.944 (test, with per-category thresholds)
|
||||
- 18/18 original categories pass gate
|
||||
- Model: `models/v15_mpnet_full_overlap/onnx/model_fp16.onnx`
|
||||
|
||||
### Active Experiment: 17 (32-Category Expansion)
|
||||
- 14 new categories added (adult subtypes + contextual moderation)
|
||||
- Data generation in progress (targeting 500 pos + 400 hard neg per category)
|
||||
- See EXPERIMENTS.md for full history and analysis
|
||||
### Production Model: v2 mpnet fp16
|
||||
- Macro F1: 0.934 (test, with tier-aware per-category thresholds)
|
||||
- 33/33 categories pass tiered quality gates
|
||||
- Model: `models/v2/onnx/model_fp16.onnx` (209 MB)
|
||||
- Thresholds: `models/v2/onnx/thresholds.json`
|
||||
|
||||
### Known Constraints
|
||||
- INT8 quantization (static or dynamic) destroys mpnet outputs — use fp16 only
|
||||
- Multi-label co-detection is weak in v15 (0/5 scenarios pass)
|
||||
- self_harm and csam have recall gaps on realistic inputs despite high test F1
|
||||
- Multi-label co-detection is the primary weakness (model catches primary label, misses co-labels)
|
||||
- Local LLM (llama-http) must be running for censored category generation
|
||||
|
||||
---
|
||||
|
|
@ -165,4 +187,4 @@ content-moderation-training taxonomy --validate
|
|||
`packages/content-moderation-feedback/` contains:
|
||||
- **FeedbackClient** — JSONL-based feedback collection
|
||||
- **Showcase app** — FastAPI with live ONNX inference
|
||||
- **Regression test suite** — `tests/test_model_categories.py` (33 positive vectors, 37+ hard negatives, 5 multi-label scenarios)
|
||||
- **Regression test suite** — `tests/test_model_categories.py` (33 positive vectors, 37+ hard negatives, multi-label scenarios)
|
||||
|
|
|
|||
160
README.md
160
README.md
|
|
@ -1,37 +1,22 @@
|
|||
# Content Moderation Classifier
|
||||
|
||||
Multi-label text classifier for the Lilith platform. Detects 24 content moderation categories across platform messages, bios, listings, and reviews.
|
||||
Multi-label text classifier for the Lilith platform. Detects 33 content moderation categories across platform messages, bios, listings, and reviews.
|
||||
|
||||
**Production model**: `all-mpnet-base-v2` fp16 ONNX — 219 MB, macro F1 0.944, 18/18 categories pass (F1 >= 0.85).
|
||||
**Production model**: `all-mpnet-base-v2` fp16 ONNX — 209 MB, macro F1 0.934, 33/33 categories pass tiered quality gates.
|
||||
|
||||
## Categories
|
||||
|
||||
| Category | Severity | Description |
|
||||
|----------|----------|-------------|
|
||||
| threats | critical | Death/harm/violence threats, veiled threats |
|
||||
| hate_speech | high | Racial, ethnic, gender, sexuality, religious, disability |
|
||||
| csam | critical | Solicitation, distribution, grooming of minors |
|
||||
| scam_patterns | high | Advance fee, deposit scam, phishing, fake escort |
|
||||
| contact_info | medium | Phone numbers, emails, social media handles, external URLs |
|
||||
| solicitation | medium | Explicit requests, price discussion, service negotiation |
|
||||
| spam | low | Mass messages, promotional, repetitive content |
|
||||
| profanity | low | Strong language, slurs, offensive terms |
|
||||
| adult_content | medium | Explicit descriptions, nudity references, sexual content |
|
||||
| doxxing | critical | Identity/address/workplace/family exposure |
|
||||
| predatory_behavior | critical | Grooming, manipulation, power imbalance, boundary violation |
|
||||
| law_enforcement | high | Sting language, entrapment patterns, investigation probing |
|
||||
| sextortion | critical | Blackmail, extortion, threat of exposure, coercion |
|
||||
| ncii | critical | Revenge porn, deepfakes, unauthorized intimate images |
|
||||
| trafficking | critical | Sexual/labor trafficking, recruitment, advertisement |
|
||||
| self_harm | critical | Suicide encouragement, self-injury, eating disorders |
|
||||
| impersonation | high | Staff/creator/law enforcement impersonation |
|
||||
| harassment | medium | Targeted abuse, bullying, stalking, persistent contact |
|
||||
| age_play | medium | Adult age-play, daddy/little dynamics, infantilism (legal edge play) |
|
||||
| bestiality | critical | Zoophilia, zoosadism, animal sexual content |
|
||||
| necrophilia | critical | Sexual content involving corpses, death fetishism |
|
||||
| scat | high | Coprophilia, emetophilia, bodily waste content |
|
||||
| snuff | critical | Murder fantasy, erotophonophilia |
|
||||
| extreme_gore | high | Extreme graphic violence, mutilation, torture content |
|
||||
33 categories organized into 5 platform priority tiers:
|
||||
|
||||
| Tier | Semantics | Categories |
|
||||
|------|-----------|-----------|
|
||||
| **T1** (F1≥0.93, R≥0.90) | Zero-tolerance (criminal) | csam, trafficking, bestiality, self_harm |
|
||||
| **T2** (F1≥0.84) | Worker safety | predatory_behavior, ncii, sextortion, threats |
|
||||
| **T3** (F1≥0.84) | Exploitation/harm | harassment, hate_speech, anti_trans†, doxxing, financial_coercion, consent_violation, intoxication, extreme_gore, snuff |
|
||||
| **T4** (F1≥0.85) | Platform policy | spam, scam_patterns, impersonation, law_enforcement, age_play, necrophilia, contact_info |
|
||||
| **T5** (F1≥0.80) | Content routing | solicitation, adult_content, bdsm, edge_play, roleplay, furry, watersports, scat, profanity |
|
||||
|
||||
† `anti_trans` is optional — excluded from inference output by default (`include_optional_categories: false`).
|
||||
|
||||
## Quick Start
|
||||
|
||||
|
|
@ -46,54 +31,51 @@ content-moderation-training status
|
|||
content-moderation-training run
|
||||
|
||||
# Run from a specific step
|
||||
content-moderation-training run --from train
|
||||
content-moderation-training run --from train-phase1
|
||||
|
||||
# Review generated examples
|
||||
content-moderation-training review harassment positives --limit 10
|
||||
|
||||
# Validate taxonomy
|
||||
content-moderation-training taxonomy --validate
|
||||
|
||||
# Evaluate the production model
|
||||
python -m content_moderation_training.evaluate \
|
||||
--model models/v15_mpnet_full_overlap/onnx/model_fp16.onnx \
|
||||
--tokenizer models/v15_mpnet_full_overlap/onnx \
|
||||
--model models/v2/onnx/model_fp16.onnx \
|
||||
--tokenizer models/v2/onnx \
|
||||
--test data/splits/test.jsonl \
|
||||
--val data/splits/val.jsonl
|
||||
|
||||
# Generate classification showcase
|
||||
python -m content_moderation_training.showcase \
|
||||
--model models/v15_mpnet_full_overlap/onnx/model_fp16.onnx \
|
||||
--tokenizer models/v15_mpnet_full_overlap/onnx \
|
||||
--thresholds models/v15_mpnet_full_overlap/onnx/thresholds.json \
|
||||
--test data/splits/test.jsonl \
|
||||
--output docs/classification-examples.md
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
### Pipeline
|
||||
|
||||
The training pipeline has 7 steps, orchestrated by `lilith-ml-data-engine`:
|
||||
10-step training pipeline orchestrated by `lilith-ml-data-engine`:
|
||||
|
||||
1. **generate-positives** — Generate positive examples for each category (500/cat, with multi-label overlap for co-occurring categories; Claude for most, local LLM for restricted categories)
|
||||
2. **generate-negatives** — Generate hard negatives (400/cat for difficult categories, 200/cat otherwise) and 3000 innocuous examples
|
||||
3. **generate-perturbations** — Adversarial perturbations from positive examples
|
||||
4. **merge-data** — Merge all sources, apply train/val/test split
|
||||
5. **train** — Fine-tune `all-mpnet-base-v2` via `train-text-classifier`
|
||||
6. **export** — Export to ONNX with fp16 conversion
|
||||
7. **evaluate** — Per-category F1 gate (>= 0.85), per-category threshold tuning
|
||||
1. **generate-positives** — Positive examples per category (Claude + local LLM for restricted categories)
|
||||
2. **generate-negatives** — Hard negatives + 3000 innocuous examples
|
||||
3. **generate-perturbations** — Adversarial perturbation negatives from positives
|
||||
4. **merge-data** — Merge all sources, apply multi-label enrichment, split train/val/test
|
||||
5. **train-phase1** — Phase 1: category representations (positives + innocuous, 7 epochs, cosine LR)
|
||||
6. **train-phase2** — Phase 2: decision boundaries (+ hard negatives, 7 epochs)
|
||||
7. **train-phase3** — Phase 3: boundary sharpening (+ perturbation negatives, 10 epochs)
|
||||
8. **export** — Export to ONNX with fp16 conversion
|
||||
9. **evaluate** — Tier-aware threshold tuning + tiered quality gate
|
||||
10. **report** — Classification examples report (docs/classification-examples.md)
|
||||
|
||||
### Source Modules
|
||||
|
||||
| Module | Purpose |
|
||||
|--------|---------|
|
||||
| `constants.py` | Label taxonomy (24 categories, canonical order) |
|
||||
| `pipeline.py` | Pipeline step definitions |
|
||||
| `claude_generator.py` | Positive + hard negative generation via Claude |
|
||||
| `merge_data.py` | Data merging, multi-label enrichment, splitting |
|
||||
| `constants.py` | Label taxonomy (33 categories, derived from CATEGORY_SPECS) |
|
||||
| `pipeline.py` | Pipeline step definitions, tier pos_weight configuration |
|
||||
| `claude_generator.py` | Positive + hard negative generation via Claude/local LLM |
|
||||
| `merge_data.py` | Data merging, multi-label enrichment, phased splitting |
|
||||
| `perturbation.py` | Adversarial perturbation generation |
|
||||
| `evaluate.py` | ONNX inference, metrics, threshold tuning, quality gate |
|
||||
| `showcase.py` | Generates classification showcase markdown from test samples |
|
||||
| `llama_client.py` | Local LLM client (alternative to Claude) |
|
||||
| `prompts/` | System prompts and category specifications |
|
||||
| `evaluate.py` | ONNX inference, tier-aware thresholds, tiered quality gate |
|
||||
| `showcase.py` | Classification report generator |
|
||||
| `prompts/category_specs.py` | Single source of truth for all 33 categories |
|
||||
|
||||
### Data Format
|
||||
|
||||
|
|
@ -102,7 +84,7 @@ Training data is JSONL with context-prefixed text:
|
|||
```json
|
||||
{
|
||||
"text": "[ADULT][MESSAGE] Your profile is stunning...",
|
||||
"labels": {"threats": 0, "hate_speech": 0, ..., "harassment": 0},
|
||||
"labels": {"threats": 0, "hate_speech": 0, ..., "anti_trans": 0},
|
||||
"metadata": {"source": "claude_positive", "category": "spam", ...}
|
||||
}
|
||||
```
|
||||
|
|
@ -115,70 +97,61 @@ Context prefixes (`[GENERAL|ADULT][BIO|MESSAGE|LISTING|REVIEW|GENERAL]`) encode
|
|||
|----------|-------|
|
||||
| Base model | `sentence-transformers/all-mpnet-base-v2` (110M params, 768-dim) |
|
||||
| ONNX variant | fp16 |
|
||||
| Size | 219 MB |
|
||||
| Macro F1 | 0.944 |
|
||||
| Quality gate | 18/18 pass (F1 >= 0.85) |
|
||||
| Per-category thresholds | Tuned (see `thresholds.json`) |
|
||||
| Path | `models/v15_mpnet_full_overlap/onnx/model_fp16.onnx` |
|
||||
|
||||
### Key Thresholds
|
||||
|
||||
Most categories use the default 0.30 threshold. Tuned exceptions:
|
||||
|
||||
| Category | Threshold | Reason |
|
||||
|----------|-----------|--------|
|
||||
| threats | 0.58 | Reduce false positives from assertive language |
|
||||
| law_enforcement | 0.63 | Narrow boundary with legitimate investigation discussion |
|
||||
| adult_content | 0.45 | Distinguish from clinical/educational content |
|
||||
| predatory_behavior | 0.44 | Separate from legitimate mentorship language |
|
||||
| harassment | 0.42 | Reduce overlap with criticism/assertive communication |
|
||||
| ncii | 0.38 | Distinguish from deepfake detection discussion |
|
||||
| Size | 209 MB |
|
||||
| Macro F1 | 0.934 |
|
||||
| Quality gate | 33/33 pass (tiered: T1≥0.93, T2/T3≥0.84, T4≥0.85, T5≥0.80) |
|
||||
| Per-category thresholds | Tier-aware tuning (see `thresholds.json`) |
|
||||
| Path | `models/v2/onnx/model_fp16.onnx` |
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
content-moderation/
|
||||
├── config.yaml # Generation config (model, batch sizes, categories)
|
||||
├── config.yaml # Engine config (paths, concurrency, caps)
|
||||
├── pyproject.toml # Package definition
|
||||
├── EXPERIMENTS.md # Full experiment log (16 experiments, v1→v15)
|
||||
├── EXPERIMENTS.md # Full experiment log (34 experiments)
|
||||
├── src/
|
||||
│ └── content_moderation_training/
|
||||
│ ├── __main__.py # CLI entry point
|
||||
│ ├── constants.py # Label taxonomy
|
||||
│ ├── pipeline.py # Pipeline orchestration
|
||||
│ ├── constants.py # Label taxonomy (derived from category_specs)
|
||||
│ ├── pipeline.py # Pipeline orchestration + tier pos_weights
|
||||
│ ├── claude_generator.py
|
||||
│ ├── merge_data.py
|
||||
│ ├── perturbation.py
|
||||
│ ├── evaluate.py
|
||||
│ ├── evaluate.py # Tier-aware thresholds + quality gates
|
||||
│ ├── showcase.py
|
||||
│ ├── llama_client.py
|
||||
│ └── prompts/
|
||||
│ └── category_specs.py # Single source of truth (33 categories)
|
||||
├── data/
|
||||
│ ├── generated/ # Generated training data per category
|
||||
│ ├── splits/ # train.jsonl, val.jsonl, test.jsonl
|
||||
│ ├── generated/ # Per-category positives + hard negatives
|
||||
│ ├── splits/ # train/val/test + phased training splits
|
||||
│ └── archive/ # Historical data snapshots
|
||||
├── models/
|
||||
│ └── v15_mpnet_full_overlap/
|
||||
│ └── v2/
|
||||
│ └── onnx/
|
||||
│ ├── model.onnx # fp32 baseline (418 MB)
|
||||
│ ├── model_fp16.onnx # Production model (219 MB)
|
||||
│ ├── thresholds.json # Per-category thresholds
|
||||
│ └── tokenizer files
|
||||
│ ├── model_fp16.onnx # Production model (209 MB)
|
||||
│ └── thresholds.json # Per-category decision thresholds
|
||||
├── packages/
|
||||
│ └── content-moderation-feedback/ # Feedback + showcase + regression tests
|
||||
├── services/
|
||||
│ └── inference-api/ # HTTP inference service
|
||||
├── cache/ # Claude API response cache
|
||||
└── docs/
|
||||
└── classification-examples.md # Showcase with sample predictions
|
||||
└── classification-examples.md # 1317 examples across 33 categories
|
||||
```
|
||||
|
||||
## Experiment History
|
||||
|
||||
16 experiments across two model architectures — see [EXPERIMENTS.md](EXPERIMENTS.md) for the full log.
|
||||
34 experiments across two model architectures — see [EXPERIMENTS.md](EXPERIMENTS.md) for the full log.
|
||||
|
||||
**Key milestones**:
|
||||
- **v1–v10**: MiniLM-L6-v2 (22M params, 384-dim). Best: 17/18 categories passing. Harassment remained stuck at F1=0.829 despite data scaling, threshold tuning, co-label enrichment, and extended training.
|
||||
- **v11–v13**: Multi-label generation by construction. Proved that generating text exhibiting multiple categories improves recall, but MiniLM lacks embedding capacity for 18 overlapping categories.
|
||||
- **v14**: Model escalation to `all-mpnet-base-v2`. Fixed 3/5 failing categories immediately. INT8 quantization destroys mpnet (confirmed across static and dynamic variants).
|
||||
- **v15**: Original overlap rates + mpnet = **18/18 PASS**. Macro F1 0.945.
|
||||
- **v16 (optimization)**: fp16 conversion — 48% size reduction (418 → 219 MB), macro F1 0.944 (near-lossless).
|
||||
- **Exp 1–10** (MiniLM-L6-v2): 22M params, 384-dim. Best: 17/18 categories passing.
|
||||
- **Exp 14** (model escalation): `all-mpnet-base-v2` — fixed 3/5 failing categories immediately.
|
||||
- **Exp 15**: 18/18 PASS. Macro F1 0.945. INT8 quantization confirmed broken for mpnet.
|
||||
- **Exp 17–30**: 32-category expansion. Data quality refinement across overlap, seed, and hard negative experiments.
|
||||
- **Exp 31**: 33rd category (anti_trans). GATE PASS, macro F1 0.935.
|
||||
- **Exp 32–34**: 5-tier platform prioritization. Tier-aware threshold search + tiered quality gates. Key finding: tier differentiation works through evaluation policy, not data manipulation.
|
||||
|
||||
## Dependencies
|
||||
|
||||
|
|
@ -187,4 +160,3 @@ content-moderation/
|
|||
- `onnxruntime` — ONNX inference
|
||||
- `transformers` — Tokenizer
|
||||
- `scikit-learn` — Metrics computation
|
||||
- `numpy` — Array operations
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue