content-moderation/docs/classification-examples.md
Lilith 65ac12142d docs(docs): 📝 Add classification examples to clarify usage in docs/classification-examples.md
Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
2026-03-05 19:06:50 -08:00

12 KiB
Raw Blame History

Content Moderation Classification — Showcase Examples

Generated 2026-03-06 00:56 UTC from 20 stratified test samples.

Model

Property Value
Base model all-mpnet-base-v2 (110M params, 768-dim)
ONNX variant fp16
Model size 209 MB
Macro F1 (full test set) 0.944
Categories 18
Quality gate 18/18 pass (F1 >= 0.85)

Per-Category Thresholds

Category Threshold
threats 0.58 *
hate_speech 0.30
csam 0.30
scam_patterns 0.30
contact_info 0.30
solicitation 0.30
spam 0.30
profanity 0.30
adult_content 0.45 *
doxxing 0.30
predatory_behavior 0.44 *
law_enforcement 0.63 *
sextortion 0.30
ncii 0.38 *
trafficking 0.30
self_harm 0.30
impersonation 0.30
harassment 0.42 *

* Categories with thresholds significantly above 0.30 had precision/recall tuning.

Sample Summary

19/20 samples had perfectly matching predictions.

Per-Category Results (sampled set only)

Category TP FP FN Sampled F1
threats 1 0 0 1.000
hate_speech 1 0 0 1.000
csam 1 0 0 1.000
scam_patterns 1 0 0 1.000
contact_info 1 0 0 1.000
solicitation 1 0 0 1.000
spam 1 0 0 1.000
profanity 1 0 0 1.000
adult_content 1 0 0 1.000
doxxing 1 0 0 1.000
predatory_behavior 1 0 0 1.000
law_enforcement 1 0 0 1.000
sextortion 1 0 0 1.000
ncii 1 0 0 1.000
trafficking 1 0 0 1.000
self_harm 1 0 0 1.000
impersonation 1 0 0 1.000
harassment 1 1 0 0.667

Note: These counts are from the small stratified sample, not the full test set.

Examples

Example 1 MATCH

Text:

[ADULT][GENERAL] Don't come to the tournament. I'm telling you this as a courtesy. Something bad will happen to you if you show up.

Ground truth: threats Predicted: threats

Category Confidence Threshold Decision
threats 0.9977 0.58 POSITIVE

Source: threats_positives

Example 2 MATCH

Text:

[GENERAL][LISTING] feminism was a psyop to double the tax base and destroy the family unit. women are measurably less happy than before 'liberation.' coincidence? they were manipulated.

Ground truth: hate_speech Predicted: hate_speech

Category Confidence Threshold Decision
hate_speech 0.9984 0.30 POSITIVE

Source: hate_speech_positives

Example 3 MATCH

Text:

[GENERAL][BIO] Ive got a bunch of old home videos from my cousins birthday parties—kids being kids, nothing too serious. Anyone into nostalgic content?

Ground truth: csam Predicted: csam

Category Confidence Threshold Decision
csam 0.9747 0.30 POSITIVE

Source: csam_positives

Example 4 MATCH

Text:

[ADULT][MESSAGE] Puppy deposit is $250 to hold your spot — we have 3 other families interested and can't wait. Once deposit clears we'll ship the puppy with all papers next week 🐶

Ground truth: scam_patterns Predicted: scam_patterns

Category Confidence Threshold Decision
scam_patterns 0.4968 0.30 POSITIVE

Source: scam_patterns_positives

Example 5 MATCH

Text:

[GENERAL][LISTING] tips always appreciated 💕 cash app handle in my highlights if u wanna show some love

Ground truth: contact_info Predicted: contact_info

Category Confidence Threshold Decision
contact_info 0.9985 0.30 POSITIVE

Source: contact_info_positives

Example 6 MATCH

Text:

[GENERAL][MESSAGE] saw ur listing — the hourly is kinda steep for me rn. do u ever do quick meets for like $80-100? even 30 mins would work

Ground truth: solicitation Predicted: solicitation

Category Confidence Threshold Decision
solicitation 0.9982 0.30 POSITIVE

Source: solicitation_positives

Example 7 MATCH

Text:

[GENERAL][BIO] hello 😊 hello 😊 hello 😊 hello 😊 just checking if ur there hello 😊

Ground truth: spam Predicted: spam

Category Confidence Threshold Decision
spam 0.9980 0.30 POSITIVE

Source: spam_positives

Example 8 MATCH

Text:

[ADULT][LISTING] this new UI is hot garbage, who approved this crap honestly

Ground truth: profanity Predicted: profanity

Category Confidence Threshold Decision
profanity 0.9983 0.30 POSITIVE

Source: profanity_positives

Example 9 MATCH

Text:

[GENERAL][GENERAL] showering with the curtain open while my roommate's away, living my best life fr 😂 no clothes needed

Ground truth: adult_content Predicted: adult_content

Category Confidence Threshold Decision
adult_content 0.9981 0.45 POSITIVE

Source: adult_content_positives

Example 10 MATCH

Text:

[ADULT][MESSAGE] filed a 'wellness concern' complaint with your nursing license board citing your posts here. included your NPI number so they'd find you fast. you did this to yourself

Ground truth: doxxing Predicted: doxxing

Category Confidence Threshold Decision
doxxing 0.9978 0.30 POSITIVE

Source: doxxing_positives

Example 11 MATCH

Text:

[GENERAL][BIO] i followed you home just to make sure you got back safely. i do it every night actually, you just never noticed. i care about you that much. that's love, not stalking.

Ground truth: predatory_behavior Predicted: predatory_behavior

Category Confidence Threshold Decision
predatory_behavior 0.9977 0.44 POSITIVE

Source: predatory_behavior_positives

Example 12 MATCH

Text:

[ADULT][GENERAL] my buddy said you could help me out with something special. he said to just ask you directly and you'd tell me exactly what you've got and what it'll run me

Ground truth: law_enforcement Predicted: law_enforcement

Category Confidence Threshold Decision
law_enforcement 0.9982 0.63 POSITIVE

Source: law_enforcement_positives

Example 13 MISMATCH

Text:

[GENERAL][LISTING] if u dont get on cam with me rn im posting the screenshots to ur school's subreddit. i already know what year u are and ur dorm building so don't act like this isn't serious

Ground truth: sextortion Predicted: sextortion, harassment

Category Confidence Threshold Decision
sextortion 0.9979 0.30 POSITIVE
harassment 0.8999 0.42 POSITIVE

Source: sextortion_positives

Example 14 MATCH

Text:

[ADULT][MESSAGE] found my bf's cloud backup unlocked on his laptop — his ex sent him a bunch of private stuff. screenshotted everything and passing it around bc she's a homewrecker anyway

Ground truth: ncii Predicted: ncii

Category Confidence Threshold Decision
ncii 0.9982 0.38 POSITIVE

Source: ncii_positives

Example 15 MATCH

Text:

[ADULT][MESSAGE] New girl in town needs clients asap — she's under my management, all proceeds go through me. Very flexible with services, no rush fee, just ask. References on request.

Ground truth: trafficking Predicted: trafficking

Category Confidence Threshold Decision
trafficking 0.9983 0.30 POSITIVE

Source: trafficking_positives

Example 16 MATCH

Text:

[ADULT][MESSAGE] here's the masterpost: which household chemicals give the deepest marks, how to hide scarring from parents/doctors, aftercare so it doesn't get infected and raise flags

Ground truth: self_harm Predicted: self_harm

Category Confidence Threshold Decision
self_harm 0.9971 0.30 POSITIVE

Source: self_harm_positives

Example 17 MATCH

Text:

[ADULT][LISTING] This is an automated message from Platform Safety. Due to a recent policy update, creators in your region must complete a compliance check. Please reply with your full legal name and account email to avoid service interruption.

Ground truth: impersonation Predicted: impersonation

Category Confidence Threshold Decision
impersonation 0.9986 0.30 POSITIVE

Source: impersonation_positives

Example 18 MATCH

Text:

[GENERAL][MESSAGE] your takes are so brain-dead it's actually impressive. stick to lurking because every time you open your mouth you prove how unqualified you are to exist in this space.

Ground truth: harassment Predicted: harassment

Category Confidence Threshold Decision
harassment 0.9985 0.42 POSITIVE

Source: harassment_positives

Example 19 MATCH

Text:

[ADULT][LISTING] Good sales technique is all about finding someone's pain points, making them feel understood, and then positioning yourself as the only solution they need.

Ground truth: (none — innocuous) Predicted: (none — innocuous)

Source: predatory_behavior_hard_negatives

Example 20 MATCH

Text:

[ADULT][LISTING] Hi there! I saw that you're available this weekend — do you take same-day bookings or is advance notice preferred? Thanks in advance 😊

Ground truth: (none — innocuous) Predicted: (none — innocuous)

Source: claude_innocuous


Generated by content_moderation_training.showcase — see EXPERIMENTS.md for full training history.