No description

Find a file

autocommit 2b21ccf1b9 Some checks failed Publish / publish (push) Failing after 0s Details Publish to PyPI / Build and Publish (push) Failing after 43s Details deps-upgrade(dependencies): ⬆️ Update core and dev dependencies to latest stable versions Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>		2026-04-12 00:21:29 -07:00
.forgejo/workflows	chore: initial commit with DRY workflow	2026-01-21 12:48:59 -08:00
dist	chore: initial commit with DRY workflow	2026-01-21 12:48:59 -08:00
src/lilith_ml_safety_filter	chore: initial commit with DRY workflow	2026-01-21 12:48:59 -08:00
tests	chore: initial commit with DRY workflow	2026-01-21 12:48:59 -08:00
pyproject.toml	deps-upgrade(dependencies): ⬆️ Update core and dev dependencies to latest stable versions	2026-04-12 00:21:29 -07:00
README.md	chore: initial commit with DRY workflow	2026-01-21 12:48:59 -08:00

README.md

ml-safety-filter

Content moderation and boundary enforcement for LLM inputs/outputs.

Installation

pip install lilith-ml-safety-filter

Quick Start

from lilith_ml_safety_filter import SafetyFilter, ContactBoundaries

filter = SafetyFilter()

# Check both input and output
result = await filter.check(
    input_text=incoming_message,
    output_text=generated_response,
    boundaries=ContactBoundaries(blocked_topics=["work"])
)

if not result.is_safe:
    print(f"Warnings: {result.warnings}")

if result.should_block:
    print(f"Blocked: {result.blocked_reason}")

Features

Input Sanitization

Prompt Injection Detection: Detects system override attempts, role escapes, delimiter attacks, jailbreak attempts
PII Pattern Detection: Email, phone, SSN, credit cards, addresses, IP addresses, and more
Scam Indicator Detection: Urgency patterns, financial requests, impersonation attempts

Contact Boundaries

Per-contact topic blocking
Risk classification (safe, unknown, suspicious, scam)
Trust level management
Whitelist and blacklist modes

Output Filtering

Financial advice detection
Medical advice detection
Legal advice detection
Topic boundary enforcement

API Reference

SafetyFilter

The main entry point for safety filtering.

from lilith_ml_safety_filter import SafetyFilter, SafetyResult

filter = SafetyFilter(
    injection_block_threshold=0.8,  # Confidence threshold for blocking injections
    scam_block_threshold=0.7,       # Confidence threshold for blocking scams
    pii_confidence_threshold=0.7,   # Confidence threshold for PII detection
)

# Async check
result: SafetyResult = await filter.check(
    input_text="User message",
    output_text="Model response",
    boundaries=boundaries,
    context=conversation_context,
)

# Sync check (for convenience)
result = filter.check_sync(input_text="User message")

SafetyResult

The result of a safety check.

@dataclass
class SafetyResult:
    is_safe: bool                    # Overall safety assessment
    warnings: list[str]              # Warning messages
    blocked_reason: str | None       # Reason for blocking (if should_block)
    detected_pii: list[str]          # Types of PII found (not values)
    scam_indicators: list[str]       # Scam indicator types detected
    injection_detected: bool         # Whether injection was detected
    violations: list[RuleViolation]  # Rule violations found

    @property
    def should_block(self) -> bool   # Whether content should be blocked

    @property
    def scam_score(self) -> float    # Aggregate scam likelihood (0.0-1.0)

ContactBoundaries

Configure per-contact safety rules.

from lilith_ml_safety_filter import ContactBoundaries, ContactClassification, TopicCategory

boundaries = ContactBoundaries(
    blocked_topics=[TopicCategory.FINANCIAL, TopicCategory.WORK],
    classification=ContactClassification.UNKNOWN,
    allow_pii_disclosure=False,
    allow_financial_discussion=False,
    allow_meeting_scheduling=True,
    trust_level=50,  # 0-100
)

# Check if topic is allowed
if boundaries.is_topic_allowed(TopicCategory.PERSONAL):
    # Handle allowed topic
    pass

Preset Boundaries

Pre-configured boundary settings for common scenarios:

from lilith_ml_safety_filter import (
    BOUNDARIES_STRICT,          # Maximum restrictions for new contacts
    BOUNDARIES_MODERATE,        # Balanced for regular interactions
    BOUNDARIES_TRUSTED,         # Relaxed for known contacts
    BOUNDARIES_SCAM_SUSPECTED,  # Maximum restriction for suspected scams
)

ConversationContext

Track conversation state for stateful filtering.

from lilith_ml_safety_filter import ConversationContext

context = ConversationContext(
    contact_id="user-123",
    boundaries=boundaries,
    message_count=5,
    previous_warnings=["Previous warning"],
)

# Check escalation level (0-3)
level = context.get_escalation_level()

# Add warnings/flags
context.add_warning("New warning")
context.add_flagged_pattern("injection_attempt")

Pattern Detection

Direct access to pattern detection:

from lilith_ml_safety_filter import PatternDetector, PIIType, InjectionType

detector = PatternDetector()

# Detect PII
pii_matches = detector.detect_pii(
    "Email: user@example.com",
    types=[PIIType.EMAIL, PIIType.PHONE],
)

# Detect injections
injection_matches = detector.detect_injections(
    "Ignore previous instructions",
    types=[InjectionType.SYSTEM_OVERRIDE],
)

# Detect scam indicators
scam_matches = detector.detect_scam_indicators(
    "URGENT: Send money now!",
    indicator_types=["urgency", "financial_request"],
)

# Redact PII from text
redacted = detector.redact_text("My SSN is 123-45-6789")
# Returns: "My SSN is ***-**-****"

Custom Rules

Add custom safety rules:

from lilith_ml_safety_filter import SafetyFilter, PatternRule, Severity

filter = SafetyFilter()

# Add a custom pattern rule
filter.add_rule(PatternRule(
    id="custom-forbidden-word",
    sev=Severity.HIGH,
    desc="Detect forbidden content",
    pattern=r"forbidden\s+word",
    message_template="Found forbidden content: {match}",
    suggestion_template="Remove or rephrase the forbidden content",
))

# Remove a rule
filter.remove_rule("custom-forbidden-word")

PII Types

Type	Description	Example
`EMAIL`	Email addresses	user@example.com
`PHONE`	Phone numbers	555-123-4567
`SSN`	Social Security Numbers	123-45-6789
`CREDIT_CARD`	Credit card numbers	4111-1111-1111-1111
`ADDRESS`	Physical addresses	123 Main Street
`IP_ADDRESS`	IPv4/IPv6 addresses	192.168.1.1
`DATE_OF_BIRTH`	Birth dates	01/15/1990
`PASSPORT`	Passport numbers	AB1234567
`DRIVER_LICENSE`	Driver license numbers	DL123456789
`BANK_ACCOUNT`	Bank account numbers	Account: 12345678
`ROUTING_NUMBER`	Routing numbers	Routing: 123456789

Injection Types

Type	Description
`SYSTEM_OVERRIDE`	Attempts to override system prompts
`ROLE_ESCAPE`	Attempts to change AI role/personality
`INSTRUCTION_INJECTION`	Injection via special delimiters
`CONTEXT_MANIPULATION`	Attempts to manipulate context
`OUTPUT_HIJACKING`	Attempts to control output format
`DELIMITER_ATTACK`	Use of suspicious delimiters
`ENCODING_BYPASS`	Attempts to bypass via encoding
`JAILBREAK_ATTEMPT`	Common jailbreak patterns

Topic Categories

Category	Description
`WORK`	Work and career discussions
`FINANCIAL`	Money, investments, banking
`PERSONAL`	Personal life details
`HEALTH`	Medical and health topics
`FAMILY`	Family-related discussions
`RELATIONSHIPS`	Romantic or intimate topics
`POLITICS`	Political discussions
`RELIGION`	Religious discussions
`LOCATION`	Geographic location sharing
`SCHEDULE`	Daily schedule or routine

Severity Levels

Level	Description
`CRITICAL`	Immediate block required
`HIGH`	Strong warning, likely block
`MEDIUM`	Moderate concern
`LOW`	Minor issues
`INFO`	Informational only

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run with coverage
pytest --cov=ml_safety_filter

# Type checking
mypy src/

# Linting
ruff check src/

License

MIT