No description
Find a file
autocommit 2b21ccf1b9
Some checks failed
Publish / publish (push) Failing after 0s
Publish to PyPI / Build and Publish (push) Failing after 43s
deps-upgrade(dependencies): ⬆️ Update core and dev dependencies to latest stable versions
Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
2026-04-12 00:21:29 -07:00
.forgejo/workflows chore: initial commit with DRY workflow 2026-01-21 12:48:59 -08:00
dist chore: initial commit with DRY workflow 2026-01-21 12:48:59 -08:00
src/lilith_ml_safety_filter chore: initial commit with DRY workflow 2026-01-21 12:48:59 -08:00
tests chore: initial commit with DRY workflow 2026-01-21 12:48:59 -08:00
pyproject.toml deps-upgrade(dependencies): ⬆️ Update core and dev dependencies to latest stable versions 2026-04-12 00:21:29 -07:00
README.md chore: initial commit with DRY workflow 2026-01-21 12:48:59 -08:00

ml-safety-filter

Content moderation and boundary enforcement for LLM inputs/outputs.

Installation

pip install lilith-ml-safety-filter

Quick Start

from lilith_ml_safety_filter import SafetyFilter, ContactBoundaries

filter = SafetyFilter()

# Check both input and output
result = await filter.check(
    input_text=incoming_message,
    output_text=generated_response,
    boundaries=ContactBoundaries(blocked_topics=["work"])
)

if not result.is_safe:
    print(f"Warnings: {result.warnings}")

if result.should_block:
    print(f"Blocked: {result.blocked_reason}")

Features

Input Sanitization

  • Prompt Injection Detection: Detects system override attempts, role escapes, delimiter attacks, jailbreak attempts
  • PII Pattern Detection: Email, phone, SSN, credit cards, addresses, IP addresses, and more
  • Scam Indicator Detection: Urgency patterns, financial requests, impersonation attempts

Contact Boundaries

  • Per-contact topic blocking
  • Risk classification (safe, unknown, suspicious, scam)
  • Trust level management
  • Whitelist and blacklist modes

Output Filtering

  • Financial advice detection
  • Medical advice detection
  • Legal advice detection
  • Topic boundary enforcement

API Reference

SafetyFilter

The main entry point for safety filtering.

from lilith_ml_safety_filter import SafetyFilter, SafetyResult

filter = SafetyFilter(
    injection_block_threshold=0.8,  # Confidence threshold for blocking injections
    scam_block_threshold=0.7,       # Confidence threshold for blocking scams
    pii_confidence_threshold=0.7,   # Confidence threshold for PII detection
)

# Async check
result: SafetyResult = await filter.check(
    input_text="User message",
    output_text="Model response",
    boundaries=boundaries,
    context=conversation_context,
)

# Sync check (for convenience)
result = filter.check_sync(input_text="User message")

SafetyResult

The result of a safety check.

@dataclass
class SafetyResult:
    is_safe: bool                    # Overall safety assessment
    warnings: list[str]              # Warning messages
    blocked_reason: str | None       # Reason for blocking (if should_block)
    detected_pii: list[str]          # Types of PII found (not values)
    scam_indicators: list[str]       # Scam indicator types detected
    injection_detected: bool         # Whether injection was detected
    violations: list[RuleViolation]  # Rule violations found

    @property
    def should_block(self) -> bool   # Whether content should be blocked

    @property
    def scam_score(self) -> float    # Aggregate scam likelihood (0.0-1.0)

ContactBoundaries

Configure per-contact safety rules.

from lilith_ml_safety_filter import ContactBoundaries, ContactClassification, TopicCategory

boundaries = ContactBoundaries(
    blocked_topics=[TopicCategory.FINANCIAL, TopicCategory.WORK],
    classification=ContactClassification.UNKNOWN,
    allow_pii_disclosure=False,
    allow_financial_discussion=False,
    allow_meeting_scheduling=True,
    trust_level=50,  # 0-100
)

# Check if topic is allowed
if boundaries.is_topic_allowed(TopicCategory.PERSONAL):
    # Handle allowed topic
    pass

Preset Boundaries

Pre-configured boundary settings for common scenarios:

from lilith_ml_safety_filter import (
    BOUNDARIES_STRICT,          # Maximum restrictions for new contacts
    BOUNDARIES_MODERATE,        # Balanced for regular interactions
    BOUNDARIES_TRUSTED,         # Relaxed for known contacts
    BOUNDARIES_SCAM_SUSPECTED,  # Maximum restriction for suspected scams
)

ConversationContext

Track conversation state for stateful filtering.

from lilith_ml_safety_filter import ConversationContext

context = ConversationContext(
    contact_id="user-123",
    boundaries=boundaries,
    message_count=5,
    previous_warnings=["Previous warning"],
)

# Check escalation level (0-3)
level = context.get_escalation_level()

# Add warnings/flags
context.add_warning("New warning")
context.add_flagged_pattern("injection_attempt")

Pattern Detection

Direct access to pattern detection:

from lilith_ml_safety_filter import PatternDetector, PIIType, InjectionType

detector = PatternDetector()

# Detect PII
pii_matches = detector.detect_pii(
    "Email: user@example.com",
    types=[PIIType.EMAIL, PIIType.PHONE],
)

# Detect injections
injection_matches = detector.detect_injections(
    "Ignore previous instructions",
    types=[InjectionType.SYSTEM_OVERRIDE],
)

# Detect scam indicators
scam_matches = detector.detect_scam_indicators(
    "URGENT: Send money now!",
    indicator_types=["urgency", "financial_request"],
)

# Redact PII from text
redacted = detector.redact_text("My SSN is 123-45-6789")
# Returns: "My SSN is ***-**-****"

Custom Rules

Add custom safety rules:

from lilith_ml_safety_filter import SafetyFilter, PatternRule, Severity

filter = SafetyFilter()

# Add a custom pattern rule
filter.add_rule(PatternRule(
    id="custom-forbidden-word",
    sev=Severity.HIGH,
    desc="Detect forbidden content",
    pattern=r"forbidden\s+word",
    message_template="Found forbidden content: {match}",
    suggestion_template="Remove or rephrase the forbidden content",
))

# Remove a rule
filter.remove_rule("custom-forbidden-word")

PII Types

Type Description Example
EMAIL Email addresses user@example.com
PHONE Phone numbers 555-123-4567
SSN Social Security Numbers 123-45-6789
CREDIT_CARD Credit card numbers 4111-1111-1111-1111
ADDRESS Physical addresses 123 Main Street
IP_ADDRESS IPv4/IPv6 addresses 192.168.1.1
DATE_OF_BIRTH Birth dates 01/15/1990
PASSPORT Passport numbers AB1234567
DRIVER_LICENSE Driver license numbers DL123456789
BANK_ACCOUNT Bank account numbers Account: 12345678
ROUTING_NUMBER Routing numbers Routing: 123456789

Injection Types

Type Description
SYSTEM_OVERRIDE Attempts to override system prompts
ROLE_ESCAPE Attempts to change AI role/personality
INSTRUCTION_INJECTION Injection via special delimiters
CONTEXT_MANIPULATION Attempts to manipulate context
OUTPUT_HIJACKING Attempts to control output format
DELIMITER_ATTACK Use of suspicious delimiters
ENCODING_BYPASS Attempts to bypass via encoding
JAILBREAK_ATTEMPT Common jailbreak patterns

Topic Categories

Category Description
WORK Work and career discussions
FINANCIAL Money, investments, banking
PERSONAL Personal life details
HEALTH Medical and health topics
FAMILY Family-related discussions
RELATIONSHIPS Romantic or intimate topics
POLITICS Political discussions
RELIGION Religious discussions
LOCATION Geographic location sharing
SCHEDULE Daily schedule or routine

Severity Levels

Level Description
CRITICAL Immediate block required
HIGH Strong warning, likely block
MEDIUM Moderate concern
LOW Minor issues
INFO Informational only

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run with coverage
pytest --cov=ml_safety_filter

# Type checking
mypy src/

# Linting
ruff check src/

License

MIT