No description
|
|
||
|---|---|---|
| .forgejo/workflows | ||
| dist | ||
| src/lilith_ml_safety_filter | ||
| tests | ||
| pyproject.toml | ||
| README.md | ||
ml-safety-filter
Content moderation and boundary enforcement for LLM inputs/outputs.
Installation
pip install lilith-ml-safety-filter
Quick Start
from lilith_ml_safety_filter import SafetyFilter, ContactBoundaries
filter = SafetyFilter()
# Check both input and output
result = await filter.check(
input_text=incoming_message,
output_text=generated_response,
boundaries=ContactBoundaries(blocked_topics=["work"])
)
if not result.is_safe:
print(f"Warnings: {result.warnings}")
if result.should_block:
print(f"Blocked: {result.blocked_reason}")
Features
Input Sanitization
- Prompt Injection Detection: Detects system override attempts, role escapes, delimiter attacks, jailbreak attempts
- PII Pattern Detection: Email, phone, SSN, credit cards, addresses, IP addresses, and more
- Scam Indicator Detection: Urgency patterns, financial requests, impersonation attempts
Contact Boundaries
- Per-contact topic blocking
- Risk classification (safe, unknown, suspicious, scam)
- Trust level management
- Whitelist and blacklist modes
Output Filtering
- Financial advice detection
- Medical advice detection
- Legal advice detection
- Topic boundary enforcement
API Reference
SafetyFilter
The main entry point for safety filtering.
from lilith_ml_safety_filter import SafetyFilter, SafetyResult
filter = SafetyFilter(
injection_block_threshold=0.8, # Confidence threshold for blocking injections
scam_block_threshold=0.7, # Confidence threshold for blocking scams
pii_confidence_threshold=0.7, # Confidence threshold for PII detection
)
# Async check
result: SafetyResult = await filter.check(
input_text="User message",
output_text="Model response",
boundaries=boundaries,
context=conversation_context,
)
# Sync check (for convenience)
result = filter.check_sync(input_text="User message")
SafetyResult
The result of a safety check.
@dataclass
class SafetyResult:
is_safe: bool # Overall safety assessment
warnings: list[str] # Warning messages
blocked_reason: str | None # Reason for blocking (if should_block)
detected_pii: list[str] # Types of PII found (not values)
scam_indicators: list[str] # Scam indicator types detected
injection_detected: bool # Whether injection was detected
violations: list[RuleViolation] # Rule violations found
@property
def should_block(self) -> bool # Whether content should be blocked
@property
def scam_score(self) -> float # Aggregate scam likelihood (0.0-1.0)
ContactBoundaries
Configure per-contact safety rules.
from lilith_ml_safety_filter import ContactBoundaries, ContactClassification, TopicCategory
boundaries = ContactBoundaries(
blocked_topics=[TopicCategory.FINANCIAL, TopicCategory.WORK],
classification=ContactClassification.UNKNOWN,
allow_pii_disclosure=False,
allow_financial_discussion=False,
allow_meeting_scheduling=True,
trust_level=50, # 0-100
)
# Check if topic is allowed
if boundaries.is_topic_allowed(TopicCategory.PERSONAL):
# Handle allowed topic
pass
Preset Boundaries
Pre-configured boundary settings for common scenarios:
from lilith_ml_safety_filter import (
BOUNDARIES_STRICT, # Maximum restrictions for new contacts
BOUNDARIES_MODERATE, # Balanced for regular interactions
BOUNDARIES_TRUSTED, # Relaxed for known contacts
BOUNDARIES_SCAM_SUSPECTED, # Maximum restriction for suspected scams
)
ConversationContext
Track conversation state for stateful filtering.
from lilith_ml_safety_filter import ConversationContext
context = ConversationContext(
contact_id="user-123",
boundaries=boundaries,
message_count=5,
previous_warnings=["Previous warning"],
)
# Check escalation level (0-3)
level = context.get_escalation_level()
# Add warnings/flags
context.add_warning("New warning")
context.add_flagged_pattern("injection_attempt")
Pattern Detection
Direct access to pattern detection:
from lilith_ml_safety_filter import PatternDetector, PIIType, InjectionType
detector = PatternDetector()
# Detect PII
pii_matches = detector.detect_pii(
"Email: user@example.com",
types=[PIIType.EMAIL, PIIType.PHONE],
)
# Detect injections
injection_matches = detector.detect_injections(
"Ignore previous instructions",
types=[InjectionType.SYSTEM_OVERRIDE],
)
# Detect scam indicators
scam_matches = detector.detect_scam_indicators(
"URGENT: Send money now!",
indicator_types=["urgency", "financial_request"],
)
# Redact PII from text
redacted = detector.redact_text("My SSN is 123-45-6789")
# Returns: "My SSN is ***-**-****"
Custom Rules
Add custom safety rules:
from lilith_ml_safety_filter import SafetyFilter, PatternRule, Severity
filter = SafetyFilter()
# Add a custom pattern rule
filter.add_rule(PatternRule(
id="custom-forbidden-word",
sev=Severity.HIGH,
desc="Detect forbidden content",
pattern=r"forbidden\s+word",
message_template="Found forbidden content: {match}",
suggestion_template="Remove or rephrase the forbidden content",
))
# Remove a rule
filter.remove_rule("custom-forbidden-word")
PII Types
| Type | Description | Example |
|---|---|---|
EMAIL |
Email addresses | user@example.com |
PHONE |
Phone numbers | 555-123-4567 |
SSN |
Social Security Numbers | 123-45-6789 |
CREDIT_CARD |
Credit card numbers | 4111-1111-1111-1111 |
ADDRESS |
Physical addresses | 123 Main Street |
IP_ADDRESS |
IPv4/IPv6 addresses | 192.168.1.1 |
DATE_OF_BIRTH |
Birth dates | 01/15/1990 |
PASSPORT |
Passport numbers | AB1234567 |
DRIVER_LICENSE |
Driver license numbers | DL123456789 |
BANK_ACCOUNT |
Bank account numbers | Account: 12345678 |
ROUTING_NUMBER |
Routing numbers | Routing: 123456789 |
Injection Types
| Type | Description |
|---|---|
SYSTEM_OVERRIDE |
Attempts to override system prompts |
ROLE_ESCAPE |
Attempts to change AI role/personality |
INSTRUCTION_INJECTION |
Injection via special delimiters |
CONTEXT_MANIPULATION |
Attempts to manipulate context |
OUTPUT_HIJACKING |
Attempts to control output format |
DELIMITER_ATTACK |
Use of suspicious delimiters |
ENCODING_BYPASS |
Attempts to bypass via encoding |
JAILBREAK_ATTEMPT |
Common jailbreak patterns |
Topic Categories
| Category | Description |
|---|---|
WORK |
Work and career discussions |
FINANCIAL |
Money, investments, banking |
PERSONAL |
Personal life details |
HEALTH |
Medical and health topics |
FAMILY |
Family-related discussions |
RELATIONSHIPS |
Romantic or intimate topics |
POLITICS |
Political discussions |
RELIGION |
Religious discussions |
LOCATION |
Geographic location sharing |
SCHEDULE |
Daily schedule or routine |
Severity Levels
| Level | Description |
|---|---|
CRITICAL |
Immediate block required |
HIGH |
Strong warning, likely block |
MEDIUM |
Moderate concern |
LOW |
Minor issues |
INFO |
Informational only |
Development
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run with coverage
pytest --cov=ml_safety_filter
# Type checking
mypy src/
# Linting
ruff check src/
License
MIT