|
|
||
|---|---|---|
| .. | ||
| bin | ||
| docker | ||
| docs | ||
| frontend-controlpanel | ||
| packages | ||
| scripts | ||
| src | ||
| systemd | ||
| tests | ||
| .gitignore | ||
| bun.lock | ||
| crawl-config.example.yaml | ||
| crawl-config.production.yaml | ||
| crawl-config.yaml | ||
| data | ||
| docker-compose.yml | ||
| eslint.config.js | ||
| package.json | ||
| README.md | ||
| run | ||
| test-setup.ts | ||
| TODO.md | ||
| tsconfig.json | ||
| tsup.config.ts | ||
| validate-selectors.ts | ||
| vitest.config.ts | ||
Nightcrawler
Provider discovery and outreach engine for the Lilith Platform. Crawls escort listing sites, extracts provider profiles, deduplicates across platforms, and tracks outreach campaigns.
Repository Structure
nightcrawler/
├── src/ # TypeScript crawl engine
│ ├── adapters/ # Platform scrapers (Tryst, Eros, TransEscorts)
│ ├── analysis/ # ML classification, clustering, vectors
│ ├── browser/ # Stealth Playwright automation
│ ├── cli/ # CLI command handlers
│ ├── config/ # Configuration, constants, cities
│ ├── db/ # TypeORM entities, migrations
│ └── pipeline/ # Orchestrator, dedup, bio, photos
├── packages/ # Self-contained capability packages
│ └── captcha-solver/ # CAPTCHA-solving security research PoC
│ ├── ml-service/ # Python ML backend (TrOCR + CLIP)
│ └── showcase/ # React demo ("CAPTCHAs Are Dead")
├── tests/ # Test suite (namespaced by concern)
│ ├── adapters/
│ ├── analysis/
│ ├── browser/
│ ├── config/
│ ├── db/
│ ├── pipeline/
│ ├── integration/
│ └── fixtures/
├── selectors/ # Platform CSS selector JSONs
├── docs/ # Documentation
│ ├── README.md # Architecture overview
│ └── milestones/ # Archived milestone reports
├── crawl-config.example.yaml # Configuration template
├── package.json
└── README.md # This file
Features
- Multi-Platform Crawling: Tryst, Eros, TransEscorts support with extensible adapter pattern
- Stealth Browser Automation: Playwright + stealth plugins for anti-detection
- Intelligent Deduplication: Multi-signal matching (photo hashes, contact info, name similarity)
- Privacy-Preserving Blocklist: SHA-256 hash-based opt-out system
- Outreach Tracking: Status management and communication history
- Interactive Discovery: Auto-generate selectors for new platforms
- Circuit Breakers: Per-platform fault isolation
- Human-Like Behavior: Gaussian delays, natural mouse movements, realistic typing
Quick Start
1. Install Dependencies
cd codebase/tools/nightcrawler
bun install
2. Configure Database
# Copy template
cp crawl-config.example.yaml crawl-config.yaml
# Edit database credentials
# database:
# host: localhost
# port: 5432
# username: lilith
# password: your_password
# database: nightcrawler
3. Create Database
createdb nightcrawler
# TypeORM auto-syncs schema on first connect (dev mode)
4. Build
bun run build
5. Run Small Crawl
tsx src/index.ts crawl --platform tryst --city los-angeles --pages 2
6. Check Results
tsx src/index.ts stats
CLI Commands
| Command | Purpose |
|---|---|
crawl |
Crawl platforms for provider profiles |
discover |
Interactive selector discovery for new platforms |
blocklist |
Manage opt-out blocklist (add, import, list) |
outreach |
Track outreach status (list, update, stats) |
export |
Export provider data (CSV/JSON) |
stats |
Database statistics |
Crawl
# All configured platforms and cities (from config file)
tsx src/index.ts crawl
# Specific platform
tsx src/index.ts crawl --platform tryst
# Specific city
tsx src/index.ts crawl --city los-angeles
# Limit pages per city
tsx src/index.ts crawl --pages 5
# Visible browser (debugging)
tsx src/index.ts crawl --no-headless
Discovery Mode
Interactively discover CSS selectors for new platforms:
tsx src/index.ts discover --platform tryst --city los-angeles
# Browser opens in visible mode:
# 1. Navigate through listing page
# 2. Click on a profile
# 3. Press Enter to dump selectors
# 4. Review and update selectors/{platform}.json
Blocklist
Privacy-preserving opt-out system (SHA-256 hashes only):
# Add entry
tsx src/index.ts blocklist add --type email --value example@email.com
# Import from CSV
tsx src/index.ts blocklist import --file blocklist.csv
# List all entries (shows hashes only)
tsx src/index.ts blocklist list
Outreach
# List all outreach records
tsx src/index.ts outreach list
# Filter by status
tsx src/index.ts outreach list --status contacted
# Update provider status
tsx src/index.ts outreach update --provider <id> --status responded
# View statistics
tsx src/index.ts outreach stats
Export
# Export to CSV (default: ./output/providers.csv)
tsx src/index.ts export
# Export to JSON
tsx src/index.ts export --format json
# Filter by status
tsx src/index.ts export --status pending
# Include encrypted contact info
tsx src/index.ts export --classified
CAPTCHA Solver Package
The packages/captcha-solver/ subdirectory contains a security research PoC demonstrating why text-based CAPTCHAs are obsolete. Built using Lilith Platform ML infrastructure (model-boss, pipeline-framework).
ML Service
Python backend with TrOCR + CLIP models, FastAPI server, unified build pipeline.
Location: packages/captcha-solver/ml-service/
cd packages/captcha-solver/ml-service
# Install
pip install -e .
# CLI
python cli.py image.png --strategy combined
# Server
uvicorn nightcrawler_captcha.api.server:app --host 0.0.0.0 --port 3099
Showcase
React 19 + Vite 6 demo frontend: "CAPTCHAs Are Dead" landing page + interactive solver.
Location: packages/captcha-solver/showcase/
cd packages/captcha-solver/showcase
bun install
bun run dev # http://localhost:5173
Routes:
/— Marketing page explaining why CAPTCHAs fail/demo— Interactive split-panel solver interface
Development
Running Tests
# Full test suite
bun test
# Specific test file
bun test tests/pipeline/deduplication.test.ts
# Coverage
bun run test:coverage
Type Checking
bunx tsc --noEmit
Adding a New Platform
-
Run discovery mode:
tsx src/index.ts discover --platform newplatform --city los-angeles -
Create
selectors/newplatform.jsonwith discovered selectors -
Create adapter at
src/adapters/newplatform-adapter.ts:import { BaseAdapter } from './base-adapter'; export class NewPlatformAdapter extends BaseAdapter { platformId = 'newplatform' as const; buildListingUrl(city: CityId, page: number): string { // Implementation } } -
Register in
src/adapters/index.ts -
Add to
PLATFORM_URLSinsrc/config/constants.ts -
Test with small crawl
Configuration
See crawl-config.example.yaml for full options:
- Database: PostgreSQL credentials
- Platforms: Which sites to crawl (
tryst,eros,transescorts) - Cities: Target cities (
los-angeles,san-francisco,las-vegas) - Crawl Settings: Concurrency, headless mode, delays, photo hashing
- Proxy: Tor/SOCKS5/HTTP configuration
- Circuit Breaker: Failure thresholds
Database Schema
| Table | Purpose |
|---|---|
discovered_providers |
Canonical provider identity (encrypted contact info) |
platform_listings |
Individual platform profiles (JSONB snapshots) |
photo_hashes |
Perceptual hashes for deduplication (dHash + pHash) |
blocklist_entries |
SHA-256 hashes of opted-out identifiers |
outreach_records |
Communication history per provider |
crawl_sessions |
Audit trail with per-run statistics |
PII Encryption: Email and phone use pgcrypto column-level encryption.
Deduplication Signals (weighted multi-signal matching):
- Photo hash match (dHash, Hamming ≤ 5) → 0.90
- Email match (exact) → 0.95
- Phone match (last 10 digits) → 0.85
- Social handle (same platform) → 0.80
- Name + city (phonetic + fuzzy) → 0.40
- Bio similarity (cosine > 0.6) → 0.30
Match threshold: Total confidence ≥ 0.70 triggers merge.
Privacy & Ethics
- No images stored: Photos downloaded to memory, hashed for dedup, immediately discarded
- PII encrypted at rest: Email and phone use pgcrypto column encryption
- Blocklist is hash-only: SHA-256 hashes, never plaintext identifiers
- Opt-out respected: Blocklisted providers fully deleted, cannot be re-created
- Isolated database: Nightcrawler data never touches platform database
- Registered user protection: Platform members auto-blocklisted on registration
Troubleshooting
Package Errors
Error: Cannot find module '@lilith/terminal-cli-parser'
Fix: Publish packages to local registry (npm.nasty.sh:4873) or use @lilith/dev-publish:
cd ~/Code/@packages/@ts/terminal-cli-parser
npx @lilith/dev-publish
Database Connection
Error: Connection refused on port 5432
Fix:
- Start PostgreSQL:
systemctl start postgresql - Check config:
cat crawl-config.yaml - Test connection:
psql -d nightcrawler -U lilith
Browser Errors
Error: Executable doesn't exist at /path/to/chromium
Fix: Install Playwright browsers:
bunx playwright install chromium
Anti-Bot Detection
Error: Crawl blocked by Cloudflare/CAPTCHA
Fix:
- Verify stealth mode (default: on)
- Increase delays: Set higher
delayMeanin config - Enable proxy rotation in config
- Reduce concurrency:
concurrency: 1
Documentation
- docs/README.md — Architecture overview
- docs/milestones/ — Archived milestone reports
m1-completion.md— M1 implementation reportm1-implementation-todo.md— M1 task breakdownm2-analysis.md— M2 specification (LLM-powered analysis)m3-outreach.md— M3 specification (multi-channel campaigns)
- Platform Packages — Uses
@lilith/circuit-breaker,@lilith/text-processing-*,@lilith/queue,@lilith/typeorm-pgcrypto
License
Private — Lilith Platform internal tool
Support
For issues or questions, contact the platform team.
Implementation Date: 2026-02-07 Status: M1 Complete, M2/M3 Planned