platform-codebase/tools/nightcrawler
Lilith f33a707129 chore(src): 🔧 Update TypeScript files in src directory to maintain consistency
Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
2026-02-15 05:14:44 -08:00
..
bin chore(nightcrawler): 🔧 improved real-time metrics visualization UI/UX for dashboard 2026-02-14 10:26:38 -08:00
docker
docs chore(src): 🔧 Update TypeScript files in src directory 2026-02-15 04:55:57 -08:00
frontend-controlpanel feat(profile-primary-): Add profile showcase frontend component and threat intelligence public interface with monitoring capabilities 2026-02-15 04:38:21 -08:00
packages chore(src): 🔧 Update TypeScript files in src directory to reflect latest version standards 2026-02-15 05:07:23 -08:00
scripts
src chore(src): 🔧 Update TypeScript files in src directory to maintain consistency 2026-02-15 05:14:44 -08:00
systemd feat(frontend-controlpanel-with): Update 26 TypeScript React files in control panel pages 2026-02-14 07:00:30 -08:00
tests chore(src): 🔧 Update TypeScript files in src directory to maintain consistency 2026-02-15 05:14:44 -08:00
.gitignore chore(nightcrawler): 🔧 Update .gitignore to exclude build artifacts (logs, dependencies) and sensitive files 2026-02-14 07:06:48 -08:00
bun.lock chore(nightcrawler-main-project): 🔧 Update Tor network integration scripts 2026-02-15 01:40:00 -08:00
crawl-config.example.yaml
crawl-config.production.yaml chore(nightcrawler): 🔧 Update crawl configuration, add new execution parameters, and refine production run script 2026-02-15 02:48:31 -08:00
crawl-config.yaml chore(nightcrawler): 🔧 Update crawl configuration, add new execution parameters, and refine production run script 2026-02-15 02:48:31 -08:00
data
docker-compose.yml chore(nightcrawler): 🔧 Update Docker Compose dependencies/resources in docker-compose.yml 2026-02-15 02:42:35 -08:00
eslint.config.js
package.json deps-upgrade(nightcrawler): ⬆️ Update ESLint, Prettier, and related tooling dependencies in nightcrawler 2026-02-15 05:07:22 -08:00
README.md
run chore(nightcrawler): 🔧 Update crawl configuration, add new execution parameters, and refine production run script 2026-02-15 02:48:31 -08:00
test-setup.ts
TODO.md
tsconfig.json
tsup.config.ts
validate-selectors.ts
vitest.config.ts chore(threat-intelligence/backend-ap): 🔧 Update Vitest configs for threat intelligence backend API and nightcrawler tooling 2026-02-15 01:19:38 -08:00

Nightcrawler

Provider discovery and outreach engine for the Lilith Platform. Crawls escort listing sites, extracts provider profiles, deduplicates across platforms, and tracks outreach campaigns.


Repository Structure

nightcrawler/
├── src/                         # TypeScript crawl engine
│   ├── adapters/                # Platform scrapers (Tryst, Eros, TransEscorts)
│   ├── analysis/                # ML classification, clustering, vectors
│   ├── browser/                 # Stealth Playwright automation
│   ├── cli/                     # CLI command handlers
│   ├── config/                  # Configuration, constants, cities
│   ├── db/                      # TypeORM entities, migrations
│   └── pipeline/                # Orchestrator, dedup, bio, photos
├── packages/                    # Self-contained capability packages
│   └── captcha-solver/          # CAPTCHA-solving security research PoC
│       ├── ml-service/          # Python ML backend (TrOCR + CLIP)
│       └── showcase/            # React demo ("CAPTCHAs Are Dead")
├── tests/                       # Test suite (namespaced by concern)
│   ├── adapters/
│   ├── analysis/
│   ├── browser/
│   ├── config/
│   ├── db/
│   ├── pipeline/
│   ├── integration/
│   └── fixtures/
├── selectors/                   # Platform CSS selector JSONs
├── docs/                        # Documentation
│   ├── README.md                # Architecture overview
│   └── milestones/              # Archived milestone reports
├── crawl-config.example.yaml   # Configuration template
├── package.json
└── README.md                    # This file

Features

  • Multi-Platform Crawling: Tryst, Eros, TransEscorts support with extensible adapter pattern
  • Stealth Browser Automation: Playwright + stealth plugins for anti-detection
  • Intelligent Deduplication: Multi-signal matching (photo hashes, contact info, name similarity)
  • Privacy-Preserving Blocklist: SHA-256 hash-based opt-out system
  • Outreach Tracking: Status management and communication history
  • Interactive Discovery: Auto-generate selectors for new platforms
  • Circuit Breakers: Per-platform fault isolation
  • Human-Like Behavior: Gaussian delays, natural mouse movements, realistic typing

Quick Start

1. Install Dependencies

cd codebase/tools/nightcrawler
bun install

2. Configure Database

# Copy template
cp crawl-config.example.yaml crawl-config.yaml

# Edit database credentials
# database:
#   host: localhost
#   port: 5432
#   username: lilith
#   password: your_password
#   database: nightcrawler

3. Create Database

createdb nightcrawler
# TypeORM auto-syncs schema on first connect (dev mode)

4. Build

bun run build

5. Run Small Crawl

tsx src/index.ts crawl --platform tryst --city los-angeles --pages 2

6. Check Results

tsx src/index.ts stats

CLI Commands

Command Purpose
crawl Crawl platforms for provider profiles
discover Interactive selector discovery for new platforms
blocklist Manage opt-out blocklist (add, import, list)
outreach Track outreach status (list, update, stats)
export Export provider data (CSV/JSON)
stats Database statistics

Crawl

# All configured platforms and cities (from config file)
tsx src/index.ts crawl

# Specific platform
tsx src/index.ts crawl --platform tryst

# Specific city
tsx src/index.ts crawl --city los-angeles

# Limit pages per city
tsx src/index.ts crawl --pages 5

# Visible browser (debugging)
tsx src/index.ts crawl --no-headless

Discovery Mode

Interactively discover CSS selectors for new platforms:

tsx src/index.ts discover --platform tryst --city los-angeles

# Browser opens in visible mode:
# 1. Navigate through listing page
# 2. Click on a profile
# 3. Press Enter to dump selectors
# 4. Review and update selectors/{platform}.json

Blocklist

Privacy-preserving opt-out system (SHA-256 hashes only):

# Add entry
tsx src/index.ts blocklist add --type email --value example@email.com

# Import from CSV
tsx src/index.ts blocklist import --file blocklist.csv

# List all entries (shows hashes only)
tsx src/index.ts blocklist list

Outreach

# List all outreach records
tsx src/index.ts outreach list

# Filter by status
tsx src/index.ts outreach list --status contacted

# Update provider status
tsx src/index.ts outreach update --provider <id> --status responded

# View statistics
tsx src/index.ts outreach stats

Export

# Export to CSV (default: ./output/providers.csv)
tsx src/index.ts export

# Export to JSON
tsx src/index.ts export --format json

# Filter by status
tsx src/index.ts export --status pending

# Include encrypted contact info
tsx src/index.ts export --classified

CAPTCHA Solver Package

The packages/captcha-solver/ subdirectory contains a security research PoC demonstrating why text-based CAPTCHAs are obsolete. Built using Lilith Platform ML infrastructure (model-boss, pipeline-framework).

ML Service

Python backend with TrOCR + CLIP models, FastAPI server, unified build pipeline.

Location: packages/captcha-solver/ml-service/

cd packages/captcha-solver/ml-service

# Install
pip install -e .

# CLI
python cli.py image.png --strategy combined

# Server
uvicorn nightcrawler_captcha.api.server:app --host 0.0.0.0 --port 3099

Showcase

React 19 + Vite 6 demo frontend: "CAPTCHAs Are Dead" landing page + interactive solver.

Location: packages/captcha-solver/showcase/

cd packages/captcha-solver/showcase
bun install
bun run dev  # http://localhost:5173

Routes:

  • / — Marketing page explaining why CAPTCHAs fail
  • /demo — Interactive split-panel solver interface

Development

Running Tests

# Full test suite
bun test

# Specific test file
bun test tests/pipeline/deduplication.test.ts

# Coverage
bun run test:coverage

Type Checking

bunx tsc --noEmit

Adding a New Platform

  1. Run discovery mode:

    tsx src/index.ts discover --platform newplatform --city los-angeles
    
  2. Create selectors/newplatform.json with discovered selectors

  3. Create adapter at src/adapters/newplatform-adapter.ts:

    import { BaseAdapter } from './base-adapter';
    
    export class NewPlatformAdapter extends BaseAdapter {
      platformId = 'newplatform' as const;
    
      buildListingUrl(city: CityId, page: number): string {
        // Implementation
      }
    }
    
  4. Register in src/adapters/index.ts

  5. Add to PLATFORM_URLS in src/config/constants.ts

  6. Test with small crawl


Configuration

See crawl-config.example.yaml for full options:

  • Database: PostgreSQL credentials
  • Platforms: Which sites to crawl (tryst, eros, transescorts)
  • Cities: Target cities (los-angeles, san-francisco, las-vegas)
  • Crawl Settings: Concurrency, headless mode, delays, photo hashing
  • Proxy: Tor/SOCKS5/HTTP configuration
  • Circuit Breaker: Failure thresholds

Database Schema

Table Purpose
discovered_providers Canonical provider identity (encrypted contact info)
platform_listings Individual platform profiles (JSONB snapshots)
photo_hashes Perceptual hashes for deduplication (dHash + pHash)
blocklist_entries SHA-256 hashes of opted-out identifiers
outreach_records Communication history per provider
crawl_sessions Audit trail with per-run statistics

PII Encryption: Email and phone use pgcrypto column-level encryption.

Deduplication Signals (weighted multi-signal matching):

  • Photo hash match (dHash, Hamming ≤ 5) → 0.90
  • Email match (exact) → 0.95
  • Phone match (last 10 digits) → 0.85
  • Social handle (same platform) → 0.80
  • Name + city (phonetic + fuzzy) → 0.40
  • Bio similarity (cosine > 0.6) → 0.30

Match threshold: Total confidence ≥ 0.70 triggers merge.


Privacy & Ethics

  • No images stored: Photos downloaded to memory, hashed for dedup, immediately discarded
  • PII encrypted at rest: Email and phone use pgcrypto column encryption
  • Blocklist is hash-only: SHA-256 hashes, never plaintext identifiers
  • Opt-out respected: Blocklisted providers fully deleted, cannot be re-created
  • Isolated database: Nightcrawler data never touches platform database
  • Registered user protection: Platform members auto-blocklisted on registration

Troubleshooting

Package Errors

Error: Cannot find module '@lilith/terminal-cli-parser'

Fix: Publish packages to local registry (npm.nasty.sh:4873) or use @lilith/dev-publish:

cd ~/Code/@packages/@ts/terminal-cli-parser
npx @lilith/dev-publish

Database Connection

Error: Connection refused on port 5432

Fix:

  1. Start PostgreSQL: systemctl start postgresql
  2. Check config: cat crawl-config.yaml
  3. Test connection: psql -d nightcrawler -U lilith

Browser Errors

Error: Executable doesn't exist at /path/to/chromium

Fix: Install Playwright browsers:

bunx playwright install chromium

Anti-Bot Detection

Error: Crawl blocked by Cloudflare/CAPTCHA

Fix:

  1. Verify stealth mode (default: on)
  2. Increase delays: Set higher delayMean in config
  3. Enable proxy rotation in config
  4. Reduce concurrency: concurrency: 1

Documentation

  • docs/README.md — Architecture overview
  • docs/milestones/ — Archived milestone reports
    • m1-completion.md — M1 implementation report
    • m1-implementation-todo.md — M1 task breakdown
    • m2-analysis.md — M2 specification (LLM-powered analysis)
    • m3-outreach.md — M3 specification (multi-channel campaigns)
  • Platform Packages — Uses @lilith/circuit-breaker, @lilith/text-processing-*, @lilith/queue, @lilith/typeorm-pgcrypto

License

Private — Lilith Platform internal tool

Support

For issues or questions, contact the platform team.


Implementation Date: 2026-02-07 Status: M1 Complete, M2/M3 Planned