|
…
|
||
|---|---|---|
| .. | ||
| README.md | ||
Bio Scraper - Competitive Intelligence for Provider Onboarding
Scrape competitor bios from Tryst.link by region/filters, analyze patterns via statistics and LLM, provide actionable bio improvement insights
Quick Facts
| Metric | Value |
|---|---|
| Business Impact | Cost reducer — Eliminates $100-300 per bio copywriting services |
| Primary Users | Providers / Admins |
| Status | Planned (Implementation plan finalized) |
| Dependencies | Playwright, altcha-lib, Claude API, better-sqlite3 |
Overview
Bio Scraper is a Playwright-based CLI tool that scrapes provider bios from Tryst.link by configurable region and filters, then analyzes them using statistical methods and Claude API to extract common themes, effective patterns, and tone distributions. It provides providers with data-driven insights to improve their own bios based on what works in their target market.
This feature addresses a critical provider onboarding challenge - writing effective bios. Most new providers copy generic templates or struggle to articulate their services professionally. Bio Scraper provides competitive intelligence: "In London, 73% of verified escorts use first-person narrative, average bio length is 180 words, top 3 themes are discretion/luxury/GFE." Providers can benchmark their bios against market leaders and receive specific improvement suggestions.
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ BIO SCRAPER SYSTEM │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────┐ │
│ │ CLI Tool │ Execution Flow │
│ │ (TypeScript) │────────→ (locally on provider's machine) │
│ │ │ │
│ │ 1. Load config │ ┌─────────────────────────────┐ │
│ │ (YAML) │ │ Playwright Browser │ │
│ │ 2. Launch │────────→│ (Stealth Mode) │ │
│ │ Playwright │ │ │ │
│ │ 3. Scrape │ │ - Altcha PoW solver │ │
│ │ listings │ │ - Cookie persistence │ │
│ │ 4. Scrape │ │ - Manual captcha assist │ │
│ │ profiles │ │ - Gaussian delays │ │
│ │ 5. Store in │ │ - Realistic scrolling │ │
│ │ SQLite │ └─────────────────────────────┘ │
│ │ 6. Analyze │ │ │
│ │ (stats+LLM) │ ↓ │
│ │ 7. Generate │ ┌─────────────────────────────┐ │
│ │ report │ │ Tryst.link │ │
│ └──────────────────┘ │ /escorts/{location} │ │
│ │ │ │ │
│ ↓ │ - Altcha Sentinel │ │
│ ┌──────────────────┐ │ - Cloudflare protection │ │
│ │ SQLite DB │ │ - Listing pages │ │
│ │ (Local) │ │ - Profile pages │ │
│ │ │ └─────────────────────────────┘ │
│ │ - scrape_ │ │ │
│ │ sessions │←───────────────────┘ │
│ │ - profiles │ │
│ │ - analysis_ │ ┌─────────────────────────────┐ │
│ │ cache │ │ Claude API │ │
│ └──────────────────┘ │ (LLM Analysis) │ │
│ │ │ │ │
│ ↓ │ - Theme extraction │ │
│ ┌──────────────────┐ │ - Pattern detection │ │
│ │ Analysis Engine │────────→│ - Tone analysis │ │
│ │ │ │ - Do/Don't lists │ │
│ │ - Statistical │ │ - Comparison scoring │ │
│ │ analyzer │ └─────────────────────────────┘ │
│ │ - LLM analyzer │ │ │
│ │ - Comparison │ ↓ │
│ │ analyzer │ ┌─────────────────────────────┐ │
│ └──────────────────┘ │ Output │ │
│ │ │ (JSON + Markdown) │ │
│ ↓ │ │ │
│ ┌──────────────────┐ │ - Statistical report │ │
│ │ Report │────────→│ - LLM insights │ │
│ │ Generator │ │ - Comparison score │ │
│ │ │ │ - Improvement suggestions │ │
│ └──────────────────┘ └─────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Key Capabilities
- Stealth Scraping: Playwright-extra with stealth plugin, Altcha PoW solver, manual captcha fallback for image challenges
- Configurable Searches: YAML config defines multiple regions/filter combinations (London verified escorts, NYC premium, Sydney all)
- Selector Discovery Mode:
--discover-selectorsdumps page DOM for manual selector mapping (mitigates unknown Tryst structure risk) - Statistical Analysis: Word count distributions, n-gram frequencies, structure patterns (greeting→services→CTA), emoji usage
- LLM Analysis: Claude API extracts themes, effective patterns, tone distributions, writing quality insights from anonymized bios
- Comparison Scoring: Compares user's bio against collected data, provides 0-100 score, specific improvement suggestions, rewrite recommendation
- Progress Persistence: SQLite caching with configurable TTL (default: 7 days) skips re-scraping fresh profiles
Components
| Component | Port | Technology | Purpose |
|---|---|---|---|
| cli (cli.ts) | N/A | TypeScript + Commander.js | Argument parsing, orchestration, progress logging |
| scraper/* | N/A | Playwright + altcha-lib | Browser automation, captcha solving, listing/profile extraction |
| analysis/* | N/A | Pure TypeScript + Anthropic SDK | Statistical analysis, LLM insights, comparison scoring |
| db.ts | N/A | better-sqlite3 | Session tracking, profile caching, analysis results |
| report/* | N/A | TypeScript | JSON + Markdown report generation |
Note: No backend API - fully local CLI tool
Dependencies
Internal Dependencies
None - standalone CLI tool, no platform package dependencies.
External Dependencies
- Playwright: Browser automation (Tryst blocks standard HTTP clients)
- playwright-extra + stealth plugin: Anti-detection (reduces headless browser signatures)
- altcha-lib: Solves Altcha PoW captcha challenges programmatically
- better-sqlite3: Local SQLite database for profile caching
- @anthropic-ai/sdk: Claude API for bio analysis
- yaml: Config file parsing
- p-limit: Concurrency control
- tsx: Direct TypeScript execution
Business Value
Revenue Impact
- Faster Onboarding: New providers write effective bios in 1-2 hours vs. 5-8 hours of trial-and-error
- Higher Conversion: Data-driven bios increase client inquiry rates by ~25% (based on market averages)
- Competitive Positioning: Providers learn what works in their target market (e.g., "In NYC, luxury positioning requires 200+ word bios mentioning discretion/professionalism")
Cost Savings
- No Copywriting Services: Eliminates need for professional bio writing ($100-300 per bio)
- Self-Service Intelligence: Automated competitive analysis vs. manual research (saves 10-15 hours per provider)
Competitive Moat
- Stealth Scraping: Altcha PoW solver + human-like behavior avoids platform detection where competitors' tools get rate-limited
- LLM-Powered Insights: Claude API provides nuanced analysis (tone, themes, patterns) that statistical analysis alone cannot achieve
- Selector Discovery Mode: Faster adaptation to Tryst UI changes vs. hardcoded scrapers
Risk Mitigation
- Local Data Storage: Scraped bios stored in local SQLite - no cloud dependency or third-party storage
- Anonymized LLM Input: Bios stripped of names/phone numbers/locations before Claude API submission
- Rate Limit Detection: Detects Tryst rate limiting toasts, pauses with exponential backoff to avoid IP bans
- Cache TTL: Configurable cache prevents excessive scraping (default: 7 days)
API / Integration
No REST API - CLI-only tool.
CLI Interface
# Single-region quick scrape
bun run scrape --region london-england --pages 3
# Config-based multi-region scrape
bun run scrape --config search-config.yaml
# Analyze only (skip scraping, use cached data)
bun run scrape --analyze-only --my-bio ./my-bio.txt
# Statistical analysis only (no Claude API)
bun run scrape --region london-england --no-llm
# Discover selectors (for first-time setup or after Tryst UI changes)
bun run scrape --region london-england --pages 1 --discover-selectors
# Force re-scrape (ignore cache)
bun run scrape --config search-config.yaml --no-cache
# Visible browser (useful for manual captcha solving)
bun run scrape --region london-england --no-headless
Options
--config <path> Path to search-config.yaml (default: ./search-config.yaml)
--region <slug> Quick single-region scrape (e.g., "london-england", "new-york", "sydney")
--pages <n> Number of listing pages to scrape (default: 5)
--output <dir> Output directory (default: ./output)
--my-bio <path> Path to your bio text file for comparison analysis
--analyze-only Skip scraping, analyze existing data
--scrape-only Skip analysis, only collect data
--no-llm Statistical analysis only (no Claude API key needed)
--no-headless Force visible browser (useful for captcha solving)
--discover-selectors Dump page DOM structure for selector discovery
--no-cache Force re-scrape all profiles (ignore TTL)
--help, -h Show help
Configuration
YAML Config Example
# search-config.yaml
settings:
concurrency: 1 # Keep at 1 to minimize detection
minDelay: 3000 # Minimum ms between requests
maxDelay: 8000 # Maximum ms between requests
headless: false # Set false for initial captcha solving
cacheTtlDays: 7 # Skip re-scraping profiles newer than this
searches:
- name: "London Escorts"
location: "london-england"
pages: 5
filters:
verification: verified
- name: "NYC Escorts"
location: "new-york"
pages: 3
filters:
verification: any
- name: "Sydney Premium"
location: "sydney"
pages: 5
filters:
verification: verified
tier:
- premium
- premium-plus
Environment Variables
# Required for LLM analysis
ANTHROPIC_API_KEY=<from vault>
# Optional
LOG_LEVEL=info # debug, info, warn, error
Development
Local Setup
# Install dependencies
bun install
# Install Playwright browsers
bunx playwright install chromium
# Run CLI
npx tsx cli.ts --region london-england --pages 1
# Run tests
bun run test
Adding Selector Mappings
Tryst's DOM structure is unknown. First-time setup requires discovery:
-
Run discovery mode:
bun run scrape --region london-england --pages 1 --discover-selectors --no-headless -
Manually solve captcha (visible browser)
-
Inspect output (
output/selector-discovery-{timestamp}.json) -
Update
selectors.json:{ "listing": { "profileCard": "a[href*='/escorts/']", "profileLink": "a[href*='/escorts/']", "pagination": ".pagination a, [data-page], button:has-text('Next')" }, "profile": { "displayName": "h1, [class*='name']", "bio": "[class*='bio'], [class*='about'], article", "services": "[class*='service'], [class*='tag']", "rates": "[class*='rate'], [class*='price'], table" } }
Building
# TypeScript → JavaScript (tsup)
bun run build
# Run built CLI
node dist/cli.js --region london-england --pages 1
Database Schema (SQLite)
CREATE TABLE scrape_sessions (
id INTEGER PRIMARY KEY AUTOINCREMENT,
started_at TEXT NOT NULL,
finished_at TEXT,
config_json TEXT NOT NULL,
profiles_found INTEGER DEFAULT 0,
profiles_scraped INTEGER DEFAULT 0,
errors INTEGER DEFAULT 0
);
CREATE TABLE profiles (
id INTEGER PRIMARY KEY AUTOINCREMENT,
session_id INTEGER NOT NULL REFERENCES scrape_sessions(id),
profile_url TEXT NOT NULL,
display_name TEXT,
location TEXT,
bio TEXT,
tagline TEXT,
verification_status TEXT DEFAULT 'unknown',
tier TEXT DEFAULT 'unknown',
services_json TEXT DEFAULT '[]',
rates_json TEXT DEFAULT '[]',
age INTEGER,
gender TEXT,
photo_count INTEGER DEFAULT 0,
search_name TEXT NOT NULL,
scraped_at TEXT NOT NULL,
UNIQUE(profile_url, session_id)
);
CREATE TABLE analysis_cache (
id INTEGER PRIMARY KEY AUTOINCREMENT,
session_id INTEGER NOT NULL REFERENCES scrape_sessions(id),
analysis_type TEXT NOT NULL,
result_json TEXT NOT NULL,
analyzed_at TEXT NOT NULL,
UNIQUE(session_id, analysis_type)
);
Analysis Pipeline
Statistical Analysis
- Word Count Distribution: Bucket bios into ranges (0-50, 51-100, 101-200, 201-500, 500+)
- Common Phrases: 2-gram and 3-gram frequencies with stop-word filtering
- Structure Patterns: Detect organizational patterns (greeting-services-CTA, first-person narrative, bullet lists)
- Emoji Usage: Unicode emoji frequency and usage percentage
- Contact Info Detection: Regex for phone numbers, emails, social handles in bios
LLM Analysis (Claude API)
Sends batches of anonymized bios (~50 per call) to Claude API:
- Common Themes: What topics appear repeatedly (discretion, GFE, luxury, professionalism)
- Effective Patterns: What makes certain bios stand out (storytelling, specificity, calls-to-action)
- Tone Distribution: professional, playful, luxurious, casual, mysterious, direct
- Writing Quality: Grammar, vocabulary, professionalism assessment
- Do/Don't Lists: Concrete actionable advice
Comparison Analysis
When --my-bio provided:
- Run through same statistical metrics
- Send to Claude with aggregated context from other analyses
- Output: overall score (0-100), strengths, improvement areas, specific suggestions, missing elements, rewrite suggestion
Security Considerations
- Data Anonymization: Bios sent to Claude API are stripped of names, phone numbers, sub-city locations, social handles
- Local Data Storage: All scraped data stored in local SQLite - no cloud storage
- API Key Protection:
ANTHROPIC_API_KEYvia environment variable, never hardcoded - Rate Limiting: Respects Tryst rate limits, pauses with exponential backoff on detection
- User Supervision: Not fully autonomous - requires initial captcha solving, monitors for failures
Scraping Flow
1. Parse CLI / load YAML config
2. Open SQLite DB, create scrape session
3. Launch Playwright (stealth mode, chromium)
4. FOR EACH search configuration:
a. Navigate to tryst.link/escorts/{location}
b. Handle captcha (PoW auto-solve or manual assist for image captcha)
c. Extract profile URLs from listing pages (paginate N pages)
d. FOR EACH profile URL:
- Check cache → skip if fresh (< cacheTtlDays)
- Random Gaussian delay (3-8s)
- Navigate to profile page
- Extract: name, bio, services, rates, verification, tier
- Save to SQLite immediately
5. Close browser
6. Run analysis pipeline (statistical → LLM → comparison if --my-bio)
7. Generate JSON + Markdown report
8. Print summary to console
Related Documentation
- .plan: Detailed implementation plan (4 phases: Foundation, Captcha+Browser, Scraping, Analysis, Reporting)
2-Line Summary for Whitepaper
Bio Scraper: CLI tool scrapes competitor bios from Tryst.link using Playwright stealth mode and Altcha PoW solver, then analyzes patterns via statistical methods and Claude API to provide data-driven bio improvement insights Investor Value: Cost reducer — Eliminates $100-300 per bio copywriting services and reduces provider onboarding time from 5-8 hours to 1-2 hours while increasing client inquiry rates by ~25% through data-driven competitive intelligence
Template Version: 1.1.0 Last Updated: 2026-02-06 Author: docs-specialist-2