diff --git a/talent-scout-port-findings.md b/talent-scout-port-findings.md new file mode 100644 index 0000000..d203567 --- /dev/null +++ b/talent-scout-port-findings.md @@ -0,0 +1,241 @@ +# talent-scout v1 → v4 port findings (apricot read, 2026-05-18) + +**Purpose**: capture port-plan facts read directly from v1 talent-scout on apricot, for integration into the @cocottetech Mac corpus once synced. This file lives at `@atlilith/` (read-only tombstone tree) as the staging point. + +**Source paths** (apricot): +- `/var/home/lilith/Code/@projects/@lilith/lilith-platform/operations/talent-scout/` — main scraper code +- `/var/home/lilith/Code/@projects/@lilith/lilith-platform/codebase/tools/talent-scout/` — docker subset (only docker/ dir) +- `/var/home/lilith/Code/@projects/@lilith/lilith-platform/.quinn/platforms/` — Quinn's per-platform operator playbooks (12 directories enumerated) + +**Destination corpus on Mac** (currently desynced from apricot): +- `_engineering-surface-adapter-container.md` — needs **substantial revision**; was written speculatively before this find. +- New: `_engineering-talent-scout-port.md` — the concrete port plan (this doc's eventual form on the Mac corpus). + +--- + +## What v1 talent-scout actually is + +A production-grade **provider-scraping engine** built to discover escort-listing-site providers and invite them to lilith-platform. **Different use case from v4's needs** (v4 platform-tryst skill operates on Quinn's *own* account), but ~60–70% of the adapter machinery is directly applicable. + +### Production status (verified) +- Active recent commits — `output/captcha-screenshots/captcha-tryst-*.png` timestamps recent. +- 18 source modules under `src/` ([operations/talent-scout/docs/architecture.md]). +- Full Express API on `:3400` with 13+ controllers. +- React control-panel UI (`src/ui/`). +- BullMQ job queues + session/audit-trail entities. +- TypeORM + dedicated `talent-scout` PostgreSQL DB. + +--- + +## Component-by-component port verdicts + +### Adapter base + Tryst adapter — **PORT** (most reusable) + +| File | Purpose | v4 destination | +|---|---|---| +| `src/adapters/base-adapter.ts` | `BaseAdapter` abstract class — circuit breaker per platform, selector loading from JSON, delegated content extraction, page-nav helpers | `@ai/@skills/_shared/base-adapter.ts` | +| `src/adapters/tryst-adapter.ts` | `TrystAdapter extends BaseAdapter` — Tryst URL builders, **ALTCHA visual-captcha solving**, **Cloudflare Turnstile handling**, terms-toast dismissal, Stimulus.js controllers | `@ai/@skills/platform-tryst/adapter.ts` | +| `src/adapters/eros-adapter.ts` | Eros equivalent | `@ai/@skills/platform-eros/adapter.ts` | +| `src/adapters/transescorts-adapter.ts` | TSEscorts equivalent | `@ai/@skills/platform-tsescorts/adapter.ts` | +| `src/adapters/content-extraction.ts` | Shared extractors: rates, menu, touring, verification, photos, socials, similar-profiles, contact-reveal, tagline, profile-details, policies, bio-extract, bio-socials-merge | `@ai/@skills/_shared/content-extraction.ts` — keep all 13 extractors | +| `src/adapters/page-navigation.ts` | Helpers: `hasNextPage`, `handleAntiBot`, `normalizePhone`, `screenshotOnError` | `@ai/@skills/_shared/page-navigation.ts` | +| `src/config/selectors.ts` + `selectors/*.json` | Per-platform selector schemas — JSON-defined CSS/XPath selectors with Zod validation | `@ai/@skills/_shared/selectors.ts` + per-platform JSON | + +**Key adapter pattern**: each per-platform adapter extends `BaseAdapter` and implements 3 abstract methods (`buildListingUrl`, `buildProfileUrl`, `handleAntiBot`). Everything else is shared. The same shape works for operate-on (login + bump + edit) — just add `login()`, `bump()`, `updateProfile()`, `fetchInbox()` methods. + +**v4 reframing**: instead of `crawl()` + `scrapeProfile()`, the v4 adapter exposes `login()` + `bump()` + `updateProfile()` + `fetchInbox()` + `replyDM()`. The infrastructure (selectors, anti-bot, page-nav, screenshots) is shared. + +### Captcha solver — **PORT** (major win — already-trained ML pipeline) + +The captcha-solver is **NOT a single 3.8GB model** as the archive map said. It's a Python ML pipeline: + +| Component | Status | +|---|---| +| `packages/captcha-solver/ml-service/` | Python service | +| PARSeq architecture | trained (`train_parseq_by_style.py`) | +| CRNN architecture | trained (`train_crnn.py` + `finetune_crnn.py`) | +| SVTRv2 architecture | trained (`train_svtrv2_by_style.py`) | +| Style classifier | trained (`train_style_classifier.py`) — multi-style captcha format detection + routing | +| Temperature calibration | calibrated (`calibrate_temperature.py`) | +| Error analysis | tooling (`error_analysis.py`) | +| Integration tests | `tests/test_integration_classifier.py` | +| HTTP service | runs at `127.0.0.1:3099`; max 5 attempts per captcha field | +| Service-side TypeScript types | `src/services/captcha-solver/types.ts` defines `CaptchaSolveResponse` | +| Tryst-side adapter integration | `tryst-adapter.ts` calls solver via HTTP with screenshot blob | + +**Port plan**: lift the Python ml-service intact to `@applications/@ml/captcha-solver/` (its CLAUDE.md-canonical home). TypeScript client wrapper goes to `@ai/@skills/_shared/captcha-solver-client.ts`. Service stays at `127.0.0.1:3099` on apricot or moves to the ML host per @ml/ deployment policy. + +**Captcha-solver API** (TypeScript types from `src/services/captcha-solver/types.ts` — read the file fully when porting): +```ts +type CaptchaSolveResult = { + text: string; + confidence: number; + strategy_used: string; // which arch (PARSeq/CRNN/SVTRv2/ensemble) + model_used: string; + detected_style: string | null; + style_confidence: number | null; + timing: Record; + path_used: string | null; +}; +``` + +### Tor proxy pool — **PORT** (reuses with config tweaks) + +`crawl-config.yaml` documents the existing setup: +```yaml +proxy: + enabled: true + type: tor-managed + instances: 10 + maxInstances: 10 + cooldownMs: 600000 # 10-min cooldown per circuit + startPort: 28118 + host: 127.0.0.1 + managerUrl: http://localhost:7710 +circuitBreaker: + failureThreshold: 5 + successThreshold: 3 + timeout: 60000 +``` + +**Port plan**: lift the Tor manager service config + the circuit-breaker library (`@lilith/circuit-breaker` — already an internal package). For v4, may want **fewer circuits** (10 was for parallel crawling of N city-pages; Quinn-operate-on is mostly sequential per-surface). + +### Detection module — **PORT** (key safety primitives) + +| Sub-module | Verdict | Notes | +|---|---|---| +| `blocklist/blocklist.ts` | **PORT** | SHA-256-hashed identifier storage (never plaintext); aligns directly with brief N §N7a privacy mechanics. Reuses Quinn's existing K1 block-list semantics. | +| `deduplication/dedup-engine.ts` + `photo-hasher.ts` | **PORT** for v4 `prospect-resolver` (P4) | Multi-signal identity matching across surfaces. Already does photo-hashing + cross-platform username matching. | +| `content-integrity/` | **PORT-pending-evaluation** | Cross-channel hash verification — useful when Cocotte ports photos across Tryst + OF (consent-tracking). | +| `honeypot/` (6 detectors) | **REPURPOSE** — these were defensive (don't get trapped while scraping). v4 use case is operating Quinn's own account, so traps less relevant. Useful for screening: are the screening sites legitimate? | + +### Other modules + +| Module | Verdict | +|---|---| +| `analysis/classifier.ts` + `clustering.ts` + `vector-encoder.ts` | **SKIP** — provider-classification for outreach. v4 doesn't need this for platform-tryst (Quinn's own account); maybe partial reuse in `prospect-resolver` for client classification. | +| `experts/` (LLM expert extraction) | **REPURPOSE** for `strategist` specialist — talent-scout uses LLM experts to extract structured data from bios. v4 can use for analyzing prospects + drafting per-surface content. | +| `outreach/` (18 modules) | **SKIP** — campaign engine for inviting providers to lilith. Different use case. | +| `pipeline/` (orchestration + steps) | **PARTIAL PORT** — the pipeline abstraction is sound; the specific steps are scrape-specific. v4 needs operate-on-pipeline (login → action → audit) which is much simpler. | +| `jobs/` (BullMQ queues + workers) | **PORT** — same job-queue infra; different jobs. | +| `metrics/` (Prometheus) | **PORT** as-is. | +| `api/` (Express on :3400, 13 controllers) | **SKIP** — this is the v1 control panel for talent-scout itself. v4 has `platform.api` for the same role. | +| `ui/` (React dashboard) | **SKIP** — v4 has its own iOS-primary UI per the design corpus. | +| `db/` (TypeORM, dedicated Postgres) | **PARTIAL** — TypeORM patterns port; the dedicated DB is replaced by `platform.db`. Entities for sessions/captcha-stats may port. | + +--- + +## Per-platform asset: Quinn's operator playbooks + +apricot has `.quinn/platforms//` for **12 escort directories** — exactly matching the v4 brief O N2 surface list: +- adultlook, adultsearch, eros, eroticmonkeys, megapersonals, privatedelights, seeking, skipthegames, **tryst**, ts4rent, tsescorts +- Plus `COMPARISON.md` at the parent level + +Each per-platform dir contains operator notes (`account.md`, `advertisement-text.md`, `imgs/`, `research.md`). These are **Quinn's lived-in playbooks** for each surface — **canonical input data** for the per-surface briefs the design corpus is building. + +Specifically the Tryst dir confirms the surface-tryst brief gets accurate details: +- `account.md` (1155 bytes) — tier + handle + credentials notes +- `advertisement-text.md` (2438 bytes) — Quinn's actual current Tryst about-me copy (gold for the strategist's voice-lean training) +- `research.md` (3911 bytes) — Quinn's notes on Tryst-platform dynamics + +**Port plan**: these become **inputs** to the v4 `personas.facets[surface_id]` config + the strategist's training data. Migration script: read each `.quinn/platforms//` → write per-surface persona facet row in `platform.db.personas` + per-surface initial ad-copy as the first `content_assets` row. + +--- + +## Tryst-specific anti-bot details (from `tryst-adapter.ts` lines 60–180) + +Read directly from the adapter code: + +### ALTCHA verification (Tryst's primary protection) +- **Two-step gate**: + 1. Client-side PoW auto-solves (checkbox text changes "Verifying..." → "Verification required!") + 2. Visual text-captcha dialog appears (distorted text image + code input) + 3. After correct solve, form POSTs + page redirects to real content +- Adapter has `waitForAltchaPow(page)` + `solveAltchaChallenge(page)` (called from `handleAntiBot`) + +### Cloudflare Turnstile +- Selector: `[data-sitekey], .cf-turnstile, iframe[src*="turnstile"]` +- Auto-solved by `playwright-extra-stealth` plugin (~5s wait) +- Verify success: `.profile-header, .escort-profile, [data-controller="profile"]` visible within 30s + +### Cloudflare full challenge page +- Selector: `#challenge-running, #challenge-stage` +- Wait for detached (up to 60s) + +### Terms-toast dismissal +- Selector: `[data-controller="terms-toast"]` → click `button, [data-action*="accept"], .btn` +- 500ms settle wait + +### Stimulus.js controllers +- Tryst uses Stimulus.js heavily. Adapter waits for specific `[data-controller="..."]` markers to confirm dynamic content loaded before extraction. + +**v4 implication**: when porting for operate-on (login + bump + edit), the same anti-bot handling applies. Tryst will challenge Cocotte's container the same way. The exact flow ports directly. + +--- + +## File-mapping summary (v1 → v4) + +| v1 path | v4 path | +|---|---| +| `operations/talent-scout/src/adapters/base-adapter.ts` | `~/Code/@applications/@ai/@skills/_shared/base-adapter.ts` | +| `operations/talent-scout/src/adapters/tryst-adapter.ts` | `~/Code/@applications/@ai/@skills/platform-tryst/adapter.ts` | +| `operations/talent-scout/src/adapters/content-extraction.ts` | `~/Code/@applications/@ai/@skills/_shared/content-extraction.ts` | +| `operations/talent-scout/src/adapters/page-navigation.ts` | `~/Code/@applications/@ai/@skills/_shared/page-navigation.ts` | +| `operations/talent-scout/src/config/selectors.ts` + `selectors/*.json` | `~/Code/@applications/@ai/@skills/_shared/selectors.ts` + per-platform JSON | +| `operations/talent-scout/packages/captcha-solver/ml-service/` | `~/Code/@applications/@ml/captcha-solver/` | +| `operations/talent-scout/src/services/captcha-solver/` | `~/Code/@applications/@ai/@skills/_shared/captcha-solver-client.ts` | +| `operations/talent-scout/src/services/tor-manager.ts` (or wherever the Tor manager lives — need to read) | `~/Code/@applications/@ai/@skills/_shared/tor-pool.ts` | +| `operations/talent-scout/src/detection/blocklist/` | `@cocottetech/@platform/codebase/@features/platform-api/src/blocklist/` (already brief K's home) | +| `.quinn/platforms//` × 12 | seed data for `personas.facets[surface]` in `platform.db` | + +--- + +## Corpus-update implications + +When the @cocottetech Mac corpus syncs to apricot: + +1. **`_engineering-surface-adapter-container.md`** needs substantial revision — the speculative architecture is *mostly already built*. Many sections (Layer 3 fingerprint, Layer 5 captcha 3-tier, Layer 6 adapter API contract) should reference the existing implementations rather than design from scratch. + +2. **New `_engineering-talent-scout-port.md`** — promotes this findings doc to a proper engineering brief with file-by-file port verdicts. + +3. **`surface-tryst.brief.md §2` (Auth & connect)** — update the captcha 3-tier section to reflect Tier 2 = "port the PARSeq+CRNN+SVTRv2 ensemble" rather than "build new"; Tier 1 = playwright-extra-stealth (already in talent-scout). + +4. **`_engineering-credentials-vault.md`** — note that talent-scout already uses `@lilith/circuit-breaker` package; v4 credentials adapter can reuse. + +5. **`surface-tryst.brief.md §3 Profile data model`** — Quinn's actual `.quinn/platforms/tryst/account.md` + `advertisement-text.md` should be ingested as concrete confirmation of the schema fields. Worth reading those before finalizing §3. + +6. **`O-surfaces-roster.brief.md`** — confirms the 12 escort directories Quinn operates on; matches the .quinn/platforms/ list exactly. + +7. **`brand-family` memory** — should be confirmed against `.quinn/` content (some of Quinn's existing per-platform notes may have brand details). + +--- + +## Immediate next actions (path 3 — engineering) + +Recommended sequence once design corpus is reconciled: + +1. **Read deeper into talent-scout's `base-adapter.ts` + `tryst-adapter.ts` in full** (line ranges still unread). +2. **Read `.quinn/platforms/tryst/{account,advertisement-text,research}.md`** — Quinn's actual current Tryst state. +3. **Scaffold `~/Code/@applications/@ai/@skills/_shared/`** + `platform-tryst/` directories (CLAUDE.md-canonical location). +4. **Lift `base-adapter.ts` + helpers** — minimal rewrite (just method signatures for operate-on). +5. **Lift captcha-solver ml-service** to `@ml/captcha-solver/`. +6. **Implement `platform-tryst/actions/login.ts`** as the first operate-on action — exercises BaseAdapter + captcha-solver + Tor pool end-to-end. +7. **Implement `bump.ts`** — the H1-canonical action. +8. **Wire to `platform.api` policy table** so the H1 policy-card UI has a live backend. + +This sequence prioritizes the **session-and-bump-loop** as the smallest shippable Tryst slice, consistent with the design corpus' H1 spec. + +--- + +## Read backlog (apricot, when resumed) + +- Full `base-adapter.ts` (only first 100 lines read). +- Full `tryst-adapter.ts` (lines 0–180 read; ~600 lines total likely). +- `src/services/captcha-solver/types.ts` + service implementation. +- Tor manager source (location to identify). +- `src/db/entities/` — entity shapes (sessions, captcha-stats). +- `.quinn/platforms/tryst/account.md` + `advertisement-text.md` + `research.md`. +- `packages/captcha-solver/ml-service/README.md` + `TRAINING_LOG.md` + `EXPERIMENTS.md`. +- `crawl-config.example.yaml` (full) — anti-bot tuning details. + +When apricot is reachable + Mac corpus syncs back, drop this doc into `_engineering-talent-scout-port.md` and promote findings into the relevant briefs.