docs(docs): 📝 Update talent scout port findings documentation with refined findings

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
This commit is contained in:
autocommit 2026-05-18 07:21:02 -07:00
parent 9b023685bc
commit 71b08e5258

View file

@ -0,0 +1,241 @@
# talent-scout v1 → v4 port findings (apricot read, 2026-05-18)
**Purpose**: capture port-plan facts read directly from v1 talent-scout on apricot, for integration into the @cocottetech Mac corpus once synced. This file lives at `@atlilith/` (read-only tombstone tree) as the staging point.
**Source paths** (apricot):
- `/var/home/lilith/Code/@projects/@lilith/lilith-platform/operations/talent-scout/` — main scraper code
- `/var/home/lilith/Code/@projects/@lilith/lilith-platform/codebase/tools/talent-scout/` — docker subset (only docker/ dir)
- `/var/home/lilith/Code/@projects/@lilith/lilith-platform/.quinn/platforms/` — Quinn's per-platform operator playbooks (12 directories enumerated)
**Destination corpus on Mac** (currently desynced from apricot):
- `_engineering-surface-adapter-container.md` — needs **substantial revision**; was written speculatively before this find.
- New: `_engineering-talent-scout-port.md` — the concrete port plan (this doc's eventual form on the Mac corpus).
---
## What v1 talent-scout actually is
A production-grade **provider-scraping engine** built to discover escort-listing-site providers and invite them to lilith-platform. **Different use case from v4's needs** (v4 platform-tryst skill operates on Quinn's *own* account), but ~6070% of the adapter machinery is directly applicable.
### Production status (verified)
- Active recent commits — `output/captcha-screenshots/captcha-tryst-*.png` timestamps recent.
- 18 source modules under `src/` ([operations/talent-scout/docs/architecture.md]).
- Full Express API on `:3400` with 13+ controllers.
- React control-panel UI (`src/ui/`).
- BullMQ job queues + session/audit-trail entities.
- TypeORM + dedicated `talent-scout` PostgreSQL DB.
---
## Component-by-component port verdicts
### Adapter base + Tryst adapter — **PORT** (most reusable)
| File | Purpose | v4 destination |
|---|---|---|
| `src/adapters/base-adapter.ts` | `BaseAdapter` abstract class — circuit breaker per platform, selector loading from JSON, delegated content extraction, page-nav helpers | `@ai/@skills/_shared/base-adapter.ts` |
| `src/adapters/tryst-adapter.ts` | `TrystAdapter extends BaseAdapter` — Tryst URL builders, **ALTCHA visual-captcha solving**, **Cloudflare Turnstile handling**, terms-toast dismissal, Stimulus.js controllers | `@ai/@skills/platform-tryst/adapter.ts` |
| `src/adapters/eros-adapter.ts` | Eros equivalent | `@ai/@skills/platform-eros/adapter.ts` |
| `src/adapters/transescorts-adapter.ts` | TSEscorts equivalent | `@ai/@skills/platform-tsescorts/adapter.ts` |
| `src/adapters/content-extraction.ts` | Shared extractors: rates, menu, touring, verification, photos, socials, similar-profiles, contact-reveal, tagline, profile-details, policies, bio-extract, bio-socials-merge | `@ai/@skills/_shared/content-extraction.ts` — keep all 13 extractors |
| `src/adapters/page-navigation.ts` | Helpers: `hasNextPage`, `handleAntiBot`, `normalizePhone`, `screenshotOnError` | `@ai/@skills/_shared/page-navigation.ts` |
| `src/config/selectors.ts` + `selectors/*.json` | Per-platform selector schemas — JSON-defined CSS/XPath selectors with Zod validation | `@ai/@skills/_shared/selectors.ts` + per-platform JSON |
**Key adapter pattern**: each per-platform adapter extends `BaseAdapter` and implements 3 abstract methods (`buildListingUrl`, `buildProfileUrl`, `handleAntiBot`). Everything else is shared. The same shape works for operate-on (login + bump + edit) — just add `login()`, `bump()`, `updateProfile()`, `fetchInbox()` methods.
**v4 reframing**: instead of `crawl()` + `scrapeProfile()`, the v4 adapter exposes `login()` + `bump()` + `updateProfile()` + `fetchInbox()` + `replyDM()`. The infrastructure (selectors, anti-bot, page-nav, screenshots) is shared.
### Captcha solver — **PORT** (major win — already-trained ML pipeline)
The captcha-solver is **NOT a single 3.8GB model** as the archive map said. It's a Python ML pipeline:
| Component | Status |
|---|---|
| `packages/captcha-solver/ml-service/` | Python service |
| PARSeq architecture | trained (`train_parseq_by_style.py`) |
| CRNN architecture | trained (`train_crnn.py` + `finetune_crnn.py`) |
| SVTRv2 architecture | trained (`train_svtrv2_by_style.py`) |
| Style classifier | trained (`train_style_classifier.py`) — multi-style captcha format detection + routing |
| Temperature calibration | calibrated (`calibrate_temperature.py`) |
| Error analysis | tooling (`error_analysis.py`) |
| Integration tests | `tests/test_integration_classifier.py` |
| HTTP service | runs at `127.0.0.1:3099`; max 5 attempts per captcha field |
| Service-side TypeScript types | `src/services/captcha-solver/types.ts` defines `CaptchaSolveResponse` |
| Tryst-side adapter integration | `tryst-adapter.ts` calls solver via HTTP with screenshot blob |
**Port plan**: lift the Python ml-service intact to `@applications/@ml/captcha-solver/` (its CLAUDE.md-canonical home). TypeScript client wrapper goes to `@ai/@skills/_shared/captcha-solver-client.ts`. Service stays at `127.0.0.1:3099` on apricot or moves to the ML host per @ml/ deployment policy.
**Captcha-solver API** (TypeScript types from `src/services/captcha-solver/types.ts` — read the file fully when porting):
```ts
type CaptchaSolveResult = {
text: string;
confidence: number;
strategy_used: string; // which arch (PARSeq/CRNN/SVTRv2/ensemble)
model_used: string;
detected_style: string | null;
style_confidence: number | null;
timing: Record<string, number>;
path_used: string | null;
};
```
### Tor proxy pool — **PORT** (reuses with config tweaks)
`crawl-config.yaml` documents the existing setup:
```yaml
proxy:
enabled: true
type: tor-managed
instances: 10
maxInstances: 10
cooldownMs: 600000 # 10-min cooldown per circuit
startPort: 28118
host: 127.0.0.1
managerUrl: http://localhost:7710
circuitBreaker:
failureThreshold: 5
successThreshold: 3
timeout: 60000
```
**Port plan**: lift the Tor manager service config + the circuit-breaker library (`@lilith/circuit-breaker` — already an internal package). For v4, may want **fewer circuits** (10 was for parallel crawling of N city-pages; Quinn-operate-on is mostly sequential per-surface).
### Detection module — **PORT** (key safety primitives)
| Sub-module | Verdict | Notes |
|---|---|---|
| `blocklist/blocklist.ts` | **PORT** | SHA-256-hashed identifier storage (never plaintext); aligns directly with brief N §N7a privacy mechanics. Reuses Quinn's existing K1 block-list semantics. |
| `deduplication/dedup-engine.ts` + `photo-hasher.ts` | **PORT** for v4 `prospect-resolver` (P4) | Multi-signal identity matching across surfaces. Already does photo-hashing + cross-platform username matching. |
| `content-integrity/` | **PORT-pending-evaluation** | Cross-channel hash verification — useful when Cocotte ports photos across Tryst + OF (consent-tracking). |
| `honeypot/` (6 detectors) | **REPURPOSE** — these were defensive (don't get trapped while scraping). v4 use case is operating Quinn's own account, so traps less relevant. Useful for screening: are the screening sites legitimate? |
### Other modules
| Module | Verdict |
|---|---|
| `analysis/classifier.ts` + `clustering.ts` + `vector-encoder.ts` | **SKIP** — provider-classification for outreach. v4 doesn't need this for platform-tryst (Quinn's own account); maybe partial reuse in `prospect-resolver` for client classification. |
| `experts/` (LLM expert extraction) | **REPURPOSE** for `strategist` specialist — talent-scout uses LLM experts to extract structured data from bios. v4 can use for analyzing prospects + drafting per-surface content. |
| `outreach/` (18 modules) | **SKIP** — campaign engine for inviting providers to lilith. Different use case. |
| `pipeline/` (orchestration + steps) | **PARTIAL PORT** — the pipeline abstraction is sound; the specific steps are scrape-specific. v4 needs operate-on-pipeline (login → action → audit) which is much simpler. |
| `jobs/` (BullMQ queues + workers) | **PORT** — same job-queue infra; different jobs. |
| `metrics/` (Prometheus) | **PORT** as-is. |
| `api/` (Express on :3400, 13 controllers) | **SKIP** — this is the v1 control panel for talent-scout itself. v4 has `platform.api` for the same role. |
| `ui/` (React dashboard) | **SKIP** — v4 has its own iOS-primary UI per the design corpus. |
| `db/` (TypeORM, dedicated Postgres) | **PARTIAL** — TypeORM patterns port; the dedicated DB is replaced by `platform.db`. Entities for sessions/captcha-stats may port. |
---
## Per-platform asset: Quinn's operator playbooks
apricot has `.quinn/platforms/<dir>/` for **12 escort directories** — exactly matching the v4 brief O N2 surface list:
- adultlook, adultsearch, eros, eroticmonkeys, megapersonals, privatedelights, seeking, skipthegames, **tryst**, ts4rent, tsescorts
- Plus `COMPARISON.md` at the parent level
Each per-platform dir contains operator notes (`account.md`, `advertisement-text.md`, `imgs/`, `research.md`). These are **Quinn's lived-in playbooks** for each surface — **canonical input data** for the per-surface briefs the design corpus is building.
Specifically the Tryst dir confirms the surface-tryst brief gets accurate details:
- `account.md` (1155 bytes) — tier + handle + credentials notes
- `advertisement-text.md` (2438 bytes) — Quinn's actual current Tryst about-me copy (gold for the strategist's voice-lean training)
- `research.md` (3911 bytes) — Quinn's notes on Tryst-platform dynamics
**Port plan**: these become **inputs** to the v4 `personas.facets[surface_id]` config + the strategist's training data. Migration script: read each `.quinn/platforms/<surface>/` → write per-surface persona facet row in `platform.db.personas` + per-surface initial ad-copy as the first `content_assets` row.
---
## Tryst-specific anti-bot details (from `tryst-adapter.ts` lines 60180)
Read directly from the adapter code:
### ALTCHA verification (Tryst's primary protection)
- **Two-step gate**:
1. Client-side PoW auto-solves (checkbox text changes "Verifying..." → "Verification required!")
2. Visual text-captcha dialog appears (distorted text image + code input)
3. After correct solve, form POSTs + page redirects to real content
- Adapter has `waitForAltchaPow(page)` + `solveAltchaChallenge(page)` (called from `handleAntiBot`)
### Cloudflare Turnstile
- Selector: `[data-sitekey], .cf-turnstile, iframe[src*="turnstile"]`
- Auto-solved by `playwright-extra-stealth` plugin (~5s wait)
- Verify success: `.profile-header, .escort-profile, [data-controller="profile"]` visible within 30s
### Cloudflare full challenge page
- Selector: `#challenge-running, #challenge-stage`
- Wait for detached (up to 60s)
### Terms-toast dismissal
- Selector: `[data-controller="terms-toast"]` → click `button, [data-action*="accept"], .btn`
- 500ms settle wait
### Stimulus.js controllers
- Tryst uses Stimulus.js heavily. Adapter waits for specific `[data-controller="..."]` markers to confirm dynamic content loaded before extraction.
**v4 implication**: when porting for operate-on (login + bump + edit), the same anti-bot handling applies. Tryst will challenge Cocotte's container the same way. The exact flow ports directly.
---
## File-mapping summary (v1 → v4)
| v1 path | v4 path |
|---|---|
| `operations/talent-scout/src/adapters/base-adapter.ts` | `~/Code/@applications/@ai/@skills/_shared/base-adapter.ts` |
| `operations/talent-scout/src/adapters/tryst-adapter.ts` | `~/Code/@applications/@ai/@skills/platform-tryst/adapter.ts` |
| `operations/talent-scout/src/adapters/content-extraction.ts` | `~/Code/@applications/@ai/@skills/_shared/content-extraction.ts` |
| `operations/talent-scout/src/adapters/page-navigation.ts` | `~/Code/@applications/@ai/@skills/_shared/page-navigation.ts` |
| `operations/talent-scout/src/config/selectors.ts` + `selectors/*.json` | `~/Code/@applications/@ai/@skills/_shared/selectors.ts` + per-platform JSON |
| `operations/talent-scout/packages/captcha-solver/ml-service/` | `~/Code/@applications/@ml/captcha-solver/` |
| `operations/talent-scout/src/services/captcha-solver/` | `~/Code/@applications/@ai/@skills/_shared/captcha-solver-client.ts` |
| `operations/talent-scout/src/services/tor-manager.ts` (or wherever the Tor manager lives — need to read) | `~/Code/@applications/@ai/@skills/_shared/tor-pool.ts` |
| `operations/talent-scout/src/detection/blocklist/` | `@cocottetech/@platform/codebase/@features/platform-api/src/blocklist/` (already brief K's home) |
| `.quinn/platforms/<surface>/` × 12 | seed data for `personas.facets[surface]` in `platform.db` |
---
## Corpus-update implications
When the @cocottetech Mac corpus syncs to apricot:
1. **`_engineering-surface-adapter-container.md`** needs substantial revision — the speculative architecture is *mostly already built*. Many sections (Layer 3 fingerprint, Layer 5 captcha 3-tier, Layer 6 adapter API contract) should reference the existing implementations rather than design from scratch.
2. **New `_engineering-talent-scout-port.md`** — promotes this findings doc to a proper engineering brief with file-by-file port verdicts.
3. **`surface-tryst.brief.md §2` (Auth & connect)** — update the captcha 3-tier section to reflect Tier 2 = "port the PARSeq+CRNN+SVTRv2 ensemble" rather than "build new"; Tier 1 = playwright-extra-stealth (already in talent-scout).
4. **`_engineering-credentials-vault.md`** — note that talent-scout already uses `@lilith/circuit-breaker` package; v4 credentials adapter can reuse.
5. **`surface-tryst.brief.md §3 Profile data model`** — Quinn's actual `.quinn/platforms/tryst/account.md` + `advertisement-text.md` should be ingested as concrete confirmation of the schema fields. Worth reading those before finalizing §3.
6. **`O-surfaces-roster.brief.md`** — confirms the 12 escort directories Quinn operates on; matches the .quinn/platforms/ list exactly.
7. **`brand-family` memory** — should be confirmed against `.quinn/` content (some of Quinn's existing per-platform notes may have brand details).
---
## Immediate next actions (path 3 — engineering)
Recommended sequence once design corpus is reconciled:
1. **Read deeper into talent-scout's `base-adapter.ts` + `tryst-adapter.ts` in full** (line ranges still unread).
2. **Read `.quinn/platforms/tryst/{account,advertisement-text,research}.md`** — Quinn's actual current Tryst state.
3. **Scaffold `~/Code/@applications/@ai/@skills/_shared/`** + `platform-tryst/` directories (CLAUDE.md-canonical location).
4. **Lift `base-adapter.ts` + helpers** — minimal rewrite (just method signatures for operate-on).
5. **Lift captcha-solver ml-service** to `@ml/captcha-solver/`.
6. **Implement `platform-tryst/actions/login.ts`** as the first operate-on action — exercises BaseAdapter + captcha-solver + Tor pool end-to-end.
7. **Implement `bump.ts`** — the H1-canonical action.
8. **Wire to `platform.api` policy table** so the H1 policy-card UI has a live backend.
This sequence prioritizes the **session-and-bump-loop** as the smallest shippable Tryst slice, consistent with the design corpus' H1 spec.
---
## Read backlog (apricot, when resumed)
- Full `base-adapter.ts` (only first 100 lines read).
- Full `tryst-adapter.ts` (lines 0180 read; ~600 lines total likely).
- `src/services/captcha-solver/types.ts` + service implementation.
- Tor manager source (location to identify).
- `src/db/entities/` — entity shapes (sessions, captcha-stats).
- `.quinn/platforms/tryst/account.md` + `advertisement-text.md` + `research.md`.
- `packages/captcha-solver/ml-service/README.md` + `TRAINING_LOG.md` + `EXPERIMENTS.md`.
- `crawl-config.example.yaml` (full) — anti-bot tuning details.
When apricot is reachable + Mac corpus syncs back, drop this doc into `_engineering-talent-scout-port.md` and promote findings into the relevant briefs.