docs(port): 📝 Update port findings documentation with verified architecture details and v4 reframing

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
This commit is contained in:
autocommit 2026-05-18 18:28:57 -07:00
parent 26ccf73242
commit 7685871bb5

View file

@ -144,7 +144,7 @@ A production-grade **provider-scraping engine** built to discover escort-listing
## Component-by-component port verdicts
### Adapter base + Tryst adapter — **PORT** (most reusable)
### Adapter base + Tryst adapter — **PORT** (most reusable) — verified read 2026-05-18 (full files)
| File | Purpose | v4 destination |
|---|---|---|
@ -160,6 +160,22 @@ A production-grade **provider-scraping engine** built to discover escort-listing
**v4 reframing**: instead of `crawl()` + `scrapeProfile()`, the v4 adapter exposes `login()` + `bump()` + `updateProfile()` + `fetchInbox()` + `replyDM()`. The infrastructure (selectors, anti-bot, page-nav, screenshots) is shared.
### BaseAdapter architecture (verified read 2026-05-18 — full 360L)
**Class shape**: `abstract class BaseAdapter implements PlatformAdapter`. Per-platform circuit breaker from `@lilith/circuit-breaker` (rename → `@cocotte/circuit-breaker`). Selector schema loaded from `selectors/<platformId>.json` via `getSelectorSchema(platformId)` (port the registry pattern).
**Abstract methods adapters must implement**: `buildListingUrlFromSlug(slug, page)`, `buildListingUrl(city, page)`, `buildProfileUrl(slug)`. **For v4 CocotteAI**: the operate-on flow (Quinn manages her own profiles) replaces these with `getOwnProfileUrl()`, `getEditUrl()`, `getBumpUrl()`, `getSettingsUrl()` — same selector-schema-driven pattern, different verbs.
**Helper-module split** (lift verbatim):
- `content-extraction.ts` — pure extractors: `extractRates / extractMenu / extractTouringStatus / extractVerification / extractPhotos / extractSocials / extractSimilarProfiles / revealContact / extractTagline / extractProfileDetails / extractPolicies / extractFromBio / mergeBioSocials`.
- `page-navigation.ts``hasNextPage / handleAntiBot / normalizePhone / screenshotOnError`.
**`scrapeProfile` parallelism pattern**: 10 extractors fired via `Promise.all([...])` after `waitForSelector(name)`. Port directly — same pattern applies to v4 "read Quinn's current profile state" before computing a diff for the operate-on action.
**Bio-text supplemental extraction**: `extractFromBio()` parses phone/rates/socials out of bio text — DOM extraction takes precedence, bio extraction SUPPLEMENTS only when DOM is empty. **Critical invariant for v4**: applies symmetrically to the operate-on flow when Quinn's *own* draft bio mentions a number that didn't make it into the phone field — surface as a suggestion via the strategist specialist.
**Telemetry hook (`onSolveAttempt?: (data) => void`)**: optional callback set by pipeline worker. Port verbatim — wire to `captcha_solve_attempts` insert in v4 pipeline worker.
### Captcha solver — **PORT** (major win — already-trained ML pipeline)
The captcha-solver is **NOT a single 3.8GB model** as the archive map said. It's a Python ML pipeline:
@ -255,6 +271,14 @@ circuitBreaker:
**Port plan**: lift the Tor manager service config + the circuit-breaker library (`@lilith/circuit-breaker` — already an internal package). For v4, may want **fewer circuits** (10 was for parallel crawling of N city-pages; Quinn-operate-on is mostly sequential per-surface).
### Expert pool (LLM extraction experts) — **PARTIAL PORT** — verified read 2026-05-18
`src/experts/expert-pool.ts` runs 5 specialized LLM extractors (`MenuExpert / RateExpert / BioExpert / ContactExpert / PolicyExpert`) against scraped third-party profile HTML to normalize raw data into typed shapes. Execution adapts to pool state: parallel via `Promise.all` when LLM pool exists (`llmClient.hasPool === true`), sequential otherwise.
**Port verdict**: **Reuse only for CocotteAI's competitor-research / prospector path** (scanning Tryst listings for competitor pricing, regional trends, etc.). For the operate-on flow (Quinn manages her own profiles) the LLM-extraction experts are mostly N/A — Quinn's draft is already structured, no normalization needed. Drop into `@cocottetech/@platform/codebase/@features/prospector/experts/` (or whichever feature owns competitor scanning); skip for `bookings-tryst` adapter.
**LLM-pool reuse**: the `TalentScoutLLMClient.hasPool` pattern + `acquire/release` semantics already align with `ServicePoolManager` — same shared infrastructure powers captcha-solver pool, Tor circuit pool, LLM expert pool. Three pools, one pattern. **Confirmed unifies cleanly.**
### Detection module — **PORT** (key safety primitives)
| Sub-module | Verdict | Notes |
@ -297,7 +321,39 @@ Specifically the Tryst dir confirms the surface-tryst brief gets accurate detail
---
## Tryst-specific anti-bot details (from `tryst-adapter.ts` lines 60180)
## Tryst-specific anti-bot details (verified — full `tryst-adapter.ts` 775L read 2026-05-18)
### Reveal flow (Stimulus `unobfuscate-details` controller) — 3-path extraction
For each contact field (email, mobile), the v1 adapter executes a triple-redundant extraction strategy because Tryst's reveal mechanism varies by browser/timing:
1. **Path A — API interception** (primary): `page.waitForResponse((r) => r.url().includes('/api/v1/profiles/') && r.request().method() === 'POST' && r.status() === 200)` installed BEFORE the reveal click; parses JSON for `data.mobile / data.email / data.phone`. Bypasses DOM timing issues.
2. **Path B — DOM polling** (fallback): `waitForFunction` checks `[data-unobfuscate-details-target="output"]` until `●` (obfuscation) chars disappear. Then reads from injected `mailto:` / `sms:` / `tel:` link if present, else from span text.
3. **Path C — postMessage capture** (final fallback): listens to `window.message` events pre-click; iframe sometimes postMessages the revealed value to parent.
**Key trigger detail**: `showButton.dispatchEvent('click')` is used INSTEAD of Playwright's `.click()` — the latter doesn't reliably fire Stimulus action handlers under stealth-mode. **Port directly.**
### CAPTCHA dialog detection
Tryst's fancybox iframe doesn't reliably URL-match — content-verifies via `frame.$('img')` AND `frame.$('input[type="text"], [role="textbox"]')` both present, then falls back to `'dialog iframe, .fancybox__container iframe, [id^="fancybox__iframe"]'` with same content check. Critical: after successful solve, the iframe navigates to a postMessage-bridge URL that still includes "challenge" — URL alone is insufficient.
### CAPTCHA submit form quirks
- Image: any `<img>` in iframe (SVG-distorted text)
- Input: `input#captcha_text, input[name="captcha_text"], input[type="text"]`
- Submit: **`<input type="submit">` not `<button>`** — selector must include both: `'input[type="submit"], button[type="submit"], button'`
### Captcha-solver HTTP contract (port verbatim)
`POST http://127.0.0.1:3099/solve` · FormData: `image` (Blob) + `strategy=style_expert` · 30s timeout · returns `CaptchaSolveResponse { text, confidence, strategy_used, model_used, detected_style, style_confidence, timing: { total_ms, preprocess_ms, inference_ms }, path_used }`.
### Telemetry callback contract (`onSolveAttempt`)
Both success and failure paths emit per-attempt telemetry → feeds `captcha_solve_attempts` table:
- Success: `success=true, failureReason=null`
- Failure: `failureReason` classified via body-text-match → `'server_error'` (text "Something went wrong") | `'wrong_answer'` ("did not match") | `'new_captcha'` (default). **Port the classification logic verbatim.**
### Original section: lines 60180 ALTCHA + Turnstile + terms-toast (kept below)
Read directly from the adapter code: