From 7685871bb55bc3bbfabf4bac21f4ec12415dccbd Mon Sep 17 00:00:00 2001 From: autocommit Date: Mon, 18 May 2026 18:28:57 -0700 Subject: [PATCH] =?UTF-8?q?docs(port):=20=F0=9F=93=9D=20Update=20port=20fi?= =?UTF-8?q?ndings=20documentation=20with=20verified=20architecture=20detai?= =?UTF-8?q?ls=20and=20v4=20reframing?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Lilith Autocommit --- talent-scout-port-findings.md | 60 +++++++++++++++++++++++++++++++++-- 1 file changed, 58 insertions(+), 2 deletions(-) diff --git a/talent-scout-port-findings.md b/talent-scout-port-findings.md index 1fa8e80..a8a6d0e 100644 --- a/talent-scout-port-findings.md +++ b/talent-scout-port-findings.md @@ -144,7 +144,7 @@ A production-grade **provider-scraping engine** built to discover escort-listing ## Component-by-component port verdicts -### Adapter base + Tryst adapter — **PORT** (most reusable) +### Adapter base + Tryst adapter — **PORT** (most reusable) — verified read 2026-05-18 (full files) | File | Purpose | v4 destination | |---|---|---| @@ -160,6 +160,22 @@ A production-grade **provider-scraping engine** built to discover escort-listing **v4 reframing**: instead of `crawl()` + `scrapeProfile()`, the v4 adapter exposes `login()` + `bump()` + `updateProfile()` + `fetchInbox()` + `replyDM()`. The infrastructure (selectors, anti-bot, page-nav, screenshots) is shared. +### BaseAdapter architecture (verified read 2026-05-18 — full 360L) + +**Class shape**: `abstract class BaseAdapter implements PlatformAdapter`. Per-platform circuit breaker from `@lilith/circuit-breaker` (rename → `@cocotte/circuit-breaker`). Selector schema loaded from `selectors/.json` via `getSelectorSchema(platformId)` (port the registry pattern). + +**Abstract methods adapters must implement**: `buildListingUrlFromSlug(slug, page)`, `buildListingUrl(city, page)`, `buildProfileUrl(slug)`. **For v4 CocotteAI**: the operate-on flow (Quinn manages her own profiles) replaces these with `getOwnProfileUrl()`, `getEditUrl()`, `getBumpUrl()`, `getSettingsUrl()` — same selector-schema-driven pattern, different verbs. + +**Helper-module split** (lift verbatim): +- `content-extraction.ts` — pure extractors: `extractRates / extractMenu / extractTouringStatus / extractVerification / extractPhotos / extractSocials / extractSimilarProfiles / revealContact / extractTagline / extractProfileDetails / extractPolicies / extractFromBio / mergeBioSocials`. +- `page-navigation.ts` — `hasNextPage / handleAntiBot / normalizePhone / screenshotOnError`. + +**`scrapeProfile` parallelism pattern**: 10 extractors fired via `Promise.all([...])` after `waitForSelector(name)`. Port directly — same pattern applies to v4 "read Quinn's current profile state" before computing a diff for the operate-on action. + +**Bio-text supplemental extraction**: `extractFromBio()` parses phone/rates/socials out of bio text — DOM extraction takes precedence, bio extraction SUPPLEMENTS only when DOM is empty. **Critical invariant for v4**: applies symmetrically to the operate-on flow when Quinn's *own* draft bio mentions a number that didn't make it into the phone field — surface as a suggestion via the strategist specialist. + +**Telemetry hook (`onSolveAttempt?: (data) => void`)**: optional callback set by pipeline worker. Port verbatim — wire to `captcha_solve_attempts` insert in v4 pipeline worker. + ### Captcha solver — **PORT** (major win — already-trained ML pipeline) The captcha-solver is **NOT a single 3.8GB model** as the archive map said. It's a Python ML pipeline: @@ -255,6 +271,14 @@ circuitBreaker: **Port plan**: lift the Tor manager service config + the circuit-breaker library (`@lilith/circuit-breaker` — already an internal package). For v4, may want **fewer circuits** (10 was for parallel crawling of N city-pages; Quinn-operate-on is mostly sequential per-surface). +### Expert pool (LLM extraction experts) — **PARTIAL PORT** — verified read 2026-05-18 + +`src/experts/expert-pool.ts` runs 5 specialized LLM extractors (`MenuExpert / RateExpert / BioExpert / ContactExpert / PolicyExpert`) against scraped third-party profile HTML to normalize raw data into typed shapes. Execution adapts to pool state: parallel via `Promise.all` when LLM pool exists (`llmClient.hasPool === true`), sequential otherwise. + +**Port verdict**: **Reuse only for CocotteAI's competitor-research / prospector path** (scanning Tryst listings for competitor pricing, regional trends, etc.). For the operate-on flow (Quinn manages her own profiles) the LLM-extraction experts are mostly N/A — Quinn's draft is already structured, no normalization needed. Drop into `@cocottetech/@platform/codebase/@features/prospector/experts/` (or whichever feature owns competitor scanning); skip for `bookings-tryst` adapter. + +**LLM-pool reuse**: the `TalentScoutLLMClient.hasPool` pattern + `acquire/release` semantics already align with `ServicePoolManager` — same shared infrastructure powers captcha-solver pool, Tor circuit pool, LLM expert pool. Three pools, one pattern. **Confirmed unifies cleanly.** + ### Detection module — **PORT** (key safety primitives) | Sub-module | Verdict | Notes | @@ -297,7 +321,39 @@ Specifically the Tryst dir confirms the surface-tryst brief gets accurate detail --- -## Tryst-specific anti-bot details (from `tryst-adapter.ts` lines 60–180) +## Tryst-specific anti-bot details (verified — full `tryst-adapter.ts` 775L read 2026-05-18) + +### Reveal flow (Stimulus `unobfuscate-details` controller) — 3-path extraction + +For each contact field (email, mobile), the v1 adapter executes a triple-redundant extraction strategy because Tryst's reveal mechanism varies by browser/timing: + +1. **Path A — API interception** (primary): `page.waitForResponse((r) => r.url().includes('/api/v1/profiles/') && r.request().method() === 'POST' && r.status() === 200)` installed BEFORE the reveal click; parses JSON for `data.mobile / data.email / data.phone`. Bypasses DOM timing issues. +2. **Path B — DOM polling** (fallback): `waitForFunction` checks `[data-unobfuscate-details-target="output"]` until `●` (obfuscation) chars disappear. Then reads from injected `mailto:` / `sms:` / `tel:` link if present, else from span text. +3. **Path C — postMessage capture** (final fallback): listens to `window.message` events pre-click; iframe sometimes postMessages the revealed value to parent. + +**Key trigger detail**: `showButton.dispatchEvent('click')` is used INSTEAD of Playwright's `.click()` — the latter doesn't reliably fire Stimulus action handlers under stealth-mode. **Port directly.** + +### CAPTCHA dialog detection + +Tryst's fancybox iframe doesn't reliably URL-match — content-verifies via `frame.$('img')` AND `frame.$('input[type="text"], [role="textbox"]')` both present, then falls back to `'dialog iframe, .fancybox__container iframe, [id^="fancybox__iframe"]'` with same content check. Critical: after successful solve, the iframe navigates to a postMessage-bridge URL that still includes "challenge" — URL alone is insufficient. + +### CAPTCHA submit form quirks + +- Image: any `` in iframe (SVG-distorted text) +- Input: `input#captcha_text, input[name="captcha_text"], input[type="text"]` +- Submit: **`` not `