docs(port): 📝 Update port findings documentation with verified architecture details and v4 reframing
Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
This commit is contained in:
parent
26ccf73242
commit
7685871bb5
1 changed files with 58 additions and 2 deletions
|
|
@ -144,7 +144,7 @@ A production-grade **provider-scraping engine** built to discover escort-listing
|
|||
|
||||
## Component-by-component port verdicts
|
||||
|
||||
### Adapter base + Tryst adapter — **PORT** (most reusable)
|
||||
### Adapter base + Tryst adapter — **PORT** (most reusable) — verified read 2026-05-18 (full files)
|
||||
|
||||
| File | Purpose | v4 destination |
|
||||
|---|---|---|
|
||||
|
|
@ -160,6 +160,22 @@ A production-grade **provider-scraping engine** built to discover escort-listing
|
|||
|
||||
**v4 reframing**: instead of `crawl()` + `scrapeProfile()`, the v4 adapter exposes `login()` + `bump()` + `updateProfile()` + `fetchInbox()` + `replyDM()`. The infrastructure (selectors, anti-bot, page-nav, screenshots) is shared.
|
||||
|
||||
### BaseAdapter architecture (verified read 2026-05-18 — full 360L)
|
||||
|
||||
**Class shape**: `abstract class BaseAdapter implements PlatformAdapter`. Per-platform circuit breaker from `@lilith/circuit-breaker` (rename → `@cocotte/circuit-breaker`). Selector schema loaded from `selectors/<platformId>.json` via `getSelectorSchema(platformId)` (port the registry pattern).
|
||||
|
||||
**Abstract methods adapters must implement**: `buildListingUrlFromSlug(slug, page)`, `buildListingUrl(city, page)`, `buildProfileUrl(slug)`. **For v4 CocotteAI**: the operate-on flow (Quinn manages her own profiles) replaces these with `getOwnProfileUrl()`, `getEditUrl()`, `getBumpUrl()`, `getSettingsUrl()` — same selector-schema-driven pattern, different verbs.
|
||||
|
||||
**Helper-module split** (lift verbatim):
|
||||
- `content-extraction.ts` — pure extractors: `extractRates / extractMenu / extractTouringStatus / extractVerification / extractPhotos / extractSocials / extractSimilarProfiles / revealContact / extractTagline / extractProfileDetails / extractPolicies / extractFromBio / mergeBioSocials`.
|
||||
- `page-navigation.ts` — `hasNextPage / handleAntiBot / normalizePhone / screenshotOnError`.
|
||||
|
||||
**`scrapeProfile` parallelism pattern**: 10 extractors fired via `Promise.all([...])` after `waitForSelector(name)`. Port directly — same pattern applies to v4 "read Quinn's current profile state" before computing a diff for the operate-on action.
|
||||
|
||||
**Bio-text supplemental extraction**: `extractFromBio()` parses phone/rates/socials out of bio text — DOM extraction takes precedence, bio extraction SUPPLEMENTS only when DOM is empty. **Critical invariant for v4**: applies symmetrically to the operate-on flow when Quinn's *own* draft bio mentions a number that didn't make it into the phone field — surface as a suggestion via the strategist specialist.
|
||||
|
||||
**Telemetry hook (`onSolveAttempt?: (data) => void`)**: optional callback set by pipeline worker. Port verbatim — wire to `captcha_solve_attempts` insert in v4 pipeline worker.
|
||||
|
||||
### Captcha solver — **PORT** (major win — already-trained ML pipeline)
|
||||
|
||||
The captcha-solver is **NOT a single 3.8GB model** as the archive map said. It's a Python ML pipeline:
|
||||
|
|
@ -255,6 +271,14 @@ circuitBreaker:
|
|||
|
||||
**Port plan**: lift the Tor manager service config + the circuit-breaker library (`@lilith/circuit-breaker` — already an internal package). For v4, may want **fewer circuits** (10 was for parallel crawling of N city-pages; Quinn-operate-on is mostly sequential per-surface).
|
||||
|
||||
### Expert pool (LLM extraction experts) — **PARTIAL PORT** — verified read 2026-05-18
|
||||
|
||||
`src/experts/expert-pool.ts` runs 5 specialized LLM extractors (`MenuExpert / RateExpert / BioExpert / ContactExpert / PolicyExpert`) against scraped third-party profile HTML to normalize raw data into typed shapes. Execution adapts to pool state: parallel via `Promise.all` when LLM pool exists (`llmClient.hasPool === true`), sequential otherwise.
|
||||
|
||||
**Port verdict**: **Reuse only for CocotteAI's competitor-research / prospector path** (scanning Tryst listings for competitor pricing, regional trends, etc.). For the operate-on flow (Quinn manages her own profiles) the LLM-extraction experts are mostly N/A — Quinn's draft is already structured, no normalization needed. Drop into `@cocottetech/@platform/codebase/@features/prospector/experts/` (or whichever feature owns competitor scanning); skip for `bookings-tryst` adapter.
|
||||
|
||||
**LLM-pool reuse**: the `TalentScoutLLMClient.hasPool` pattern + `acquire/release` semantics already align with `ServicePoolManager` — same shared infrastructure powers captcha-solver pool, Tor circuit pool, LLM expert pool. Three pools, one pattern. **Confirmed unifies cleanly.**
|
||||
|
||||
### Detection module — **PORT** (key safety primitives)
|
||||
|
||||
| Sub-module | Verdict | Notes |
|
||||
|
|
@ -297,7 +321,39 @@ Specifically the Tryst dir confirms the surface-tryst brief gets accurate detail
|
|||
|
||||
---
|
||||
|
||||
## Tryst-specific anti-bot details (from `tryst-adapter.ts` lines 60–180)
|
||||
## Tryst-specific anti-bot details (verified — full `tryst-adapter.ts` 775L read 2026-05-18)
|
||||
|
||||
### Reveal flow (Stimulus `unobfuscate-details` controller) — 3-path extraction
|
||||
|
||||
For each contact field (email, mobile), the v1 adapter executes a triple-redundant extraction strategy because Tryst's reveal mechanism varies by browser/timing:
|
||||
|
||||
1. **Path A — API interception** (primary): `page.waitForResponse((r) => r.url().includes('/api/v1/profiles/') && r.request().method() === 'POST' && r.status() === 200)` installed BEFORE the reveal click; parses JSON for `data.mobile / data.email / data.phone`. Bypasses DOM timing issues.
|
||||
2. **Path B — DOM polling** (fallback): `waitForFunction` checks `[data-unobfuscate-details-target="output"]` until `●` (obfuscation) chars disappear. Then reads from injected `mailto:` / `sms:` / `tel:` link if present, else from span text.
|
||||
3. **Path C — postMessage capture** (final fallback): listens to `window.message` events pre-click; iframe sometimes postMessages the revealed value to parent.
|
||||
|
||||
**Key trigger detail**: `showButton.dispatchEvent('click')` is used INSTEAD of Playwright's `.click()` — the latter doesn't reliably fire Stimulus action handlers under stealth-mode. **Port directly.**
|
||||
|
||||
### CAPTCHA dialog detection
|
||||
|
||||
Tryst's fancybox iframe doesn't reliably URL-match — content-verifies via `frame.$('img')` AND `frame.$('input[type="text"], [role="textbox"]')` both present, then falls back to `'dialog iframe, .fancybox__container iframe, [id^="fancybox__iframe"]'` with same content check. Critical: after successful solve, the iframe navigates to a postMessage-bridge URL that still includes "challenge" — URL alone is insufficient.
|
||||
|
||||
### CAPTCHA submit form quirks
|
||||
|
||||
- Image: any `<img>` in iframe (SVG-distorted text)
|
||||
- Input: `input#captcha_text, input[name="captcha_text"], input[type="text"]`
|
||||
- Submit: **`<input type="submit">` not `<button>`** — selector must include both: `'input[type="submit"], button[type="submit"], button'`
|
||||
|
||||
### Captcha-solver HTTP contract (port verbatim)
|
||||
|
||||
`POST http://127.0.0.1:3099/solve` · FormData: `image` (Blob) + `strategy=style_expert` · 30s timeout · returns `CaptchaSolveResponse { text, confidence, strategy_used, model_used, detected_style, style_confidence, timing: { total_ms, preprocess_ms, inference_ms }, path_used }`.
|
||||
|
||||
### Telemetry callback contract (`onSolveAttempt`)
|
||||
|
||||
Both success and failure paths emit per-attempt telemetry → feeds `captcha_solve_attempts` table:
|
||||
- Success: `success=true, failureReason=null`
|
||||
- Failure: `failureReason` classified via body-text-match → `'server_error'` (text "Something went wrong") | `'wrong_answer'` ("did not match") | `'new_captcha'` (default). **Port the classification logic verbatim.**
|
||||
|
||||
### Original section: lines 60–180 ALTCHA + Turnstile + terms-toast (kept below)
|
||||
|
||||
Read directly from the adapter code:
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue