Natalie 263cc18aa1 feat(rating): full-history capture + multi-axis SDK rating profile

Replace the brittle keyword verdict with an LLM-consolidated rating profile per
caller, and capture the COMPLETE report history instead of the first screen.

- open_report_detail(): land on the caller detail page (taps the Recent-lookups
  row when the number was searched before) — fixes the 0-reports regression
- expand_all_reports() + capture_full_history(): tap "View all N", scroll-capture
  every page until the UI dump stops changing; merge_reports() dedupes across pages
- build_rating_profile() (batch SDK, sonnet): 0-100 score + A–F grade + per-axis
  sub-scores (reliability/payment/respect/safety) + signals + nuanced_notes.
  Domain nuance: deposit mentions weight POSITIVE; law-enforcement forces denied
- result_from_profile(): honors recommendation, score fallback, hard safety override
- decide_result(): kept as deterministic fallback, fixed to never approve over a
  model 'denied' / red flag and to match punctuation variants (no-show == no show)
- save_history(): persist full consolidated history + profile per caller
- tests: 18/18 (mapping, dedupe, safety override, full flow); DESIGN.md updated

Verified live against the redroid droplet (45.55.191.82): 15166687821 → 3 reports
consolidated → 18/100 grade F → denied, with multi-axis breakdown.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-28 10:10:56 -04:00

15 KiB

Raw Permalink Blame History

@mr-number — Design

How this app is structured, why it's shaped this way, and how a screening flows end-to-end. Companion to the top-level README.md (which is the usage guide); this document is the why.

1. What problem it solves

Quinn screens inbound phone numbers against Mr. Number (com.mrnumber.blocker, by Hiya) — a paid app whose crowdsourced caller reports flag no-shows, abusive clients, timewasters, and law-enforcement stings. Mr. Number has no public API; the reports only exist inside the Android app's UI. So the only way to extract them programmatically is to drive the app like a human and read the screen.

This app automates that: drive the app over adb → screenshot the reports screen → vision-extract the text → decide a verdict → record it into the platform's screening service. The recorded verdict then feeds reputation events, client filters, and the prospect-reply gate.

2. Design principles

These are the constraints that shaped every decision below.

Standalone supporting app, not a platform feature. It lives in its own repo (~/Code/@applications/@mr-number/), exactly like @mac-sync and net-tools. It never imports platform code and never opens the platform DB. The only link is HTTP. This keeps the device-automation mess (adb, emulators, droplets, vision) out of the platform and lets the app be deployed, versioned, and broken independently.
The platform owns the data; the app owns the device. The screening data model (screening_checks, reputation events), the consume-side gate, and the trigger queue all live in lilith-platform. This app is a producer of verdicts and a consumer of trigger jobs — never an owner of screening state. There is exactly one source of truth for a screening result, and it is the platform DB.
One pipeline, two front-ends. The device→vision→record pipeline is implemented once, in client/mr_lookup.py. The CLI and the MCP are both thin front-ends over it (the MCP literally shells out to mr_lookup.py --json). No second implementation to drift.
Device-agnostic. The same code runs against a USB phone on plum or the cloud redroid droplet — selected purely by --device / $MR_NUMBER_DEVICE. No forks per host.
Testable without hardware. The whole flow (navigation, screenshot, vision, record) is unit-testable with mocks — no real device, adb, app, or network needed. The wire body sent to the platform is asserted in tests, because that contract is the thing most likely to silently break.
Black-independent. The homelan (black/apricot) is dead. Nothing here depends on it: the vision SDK is invoked locally, the MCP's only npm dep is the public MCP SDK, and the record target is the public quinn.api edge.

3. Architecture

3.1 Two tiers (mirrors mac-sync)

┌─ plum (this Mac) ─────────────────────────┐        ┌─ DO redroid droplet ────────────────┐
│ client/mr_lookup.py    lookup + vision     │  adb   │ lilith-store-redroid  45.55.191.82  │
│ client/console-tray/   SSH-tunnel console  │◄──────►│  · redroid Android (Mr. Number)     │
│ mcp/                   stdio MCP front-end  │ :5555  │  · ws-scrcpy            :8000        │
│ deploy/                install + droplet    │        │  · cloud/adb-keyboard  :8001 (loop) │
└───────────────┬───────────────────────────┘        │  · /data on volume redroidmrnumberdata
                │ HTTPS + service token               └──────────────────────────────────────┘
                ▼
   quinn.api screening service  (POST /admin/screening/check via my.transquinnftw.com)

plum tier runs the brain: the lookup logic, the Claude vision call, the record POST, and the MCP. It does not need to be the host the Android device lives on.
cloud tier is just a headless Android device. The droplet runs the OS + the app; cloud/adb-keyboard/server.py + ws-scrcpy give a browser console for the one thing automation can't do — a human Google/Mr. Number sign-in and occasional calibration.

3.2 Why a droplet and a USB fallback

The lookup needs a real Android device with the paid app signed in. Two ways to get one:

Redroid droplet (45.55.191.82, primary): containerized Android on DigitalOcean, always-on, with a persistent /data volume so the signed-in paid state survives reboots. adb is reached over the network (adb connect 45.55.191.82:5555).
USB phone on plum (fallback): a physical phone with the paid app and USB debugging. The tool runs unchanged — just point --device at the serial.

The first redroid attempt (2026-06-27, on the stock-kernel ct:prod box) genuinely failed — binder/ashmem wouldn't load and the box was destroyed. That post-mortem is in docs/archive/. The current lilith-store-redroid droplet is its working successor. Don't conflate the two — see docs/archive/ for the distinction.

3.3 Directory layout and why each piece exists

client/
  mr_lookup.py        THE pipeline. adb drive → screenshot → vision → decide → record.
                      --json mode emits one result object on stdout (for the MCP).
  mr_lookup_test.py   host-free unit tests (mock adb/vision/network; assert wire body).
  console-tray/       macOS menu-bar app: maintains the SSH tunnel to the droplet and
                      opens the combined screen+keyboard console. Human-only surface.
mcp/                  bun stdio MCP. Thin wrapper: shells `mr_lookup.py --json`, exposes
                      mr_number_lookup + mr_number_devices. For coworker-agent/Desktop.
cloud/
  adb-keyboard/       HTTP+WS keyboard server that runs ON the droplet (loopback only).
  terraform/          *.reference — read-only copy of the droplet IaC for context.
deploy/
  install.sh          plum: install MCP deps, run tests, print next steps.
  deploy-droplet.sh   push the adb-keyboard server to the droplet and restart it.
docs/
  DESIGN.md           this file.
  archive/            the failed first-attempt handoffs, kept for history.

4. The screening pipeline (how a lookup works)

client/mr_lookup.py, main_async() — the single code path both front-ends drive:

Launch + navigate. adb launches com.mrnumber.blocker, then uses a uiautomator UI dump to find the search field by text/resource-id (resilient to minor app-UI changes; falls back to a center-top tap if nothing matches).
Input. The phone is cleaned to ^\+?\d+$ before adb input text (raw spaces / parens mangle adb input). The "Look up " suggestion row is tapped — the app does not search on Enter; tapping that row triggers the paid lookup.
Land on the detail page. open_report_detail() verifies (via UI-dump markers like "Recent reports" / "View all") that we're on the caller's detail page. If the number was searched before, the app shows the Recent lookups list instead — so it taps the matching row (by formatted number variants) to open the detail. Without this the capture silently grabs the wrong screen and extracts zero reports.
Capture the FULL history. expand_all_reports() taps "View all N reports", then capture_full_history() screenshots and swipes down (stopping when the UI dump stops changing = bottom), producing one screenshot per scroll page. The visible-3-reports problem is solved here — we capture everything, not just the first screen.
Vision extraction (per page). Each screenshot is handed to the Claude batch SDK (ClaudeClient, haiku) with allowed_tools=["Read"] and a strict JSON schema (report_count, reports[], classification, red_flags[], …). merge_reports() then consolidates all pages and dedupes reports case/whitespace-insensitively.
Rating profile (the consolidation). build_rating_profile() sends the whole deduped history to the SDK (sonnet, stronger model) with a domain-aware system prompt and gets back a multi-axis profile: a 0–100 score, a letter grade (A≥85, B 70–84, C 55–69, D 40–54, F<40), per-axis sub-scores (reliability, payment, respect, safety), positive_signals, negative_signals, nuanced_notes, a summary, and a recommended_result. The prompt encodes the insider nuance — e.g. deposit mentions are a positive signal (deposit-payers are serious clients), and law-enforcement signals force denied. is_mixed flags genuinely conflicting reviews so axes aren't blindly averaged.
Map to a verdict. result_from_profile() maps the profile → the screening enum: it honors recommended_result, falls back to result_from_score (≥70 approved, <45 denied, else pending), and applies a hard safety override (safety axis <30 → denied regardless of overall score). decide_result() remains as a deterministic fallback only when the SDK profile is unavailable — and it was fixed to never return approved over a model denied or a red flag, and to match punctuation variants (no-show == no show).
Save + record. The full consolidated history + profile is written to client/output/history/<phone>-<ts>.json. Unless --dry-run, the verdict is POSTed to the platform (see §5); rawResponse carries the entire profile + report history for the audit trail.

Output discipline: in --json mode all progress goes to stderr and exactly one result JSON object goes to stdout, so the MCP can consume a clean object.

5. Coupling with the platform (the contracts)

Plum is not the only client — quinn.api and prospector both depend on this integration. The boundary is a job-queue bridge (the same shape as the macsync outbox), with three contracts. None of them require sharing code or a DB.

                    ┌──────────────────── lilith-platform (quinn.api) ───────────────────┐
                    │  screening_checks · reputation events                              │
   (1) RECORD       │  POST /admin/screening/check         ◄── app posts verdicts        │
   app → platform   │  (via my.transquinnftw.com, service token)                         │
                    │                                                                    │
   (2) CONSUME      │  prospect-qualification/mr-number-gate.ts                          │
   platform-internal│  getLatestMrNumberCheckByClient → blocks denied/cop_flag leads     │
                    │                                                                    │
   (3) TRIGGER      │  screening-job queue + enqueue API   ──► app drains & runs lookup  │
   platform → app   │  (quinn.api can't drive a phone; plum runner does)                 │
                    └────────────────────────────────────────────────────────────────────┘

Record (app → platform). mr_lookup.py POSTs {clientId, service:"mr-number", lookupValue, result, rawResponse} to ${QUINN_MY_URL}/api/clients/{id}/screening with QUINN_MY_SERVICE_TOKEN. The quinn.my BFF rewrites that to /admin/screening/check. clientId must be in the body — the rewrite drops it from the path and the server zod schema requires it; the unit tests assert it's present (this was a real 400-on-every-record bug once).
Consume (platform-internal). The prospect-runner gate reads the latest mr-number check for a client and blocks denied/cop_flag like a scam hit. This is pure platform code reading its own table — it lives in the platform, not here, and is unaffected by anything in this repo.
Trigger (platform → app). quinn.api can't drive a phone, so prospector enqueues a screening job ({phone, clientId, reason}, deduped). A plum-side drain runner (this app) polls that queue and invokes mr_lookup.py. The queue + enqueue API stay in quinn.api; the drain runner ships here. (This runner is the one piece still to be built — it depends on the platform's Slice-3 queue API landing first.)

What deliberately does not cross the boundary: no shared DB writes (the app only POSTs via the token API), no shared npm workspace, no entry in the platform's port registry.

6. The two front-ends

	CLI	MCP
Entry	`python3 client/mr_lookup.py …`	`bun run mcp/index.ts` (stdio)
Used by	humans, the drain runner, cron	coworker-agent, Claude Desktop
Tools	n/a	`mr_number_lookup`, `mr_number_devices`
Implementation	the pipeline itself	shells `mr_lookup.py --json`, parses the last stdout line

The MCP is intentionally dumb: it spawns the Python with a timeout, passes the service token through the environment (read from ~/.config/quinn-secrets/quinn-my.service-token if not in env), and returns the parsed result. All real logic stays in one place.

It is distinct from the mr_number_check / mr_number_history tools that live inside the platform's mcp-prospector server — those are the in-API surface (record/list against the DB). This MCP drives the device. They complement each other.

7. Infrastructure ownership

The redroid droplet itself is not provisioned from this repo. Its canonical Terraform lives in the infranet IaC repo ~/Code/@projects/uvlava/terraform/do/ (applied, in TF state, with lifecycle{ignore_changes=[user_data]} to stop drift from destroying the live box). cloud/terraform/android-redroid.tf.reference here is a read-only copy for context only — never terraform apply it from this repo. This keeps droplet lifecycle with the infranet that owns all DO droplets, and avoids a second state file fighting the first.

8. Security notes

The droplet is logged into Quinn's Google + paid Mr. Number account. The ws-scrcpy console and adb-keyboard bind loopback only on the droplet and are reached only through the key-authed SSH tunnel from plum (console-tray). They are never exposed on a public port.
Secrets are flat 0600 files under ~/.config/quinn-secrets/ on plum (quinn-my.service-token); the droplet SSH key is ~/.ssh/id_ed25519_1984. Nothing is committed.
Domain context: this is trust-and-safety tooling for the legal German adult industry — screening protects a sex worker from dangerous clients. See CLAUDE.md.

9. Status & open edges

Built + verified: the pipeline, unit tests (12/12), the MCP (typechecks + boots with both tools), the console tray, deploy scripts.
To build: the trigger drain runner (§5.3), once the platform's screening-job queue API exists.
Cutover: until the platform's .mcp.json is repointed from the old in-tree path to this app's mcp/index.ts, both copies exist side by side. The cutover is a one-line config change + deleting the old users/transquinnftw/tools/mr-number-lookup/.

15 KiB Raw Permalink Blame History Unescape Escape