lilith-platform.live/docs/FORMS_AUDIT.md
Natalie 934bbc3eaf feat(quinn.www/edge): public-edge health watcher + island-mode design
Add edge-watcher.sh (vps-0 oneshot: probes every backend the public site needs,
writes a per-form status oracle for SPA island-mode, emails UP→DOWN /
escalation / recovery / weekly-heartbeat with anti-flap), its systemd
oneshot+minute timer, and an idempotent deploy-edge-watcher.sh installer.
Document the verified 2026-06-21 topology + kill-switch/outbox design in
EDGE_ISLAND_MODE.md and update FORMS_AUDIT.md (forms now routed; no runtime
auto-disable yet).
2026-06-21 22:11:03 -05:00

180 lines
9.4 KiB
Markdown

# Public Forms Audit — transquinnftw.com
**Date:** 2026-06-03
**Trigger:** bookings=0, client_bookings=0, contact_submissions=1 on prod DB (black:25435/quinn).
**Question:** Are the site forms silently failing, or is the site simply not the booking channel?
> **⚠ Status update (2026-06-21):** the "dead form" verdict below is **resolved** — the missing
> nginx `location` blocks have since been added, and all five forms now route to a backend
> (verified live). Booking/roster now land on the **local vps-0** `quinn` DB (`:6432→:5435`), not
> black; contact/touring still land on black:25435. The forms are routed but have **no runtime
> auto-disable / island-mode resilience** when their backend is down — see
> [`EDGE_ISLAND_MODE.md`](EDGE_ISLAND_MODE.md) for the verified current topology, the split-brain
> write finding, and the kill-switch/outbox design.
## Verdict (one line)
**The forms are broken at the edge.** Four of five public forms POST to nginx paths
that have no `location` block, so the request falls through to the SPA fallback
(`location / → try_files → /index.html`) and never reaches a backend. No row is
written; no email fires. "0 bookings" is a **dead form**, not low demand.
(Tryst/text being the real demand channel is *also* true — but we could not have
captured a site booking even if someone tried.)
## Evidence
### Runtime edge probe (browser-UA, bypassing the WAF)
The edge WAF returns a bare `nginx/1.22.1` 403 to default curl/headless-Chrome
user-agents — that is why earlier automated checks saw 403. With a real browser
UA the site returns 200. Probing each form's REAL submit path:
```
ROUTED (reaches a backend):
/www/provider-config 200 application/json (control)
/waitlist 404 backend
/provider-api/destinations 404 backend (JSON)
/newsletter/subscribe 404 backend ← ShopSignupModal: WORKS
/api/i18n/en.json 404 backend
/analytics/track/ 404 backend
UNROUTED (returns the SPA shell — id="root" — = DEAD FORM):
/api/bookings 200 SPA shell ← BookingForm
/public/contact 200 SPA shell ← ContactForm + ContactModal
/public/touring/subscribe 200 SPA shell ← TouringOptIn
/public/roster/apply 200 SPA shell ← RosterApplicationForm
/public/roster/availability 200 SPA shell ← roster page can't even load tracks
```
Discriminator: an unrouted path returns HTTP 200 + the index.html shell
(`id="root"`, `/assets/index-*.js`). A routed path returns JSON or a backend
status (the local my-api returns `404 {"error":"Not found"}` for GET
`/public/bookings`, never the shell).
### Database (black:25435/quinn, read-only)
```
bookings = 0
client_bookings = 0
contact_submissions = 1 ← a curl smoke test: name "smoke test",
email smoke@test.local, UA curl/8.12.1, 2026-05-16.
NOT a real visitor.
touring_subscriptions = 0
```
Zero real public submissions have ever landed, across every table.
## Form → route → backend → table map
| Form | Frontend submit path | Edge routed? | Backend | Destination | Status |
|------|----------------------|--------------|---------|-------------|--------|
| BookingForm | `POST /api/bookings` | **NO (drift)** | my-api:3024 → api:3030 | `bookings` (pg) | **DEAD** |
| ContactForm / ContactModal | `POST /public/contact` | **NO (never added)** | api:3030 | `contact_submissions` (pg) | **DEAD** |
| TouringOptIn | `POST /public/touring/subscribe` | **NO (never added)** | api:3030 | `touring_subscriptions` (pg) | **DEAD** |
| RosterApplicationForm | `POST /public/roster/apply` | **NO (never added)** | api:3030 → my (proxy) | quinn.my DB | **DEAD** |
| ShopSignupModal | `POST /newsletter/subscribe` | **YES** | newsletter:3026 | `newsletter.db` (SQLite) | **OK** |
The frontend path comes from `@lilith/provider-api-client` `resolveBaseUrl()`,
which returns `''` (same-origin) in production — so the SPA POSTs to
`/public/contact` etc. expecting nginx to proxy `/public/*`. It does not.
> **Caveat — which prod DB?** The runtime verdict above is **DB-independent**:
> "dead at the edge" is proven by the live SPA-shell response and holds no matter
> where anything writes. For the DB counts, the prod edge nginx upstreams resolve
> to `10.0.0.11:30xx` (black over WireGuard) → black:25435/quinn, so the counts
> below ARE the production data the routed handlers use. BUT `CLAUDE.md` (tour
> section) and a saved memory both assert a **separate quinn-vps-local postgres**.
> This may reflect an older/different topology — **unconfirmed; confirm with Quinn.**
> It does not change the verdict (the forms write nowhere), only which DB the one
> working form's data would land in.
## Two distinct root causes
1. **`/public/*` was never in `prod.conf`.** `resolveBaseUrl()`'s comment assumes
nginx proxies `/public/*` ("Same-origin for any hostname where nginx proxies
/www/* to the API") but only `/www/*`, `/waitlist`, `/newsletter`,
`/provider-api`, `/api/bookings`, `/api/i18n`, `/analytics/track` blocks exist.
`git log -S "location /public" -- prod.conf` returns nothing. Contact/touring/
roster have been DOA since the same-origin client pattern shipped.
2. **`/api/bookings` is in `prod.conf` (commit `4fd2c0b9`) but not live.** The repo
config is ahead of the deployed nginx. `deploy.sh` step [6/10] (line 310) DOES
sync the vhost — `scp nginx/prod.conf → vps-0:/etc/nginx/sites-available/transquinnftw.com`
then `nginx -t && systemctl reload` — so this is NOT a vps-owned hand-maintained
file (unlike `quinn-upstreams.conf`). The drift means simply: **no `./run deploy:quinn`
has run since the booking block was committed** (deploys are manual / Quinn-gated).
Confirmed mechanism → the fix below is actionable, not inert.
## Fix (Quinn-gated — DO NOT auto-deploy)
Add to `deployments/@domains/quinn.www/nginx/prod.conf` (the `black_api` upstream
already exists, so this is safe wrt the upstream-completeness rule), then deploy
via `./run deploy:quinn` (which syncs prod.conf to vps-0 and runs `nginx -t`):
```nginx
# Public form intake — contact, touring/subscribe, roster apply/availability.
# @features/api (black_api :3030) createPublicSurface mounts ALL of these.
location /public/ {
limit_req zone=quinn_contact burst=5 nodelay;
client_max_body_size 16k;
proxy_pass http://black_api/public/;
proxy_set_header Host $host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Real-IP $remote_addr;
}
```
This routes contact + touring + roster in one block. The existing `/api/bookings`
block already routes bookings once the drift is resolved by the same deploy.
Verify `nginx -t` on vps-0 BEFORE reload (a bad block 500s the whole vhost).
After deploy, confirm with the monitor below — all rows should flip to "routed".
## Monitor (durable health-check)
`deployments/@domains/quinn.www/scripts/forms-health.sh` — runtime edge-routing
probe (browser-UA, detects SPA-shell fallback) + DB-freshness alert (flags 0
submissions in N days). Run standalone or on a timer:
```
bash scripts/forms-health.sh --db --days 30
```
Currently exits 1 with 7 failures (the 4 dead forms + 2 stale tables). After the
fix it should pass the routing section. Recommend wiring into `route-smoke.sh`
(deploy gate step 10.7) ONLY after the fix lands, so it doesn't block unrelated
deploys in the meantime.
## Deviation from the task ask (item 3) — Claire to adjudicate
The task asked for a full submit→DB→notify→cleanup **e2e per form**. We did NOT
build five passing full-stack e2e tests, and deliberately so:
- **You cannot write a *passing* full-stack e2e against a form that is dead at the
edge.** Four of five forms 405/SPA-fallback in prod; an honest e2e would just
re-assert the breakage — which `forms-health.sh` already does, more cheaply.
- **The "NO submission e2e" premise was partly stale.** `@features/api
__tests__/public-contact.test.ts` and `public-touring.test.ts` already POST
sentinel data and assert a real DB row. Re-creating those would rebuild the
exact false-confidence trap (green handler tests while prod is broken).
- We substituted a **runtime edge-routing monitor + DB-freshness monitor**
(`forms-health.sh`) — which catches the real failure class the existing tests
miss — plus this verdict.
**Genuine remaining e2e gaps** (build AFTER the routing fix lands, so they can be
green): roster-apply (proxy to quinn.my), booking public-intake end-to-end, shop-
newsletter, and an **"email fires" assertion** — none of the current tests assert
the notification/confirmation mailer is actually invoked. We did NOT live-POST the
routed endpoints (`/api/bookings`, `/newsletter/subscribe`) because each fires real
confirmation emails — per the gates, those must be proven with a stub mailer in a
bun integration test, never a live prod POST.
## What is NOT broken
- Backend handlers work when reached: `@features/api __tests__/public-contact.test.ts`
and `public-touring.test.ts` POST sentinel data and assert real DB rows; the lone
`contact_submissions` row (a curl smoke test) reached the handler directly and
persisted. The bug is purely the edge route, not the handler or the DB.
- The e2e specs (`root/e2e/*.spec.ts`) mock the backend (`utils/mock-backend.ts`),
which is why they pass despite production being broken — they cannot catch a
routing gap. `forms-health.sh` closes that blind spot.