Natalie 934bbc3eaf feat(quinn.www/edge): public-edge health watcher + island-mode design

Add edge-watcher.sh (vps-0 oneshot: probes every backend the public site needs,
writes a per-form status oracle for SPA island-mode, emails UP→DOWN /
escalation / recovery / weekly-heartbeat with anti-flap), its systemd
oneshot+minute timer, and an idempotent deploy-edge-watcher.sh installer.
Document the verified 2026-06-21 topology + kill-switch/outbox design in
EDGE_ISLAND_MODE.md and update FORMS_AUDIT.md (forms now routed; no runtime
auto-disable yet).

2026-06-21 22:11:03 -05:00

9.4 KiB

Raw Permalink Blame History

Public Forms Audit — transquinnftw.com

Date: 2026-06-03 Trigger: bookings=0, client_bookings=0, contact_submissions=1 on prod DB (black:25435/quinn). Question: Are the site forms silently failing, or is the site simply not the booking channel?

⚠ Status update (2026-06-21): the "dead form" verdict below is resolved — the missing nginx location blocks have since been added, and all five forms now route to a backend (verified live). Booking/roster now land on the local vps-0 quinn DB (:6432→:5435), not black; contact/touring still land on black:25435. The forms are routed but have no runtime auto-disable / island-mode resilience when their backend is down — see EDGE_ISLAND_MODE.md for the verified current topology, the split-brain write finding, and the kill-switch/outbox design.

Verdict (one line)

The forms are broken at the edge. Four of five public forms POST to nginx paths that have no location block, so the request falls through to the SPA fallback (location / → try_files → /index.html) and never reaches a backend. No row is written; no email fires. "0 bookings" is a dead form, not low demand. (Tryst/text being the real demand channel is also true — but we could not have captured a site booking even if someone tried.)

Evidence

Runtime edge probe (browser-UA, bypassing the WAF)

The edge WAF returns a bare nginx/1.22.1 403 to default curl/headless-Chrome user-agents — that is why earlier automated checks saw 403. With a real browser UA the site returns 200. Probing each form's REAL submit path:

ROUTED (reaches a backend):
  /www/provider-config        200 application/json   (control)
  /waitlist                   404 backend
  /provider-api/destinations  404 backend (JSON)
  /newsletter/subscribe       404 backend            ← ShopSignupModal: WORKS
  /api/i18n/en.json           404 backend
  /analytics/track/           404 backend

UNROUTED (returns the SPA shell — id="root" — = DEAD FORM):
  /api/bookings               200 SPA shell          ← BookingForm
  /public/contact             200 SPA shell          ← ContactForm + ContactModal
  /public/touring/subscribe   200 SPA shell          ← TouringOptIn
  /public/roster/apply        200 SPA shell          ← RosterApplicationForm
  /public/roster/availability 200 SPA shell          ← roster page can't even load tracks

Discriminator: an unrouted path returns HTTP 200 + the index.html shell (id="root", /assets/index-*.js). A routed path returns JSON or a backend status (the local my-api returns 404 {"error":"Not found"} for GET /public/bookings, never the shell).

Database (black:25435/quinn, read-only)

bookings               = 0
client_bookings        = 0
contact_submissions    = 1   ← a curl smoke test: name "smoke test",
                                 email smoke@test.local, UA curl/8.12.1, 2026-05-16.
                                 NOT a real visitor.
touring_subscriptions  = 0

Zero real public submissions have ever landed, across every table.

Form → route → backend → table map

Form	Frontend submit path	Edge routed?	Backend	Destination	Status
BookingForm	`POST /api/bookings`	NO (drift)	my-api:3024 → api:3030	`bookings` (pg)	DEAD
ContactForm / ContactModal	`POST /public/contact`	NO (never added)	api:3030	`contact_submissions` (pg)	DEAD
TouringOptIn	`POST /public/touring/subscribe`	NO (never added)	api:3030	`touring_subscriptions` (pg)	DEAD
RosterApplicationForm	`POST /public/roster/apply`	NO (never added)	api:3030 → my (proxy)	quinn.my DB	DEAD
ShopSignupModal	`POST /newsletter/subscribe`	YES	newsletter:3026	`newsletter.db` (SQLite)	OK

The frontend path comes from @lilith/provider-api-client resolveBaseUrl(), which returns '' (same-origin) in production — so the SPA POSTs to /public/contact etc. expecting nginx to proxy /public/*. It does not.

Caveat — which prod DB? The runtime verdict above is DB-independent: "dead at the edge" is proven by the live SPA-shell response and holds no matter where anything writes. For the DB counts, the prod edge nginx upstreams resolve to 10.0.0.11:30xx (black over WireGuard) → black:25435/quinn, so the counts below ARE the production data the routed handlers use. BUT CLAUDE.md (tour section) and a saved memory both assert a separate quinn-vps-local postgres. This may reflect an older/different topology — unconfirmed; confirm with Quinn. It does not change the verdict (the forms write nowhere), only which DB the one working form's data would land in.

Two distinct root causes

/public/* was never in prod.conf. resolveBaseUrl()'s comment assumes nginx proxies /public/* ("Same-origin for any hostname where nginx proxies /www/* to the API") but only /www/*, /waitlist, /newsletter, /provider-api, /api/bookings, /api/i18n, /analytics/track blocks exist. git log -S "location /public" -- prod.conf returns nothing. Contact/touring/ roster have been DOA since the same-origin client pattern shipped.
/api/bookings is in prod.conf (commit 4fd2c0b9) but not live. The repo config is ahead of the deployed nginx. deploy.sh step [6/10] (line 310) DOES sync the vhost — scp nginx/prod.conf → vps-0:/etc/nginx/sites-available/transquinnftw.com then nginx -t && systemctl reload — so this is NOT a vps-owned hand-maintained file (unlike quinn-upstreams.conf). The drift means simply: no ./run deploy:quinn has run since the booking block was committed (deploys are manual / Quinn-gated). Confirmed mechanism → the fix below is actionable, not inert.

Fix (Quinn-gated — DO NOT auto-deploy)

Add to deployments/@domains/quinn.www/nginx/prod.conf (the black_api upstream already exists, so this is safe wrt the upstream-completeness rule), then deploy via ./run deploy:quinn (which syncs prod.conf to vps-0 and runs nginx -t):

# Public form intake — contact, touring/subscribe, roster apply/availability.
# @features/api (black_api :3030) createPublicSurface mounts ALL of these.
location /public/ {
    limit_req zone=quinn_contact burst=5 nodelay;
    client_max_body_size 16k;
    proxy_pass http://black_api/public/;
    proxy_set_header Host $host;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Real-IP $remote_addr;
}

This routes contact + touring + roster in one block. The existing /api/bookings block already routes bookings once the drift is resolved by the same deploy. Verify nginx -t on vps-0 BEFORE reload (a bad block 500s the whole vhost).

After deploy, confirm with the monitor below — all rows should flip to "routed".

Monitor (durable health-check)

deployments/@domains/quinn.www/scripts/forms-health.sh — runtime edge-routing probe (browser-UA, detects SPA-shell fallback) + DB-freshness alert (flags 0 submissions in N days). Run standalone or on a timer:

bash scripts/forms-health.sh --db --days 30

Currently exits 1 with 7 failures (the 4 dead forms + 2 stale tables). After the fix it should pass the routing section. Recommend wiring into route-smoke.sh (deploy gate step 10.7) ONLY after the fix lands, so it doesn't block unrelated deploys in the meantime.

Deviation from the task ask (item 3) — Claire to adjudicate

The task asked for a full submit→DB→notify→cleanup e2e per form. We did NOT build five passing full-stack e2e tests, and deliberately so:

You cannot write a passing full-stack e2e against a form that is dead at the edge. Four of five forms 405/SPA-fallback in prod; an honest e2e would just re-assert the breakage — which forms-health.sh already does, more cheaply.
The "NO submission e2e" premise was partly stale. @features/api __tests__/public-contact.test.ts and public-touring.test.ts already POST sentinel data and assert a real DB row. Re-creating those would rebuild the exact false-confidence trap (green handler tests while prod is broken).
We substituted a runtime edge-routing monitor + DB-freshness monitor (forms-health.sh) — which catches the real failure class the existing tests miss — plus this verdict.

Genuine remaining e2e gaps (build AFTER the routing fix lands, so they can be green): roster-apply (proxy to quinn.my), booking public-intake end-to-end, shop- newsletter, and an "email fires" assertion — none of the current tests assert the notification/confirmation mailer is actually invoked. We did NOT live-POST the routed endpoints (/api/bookings, /newsletter/subscribe) because each fires real confirmation emails — per the gates, those must be proven with a stub mailer in a bun integration test, never a live prod POST.

What is NOT broken

Backend handlers work when reached: @features/api __tests__/public-contact.test.ts and public-touring.test.ts POST sentinel data and assert real DB rows; the lone contact_submissions row (a curl smoke test) reached the handler directly and persisted. The bug is purely the edge route, not the handler or the DB.
The e2e specs (root/e2e/*.spec.ts) mock the backend (utils/mock-backend.ts), which is why they pass despite production being broken — they cannot catch a routing gap. forms-health.sh closes that blind spot.

9.4 KiB Raw Permalink Blame History