Natalie da16755bfc docs(edge): Phase 2 outbox failover live + document public_write upstream

2b (G9 idempotency) deployed to black; 2c (nginx failover) live and verified
end-to-end (normal 201 / black-down 202 -> spool -> replay -> G9 dedup). Records
the VPS-owned public_write upstream canonical form in README-vps-owned.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-22 02:55:08 -05:00

17 KiB

Raw Blame History

Edge Resilience & Island Mode — Verified Topology + Design

Status: Investigation + design. No runtime changes made (read-only probes only). Verified: 2026-06-21, via read-only ssh quinn-vps / ssh black, live nginx config, live quinn-upstreams.conf, live HTTP + DB probes. Current-state facts here supersede: the SQLite-era inventory in PROD_DB_UNIFICATION_PLAN.md (the platform has since moved to PostgreSQL) and the "dead forms" verdict in FORMS_AUDIT.md (2026-06-03 — the edge location blocks have since been added; forms are now routed). Direction docs (target, not current state): migration-vps-to-black.md, PROD_DB_UNIFICATION_PLAN.md.

1. Why this doc exists

The originating ask: public contact forms should automatically disable themselves when their backend is unreachable, and vps-0 should be able to "island mode" without black (keep serving what it can when black or the WireGuard link drops). Investigating that surfaced a live topology that diverges from the documented target, plus a data-integrity issue independent of island mode. This doc records:

The verified current topology (2026-06-21).
The island-mode / runtime-kill-switch design.
The consolidated gap register.
The one open decision that blocks the design.

2. Verified current topology (2026-06-21)

2.1 Hosts

Host	Role	Reachability
vps-0 (`89.127.233.145`, WG `10.9.0.1`)	Public edge: nginx + static SPA + edge cache + a near-complete local backend stack incl. its own Postgres	Public internet
black (`10.0.0.11` over WireGuard)	Canonical for the public read surface + contact/touring writes	LAN/WG only

Reality check: vps-0 is not the "pure edge" the migration target describes. It runs quinn-api, quinn-admin-api, quinn-data-api, quinn-my-api, quinn-sso-api, quinn-newsletter-api, quinn-m-backend-user, plus postgresql@17-quinn (local :5435) and pgBouncer (:6432).

2.2 Edge routing — nginx on vps-0 (`prod.conf`)

Public path	Upstream	Resolves to	Cached?
`/www/*` (destinations, tour, blog, regions)	`black_api`	black:3023	`pseo_cache` 60m, serve-stale-on-error
`/sitemap.xml`	`black_api`	black:3023	`pseo_cache` 60m
`/api/i18n/*`	`black_api`	black:3023	none ⚠
`/provider-api/*` (ProviderData JSON)	`black_data_api`	black:3022	`data_cache` 30m, serve-stale
`/photos/*`	`black_photos`	black:8081	`photos_cache` 7d, serve-stale
`/public/` (contact, touring) — write*	`black_api`	black:3023	none
`/waitlist` — write	`black_api`	black:3023	none
`/api/bookings` — write	`black_my_api`	127.0.0.1:3024 (LOCAL)	none
`/public/roster/` — write*	`black_my_api`	127.0.0.1:3024 (LOCAL)	none
`/newsletter/` — write*	`black_newsletter`	127.0.0.1:3026 (LOCAL)	none

2.3 Live upstreams (`/etc/nginx/conf.d/quinn-upstreams.conf` — VPS-owned, not in repo)

The black_ prefix is historical and misleading — two upstreams are local to vps-0:

black_api        → 10.0.0.11:3023    (black, WG)
black_data_api   → 10.0.0.11:3022    (black, WG)
black_my_api     → 127.0.0.1:3024    (LOCAL vps-0)
black_newsletter → 127.0.0.1:3026    (LOCAL vps-0)
black_photos     → 10.0.0.11:8081    (black, WG)

The repo's README-vps-owned.md was stale (documented black_api as :3030 and my-api/newsletter as 10.0.0.11) — corrected 2026-06-21 to match live, since re-applying the stale values would mis-route production.

2.4 Backends + databases — split-brain

Surface	Backend	Database	Canonical host
Public reads (`/www`, `/provider-api`)	black `:3023` / `:3022`	`black:25435/quinn(_admin)`	black
contact / touring / waitlist (write)	black `:3023`	`black:25435/quinn`	black
booking / roster (write)	vps-0 local `:3024`	`vps-0:5435/quinn` (via pgBouncer `:6432`)	vps-0
newsletter (write)	vps-0 local `:3026`	`vps-0:5435/quinn`	vps-0

Writes are partitioned across two canonical Postgres instances by form. Booking data exists only on vps-0; contact data only on black. Nothing reconciles them.

2.5 The local stack is NOT a replica (HTTP compare, vps-0 local `:3023` vs black `:3023`)

Endpoint	LOCAL vps-0	BLACK
`/health`	build 257 / `0.1.149`, `mode:internal`, 2026-06-21	older build (`{"ok":true}` only)	vps-0 is newer
`/www/destinations`	79 items	82	DIFFER
`/www/provider-config`	95 items	98	DIFFER
`/www/tour-stops`	—	—	DIFFER

vps-0's local quinn DB is populated but drifted (destinations max 2026-05-18), with no replication feed from black. The public site reads from black (82 destinations); vps-0's copy (79) is behind and not in the public read path. This reads like a stalled cutover: vps-0 looks half-prepped to become primary (newer build, full local DB) but the public path was never switched to it.

2.6 Edge cache durability

Zone	inactive (eviction)	Island value
`pseo_cache` (`/www`)	1h	Weak — cold pages evict within 1h of an outage → 502
`data_cache` (`/provider-api`)	1d	Good
`photos_cache` (`/photos`)	30d	Strong

3. The exposure (what breaks when black / WG drops)

Hard-fail: contact, touring, waitlist (POST → black:3023). No pre-warning — the SPA fetches no runtime config, so it posts into a 502 blind.
Degrades to stale, then fails: /www reads survive only while cached and only ~1h for cold pages (pseo_cache inactive=1h); /provider-api survives ~1d; /api/i18n is uncached and is fetched at runtime (provider App.tsx:91, landing App.tsx:241) → translations hard-fail.
Stays fully alive (local on vps-0): booking, roster, newsletter.

There is no watcher on vps-0 for the API/forms surface today (a separate gallery monitor exists for photos only).

4. Island-mode design (proposed — not built)

In-process edge-health module in a vps-0-local quinn.api PUBLIC instance (placement decided), with manual override. Maps onto existing seams: public-proxy.ts (isLocallyServable / publicModeGate) and the probe pattern in system-status.ts.

Runtime kill switch. Background prober + circuit breaker per form (fed actively by probes, passively by proxy failures). GET /edge/status served locally (island-safe) returns the per-form enabled/disabled map. Frontend FormGateProvider fetches it on load + focus; forms render a "reach me by SMS" fallback instead of posting into a 502.
Store-and-forward outbox (for the black-dependent writes only: contact, touring, waitlist). Edge accepts the POST, persists to a durable local spool, returns 200, and a background forwarder replays to black on recovery. Requires: idempotency key + black-side dedupe (ON CONFLICT DO NOTHING); encrypted/short-lived spool (PII on a public host); throttled replay (respect black + vps-0 fail2ban).
Watcher + alerting. Weekly "active" heartbeat + immediate failure alert with 1h / 4h / 6h backoff, escalation state persisted across restarts, anti-flap (reuse gallery-monitor pattern), sent via vps-0 local DMS (swaks --server 127.0.0.1:25, black-independent).
Backend fail-fast. When a breaker is open, short-circuit the write with a fast structured 503 instead of hanging on a dead TCP connect.
Never a new SPOF. nginx keeps black as primary; the edge service is failover/accept-on-error only; under systemd Restart=always.

What stays alive in island mode: booking, roster, newsletter (already local); cached /www + /provider-api reads (stale); contact/touring accepted to the outbox for later replay. Disabled/degraded: cold /www pages, runtime i18n, live contact/touring delivery.

5. Gap register

#	Gap	Handling	Status
G1	Public read/contact upstreams point only at black; no failover to the local twin	nginx upstream failover — blocked: local DB is not a replica (G2)	blocked
G2	Local `:5435/quinn` is not a live replica of black (drifted, no feed)	Establish real replication before any failover-to-local	verified NO
G3	Split-brain writes across two canonical DBs (contact→black, booking/roster/newsletter→vps-0)	Unify canonical DB (see §6); latent data-integrity issue independent of island mode	verified
G4	`README-vps-owned.md` upstream ports stale → re-applying mis-routes prod	Reconciled to live mapping	done 2026-06-21
G5	`pseo_cache inactive=1h` → cold `/www` evicts within 1h of outage	Raised to 24h in `quinn-maps.conf`	DONE — deployed 2026-06-22
G6	`/api/i18n/` uncached and fetched at runtime → translations hard-fail on black down	`proxy_cache` (6h, serve-stale) added	DONE — deployed 2026-06-22
G7	No runtime form gating; SPA posts into 502 blind	`/edge/status` oracle + nginx serve + `FormGateProvider`	DONE — deployed live 2026-06-22 (gitSha `74017f18`)
G8	Black-dependent writes (contact/touring/waitlist) hard-fail on outage	Store-and-forward outbox (nginx failover backup)	DONE — live 2026-06-22
G9	`contact_submissions` has no unique/idempotency constraint → replay duplicates	`idempotency_key` + unique index + `ON CONFLICT DO NOTHING`	DONE — deployed 2026-06-22
G10	PII at rest on public host (outbox spool)	Encrypt at rest / short-lived / never log bodies	build rule
G11	Provider SMTP notify delayed until replay	Accept delay, or local-DMS notify on accept	decision
G12	Edge service could become a new SPOF	black stays primary; edge failover-only; `Restart=always`	build rule
G13	Outbox unbounded growth + recovery thundering herd + vps-0 fail2ban on POST bursts	Cap spool + alert on depth/age; throttle replay (~≤30/min)	build rule
G14	Heartbeat/alert robustness (1h/4h/6h escalation must survive restarts; anti-flap)	Persist state to file; systemd timer; local DMS	DONE 2026-06-21 — deployed
G15	Local write-services can crash independently	Watcher probes `:3024`/`:3026` too — never assume "local = up"	build
G16	Idempotency migration safety on existing contact/touring/waitlist inserts	Backfill-safe migration; verify before deploy	verify

6. Open decision — which DB is canonical? (blocks the design)

The island-mode architecture depends on resolving the split-brain, and that is above an agent's authority — it's an operator decision that also touches migration-vps-to-black.md.

If black stays canonical (the documented target): island mode = outbox + accept-stale-cache (G7–G14). The local vps-0 stack/DB is dead weight until replicated, and booking/roster/newsletter writes must be moved back to black to undo the split-brain.
If vps-0 becomes primary (what the newer shadow build hints at): finish the cutover, replicate vps-0 → black as standby, and move contact/touring writes onto vps-0. Island mode then becomes nearly free.

Either way the split-brain is a standing data-integrity problem (booking data lives only on vps-0, contact only on black) that should be resolved regardless of island mode.

7. Build order

Phases sequenced by risk and by what's blocked on the §6 canonical-DB decision.

Phase 1a — Edge watcher + status oracle (DONE, deployed 2026-06-21). Decision-independent; touches no data path. Probes the five backends every minute, writes the per-form kill-switch JSON, and emails heartbeat + escalating down alerts via local DMS. See §8.
Phase 1b — Serve the oracle + frontend gate (next, decision-independent). Add an nginx location /edge/status.json (or have the watcher write into a served path) and a SPA FormGateProvider that reads it and disables a form whose dependsOn target is down. Ships via the normal quinn.www deploy (e2e smoke gate). Also handles G6 (cache /api/i18n) and G5 (raise pseo_cache inactive).
Phase 2 — Store-and-forward outbox (NOT blocked on §6 — corrected). Only the black-dependent writes (contact/touring/waitlist). The outbox keeps black as contact's canonical home (status quo) and replays to black's current address, so it does not require the §6 unification decision. Lower-risk failover design: nginx keeps routing contact directly to black as primary; the local outbox is the backup upstream that only receives traffic when black fails (proxy_next_upstream error timeout non_idempotent). The normal path is therefore unchanged — a bug in the outbox can only affect the already-failing (black-down) case, never drop a lead during normal operation.
- 2a — outbox service (vps-0 local Node service: accept-on-failover → durable spool → throttled forwarder to black with Idempotency-Key). DONE — deployed dormant 2026-06-22 (quinn-edge-outbox on 127.0.0.1:3098, empty spool, unrouted). Verified in isolation (accept→spool→forward→clear against a sink).
- 2b — G9 idempotency on black contact_submissions (additive nullable column + unique index + ON CONFLICT DO NOTHING). DONE — deployed to black 2026-06-22. (touring/waitlist already natural-idempotent via UNIQUE(email,provider_slug) upsert, so only contact needed it.)
- 2c — nginx failover cutover. DONE — live 2026-06-22. public_write upstream (black primary + outbox :3098 backup) with proxy_next_upstream ... non_idempotent on /public/contact, /public/touring/subscribe, /waitlist.
- 2d — frontend emits a client Idempotency-Key per submission. Optional / not done — the outbox generates a key per spooled item, so replay dedupe already works; 2d only adds client-double-submit protection.

8. Implementation status

Done & live

G4 — README-vps-owned.md corrected to the live upstream mapping.
Phase 1a watcher (G14, + G7 oracle) — built, verified, deployed to vps-0 and enabled:
- deployments/@domains/quinn.www/scripts/edge-watcher.sh — probe + per-form status JSON + alert state machine (anti-flap threshold, immediate/+1h/+4h/+6h escalation, recovery, weekly heartbeat).
- quinn-edge-watcher.service + quinn-edge-watcher.timer (minute oneshot) → /opt/quinn-edge-watcher on vps-0.
- deploy-edge-watcher.sh (idempotent; --verify ships+dry-runs without enabling).
- Status oracle at /opt/quinn-edge-watcher/state/status.json; alerts via DMS 127.0.0.1:25 → transquinnftw@pm.me.
- Verified: healthy + immediate-down + cross-run persistence/flap-guard (dry-run & NO_MAIL); live deploy run status=0/SUCCESS; ACTIVE email delivery confirmed in DMS log (status=sent, ProtonMail 250 OK).

Phase 1b — DONE, deployed live 2026-06-22 (gitSha `74017f18`)

Verified live: /edge/status.json serves the oracle JSON; the deployed SPA bundle fetches it; /api/i18n is now edge-cached (G6); pseo_cache inactive=24h (G5). e2e smoke gate passed. The runtime form kill-switch (the original ask). All fail-open: if the oracle is missing/stale/unreachable, every form stays enabled — the oracle can only ever disable.

provider-website/frontend-public/src/context/EdgeStatusContext.tsx — polls /edge/status.json (60s + on focus), useFormGate(form), stale-guard (5 min) → fail-open. Tested (5 specs).
.../components/shared/FormUnavailableNotice.tsx — SMS-fallback shown in place of a down form.
Wired EdgeStatusProvider into App.tsx; gated all five forms (Contact/ContactModal, Touring, Booking, Roster, ShopSignup/newsletter).
nginx location = /edge/status.json in prod.conf (alias → watcher state dir; no-store; island-safe). status.json verified nginx-readable (644).
Verified: tsc --noEmit clean; 40 existing form tests + 5 new gate tests pass.
Go-live: ./run deploy:quinn (CI from origin/main, e2e smoke gate) — requires commit+push first. Until deployed, the SPA /edge/status.json fetch 404s → fail-open (no behaviour change).

Not done (parked on §6 decision)

G5 (pseo_cache inactive raise) + G6 (/api/i18n cache) — adjacent cache-durability fixes, can ride the same deploy.
Phase 2 (outbox, G8–G13/G16).

Verification method

Read-only ssh to vps-0/black, live nginx + pgBouncer config reads, HTTP /health + /www/* compares, DB row-count/freshness queries, and the watcher's own dry-run/NO_MAIL self-tests.

17 KiB Raw Blame History Unescape Escape