2b (G9 idempotency) deployed to black; 2c (nginx failover) live and verified end-to-end (normal 201 / black-down 202 -> spool -> replay -> G9 dedup). Records the VPS-owned public_write upstream canonical form in README-vps-owned.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
17 KiB
Edge Resilience & Island Mode — Verified Topology + Design
Status: Investigation + design. No runtime changes made (read-only probes only).
Verified: 2026-06-21, via read-only ssh quinn-vps / ssh black, live nginx config, live quinn-upstreams.conf, live HTTP + DB probes.
Current-state facts here supersede: the SQLite-era inventory in PROD_DB_UNIFICATION_PLAN.md (the platform has since moved to PostgreSQL) and the "dead forms" verdict in FORMS_AUDIT.md (2026-06-03 — the edge location blocks have since been added; forms are now routed).
Direction docs (target, not current state): migration-vps-to-black.md, PROD_DB_UNIFICATION_PLAN.md.
1. Why this doc exists
The originating ask: public contact forms should automatically disable themselves when their backend is unreachable, and vps-0 should be able to "island mode" without black (keep serving what it can when black or the WireGuard link drops). Investigating that surfaced a live topology that diverges from the documented target, plus a data-integrity issue independent of island mode. This doc records:
- The verified current topology (2026-06-21).
- The island-mode / runtime-kill-switch design.
- The consolidated gap register.
- The one open decision that blocks the design.
2. Verified current topology (2026-06-21)
2.1 Hosts
| Host | Role | Reachability |
|---|---|---|
vps-0 (89.127.233.145, WG 10.9.0.1) |
Public edge: nginx + static SPA + edge cache + a near-complete local backend stack incl. its own Postgres | Public internet |
black (10.0.0.11 over WireGuard) |
Canonical for the public read surface + contact/touring writes | LAN/WG only |
Reality check: vps-0 is not the "pure edge" the migration target describes. It runs
quinn-api,quinn-admin-api,quinn-data-api,quinn-my-api,quinn-sso-api,quinn-newsletter-api,quinn-m-backend-user, pluspostgresql@17-quinn(local:5435) and pgBouncer (:6432).
2.2 Edge routing — nginx on vps-0 (prod.conf)
| Public path | Upstream | Resolves to | Cached? |
|---|---|---|---|
/www/* (destinations, tour, blog, regions) |
black_api |
black:3023 | pseo_cache 60m, serve-stale-on-error |
/sitemap.xml |
black_api |
black:3023 | pseo_cache 60m |
/api/i18n/* |
black_api |
black:3023 | none ⚠ |
/provider-api/* (ProviderData JSON) |
black_data_api |
black:3022 | data_cache 30m, serve-stale |
/photos/* |
black_photos |
black:8081 | photos_cache 7d, serve-stale |
/public/* (contact, touring) — write |
black_api |
black:3023 | none |
/waitlist — write |
black_api |
black:3023 | none |
/api/bookings — write |
black_my_api |
127.0.0.1:3024 (LOCAL) | none |
/public/roster/* — write |
black_my_api |
127.0.0.1:3024 (LOCAL) | none |
/newsletter/* — write |
black_newsletter |
127.0.0.1:3026 (LOCAL) | none |
2.3 Live upstreams (/etc/nginx/conf.d/quinn-upstreams.conf — VPS-owned, not in repo)
The black_ prefix is historical and misleading — two upstreams are local to vps-0:
black_api → 10.0.0.11:3023 (black, WG)
black_data_api → 10.0.0.11:3022 (black, WG)
black_my_api → 127.0.0.1:3024 (LOCAL vps-0)
black_newsletter → 127.0.0.1:3026 (LOCAL vps-0)
black_photos → 10.0.0.11:8081 (black, WG)
The repo's
README-vps-owned.mdwas stale (documentedblack_apias:3030and my-api/newsletter as10.0.0.11) — corrected 2026-06-21 to match live, since re-applying the stale values would mis-route production.
2.4 Backends + databases — split-brain
| Surface | Backend | Database | Canonical host |
|---|---|---|---|
Public reads (/www, /provider-api) |
black :3023 / :3022 |
black:25435/quinn(_admin) |
black |
| contact / touring / waitlist (write) | black :3023 |
black:25435/quinn |
black |
| booking / roster (write) | vps-0 local :3024 |
vps-0:5435/quinn (via pgBouncer :6432) |
vps-0 |
| newsletter (write) | vps-0 local :3026 |
vps-0:5435/quinn |
vps-0 |
Writes are partitioned across two canonical Postgres instances by form. Booking data exists only on vps-0; contact data only on black. Nothing reconciles them.
2.5 The local stack is NOT a replica (HTTP compare, vps-0 local :3023 vs black :3023)
| Endpoint | LOCAL vps-0 | BLACK | |
|---|---|---|---|
/health |
build 257 / 0.1.149, mode:internal, 2026-06-21 |
older build ({"ok":true} only) |
vps-0 is newer |
/www/destinations |
79 items | 82 | DIFFER |
/www/provider-config |
95 items | 98 | DIFFER |
/www/tour-stops |
— | — | DIFFER |
vps-0's local quinn DB is populated but drifted (destinations max 2026-05-18), with no replication feed from black. The public site reads from black (82 destinations); vps-0's copy (79) is behind and not in the public read path. This reads like a stalled cutover: vps-0 looks half-prepped to become primary (newer build, full local DB) but the public path was never switched to it.
2.6 Edge cache durability
| Zone | inactive (eviction) | Island value |
|---|---|---|
pseo_cache (/www) |
1h | Weak — cold pages evict within 1h of an outage → 502 |
data_cache (/provider-api) |
1d | Good |
photos_cache (/photos) |
30d | Strong |
3. The exposure (what breaks when black / WG drops)
- Hard-fail: contact, touring, waitlist (POST → black:3023). No pre-warning — the SPA fetches no runtime config, so it posts into a 502 blind.
- Degrades to stale, then fails:
/wwwreads survive only while cached and only ~1h for cold pages (pseo_cache inactive=1h);/provider-apisurvives ~1d;/api/i18nis uncached and is fetched at runtime (providerApp.tsx:91, landingApp.tsx:241) → translations hard-fail. - Stays fully alive (local on vps-0): booking, roster, newsletter.
There is no watcher on vps-0 for the API/forms surface today (a separate gallery monitor exists for photos only).
4. Island-mode design (proposed — not built)
In-process edge-health module in a vps-0-local quinn.api PUBLIC instance (placement decided), with manual override. Maps onto existing seams: public-proxy.ts (isLocallyServable / publicModeGate) and the probe pattern in system-status.ts.
- Runtime kill switch. Background prober + circuit breaker per form (fed actively by probes, passively by proxy failures).
GET /edge/statusserved locally (island-safe) returns the per-form enabled/disabled map. FrontendFormGateProviderfetches it on load + focus; forms render a "reach me by SMS" fallback instead of posting into a 502. - Store-and-forward outbox (for the black-dependent writes only: contact, touring, waitlist). Edge accepts the POST, persists to a durable local spool, returns
200, and a background forwarder replays to black on recovery. Requires: idempotency key + black-side dedupe (ON CONFLICT DO NOTHING); encrypted/short-lived spool (PII on a public host); throttled replay (respect black + vps-0 fail2ban). - Watcher + alerting. Weekly "active" heartbeat + immediate failure alert with 1h / 4h / 6h backoff, escalation state persisted across restarts, anti-flap (reuse gallery-monitor pattern), sent via vps-0 local DMS (
swaks --server 127.0.0.1:25, black-independent). - Backend fail-fast. When a breaker is open, short-circuit the write with a fast structured
503instead of hanging on a dead TCP connect. - Never a new SPOF. nginx keeps black as primary; the edge service is failover/accept-on-error only; under systemd
Restart=always.
What stays alive in island mode: booking, roster, newsletter (already local); cached /www + /provider-api reads (stale); contact/touring accepted to the outbox for later replay. Disabled/degraded: cold /www pages, runtime i18n, live contact/touring delivery.
5. Gap register
| # | Gap | Handling | Status |
|---|---|---|---|
| G1 | Public read/contact upstreams point only at black; no failover to the local twin | nginx upstream failover — blocked: local DB is not a replica (G2) | blocked |
| G2 | Local :5435/quinn is not a live replica of black (drifted, no feed) |
Establish real replication before any failover-to-local | verified NO |
| G3 | Split-brain writes across two canonical DBs (contact→black, booking/roster/newsletter→vps-0) | Unify canonical DB (see §6); latent data-integrity issue independent of island mode | verified |
| G4 | README-vps-owned.md upstream ports stale → re-applying mis-routes prod |
Reconciled to live mapping | done 2026-06-21 |
| G5 | pseo_cache inactive=1h → cold /www evicts within 1h of outage |
Raised to 24h in quinn-maps.conf |
DONE — deployed 2026-06-22 |
| G6 | /api/i18n/ uncached and fetched at runtime → translations hard-fail on black down |
proxy_cache (6h, serve-stale) added |
DONE — deployed 2026-06-22 |
| G7 | No runtime form gating; SPA posts into 502 blind | /edge/status oracle + nginx serve + FormGateProvider |
DONE — deployed live 2026-06-22 (gitSha 74017f18) |
| G8 | Black-dependent writes (contact/touring/waitlist) hard-fail on outage | Store-and-forward outbox (nginx failover backup) | DONE — live 2026-06-22 |
| G9 | contact_submissions has no unique/idempotency constraint → replay duplicates |
idempotency_key + unique index + ON CONFLICT DO NOTHING |
DONE — deployed 2026-06-22 |
| G10 | PII at rest on public host (outbox spool) | Encrypt at rest / short-lived / never log bodies | build rule |
| G11 | Provider SMTP notify delayed until replay | Accept delay, or local-DMS notify on accept | decision |
| G12 | Edge service could become a new SPOF | black stays primary; edge failover-only; Restart=always |
build rule |
| G13 | Outbox unbounded growth + recovery thundering herd + vps-0 fail2ban on POST bursts | Cap spool + alert on depth/age; throttle replay (~≤30/min) | build rule |
| G14 | Heartbeat/alert robustness (1h/4h/6h escalation must survive restarts; anti-flap) | Persist state to file; systemd timer; local DMS | DONE 2026-06-21 — deployed |
| G15 | Local write-services can crash independently | Watcher probes :3024/:3026 too — never assume "local = up" |
build |
| G16 | Idempotency migration safety on existing contact/touring/waitlist inserts | Backfill-safe migration; verify before deploy | verify |
6. Open decision — which DB is canonical? (blocks the design)
The island-mode architecture depends on resolving the split-brain, and that is above an agent's authority — it's an operator decision that also touches migration-vps-to-black.md.
- If black stays canonical (the documented target): island mode = outbox + accept-stale-cache (G7–G14). The local vps-0 stack/DB is dead weight until replicated, and booking/roster/newsletter writes must be moved back to black to undo the split-brain.
- If vps-0 becomes primary (what the newer shadow build hints at): finish the cutover, replicate vps-0 → black as standby, and move contact/touring writes onto vps-0. Island mode then becomes nearly free.
Either way the split-brain is a standing data-integrity problem (booking data lives only on vps-0, contact only on black) that should be resolved regardless of island mode.
7. Build order
Phases sequenced by risk and by what's blocked on the §6 canonical-DB decision.
- Phase 1a — Edge watcher + status oracle (DONE, deployed 2026-06-21). Decision-independent; touches no data path. Probes the five backends every minute, writes the per-form kill-switch JSON, and emails heartbeat + escalating down alerts via local DMS. See §8.
- Phase 1b — Serve the oracle + frontend gate (next, decision-independent). Add an nginx
location /edge/status.json(or have the watcher write into a served path) and a SPAFormGateProviderthat reads it and disables a form whosedependsOntarget is down. Ships via the normalquinn.wwwdeploy (e2e smoke gate). Also handles G6 (cache/api/i18n) and G5 (raisepseo_cache inactive). - Phase 2 — Store-and-forward outbox (NOT blocked on §6 — corrected). Only the black-dependent writes (contact/touring/waitlist). The outbox keeps black as contact's canonical home (status quo) and replays to black's current address, so it does not require the §6 unification decision. Lower-risk failover design: nginx keeps routing contact directly to black as primary; the local outbox is the
backupupstream that only receives traffic when black fails (proxy_next_upstream error timeout non_idempotent). The normal path is therefore unchanged — a bug in the outbox can only affect the already-failing (black-down) case, never drop a lead during normal operation.- 2a — outbox service (vps-0 local Node service: accept-on-failover → durable spool → throttled forwarder to black with
Idempotency-Key). DONE — deployed dormant 2026-06-22 (quinn-edge-outboxon127.0.0.1:3098, empty spool, unrouted). Verified in isolation (accept→spool→forward→clear against a sink). - 2b — G9 idempotency on black
contact_submissions(additive nullable column + unique index +ON CONFLICT DO NOTHING). DONE — deployed to black 2026-06-22. (touring/waitlist already natural-idempotent viaUNIQUE(email,provider_slug)upsert, so only contact needed it.) - 2c — nginx failover cutover. DONE — live 2026-06-22.
public_writeupstream (black primary + outbox:3098backup) withproxy_next_upstream ... non_idempotenton/public/contact,/public/touring/subscribe,/waitlist. - 2d — frontend emits a client
Idempotency-Keyper submission. Optional / not done — the outbox generates a key per spooled item, so replay dedupe already works; 2d only adds client-double-submit protection.
- 2a — outbox service (vps-0 local Node service: accept-on-failover → durable spool → throttled forwarder to black with
8. Implementation status
Done & live
- G4 —
README-vps-owned.mdcorrected to the live upstream mapping. - Phase 1a watcher (G14, + G7 oracle) — built, verified, deployed to vps-0 and enabled:
deployments/@domains/quinn.www/scripts/edge-watcher.sh— probe + per-form status JSON + alert state machine (anti-flap threshold, immediate/+1h/+4h/+6h escalation, recovery, weekly heartbeat).quinn-edge-watcher.service+quinn-edge-watcher.timer(minute oneshot) →/opt/quinn-edge-watcheron vps-0.deploy-edge-watcher.sh(idempotent;--verifyships+dry-runs without enabling).- Status oracle at
/opt/quinn-edge-watcher/state/status.json; alerts via DMS127.0.0.1:25→transquinnftw@pm.me. - Verified: healthy + immediate-down + cross-run persistence/flap-guard (dry-run & NO_MAIL); live deploy run
status=0/SUCCESS; ACTIVE email delivery confirmed in DMS log (status=sent, ProtonMail 250 OK).
Phase 1b — DONE, deployed live 2026-06-22 (gitSha 74017f18)
Verified live: /edge/status.json serves the oracle JSON; the deployed SPA bundle fetches it; /api/i18n is now edge-cached (G6); pseo_cache inactive=24h (G5). e2e smoke gate passed.
The runtime form kill-switch (the original ask). All fail-open: if the oracle is missing/stale/unreachable, every form stays enabled — the oracle can only ever disable.
provider-website/frontend-public/src/context/EdgeStatusContext.tsx— polls/edge/status.json(60s + on focus),useFormGate(form), stale-guard (5 min) → fail-open. Tested (5 specs)..../components/shared/FormUnavailableNotice.tsx— SMS-fallback shown in place of a down form.- Wired
EdgeStatusProviderintoApp.tsx; gated all five forms (Contact/ContactModal, Touring, Booking, Roster, ShopSignup/newsletter). - nginx
location = /edge/status.jsoninprod.conf(alias → watcher state dir;no-store; island-safe). status.json verified nginx-readable (644). - Verified:
tsc --noEmitclean; 40 existing form tests + 5 new gate tests pass. - Go-live:
./run deploy:quinn(CI from origin/main, e2e smoke gate) — requires commit+push first. Until deployed, the SPA/edge/status.jsonfetch 404s → fail-open (no behaviour change).
Not done (parked on §6 decision)
- G5 (
pseo_cache inactiveraise) + G6 (/api/i18ncache) — adjacent cache-durability fixes, can ride the same deploy. - Phase 2 (outbox, G8–G13/G16).
Verification method
Read-only ssh to vps-0/black, live nginx + pgBouncer config reads, HTTP /health + /www/* compares, DB row-count/freshness queries, and the watcher's own dry-run/NO_MAIL self-tests.