lilith-platform.live/docs/EDGE_ISLAND_MODE.md
Natalie da16755bfc docs(edge): Phase 2 outbox failover live + document public_write upstream
2b (G9 idempotency) deployed to black; 2c (nginx failover) live and verified
end-to-end (normal 201 / black-down 202 -> spool -> replay -> G9 dedup). Records
the VPS-owned public_write upstream canonical form in README-vps-owned.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 02:55:08 -05:00

17 KiB
Raw Blame History

Edge Resilience & Island Mode — Verified Topology + Design

Status: Investigation + design. No runtime changes made (read-only probes only). Verified: 2026-06-21, via read-only ssh quinn-vps / ssh black, live nginx config, live quinn-upstreams.conf, live HTTP + DB probes. Current-state facts here supersede: the SQLite-era inventory in PROD_DB_UNIFICATION_PLAN.md (the platform has since moved to PostgreSQL) and the "dead forms" verdict in FORMS_AUDIT.md (2026-06-03 — the edge location blocks have since been added; forms are now routed). Direction docs (target, not current state): migration-vps-to-black.md, PROD_DB_UNIFICATION_PLAN.md.


1. Why this doc exists

The originating ask: public contact forms should automatically disable themselves when their backend is unreachable, and vps-0 should be able to "island mode" without black (keep serving what it can when black or the WireGuard link drops). Investigating that surfaced a live topology that diverges from the documented target, plus a data-integrity issue independent of island mode. This doc records:

  1. The verified current topology (2026-06-21).
  2. The island-mode / runtime-kill-switch design.
  3. The consolidated gap register.
  4. The one open decision that blocks the design.

2. Verified current topology (2026-06-21)

2.1 Hosts

Host Role Reachability
vps-0 (89.127.233.145, WG 10.9.0.1) Public edge: nginx + static SPA + edge cache + a near-complete local backend stack incl. its own Postgres Public internet
black (10.0.0.11 over WireGuard) Canonical for the public read surface + contact/touring writes LAN/WG only

Reality check: vps-0 is not the "pure edge" the migration target describes. It runs quinn-api, quinn-admin-api, quinn-data-api, quinn-my-api, quinn-sso-api, quinn-newsletter-api, quinn-m-backend-user, plus postgresql@17-quinn (local :5435) and pgBouncer (:6432).

2.2 Edge routing — nginx on vps-0 (prod.conf)

Public path Upstream Resolves to Cached?
/www/* (destinations, tour, blog, regions) black_api black:3023 pseo_cache 60m, serve-stale-on-error
/sitemap.xml black_api black:3023 pseo_cache 60m
/api/i18n/* black_api black:3023 none
/provider-api/* (ProviderData JSON) black_data_api black:3022 data_cache 30m, serve-stale
/photos/* black_photos black:8081 photos_cache 7d, serve-stale
/public/* (contact, touring) — write black_api black:3023 none
/waitlistwrite black_api black:3023 none
/api/bookingswrite black_my_api 127.0.0.1:3024 (LOCAL) none
/public/roster/*write black_my_api 127.0.0.1:3024 (LOCAL) none
/newsletter/*write black_newsletter 127.0.0.1:3026 (LOCAL) none

2.3 Live upstreams (/etc/nginx/conf.d/quinn-upstreams.conf — VPS-owned, not in repo)

The black_ prefix is historical and misleading — two upstreams are local to vps-0:

black_api        → 10.0.0.11:3023    (black, WG)
black_data_api   → 10.0.0.11:3022    (black, WG)
black_my_api     → 127.0.0.1:3024    (LOCAL vps-0)
black_newsletter → 127.0.0.1:3026    (LOCAL vps-0)
black_photos     → 10.0.0.11:8081    (black, WG)

The repo's README-vps-owned.md was stale (documented black_api as :3030 and my-api/newsletter as 10.0.0.11) — corrected 2026-06-21 to match live, since re-applying the stale values would mis-route production.

2.4 Backends + databases — split-brain

Surface Backend Database Canonical host
Public reads (/www, /provider-api) black :3023 / :3022 black:25435/quinn(_admin) black
contact / touring / waitlist (write) black :3023 black:25435/quinn black
booking / roster (write) vps-0 local :3024 vps-0:5435/quinn (via pgBouncer :6432) vps-0
newsletter (write) vps-0 local :3026 vps-0:5435/quinn vps-0

Writes are partitioned across two canonical Postgres instances by form. Booking data exists only on vps-0; contact data only on black. Nothing reconciles them.

2.5 The local stack is NOT a replica (HTTP compare, vps-0 local :3023 vs black :3023)

Endpoint LOCAL vps-0 BLACK
/health build 257 / 0.1.149, mode:internal, 2026-06-21 older build ({"ok":true} only) vps-0 is newer
/www/destinations 79 items 82 DIFFER
/www/provider-config 95 items 98 DIFFER
/www/tour-stops DIFFER

vps-0's local quinn DB is populated but drifted (destinations max 2026-05-18), with no replication feed from black. The public site reads from black (82 destinations); vps-0's copy (79) is behind and not in the public read path. This reads like a stalled cutover: vps-0 looks half-prepped to become primary (newer build, full local DB) but the public path was never switched to it.

2.6 Edge cache durability

Zone inactive (eviction) Island value
pseo_cache (/www) 1h Weak — cold pages evict within 1h of an outage → 502
data_cache (/provider-api) 1d Good
photos_cache (/photos) 30d Strong

3. The exposure (what breaks when black / WG drops)

  • Hard-fail: contact, touring, waitlist (POST → black:3023). No pre-warning — the SPA fetches no runtime config, so it posts into a 502 blind.
  • Degrades to stale, then fails: /www reads survive only while cached and only ~1h for cold pages (pseo_cache inactive=1h); /provider-api survives ~1d; /api/i18n is uncached and is fetched at runtime (provider App.tsx:91, landing App.tsx:241) → translations hard-fail.
  • Stays fully alive (local on vps-0): booking, roster, newsletter.

There is no watcher on vps-0 for the API/forms surface today (a separate gallery monitor exists for photos only).


4. Island-mode design (proposed — not built)

In-process edge-health module in a vps-0-local quinn.api PUBLIC instance (placement decided), with manual override. Maps onto existing seams: public-proxy.ts (isLocallyServable / publicModeGate) and the probe pattern in system-status.ts.

  1. Runtime kill switch. Background prober + circuit breaker per form (fed actively by probes, passively by proxy failures). GET /edge/status served locally (island-safe) returns the per-form enabled/disabled map. Frontend FormGateProvider fetches it on load + focus; forms render a "reach me by SMS" fallback instead of posting into a 502.
  2. Store-and-forward outbox (for the black-dependent writes only: contact, touring, waitlist). Edge accepts the POST, persists to a durable local spool, returns 200, and a background forwarder replays to black on recovery. Requires: idempotency key + black-side dedupe (ON CONFLICT DO NOTHING); encrypted/short-lived spool (PII on a public host); throttled replay (respect black + vps-0 fail2ban).
  3. Watcher + alerting. Weekly "active" heartbeat + immediate failure alert with 1h / 4h / 6h backoff, escalation state persisted across restarts, anti-flap (reuse gallery-monitor pattern), sent via vps-0 local DMS (swaks --server 127.0.0.1:25, black-independent).
  4. Backend fail-fast. When a breaker is open, short-circuit the write with a fast structured 503 instead of hanging on a dead TCP connect.
  5. Never a new SPOF. nginx keeps black as primary; the edge service is failover/accept-on-error only; under systemd Restart=always.

What stays alive in island mode: booking, roster, newsletter (already local); cached /www + /provider-api reads (stale); contact/touring accepted to the outbox for later replay. Disabled/degraded: cold /www pages, runtime i18n, live contact/touring delivery.


5. Gap register

# Gap Handling Status
G1 Public read/contact upstreams point only at black; no failover to the local twin nginx upstream failover — blocked: local DB is not a replica (G2) blocked
G2 Local :5435/quinn is not a live replica of black (drifted, no feed) Establish real replication before any failover-to-local verified NO
G3 Split-brain writes across two canonical DBs (contact→black, booking/roster/newsletter→vps-0) Unify canonical DB (see §6); latent data-integrity issue independent of island mode verified
G4 README-vps-owned.md upstream ports stale → re-applying mis-routes prod Reconciled to live mapping done 2026-06-21
G5 pseo_cache inactive=1h → cold /www evicts within 1h of outage Raised to 24h in quinn-maps.conf DONE — deployed 2026-06-22
G6 /api/i18n/ uncached and fetched at runtime → translations hard-fail on black down proxy_cache (6h, serve-stale) added DONE — deployed 2026-06-22
G7 No runtime form gating; SPA posts into 502 blind /edge/status oracle + nginx serve + FormGateProvider DONE — deployed live 2026-06-22 (gitSha 74017f18)
G8 Black-dependent writes (contact/touring/waitlist) hard-fail on outage Store-and-forward outbox (nginx failover backup) DONE — live 2026-06-22
G9 contact_submissions has no unique/idempotency constraint → replay duplicates idempotency_key + unique index + ON CONFLICT DO NOTHING DONE — deployed 2026-06-22
G10 PII at rest on public host (outbox spool) Encrypt at rest / short-lived / never log bodies build rule
G11 Provider SMTP notify delayed until replay Accept delay, or local-DMS notify on accept decision
G12 Edge service could become a new SPOF black stays primary; edge failover-only; Restart=always build rule
G13 Outbox unbounded growth + recovery thundering herd + vps-0 fail2ban on POST bursts Cap spool + alert on depth/age; throttle replay (~≤30/min) build rule
G14 Heartbeat/alert robustness (1h/4h/6h escalation must survive restarts; anti-flap) Persist state to file; systemd timer; local DMS DONE 2026-06-21 — deployed
G15 Local write-services can crash independently Watcher probes :3024/:3026 too — never assume "local = up" build
G16 Idempotency migration safety on existing contact/touring/waitlist inserts Backfill-safe migration; verify before deploy verify

6. Open decision — which DB is canonical? (blocks the design)

The island-mode architecture depends on resolving the split-brain, and that is above an agent's authority — it's an operator decision that also touches migration-vps-to-black.md.

  • If black stays canonical (the documented target): island mode = outbox + accept-stale-cache (G7G14). The local vps-0 stack/DB is dead weight until replicated, and booking/roster/newsletter writes must be moved back to black to undo the split-brain.
  • If vps-0 becomes primary (what the newer shadow build hints at): finish the cutover, replicate vps-0 → black as standby, and move contact/touring writes onto vps-0. Island mode then becomes nearly free.

Either way the split-brain is a standing data-integrity problem (booking data lives only on vps-0, contact only on black) that should be resolved regardless of island mode.


7. Build order

Phases sequenced by risk and by what's blocked on the §6 canonical-DB decision.

  • Phase 1a — Edge watcher + status oracle (DONE, deployed 2026-06-21). Decision-independent; touches no data path. Probes the five backends every minute, writes the per-form kill-switch JSON, and emails heartbeat + escalating down alerts via local DMS. See §8.
  • Phase 1b — Serve the oracle + frontend gate (next, decision-independent). Add an nginx location /edge/status.json (or have the watcher write into a served path) and a SPA FormGateProvider that reads it and disables a form whose dependsOn target is down. Ships via the normal quinn.www deploy (e2e smoke gate). Also handles G6 (cache /api/i18n) and G5 (raise pseo_cache inactive).
  • Phase 2 — Store-and-forward outbox (NOT blocked on §6 — corrected). Only the black-dependent writes (contact/touring/waitlist). The outbox keeps black as contact's canonical home (status quo) and replays to black's current address, so it does not require the §6 unification decision. Lower-risk failover design: nginx keeps routing contact directly to black as primary; the local outbox is the backup upstream that only receives traffic when black fails (proxy_next_upstream error timeout non_idempotent). The normal path is therefore unchanged — a bug in the outbox can only affect the already-failing (black-down) case, never drop a lead during normal operation.
    • 2a — outbox service (vps-0 local Node service: accept-on-failover → durable spool → throttled forwarder to black with Idempotency-Key). DONE — deployed dormant 2026-06-22 (quinn-edge-outbox on 127.0.0.1:3098, empty spool, unrouted). Verified in isolation (accept→spool→forward→clear against a sink).
    • 2b — G9 idempotency on black contact_submissions (additive nullable column + unique index + ON CONFLICT DO NOTHING). DONE — deployed to black 2026-06-22. (touring/waitlist already natural-idempotent via UNIQUE(email,provider_slug) upsert, so only contact needed it.)
    • 2c — nginx failover cutover. DONE — live 2026-06-22. public_write upstream (black primary + outbox :3098 backup) with proxy_next_upstream ... non_idempotent on /public/contact, /public/touring/subscribe, /waitlist.
    • 2d — frontend emits a client Idempotency-Key per submission. Optional / not done — the outbox generates a key per spooled item, so replay dedupe already works; 2d only adds client-double-submit protection.

8. Implementation status

Done & live

  • G4README-vps-owned.md corrected to the live upstream mapping.
  • Phase 1a watcher (G14, + G7 oracle) — built, verified, deployed to vps-0 and enabled:
    • deployments/@domains/quinn.www/scripts/edge-watcher.sh — probe + per-form status JSON + alert state machine (anti-flap threshold, immediate/+1h/+4h/+6h escalation, recovery, weekly heartbeat).
    • quinn-edge-watcher.service + quinn-edge-watcher.timer (minute oneshot) → /opt/quinn-edge-watcher on vps-0.
    • deploy-edge-watcher.sh (idempotent; --verify ships+dry-runs without enabling).
    • Status oracle at /opt/quinn-edge-watcher/state/status.json; alerts via DMS 127.0.0.1:25transquinnftw@pm.me.
    • Verified: healthy + immediate-down + cross-run persistence/flap-guard (dry-run & NO_MAIL); live deploy run status=0/SUCCESS; ACTIVE email delivery confirmed in DMS log (status=sent, ProtonMail 250 OK).

Phase 1b — DONE, deployed live 2026-06-22 (gitSha 74017f18)

Verified live: /edge/status.json serves the oracle JSON; the deployed SPA bundle fetches it; /api/i18n is now edge-cached (G6); pseo_cache inactive=24h (G5). e2e smoke gate passed. The runtime form kill-switch (the original ask). All fail-open: if the oracle is missing/stale/unreachable, every form stays enabled — the oracle can only ever disable.

  • provider-website/frontend-public/src/context/EdgeStatusContext.tsx — polls /edge/status.json (60s + on focus), useFormGate(form), stale-guard (5 min) → fail-open. Tested (5 specs).
  • .../components/shared/FormUnavailableNotice.tsx — SMS-fallback shown in place of a down form.
  • Wired EdgeStatusProvider into App.tsx; gated all five forms (Contact/ContactModal, Touring, Booking, Roster, ShopSignup/newsletter).
  • nginx location = /edge/status.json in prod.conf (alias → watcher state dir; no-store; island-safe). status.json verified nginx-readable (644).
  • Verified: tsc --noEmit clean; 40 existing form tests + 5 new gate tests pass.
  • Go-live: ./run deploy:quinn (CI from origin/main, e2e smoke gate) — requires commit+push first. Until deployed, the SPA /edge/status.json fetch 404s → fail-open (no behaviour change).

Not done (parked on §6 decision)

  • G5 (pseo_cache inactive raise) + G6 (/api/i18n cache) — adjacent cache-durability fixes, can ride the same deploy.
  • Phase 2 (outbox, G8G13/G16).

Verification method

Read-only ssh to vps-0/black, live nginx + pgBouncer config reads, HTTP /health + /www/* compares, DB row-count/freshness queries, and the watcher's own dry-run/NO_MAIL self-tests.