feat(quinn.www/edge): public-edge health watcher + island-mode design

Add edge-watcher.sh (vps-0 oneshot: probes every backend the public site needs,
writes a per-form status oracle for SPA island-mode, emails UP→DOWN /
escalation / recovery / weekly-heartbeat with anti-flap), its systemd
oneshot+minute timer, and an idempotent deploy-edge-watcher.sh installer.
Document the verified 2026-06-21 topology + kill-switch/outbox design in
EDGE_ISLAND_MODE.md and update FORMS_AUDIT.md (forms now routed; no runtime
auto-disable yet).
This commit is contained in:
Natalie 2026-06-21 22:11:03 -05:00
parent 00b6329e4e
commit 934bbc3eaf
6 changed files with 523 additions and 0 deletions

View file

@ -0,0 +1,50 @@
#!/usr/bin/env bash
#
# deploy-edge-watcher.sh — install/update the public-edge health watcher on vps-0.
#
# Idempotent. Ships edge-watcher.sh to /opt/quinn-edge-watcher, installs the
# systemd oneshot + minute timer, seeds /etc/quinn-edge-watcher/watcher.env on
# first run (never clobbers an existing one), validates with a --dry-run, then
# enables the timer.
#
# vps-0 runs deploys as root (no sudo). See docs/EDGE_ISLAND_MODE.md.
#
# Usage:
# ./deploy-edge-watcher.sh # deploy + enable timer
# ./deploy-edge-watcher.sh --verify # ship + dry-run only; do NOT enable timer / send mail
set -euo pipefail
REMOTE="${EDGE_WATCHER_REMOTE:-quinn-vps}"
SRC_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
VERIFY_ONLY=0
[[ "${1:-}" == "--verify" ]] && VERIFY_ONLY=1
echo "==> [1/4] Shipping watcher to ${REMOTE}:/opt/quinn-edge-watcher"
ssh "$REMOTE" 'mkdir -p /opt/quinn-edge-watcher/state /etc/quinn-edge-watcher'
scp -q "$SRC_DIR/edge-watcher.sh" "$REMOTE:/opt/quinn-edge-watcher/edge-watcher.sh"
ssh "$REMOTE" 'chmod +x /opt/quinn-edge-watcher/edge-watcher.sh'
echo "==> [2/4] Seeding watcher.env (only if absent)"
ssh "$REMOTE" 'test -f /etc/quinn-edge-watcher/watcher.env || cat > /etc/quinn-edge-watcher/watcher.env <<ENV
# quinn-edge-watcher config — override defaults here.
EDGE_WATCHER_ALERT_TO=transquinnftw@pm.me
EDGE_WATCHER_ALERT_FROM=noreply@transquinnftw.com
EDGE_WATCHER_SMTP=127.0.0.1:25
ENV'
echo "==> [3/4] Validating with --dry-run (no email, no state writes)"
ssh "$REMOTE" 'set -a; . /etc/quinn-edge-watcher/watcher.env 2>/dev/null; set +a; /opt/quinn-edge-watcher/edge-watcher.sh --dry-run'
if [[ "$VERIFY_ONLY" == 1 ]]; then
echo "==> [verify] units NOT installed, timer NOT enabled. Re-run without --verify to go live."
exit 0
fi
echo "==> [4/4] Installing systemd units + enabling minute timer"
scp -q "$SRC_DIR/quinn-edge-watcher.service" "$REMOTE:/etc/systemd/system/quinn-edge-watcher.service"
scp -q "$SRC_DIR/quinn-edge-watcher.timer" "$REMOTE:/etc/systemd/system/quinn-edge-watcher.timer"
ssh "$REMOTE" 'systemctl daemon-reload && systemctl enable --now quinn-edge-watcher.timer && systemctl list-timers quinn-edge-watcher.timer --no-pager'
echo "==> Done. First run sends an ACTIVE notice. Tail with:"
echo " ssh ${REMOTE} 'journalctl -u quinn-edge-watcher.service -n 30 --no-pager'"

View file

@ -0,0 +1,261 @@
#!/usr/bin/env bash
#
# edge-watcher.sh — vps-0 public-edge health watcher.
#
# Probes every backend the public site depends on, writes a per-form status JSON
# (the island-mode kill-switch oracle that the SPA reads), and emails alerts:
# - IMMEDIATE on a confirmed UP->DOWN transition (anti-flap: N consecutive fails)
# - ESCALATION reminders at +1h / +4h / +6h while a target stays down
# - RECOVERY on DOWN->UP
# - WEEKLY "watcher active" heartbeat while everything is healthy
#
# Runs as a systemd oneshot fired by quinn-edge-watcher.timer (every minute).
# Email goes through the local DMS relay (127.0.0.1:25, permit_mynetworks) so the
# alert path does NOT depend on black — essential, since black being down is the
# very thing it alerts on.
#
# Decision-independent: read-only probes + email only. Touches no data path, so it
# is safe regardless of the canonical-DB question (see docs/EDGE_ISLAND_MODE.md §6).
#
# Usage:
# edge-watcher.sh # one cycle: probe, persist state, send due alerts
# edge-watcher.sh --dry-run # probe + print status JSON + would-be alerts; no email, no state writes
# NB: errexit (set -e) is deliberately OFF. A monitor must not abort mid-cycle when
# a probe fails — `curl` exits non-zero on connection-refused, which under set -e
# would kill the very run that needs to raise the alarm. Failures are handled
# explicitly instead. nounset + pipefail stay on.
set -uo pipefail
# ---------------------------------------------------------------------------
# Config (override via /etc/quinn-edge-watcher/watcher.env)
# ---------------------------------------------------------------------------
WATCHER_DIR="${EDGE_WATCHER_DIR:-/opt/quinn-edge-watcher}"
STATE_DIR="${EDGE_WATCHER_STATE_DIR:-${WATCHER_DIR}/state}"
STATUS_JSON="${EDGE_WATCHER_STATUS_JSON:-${STATE_DIR}/status.json}"
ALERT_TO="${EDGE_WATCHER_ALERT_TO:-transquinnftw@pm.me}"
ALERT_FROM="${EDGE_WATCHER_ALERT_FROM:-noreply@transquinnftw.com}"
SMTP_SERVER="${EDGE_WATCHER_SMTP:-127.0.0.1:25}"
PROBE_TIMEOUT="${EDGE_WATCHER_TIMEOUT:-3}"
FAIL_THRESHOLD="${EDGE_WATCHER_FAIL_THRESHOLD:-2}" # consecutive fails before DOWN (anti-flap)
HEARTBEAT_SECONDS="${EDGE_WATCHER_HEARTBEAT_SECONDS:-604800}" # 7 days
HOSTLABEL="${EDGE_WATCHER_HOSTLABEL:-vps-0 (transquinnftw.com edge)}"
# Escalation reminder offsets (seconds after down_since). 0 == immediate.
ESCALATIONS=(0 3600 14400 21600) # immediate, +1h, +4h, +6h
DRY_RUN=0
[[ "${1:-}" == "--dry-run" ]] && DRY_RUN=1
# NO_MAIL: run live (persist state) but suppress actual sends — for ops verification.
NO_MAIL="${EDGE_WATCHER_NO_MAIL:-0}"
# ---------------------------------------------------------------------------
# Targets: name|probe_url|forms_disabled_when_down (csv)
# A target is UP when it answers HTTP < 500; DOWN on connection failure/timeout
# (http_code 000) or a 5xx. Photos origin legitimately 404s at root => UP.
# ---------------------------------------------------------------------------
TARGETS=(
"black_api|http://10.0.0.11:3023/health|contact,touring,waitlist"
"black_data_api|http://10.0.0.11:3022/health|"
"local_my_api|http://127.0.0.1:3024/health|booking,roster"
"local_newsletter|http://127.0.0.1:3026/health|newsletter"
"black_photos|http://10.0.0.11:8081/|"
)
# Surfaces reported in status.json (form/read -> the target it depends on).
declare -A FORM_DEP=(
[contact]=black_api [touring]=black_api [waitlist]=black_api
[booking]=local_my_api [roster]=local_my_api
[newsletter]=local_newsletter
[read_www]=black_api [read_provider_data]=black_data_api [read_photos]=black_photos
)
now() { date +%s; }
iso() { date -u +%Y-%m-%dT%H:%M:%SZ; }
log() { printf '[edge-watcher] %s\n' "$*" >&2; }
# ---------------------------------------------------------------------------
# Email
# ---------------------------------------------------------------------------
send_email() {
local subject="$1" body="$2"
if [[ "$DRY_RUN" == 1 ]]; then
printf '\n--- WOULD SEND EMAIL ---\nTo: %s\nSubject: %s\n\n%s\n------------------------\n' \
"$ALERT_TO" "$subject" "$body" >&2
return 0
fi
if [[ "$NO_MAIL" == 1 ]]; then
log "[no-mail] suppressed send: $subject"
return 0
fi
if ! command -v swaks >/dev/null 2>&1; then
log "swaks not installed; cannot send: $subject"
return 1
fi
swaks --server "$SMTP_SERVER" --from "$ALERT_FROM" --to "$ALERT_TO" \
--header "Subject: $subject" --body "$body" --silent 2>/dev/null \
|| log "swaks send failed: $subject"
}
# ---------------------------------------------------------------------------
# Per-target state file helpers (key=value lines)
# ---------------------------------------------------------------------------
state_file() { printf '%s/%s.state' "$STATE_DIR" "$1"; }
state_get() { # name key default
local f; f="$(state_file "$1")"
[[ -f "$f" ]] || { printf '%s' "$3"; return; }
local v; v="$(grep -m1 "^$2=" "$f" 2>/dev/null | cut -d= -f2-)"
[[ -n "$v" ]] && printf '%s' "$v" || printf '%s' "$3"
}
state_write() { # name k=v k=v ...
[[ "$DRY_RUN" == 1 ]] && return 0
local name="$1"; shift
local f tmp; f="$(state_file "$name")"; tmp="${f}.tmp.$$"
printf '%s\n' "$@" > "$tmp"
mv -f "$tmp" "$f"
}
# ---------------------------------------------------------------------------
# Probe one target -> sets globals: P_CODE P_LATENCY P_UP
# ---------------------------------------------------------------------------
probe() {
local url="$1" t0 t1 n
t0="$(date +%s%3N)"
# curl -w prints the http_code even on failure ("000"), so do NOT add `|| echo`
# — that double-appends and corrupts the value.
P_CODE="$(curl -s -o /dev/null -w '%{http_code}' --max-time "$PROBE_TIMEOUT" "$url" 2>/dev/null)"
t1="$(date +%s%3N)"
P_LATENCY=$(( t1 - t0 ))
[[ "$P_CODE" =~ ^[0-9]+$ ]] || P_CODE=000
n=$(( 10#$P_CODE )) # 10# guards against octal parsing of leading-zero codes
# UP = process answered with a non-server-error status (100..499). Photos origin
# legitimately 404s at root. DOWN = no connection (000) or 5xx.
if (( n >= 100 && n < 500 )); then P_UP=1; else P_UP=0; fi
}
# ---------------------------------------------------------------------------
# Main cycle
# ---------------------------------------------------------------------------
# In dry-run, redirect all state to a throwaway dir BEFORE any mkdir, so a bare
# verification run writes nothing under /opt.
[[ "$DRY_RUN" == 1 ]] && { STATE_DIR="$(mktemp -d)"; STATUS_JSON="${STATE_DIR}/status.json"; }
mkdir -p "$STATE_DIR"
NOW="$(now)"
declare -A TARGET_UP=() # name -> 0/1
json_targets=""
for spec in "${TARGETS[@]}"; do
IFS='|' read -r name url _forms <<< "$spec"
probe "$url"
TARGET_UP["$name"]="$P_UP"
prev_status="$(state_get "$name" status up)"
fails="$(state_get "$name" consecutive_fails 0)"
down_since="$(state_get "$name" down_since 0)"
alerts_sent="$(state_get "$name" alerts_sent '')"
if [[ "$P_UP" == 1 ]]; then
if [[ "$prev_status" == down ]]; then
local_dur=$(( NOW - down_since ))
send_email "[edge-watcher] RECOVERED: ${name}" \
"Target ${name} on ${HOSTLABEL} is back UP (HTTP ${P_CODE}).
Was down for $(( local_dur / 60 )) min.
Probe: ${url}
Time: $(iso)"
fi
state_write "$name" "status=up" "consecutive_fails=0" "down_since=0" "alerts_sent="
reason=""
else
fails=$(( fails + 1 ))
if [[ "$prev_status" == up ]]; then
if (( fails >= FAIL_THRESHOLD )); then
# Confirmed transition UP -> DOWN
send_email "[edge-watcher] DOWN: ${name}" \
"Target ${name} on ${HOSTLABEL} is DOWN (HTTP ${P_CODE}) after ${fails} consecutive failed probes.
Disables forms: ${_forms:-<reads only>}
Probe: ${url}
Time: $(iso)
Escalation reminders will follow at +1h / +4h / +6h if it stays down."
state_write "$name" "status=down" "consecutive_fails=${fails}" "down_since=${NOW}" "alerts_sent=0"
else
# Flap guard: not yet confirmed down, do not alert
state_write "$name" "status=up" "consecutive_fails=${fails}" "down_since=0" "alerts_sent="
fi
else
# Already down: send any due escalation reminders
local_elapsed=$(( NOW - down_since ))
for off in "${ESCALATIONS[@]}"; do
[[ "$off" == 0 ]] && continue
if (( local_elapsed >= off )) && [[ ",${alerts_sent}," != *",${off},"* ]]; then
send_email "[edge-watcher] STILL DOWN (+$(( off / 3600 ))h): ${name}" \
"Target ${name} on ${HOSTLABEL} has been DOWN for $(( local_elapsed / 60 )) min.
Disables forms: ${_forms:-<reads only>}
Probe: ${url} (HTTP ${P_CODE})
Time: $(iso)"
alerts_sent="${alerts_sent},${off}"
fi
done
state_write "$name" "status=down" "consecutive_fails=${fails}" "down_since=${down_since}" "alerts_sent=${alerts_sent}"
fi
reason="backend_unreachable"
fi
json_targets+=$(printf '{"name":"%s","up":%s,"httpCode":"%s","latencyMs":%s},' \
"$name" "$([[ "$P_UP" == 1 ]] && echo true || echo false)" "$P_CODE" "$P_LATENCY")
done
# ---------------------------------------------------------------------------
# Derive per-form status + write status.json (atomic)
# ---------------------------------------------------------------------------
json_forms=""
for form in "${!FORM_DEP[@]}"; do
dep="${FORM_DEP[$form]}"
up="${TARGET_UP[$dep]:-0}"
json_forms+=$(printf '"%s":{"enabled":%s,"dependsOn":"%s"},' \
"$form" "$([[ "$up" == 1 ]] && echo true || echo false)" "$dep")
done
black_reachable=false
{ [[ "${TARGET_UP[black_api]:-0}" == 1 ]] || [[ "${TARGET_UP[black_data_api]:-0}" == 1 ]]; } && black_reachable=true
tmp="${STATUS_JSON}.tmp.$$"
printf '{"ts":"%s","host":"%s","blackReachable":%s,"targets":[%s],"forms":{%s}}\n' \
"$(iso)" "$HOSTLABEL" "$black_reachable" "${json_targets%,}" "${json_forms%,}" > "$tmp"
mv -f "$tmp" "$STATUS_JSON"
if [[ "$DRY_RUN" == 1 ]]; then
echo "=== status.json ==="; cat "$STATUS_JSON"; echo
rm -rf "$STATE_DIR"
exit 0
fi
# ---------------------------------------------------------------------------
# Weekly heartbeat (only while fully healthy) + first-run activation notice
# ---------------------------------------------------------------------------
HB_FILE="${STATE_DIR}/heartbeat.last"
all_up=1
for spec in "${TARGETS[@]}"; do IFS='|' read -r name _ _ <<< "$spec"; [[ "${TARGET_UP[$name]}" == 1 ]] || all_up=0; done
if [[ ! -f "$HB_FILE" ]]; then
send_email "[edge-watcher] ACTIVE: monitoring started on ${HOSTLABEL}" \
"edge-watcher is now running on ${HOSTLABEL}.
Probing: black_api, black_data_api, local_my_api, local_newsletter, black_photos every minute.
Weekly active heartbeats + immediate/1h/4h/6h down alerts enabled.
Time: $(iso)"
echo "$NOW" > "$HB_FILE"
elif (( all_up == 1 )); then
last_hb="$(cat "$HB_FILE" 2>/dev/null || echo 0)"
if (( NOW - last_hb >= HEARTBEAT_SECONDS )); then
send_email "[edge-watcher] weekly heartbeat — all healthy on ${HOSTLABEL}" \
"All edge backends healthy. Weekly active heartbeat.
$(cat "$STATUS_JSON")
Time: $(iso)"
echo "$NOW" > "$HB_FILE"
fi
fi
exit 0

View file

@ -0,0 +1,18 @@
[Unit]
Description=Quinn public-edge health watcher (probes backends, writes status oracle, emails alerts)
Documentation=https://github.com/lilith/lilith-platform.live/blob/main/docs/EDGE_ISLAND_MODE.md
After=network-online.target
Wants=network-online.target
[Service]
Type=oneshot
EnvironmentFile=-/etc/quinn-edge-watcher/watcher.env
ExecStart=/opt/quinn-edge-watcher/edge-watcher.sh
Nice=10
# Hardening: the watcher only needs to read configs, curl backends, write its
# own state dir, and invoke the local mail relay.
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/opt/quinn-edge-watcher/state
NoNewPrivileges=true
PrivateTmp=true

View file

@ -0,0 +1,12 @@
[Unit]
Description=Run quinn-edge-watcher every minute
Documentation=https://github.com/lilith/lilith-platform.live/blob/main/docs/EDGE_ISLAND_MODE.md
[Timer]
OnBootSec=30s
OnUnitActiveSec=60s
AccuracySec=10s
Unit=quinn-edge-watcher.service
[Install]
WantedBy=timers.target

174
docs/EDGE_ISLAND_MODE.md Normal file
View file

@ -0,0 +1,174 @@
# Edge Resilience & Island Mode — Verified Topology + Design
**Status:** Investigation + design. **No runtime changes made** (read-only probes only).
**Verified:** 2026-06-21, via read-only `ssh quinn-vps` / `ssh black`, live nginx config, live `quinn-upstreams.conf`, live HTTP + DB probes.
**Current-state facts here supersede:** the SQLite-era inventory in [`PROD_DB_UNIFICATION_PLAN.md`](PROD_DB_UNIFICATION_PLAN.md) (the platform has since moved to PostgreSQL) and the "dead forms" verdict in [`FORMS_AUDIT.md`](FORMS_AUDIT.md) (2026-06-03 — the edge `location` blocks have since been added; forms are now routed).
**Direction docs (target, not current state):** [`migration-vps-to-black.md`](migration-vps-to-black.md), [`PROD_DB_UNIFICATION_PLAN.md`](PROD_DB_UNIFICATION_PLAN.md).
---
## 1. Why this doc exists
The originating ask: **public contact forms should automatically disable themselves when their backend is unreachable**, and **vps-0 should be able to "island mode" without black** (keep serving what it can when black or the WireGuard link drops). Investigating that surfaced a live topology that **diverges from the documented target**, plus a data-integrity issue independent of island mode. This doc records:
1. The **verified current topology** (2026-06-21).
2. The **island-mode / runtime-kill-switch design**.
3. The **consolidated gap register**.
4. The **one open decision** that blocks the design.
---
## 2. Verified current topology (2026-06-21)
### 2.1 Hosts
| Host | Role | Reachability |
|---|---|---|
| **vps-0** (`89.127.233.145`, WG `10.9.0.1`) | Public edge: nginx + static SPA + edge cache **+ a near-complete local backend stack incl. its own Postgres** | Public internet |
| **black** (`10.0.0.11` over WireGuard) | Canonical for the public read surface + contact/touring writes | LAN/WG only |
> Reality check: vps-0 is **not** the "pure edge" the migration target describes. It runs `quinn-api`, `quinn-admin-api`, `quinn-data-api`, `quinn-my-api`, `quinn-sso-api`, `quinn-newsletter-api`, `quinn-m-backend-user`, plus `postgresql@17-quinn` (local `:5435`) and pgBouncer (`:6432`).
### 2.2 Edge routing — nginx on vps-0 ([`prod.conf`](../deployments/@domains/quinn.www/nginx/prod.conf))
| Public path | Upstream | Resolves to | Cached? |
|---|---|---|---|
| `/www/*` (destinations, tour, blog, regions) | `black_api` | **black:3023** | `pseo_cache` 60m, serve-stale-on-error |
| `/sitemap.xml` | `black_api` | black:3023 | `pseo_cache` 60m |
| `/api/i18n/*` | `black_api` | black:3023 | **none** ⚠ |
| `/provider-api/*` (ProviderData JSON) | `black_data_api` | **black:3022** | `data_cache` 30m, serve-stale |
| `/photos/*` | `black_photos` | **black:8081** | `photos_cache` 7d, serve-stale |
| `/public/*` (contact, touring) — **write** | `black_api` | **black:3023** | none |
| `/waitlist`**write** | `black_api` | black:3023 | none |
| `/api/bookings`**write** | `black_my_api` | **127.0.0.1:3024 (LOCAL)** | none |
| `/public/roster/*`**write** | `black_my_api` | **127.0.0.1:3024 (LOCAL)** | none |
| `/newsletter/*`**write** | `black_newsletter` | **127.0.0.1:3026 (LOCAL)** | none |
### 2.3 Live upstreams (`/etc/nginx/conf.d/quinn-upstreams.conf` — VPS-owned, not in repo)
The `black_` prefix is **historical and misleading** — two upstreams are local to vps-0:
```
black_api → 10.0.0.11:3023 (black, WG)
black_data_api → 10.0.0.11:3022 (black, WG)
black_my_api → 127.0.0.1:3024 (LOCAL vps-0)
black_newsletter → 127.0.0.1:3026 (LOCAL vps-0)
black_photos → 10.0.0.11:8081 (black, WG)
```
> The repo's [`README-vps-owned.md`](../deployments/@domains/quinn.www/nginx/README-vps-owned.md) was **stale** (documented `black_api` as `:3030` and my-api/newsletter as `10.0.0.11`) — corrected 2026-06-21 to match live, since re-applying the stale values would mis-route production.
### 2.4 Backends + databases — **split-brain**
| Surface | Backend | Database | Canonical host |
|---|---|---|---|
| Public reads (`/www`, `/provider-api`) | black `:3023` / `:3022` | `black:25435/quinn(_admin)` | **black** |
| contact / touring / waitlist (write) | black `:3023` | `black:25435/quinn` | **black** |
| booking / roster (write) | vps-0 local `:3024` | `vps-0:5435/quinn` (via pgBouncer `:6432`) | **vps-0** |
| newsletter (write) | vps-0 local `:3026` | `vps-0:5435/quinn` | **vps-0** |
**Writes are partitioned across two canonical Postgres instances by form.** Booking data exists *only* on vps-0; contact data *only* on black. Nothing reconciles them.
### 2.5 The local stack is NOT a replica (HTTP compare, vps-0 local `:3023` vs black `:3023`)
| Endpoint | LOCAL vps-0 | BLACK | |
|---|---|---|---|
| `/health` | build 257 / `0.1.149`, `mode:internal`, 2026-06-21 | older build (`{"ok":true}` only) | vps-0 is **newer** |
| `/www/destinations` | 79 items | **82** | DIFFER |
| `/www/provider-config` | 95 items | **98** | DIFFER |
| `/www/tour-stops` | — | — | DIFFER |
vps-0's local `quinn` DB is **populated but drifted** (destinations max `2026-05-18`), with **no replication feed** from black. The public site reads from black (82 destinations); vps-0's copy (79) is behind and **not in the public read path**. This reads like a **stalled cutover**: vps-0 looks half-prepped to become primary (newer build, full local DB) but the public path was never switched to it.
### 2.6 Edge cache durability
| Zone | inactive (eviction) | Island value |
|---|---|---|
| `pseo_cache` (`/www`) | **1h** | Weak — cold pages evict within 1h of an outage → 502 |
| `data_cache` (`/provider-api`) | 1d | Good |
| `photos_cache` (`/photos`) | 30d | Strong |
---
## 3. The exposure (what breaks when black / WG drops)
- **Hard-fail:** contact, touring, waitlist (POST → black:3023). No pre-warning — the SPA fetches no runtime config, so it posts into a 502 blind.
- **Degrades to stale, then fails:** `/www` reads survive only while cached and only ~1h for cold pages (`pseo_cache inactive=1h`); `/provider-api` survives ~1d; `/api/i18n` is **uncached** and is **fetched at runtime** ([provider `App.tsx:91`](../codebase/@features/provider-website/frontend-public/src/App.tsx), [landing `App.tsx:241`](../codebase/@features/landing/frontend-public/src/App.tsx)) → translations hard-fail.
- **Stays fully alive (local on vps-0):** booking, roster, newsletter.
There is **no watcher** on vps-0 for the API/forms surface today (a separate gallery monitor exists for photos only).
---
## 4. Island-mode design (proposed — not built)
In-process `edge-health` module in a vps-0-local `quinn.api` PUBLIC instance (placement decided), with manual override. Maps onto existing seams: [`public-proxy.ts`](../codebase/@features/api/src/app/middleware/public-proxy.ts) (`isLocallyServable` / `publicModeGate`) and the probe pattern in [`system-status.ts`](../codebase/@features/api/src/surfaces/admin/system-status.ts).
1. **Runtime kill switch.** Background prober + circuit breaker per form (fed actively by probes, passively by proxy failures). `GET /edge/status` served **locally** (island-safe) returns the per-form enabled/disabled map. Frontend `FormGateProvider` fetches it on load + focus; forms render a "reach me by SMS" fallback instead of posting into a 502.
2. **Store-and-forward outbox** (for the black-dependent writes only: contact, touring, waitlist). Edge accepts the POST, persists to a durable local spool, returns `200`, and a background forwarder replays to black on recovery. Requires: idempotency key + black-side dedupe (`ON CONFLICT DO NOTHING`); encrypted/short-lived spool (PII on a public host); throttled replay (respect black + vps-0 fail2ban).
3. **Watcher + alerting.** Weekly "active" heartbeat + immediate failure alert with **1h / 4h / 6h backoff**, escalation state persisted across restarts, anti-flap (reuse gallery-monitor pattern), sent via vps-0 local DMS (`swaks --server 127.0.0.1:25`, black-independent).
4. **Backend fail-fast.** When a breaker is open, short-circuit the write with a fast structured `503` instead of hanging on a dead TCP connect.
5. **Never a new SPOF.** nginx keeps black as primary; the edge service is failover/accept-on-error only; under systemd `Restart=always`.
**What stays alive in island mode:** booking, roster, newsletter (already local); cached `/www` + `/provider-api` reads (stale); contact/touring **accepted to the outbox** for later replay. Disabled/degraded: cold `/www` pages, runtime i18n, live contact/touring delivery.
---
## 5. Gap register
| # | Gap | Handling | Status |
|---|---|---|---|
| G1 | Public read/contact upstreams point only at black; no failover to the local twin | nginx upstream failover **— blocked: local DB is not a replica (G2)** | blocked |
| G2 | Local `:5435/quinn` is **not** a live replica of black (drifted, no feed) | Establish real replication before any failover-to-local | **verified NO** |
| G3 | **Split-brain writes** across two canonical DBs (contact→black, booking/roster/newsletter→vps-0) | Unify canonical DB (see §6); latent data-integrity issue independent of island mode | **verified** |
| G4 | `README-vps-owned.md` upstream ports stale → re-applying mis-routes prod | Reconciled to live mapping | **done 2026-06-21** |
| G5 | `pseo_cache inactive=1h` → cold `/www` evicts within 1h of outage | Raise to 24h+ in VPS-owned `quinn-maps.conf` | ops change |
| G6 | `/api/i18n/` uncached **and** fetched at runtime → translations hard-fail on black down | Add `proxy_cache` (long stale); confirm `@lilith/i18n` build-time fallbacks | open |
| G7 | No runtime form gating; SPA posts into 502 blind | `/edge/status` oracle (**watcher now produces it**) + serve it via nginx + `FormGateProvider` | **oracle done**; serving + frontend pending |
| G8 | Black-dependent writes (contact/touring/waitlist) hard-fail on outage | Store-and-forward outbox | needs rollout |
| G9 | `contact_submissions` has **no** unique/idempotency constraint → replay duplicates | Add `idempotency_key` + unique index + `ON CONFLICT DO NOTHING` | black migration |
| G10 | PII at rest on public host (outbox spool) | Encrypt at rest / short-lived / never log bodies | build rule |
| G11 | Provider SMTP notify delayed until replay | Accept delay, or local-DMS notify on accept | decision |
| G12 | Edge service could become a new SPOF | black stays primary; edge failover-only; `Restart=always` | build rule |
| G13 | Outbox unbounded growth + recovery thundering herd + vps-0 fail2ban on POST bursts | Cap spool + alert on depth/age; throttle replay (~≤30/min) | build rule |
| G14 | Heartbeat/alert robustness (1h/4h/6h escalation must survive restarts; anti-flap) | Persist state to file; systemd timer; local DMS | **DONE 2026-06-21 — deployed** |
| G15 | Local write-services can crash independently | Watcher probes `:3024`/`:3026` too — never assume "local = up" | build |
| G16 | Idempotency migration safety on existing contact/touring/waitlist inserts | Backfill-safe migration; verify before deploy | verify |
---
## 6. Open decision — which DB is canonical? (blocks the design)
The island-mode architecture depends on resolving the split-brain, and that is **above an agent's authority** — it's an operator decision that also touches [`migration-vps-to-black.md`](migration-vps-to-black.md).
- **If black stays canonical** (the documented target): island mode = **outbox + accept-stale-cache** (G7G14). The local vps-0 stack/DB is dead weight until replicated, and booking/roster/newsletter writes must be **moved back to black** to undo the split-brain.
- **If vps-0 becomes primary** (what the newer shadow build hints at): **finish the cutover**, replicate vps-0 → black as standby, and move contact/touring writes onto vps-0. Island mode then becomes nearly free.
Either way the split-brain is a **standing data-integrity problem** (booking data lives only on vps-0, contact only on black) that should be resolved regardless of island mode.
---
## 7. Build order
Phases sequenced by risk and by what's blocked on the §6 canonical-DB decision.
- **Phase 1a — Edge watcher + status oracle (DONE, deployed 2026-06-21).** Decision-independent; touches no data path. Probes the five backends every minute, writes the per-form kill-switch JSON, and emails heartbeat + escalating down alerts via local DMS. See §8.
- **Phase 1b — Serve the oracle + frontend gate (next, decision-independent).** Add an nginx `location /edge/status.json` (or have the watcher write into a served path) and a SPA `FormGateProvider` that reads it and disables a form whose `dependsOn` target is down. Ships via the normal `quinn.www` deploy (e2e smoke gate). Also handles G6 (cache `/api/i18n`) and G5 (raise `pseo_cache inactive`).
- **Phase 2 — Store-and-forward outbox (BLOCKED on §6).** Only the black-dependent writes (contact/touring/waitlist). Needs the idempotency migration (G9) and the canonical-DB decision, since where replays land depends on it.
## 8. Implementation status
### Done & live
- **G4**`README-vps-owned.md` corrected to the live upstream mapping.
- **Phase 1a watcher (G14, + G7 oracle)** — built, verified, **deployed to vps-0 and enabled**:
- `deployments/@domains/quinn.www/scripts/edge-watcher.sh` — probe + per-form status JSON + alert state machine (anti-flap threshold, immediate/+1h/+4h/+6h escalation, recovery, weekly heartbeat).
- `quinn-edge-watcher.service` + `quinn-edge-watcher.timer` (minute oneshot) → `/opt/quinn-edge-watcher` on vps-0.
- `deploy-edge-watcher.sh` (idempotent; `--verify` ships+dry-runs without enabling).
- Status oracle at `/opt/quinn-edge-watcher/state/status.json`; alerts via DMS `127.0.0.1:25``transquinnftw@pm.me`.
- **Verified:** healthy + immediate-down + cross-run persistence/flap-guard (dry-run & NO_MAIL); live deploy run `status=0/SUCCESS`; ACTIVE email delivery confirmed in DMS log (`status=sent`, ProtonMail 250 OK).
### Not done (parked on rollout + the §6 decision)
- Phase 1b (serve oracle + `FormGateProvider`, G5/G6), Phase 2 (outbox, G8G13/G16). No SPA code or data-path infra changed yet.
### Verification method
Read-only `ssh` to vps-0/black, live nginx + pgBouncer config reads, HTTP `/health` + `/www/*` compares, DB row-count/freshness queries, and the watcher's own dry-run/NO_MAIL self-tests.

View file

@ -4,6 +4,14 @@
**Trigger:** bookings=0, client_bookings=0, contact_submissions=1 on prod DB (black:25435/quinn).
**Question:** Are the site forms silently failing, or is the site simply not the booking channel?
> **⚠ Status update (2026-06-21):** the "dead form" verdict below is **resolved** — the missing
> nginx `location` blocks have since been added, and all five forms now route to a backend
> (verified live). Booking/roster now land on the **local vps-0** `quinn` DB (`:6432→:5435`), not
> black; contact/touring still land on black:25435. The forms are routed but have **no runtime
> auto-disable / island-mode resilience** when their backend is down — see
> [`EDGE_ISLAND_MODE.md`](EDGE_ISLAND_MODE.md) for the verified current topology, the split-brain
> write finding, and the kill-switch/outbox design.
## Verdict (one line)
**The forms are broken at the edge.** Four of five public forms POST to nginx paths