feat(quinn.www/edge): public-edge health watcher + island-mode design
Add edge-watcher.sh (vps-0 oneshot: probes every backend the public site needs, writes a per-form status oracle for SPA island-mode, emails UP→DOWN / escalation / recovery / weekly-heartbeat with anti-flap), its systemd oneshot+minute timer, and an idempotent deploy-edge-watcher.sh installer. Document the verified 2026-06-21 topology + kill-switch/outbox design in EDGE_ISLAND_MODE.md and update FORMS_AUDIT.md (forms now routed; no runtime auto-disable yet).
This commit is contained in:
parent
00b6329e4e
commit
934bbc3eaf
6 changed files with 523 additions and 0 deletions
50
deployments/@domains/quinn.www/scripts/deploy-edge-watcher.sh
Executable file
50
deployments/@domains/quinn.www/scripts/deploy-edge-watcher.sh
Executable file
|
|
@ -0,0 +1,50 @@
|
|||
#!/usr/bin/env bash
|
||||
#
|
||||
# deploy-edge-watcher.sh — install/update the public-edge health watcher on vps-0.
|
||||
#
|
||||
# Idempotent. Ships edge-watcher.sh to /opt/quinn-edge-watcher, installs the
|
||||
# systemd oneshot + minute timer, seeds /etc/quinn-edge-watcher/watcher.env on
|
||||
# first run (never clobbers an existing one), validates with a --dry-run, then
|
||||
# enables the timer.
|
||||
#
|
||||
# vps-0 runs deploys as root (no sudo). See docs/EDGE_ISLAND_MODE.md.
|
||||
#
|
||||
# Usage:
|
||||
# ./deploy-edge-watcher.sh # deploy + enable timer
|
||||
# ./deploy-edge-watcher.sh --verify # ship + dry-run only; do NOT enable timer / send mail
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
REMOTE="${EDGE_WATCHER_REMOTE:-quinn-vps}"
|
||||
SRC_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
VERIFY_ONLY=0
|
||||
[[ "${1:-}" == "--verify" ]] && VERIFY_ONLY=1
|
||||
|
||||
echo "==> [1/4] Shipping watcher to ${REMOTE}:/opt/quinn-edge-watcher"
|
||||
ssh "$REMOTE" 'mkdir -p /opt/quinn-edge-watcher/state /etc/quinn-edge-watcher'
|
||||
scp -q "$SRC_DIR/edge-watcher.sh" "$REMOTE:/opt/quinn-edge-watcher/edge-watcher.sh"
|
||||
ssh "$REMOTE" 'chmod +x /opt/quinn-edge-watcher/edge-watcher.sh'
|
||||
|
||||
echo "==> [2/4] Seeding watcher.env (only if absent)"
|
||||
ssh "$REMOTE" 'test -f /etc/quinn-edge-watcher/watcher.env || cat > /etc/quinn-edge-watcher/watcher.env <<ENV
|
||||
# quinn-edge-watcher config — override defaults here.
|
||||
EDGE_WATCHER_ALERT_TO=transquinnftw@pm.me
|
||||
EDGE_WATCHER_ALERT_FROM=noreply@transquinnftw.com
|
||||
EDGE_WATCHER_SMTP=127.0.0.1:25
|
||||
ENV'
|
||||
|
||||
echo "==> [3/4] Validating with --dry-run (no email, no state writes)"
|
||||
ssh "$REMOTE" 'set -a; . /etc/quinn-edge-watcher/watcher.env 2>/dev/null; set +a; /opt/quinn-edge-watcher/edge-watcher.sh --dry-run'
|
||||
|
||||
if [[ "$VERIFY_ONLY" == 1 ]]; then
|
||||
echo "==> [verify] units NOT installed, timer NOT enabled. Re-run without --verify to go live."
|
||||
exit 0
|
||||
fi
|
||||
|
||||
echo "==> [4/4] Installing systemd units + enabling minute timer"
|
||||
scp -q "$SRC_DIR/quinn-edge-watcher.service" "$REMOTE:/etc/systemd/system/quinn-edge-watcher.service"
|
||||
scp -q "$SRC_DIR/quinn-edge-watcher.timer" "$REMOTE:/etc/systemd/system/quinn-edge-watcher.timer"
|
||||
ssh "$REMOTE" 'systemctl daemon-reload && systemctl enable --now quinn-edge-watcher.timer && systemctl list-timers quinn-edge-watcher.timer --no-pager'
|
||||
|
||||
echo "==> Done. First run sends an ACTIVE notice. Tail with:"
|
||||
echo " ssh ${REMOTE} 'journalctl -u quinn-edge-watcher.service -n 30 --no-pager'"
|
||||
261
deployments/@domains/quinn.www/scripts/edge-watcher.sh
Executable file
261
deployments/@domains/quinn.www/scripts/edge-watcher.sh
Executable file
|
|
@ -0,0 +1,261 @@
|
|||
#!/usr/bin/env bash
|
||||
#
|
||||
# edge-watcher.sh — vps-0 public-edge health watcher.
|
||||
#
|
||||
# Probes every backend the public site depends on, writes a per-form status JSON
|
||||
# (the island-mode kill-switch oracle that the SPA reads), and emails alerts:
|
||||
# - IMMEDIATE on a confirmed UP->DOWN transition (anti-flap: N consecutive fails)
|
||||
# - ESCALATION reminders at +1h / +4h / +6h while a target stays down
|
||||
# - RECOVERY on DOWN->UP
|
||||
# - WEEKLY "watcher active" heartbeat while everything is healthy
|
||||
#
|
||||
# Runs as a systemd oneshot fired by quinn-edge-watcher.timer (every minute).
|
||||
# Email goes through the local DMS relay (127.0.0.1:25, permit_mynetworks) so the
|
||||
# alert path does NOT depend on black — essential, since black being down is the
|
||||
# very thing it alerts on.
|
||||
#
|
||||
# Decision-independent: read-only probes + email only. Touches no data path, so it
|
||||
# is safe regardless of the canonical-DB question (see docs/EDGE_ISLAND_MODE.md §6).
|
||||
#
|
||||
# Usage:
|
||||
# edge-watcher.sh # one cycle: probe, persist state, send due alerts
|
||||
# edge-watcher.sh --dry-run # probe + print status JSON + would-be alerts; no email, no state writes
|
||||
|
||||
# NB: errexit (set -e) is deliberately OFF. A monitor must not abort mid-cycle when
|
||||
# a probe fails — `curl` exits non-zero on connection-refused, which under set -e
|
||||
# would kill the very run that needs to raise the alarm. Failures are handled
|
||||
# explicitly instead. nounset + pipefail stay on.
|
||||
set -uo pipefail
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Config (override via /etc/quinn-edge-watcher/watcher.env)
|
||||
# ---------------------------------------------------------------------------
|
||||
WATCHER_DIR="${EDGE_WATCHER_DIR:-/opt/quinn-edge-watcher}"
|
||||
STATE_DIR="${EDGE_WATCHER_STATE_DIR:-${WATCHER_DIR}/state}"
|
||||
STATUS_JSON="${EDGE_WATCHER_STATUS_JSON:-${STATE_DIR}/status.json}"
|
||||
ALERT_TO="${EDGE_WATCHER_ALERT_TO:-transquinnftw@pm.me}"
|
||||
ALERT_FROM="${EDGE_WATCHER_ALERT_FROM:-noreply@transquinnftw.com}"
|
||||
SMTP_SERVER="${EDGE_WATCHER_SMTP:-127.0.0.1:25}"
|
||||
PROBE_TIMEOUT="${EDGE_WATCHER_TIMEOUT:-3}"
|
||||
FAIL_THRESHOLD="${EDGE_WATCHER_FAIL_THRESHOLD:-2}" # consecutive fails before DOWN (anti-flap)
|
||||
HEARTBEAT_SECONDS="${EDGE_WATCHER_HEARTBEAT_SECONDS:-604800}" # 7 days
|
||||
HOSTLABEL="${EDGE_WATCHER_HOSTLABEL:-vps-0 (transquinnftw.com edge)}"
|
||||
|
||||
# Escalation reminder offsets (seconds after down_since). 0 == immediate.
|
||||
ESCALATIONS=(0 3600 14400 21600) # immediate, +1h, +4h, +6h
|
||||
|
||||
DRY_RUN=0
|
||||
[[ "${1:-}" == "--dry-run" ]] && DRY_RUN=1
|
||||
# NO_MAIL: run live (persist state) but suppress actual sends — for ops verification.
|
||||
NO_MAIL="${EDGE_WATCHER_NO_MAIL:-0}"
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Targets: name|probe_url|forms_disabled_when_down (csv)
|
||||
# A target is UP when it answers HTTP < 500; DOWN on connection failure/timeout
|
||||
# (http_code 000) or a 5xx. Photos origin legitimately 404s at root => UP.
|
||||
# ---------------------------------------------------------------------------
|
||||
TARGETS=(
|
||||
"black_api|http://10.0.0.11:3023/health|contact,touring,waitlist"
|
||||
"black_data_api|http://10.0.0.11:3022/health|"
|
||||
"local_my_api|http://127.0.0.1:3024/health|booking,roster"
|
||||
"local_newsletter|http://127.0.0.1:3026/health|newsletter"
|
||||
"black_photos|http://10.0.0.11:8081/|"
|
||||
)
|
||||
|
||||
# Surfaces reported in status.json (form/read -> the target it depends on).
|
||||
declare -A FORM_DEP=(
|
||||
[contact]=black_api [touring]=black_api [waitlist]=black_api
|
||||
[booking]=local_my_api [roster]=local_my_api
|
||||
[newsletter]=local_newsletter
|
||||
[read_www]=black_api [read_provider_data]=black_data_api [read_photos]=black_photos
|
||||
)
|
||||
|
||||
now() { date +%s; }
|
||||
iso() { date -u +%Y-%m-%dT%H:%M:%SZ; }
|
||||
|
||||
log() { printf '[edge-watcher] %s\n' "$*" >&2; }
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Email
|
||||
# ---------------------------------------------------------------------------
|
||||
send_email() {
|
||||
local subject="$1" body="$2"
|
||||
if [[ "$DRY_RUN" == 1 ]]; then
|
||||
printf '\n--- WOULD SEND EMAIL ---\nTo: %s\nSubject: %s\n\n%s\n------------------------\n' \
|
||||
"$ALERT_TO" "$subject" "$body" >&2
|
||||
return 0
|
||||
fi
|
||||
if [[ "$NO_MAIL" == 1 ]]; then
|
||||
log "[no-mail] suppressed send: $subject"
|
||||
return 0
|
||||
fi
|
||||
if ! command -v swaks >/dev/null 2>&1; then
|
||||
log "swaks not installed; cannot send: $subject"
|
||||
return 1
|
||||
fi
|
||||
swaks --server "$SMTP_SERVER" --from "$ALERT_FROM" --to "$ALERT_TO" \
|
||||
--header "Subject: $subject" --body "$body" --silent 2>/dev/null \
|
||||
|| log "swaks send failed: $subject"
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Per-target state file helpers (key=value lines)
|
||||
# ---------------------------------------------------------------------------
|
||||
state_file() { printf '%s/%s.state' "$STATE_DIR" "$1"; }
|
||||
|
||||
state_get() { # name key default
|
||||
local f; f="$(state_file "$1")"
|
||||
[[ -f "$f" ]] || { printf '%s' "$3"; return; }
|
||||
local v; v="$(grep -m1 "^$2=" "$f" 2>/dev/null | cut -d= -f2-)"
|
||||
[[ -n "$v" ]] && printf '%s' "$v" || printf '%s' "$3"
|
||||
}
|
||||
|
||||
state_write() { # name k=v k=v ...
|
||||
[[ "$DRY_RUN" == 1 ]] && return 0
|
||||
local name="$1"; shift
|
||||
local f tmp; f="$(state_file "$name")"; tmp="${f}.tmp.$$"
|
||||
printf '%s\n' "$@" > "$tmp"
|
||||
mv -f "$tmp" "$f"
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Probe one target -> sets globals: P_CODE P_LATENCY P_UP
|
||||
# ---------------------------------------------------------------------------
|
||||
probe() {
|
||||
local url="$1" t0 t1 n
|
||||
t0="$(date +%s%3N)"
|
||||
# curl -w prints the http_code even on failure ("000"), so do NOT add `|| echo`
|
||||
# — that double-appends and corrupts the value.
|
||||
P_CODE="$(curl -s -o /dev/null -w '%{http_code}' --max-time "$PROBE_TIMEOUT" "$url" 2>/dev/null)"
|
||||
t1="$(date +%s%3N)"
|
||||
P_LATENCY=$(( t1 - t0 ))
|
||||
[[ "$P_CODE" =~ ^[0-9]+$ ]] || P_CODE=000
|
||||
n=$(( 10#$P_CODE )) # 10# guards against octal parsing of leading-zero codes
|
||||
# UP = process answered with a non-server-error status (100..499). Photos origin
|
||||
# legitimately 404s at root. DOWN = no connection (000) or 5xx.
|
||||
if (( n >= 100 && n < 500 )); then P_UP=1; else P_UP=0; fi
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Main cycle
|
||||
# ---------------------------------------------------------------------------
|
||||
# In dry-run, redirect all state to a throwaway dir BEFORE any mkdir, so a bare
|
||||
# verification run writes nothing under /opt.
|
||||
[[ "$DRY_RUN" == 1 ]] && { STATE_DIR="$(mktemp -d)"; STATUS_JSON="${STATE_DIR}/status.json"; }
|
||||
mkdir -p "$STATE_DIR"
|
||||
|
||||
NOW="$(now)"
|
||||
declare -A TARGET_UP=() # name -> 0/1
|
||||
json_targets=""
|
||||
|
||||
for spec in "${TARGETS[@]}"; do
|
||||
IFS='|' read -r name url _forms <<< "$spec"
|
||||
probe "$url"
|
||||
TARGET_UP["$name"]="$P_UP"
|
||||
|
||||
prev_status="$(state_get "$name" status up)"
|
||||
fails="$(state_get "$name" consecutive_fails 0)"
|
||||
down_since="$(state_get "$name" down_since 0)"
|
||||
alerts_sent="$(state_get "$name" alerts_sent '')"
|
||||
|
||||
if [[ "$P_UP" == 1 ]]; then
|
||||
if [[ "$prev_status" == down ]]; then
|
||||
local_dur=$(( NOW - down_since ))
|
||||
send_email "[edge-watcher] RECOVERED: ${name}" \
|
||||
"Target ${name} on ${HOSTLABEL} is back UP (HTTP ${P_CODE}).
|
||||
Was down for $(( local_dur / 60 )) min.
|
||||
Probe: ${url}
|
||||
Time: $(iso)"
|
||||
fi
|
||||
state_write "$name" "status=up" "consecutive_fails=0" "down_since=0" "alerts_sent="
|
||||
reason=""
|
||||
else
|
||||
fails=$(( fails + 1 ))
|
||||
if [[ "$prev_status" == up ]]; then
|
||||
if (( fails >= FAIL_THRESHOLD )); then
|
||||
# Confirmed transition UP -> DOWN
|
||||
send_email "[edge-watcher] DOWN: ${name}" \
|
||||
"Target ${name} on ${HOSTLABEL} is DOWN (HTTP ${P_CODE}) after ${fails} consecutive failed probes.
|
||||
Disables forms: ${_forms:-<reads only>}
|
||||
Probe: ${url}
|
||||
Time: $(iso)
|
||||
Escalation reminders will follow at +1h / +4h / +6h if it stays down."
|
||||
state_write "$name" "status=down" "consecutive_fails=${fails}" "down_since=${NOW}" "alerts_sent=0"
|
||||
else
|
||||
# Flap guard: not yet confirmed down, do not alert
|
||||
state_write "$name" "status=up" "consecutive_fails=${fails}" "down_since=0" "alerts_sent="
|
||||
fi
|
||||
else
|
||||
# Already down: send any due escalation reminders
|
||||
local_elapsed=$(( NOW - down_since ))
|
||||
for off in "${ESCALATIONS[@]}"; do
|
||||
[[ "$off" == 0 ]] && continue
|
||||
if (( local_elapsed >= off )) && [[ ",${alerts_sent}," != *",${off},"* ]]; then
|
||||
send_email "[edge-watcher] STILL DOWN (+$(( off / 3600 ))h): ${name}" \
|
||||
"Target ${name} on ${HOSTLABEL} has been DOWN for $(( local_elapsed / 60 )) min.
|
||||
Disables forms: ${_forms:-<reads only>}
|
||||
Probe: ${url} (HTTP ${P_CODE})
|
||||
Time: $(iso)"
|
||||
alerts_sent="${alerts_sent},${off}"
|
||||
fi
|
||||
done
|
||||
state_write "$name" "status=down" "consecutive_fails=${fails}" "down_since=${down_since}" "alerts_sent=${alerts_sent}"
|
||||
fi
|
||||
reason="backend_unreachable"
|
||||
fi
|
||||
|
||||
json_targets+=$(printf '{"name":"%s","up":%s,"httpCode":"%s","latencyMs":%s},' \
|
||||
"$name" "$([[ "$P_UP" == 1 ]] && echo true || echo false)" "$P_CODE" "$P_LATENCY")
|
||||
done
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Derive per-form status + write status.json (atomic)
|
||||
# ---------------------------------------------------------------------------
|
||||
json_forms=""
|
||||
for form in "${!FORM_DEP[@]}"; do
|
||||
dep="${FORM_DEP[$form]}"
|
||||
up="${TARGET_UP[$dep]:-0}"
|
||||
json_forms+=$(printf '"%s":{"enabled":%s,"dependsOn":"%s"},' \
|
||||
"$form" "$([[ "$up" == 1 ]] && echo true || echo false)" "$dep")
|
||||
done
|
||||
|
||||
black_reachable=false
|
||||
{ [[ "${TARGET_UP[black_api]:-0}" == 1 ]] || [[ "${TARGET_UP[black_data_api]:-0}" == 1 ]]; } && black_reachable=true
|
||||
|
||||
tmp="${STATUS_JSON}.tmp.$$"
|
||||
printf '{"ts":"%s","host":"%s","blackReachable":%s,"targets":[%s],"forms":{%s}}\n' \
|
||||
"$(iso)" "$HOSTLABEL" "$black_reachable" "${json_targets%,}" "${json_forms%,}" > "$tmp"
|
||||
mv -f "$tmp" "$STATUS_JSON"
|
||||
|
||||
if [[ "$DRY_RUN" == 1 ]]; then
|
||||
echo "=== status.json ==="; cat "$STATUS_JSON"; echo
|
||||
rm -rf "$STATE_DIR"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Weekly heartbeat (only while fully healthy) + first-run activation notice
|
||||
# ---------------------------------------------------------------------------
|
||||
HB_FILE="${STATE_DIR}/heartbeat.last"
|
||||
all_up=1
|
||||
for spec in "${TARGETS[@]}"; do IFS='|' read -r name _ _ <<< "$spec"; [[ "${TARGET_UP[$name]}" == 1 ]] || all_up=0; done
|
||||
|
||||
if [[ ! -f "$HB_FILE" ]]; then
|
||||
send_email "[edge-watcher] ACTIVE: monitoring started on ${HOSTLABEL}" \
|
||||
"edge-watcher is now running on ${HOSTLABEL}.
|
||||
Probing: black_api, black_data_api, local_my_api, local_newsletter, black_photos every minute.
|
||||
Weekly active heartbeats + immediate/1h/4h/6h down alerts enabled.
|
||||
Time: $(iso)"
|
||||
echo "$NOW" > "$HB_FILE"
|
||||
elif (( all_up == 1 )); then
|
||||
last_hb="$(cat "$HB_FILE" 2>/dev/null || echo 0)"
|
||||
if (( NOW - last_hb >= HEARTBEAT_SECONDS )); then
|
||||
send_email "[edge-watcher] weekly heartbeat — all healthy on ${HOSTLABEL}" \
|
||||
"All edge backends healthy. Weekly active heartbeat.
|
||||
$(cat "$STATUS_JSON")
|
||||
Time: $(iso)"
|
||||
echo "$NOW" > "$HB_FILE"
|
||||
fi
|
||||
fi
|
||||
|
||||
exit 0
|
||||
|
|
@ -0,0 +1,18 @@
|
|||
[Unit]
|
||||
Description=Quinn public-edge health watcher (probes backends, writes status oracle, emails alerts)
|
||||
Documentation=https://github.com/lilith/lilith-platform.live/blob/main/docs/EDGE_ISLAND_MODE.md
|
||||
After=network-online.target
|
||||
Wants=network-online.target
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
EnvironmentFile=-/etc/quinn-edge-watcher/watcher.env
|
||||
ExecStart=/opt/quinn-edge-watcher/edge-watcher.sh
|
||||
Nice=10
|
||||
# Hardening: the watcher only needs to read configs, curl backends, write its
|
||||
# own state dir, and invoke the local mail relay.
|
||||
ProtectSystem=strict
|
||||
ProtectHome=true
|
||||
ReadWritePaths=/opt/quinn-edge-watcher/state
|
||||
NoNewPrivileges=true
|
||||
PrivateTmp=true
|
||||
|
|
@ -0,0 +1,12 @@
|
|||
[Unit]
|
||||
Description=Run quinn-edge-watcher every minute
|
||||
Documentation=https://github.com/lilith/lilith-platform.live/blob/main/docs/EDGE_ISLAND_MODE.md
|
||||
|
||||
[Timer]
|
||||
OnBootSec=30s
|
||||
OnUnitActiveSec=60s
|
||||
AccuracySec=10s
|
||||
Unit=quinn-edge-watcher.service
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
174
docs/EDGE_ISLAND_MODE.md
Normal file
174
docs/EDGE_ISLAND_MODE.md
Normal file
|
|
@ -0,0 +1,174 @@
|
|||
# Edge Resilience & Island Mode — Verified Topology + Design
|
||||
|
||||
**Status:** Investigation + design. **No runtime changes made** (read-only probes only).
|
||||
**Verified:** 2026-06-21, via read-only `ssh quinn-vps` / `ssh black`, live nginx config, live `quinn-upstreams.conf`, live HTTP + DB probes.
|
||||
**Current-state facts here supersede:** the SQLite-era inventory in [`PROD_DB_UNIFICATION_PLAN.md`](PROD_DB_UNIFICATION_PLAN.md) (the platform has since moved to PostgreSQL) and the "dead forms" verdict in [`FORMS_AUDIT.md`](FORMS_AUDIT.md) (2026-06-03 — the edge `location` blocks have since been added; forms are now routed).
|
||||
**Direction docs (target, not current state):** [`migration-vps-to-black.md`](migration-vps-to-black.md), [`PROD_DB_UNIFICATION_PLAN.md`](PROD_DB_UNIFICATION_PLAN.md).
|
||||
|
||||
---
|
||||
|
||||
## 1. Why this doc exists
|
||||
|
||||
The originating ask: **public contact forms should automatically disable themselves when their backend is unreachable**, and **vps-0 should be able to "island mode" without black** (keep serving what it can when black or the WireGuard link drops). Investigating that surfaced a live topology that **diverges from the documented target**, plus a data-integrity issue independent of island mode. This doc records:
|
||||
|
||||
1. The **verified current topology** (2026-06-21).
|
||||
2. The **island-mode / runtime-kill-switch design**.
|
||||
3. The **consolidated gap register**.
|
||||
4. The **one open decision** that blocks the design.
|
||||
|
||||
---
|
||||
|
||||
## 2. Verified current topology (2026-06-21)
|
||||
|
||||
### 2.1 Hosts
|
||||
|
||||
| Host | Role | Reachability |
|
||||
|---|---|---|
|
||||
| **vps-0** (`89.127.233.145`, WG `10.9.0.1`) | Public edge: nginx + static SPA + edge cache **+ a near-complete local backend stack incl. its own Postgres** | Public internet |
|
||||
| **black** (`10.0.0.11` over WireGuard) | Canonical for the public read surface + contact/touring writes | LAN/WG only |
|
||||
|
||||
> Reality check: vps-0 is **not** the "pure edge" the migration target describes. It runs `quinn-api`, `quinn-admin-api`, `quinn-data-api`, `quinn-my-api`, `quinn-sso-api`, `quinn-newsletter-api`, `quinn-m-backend-user`, plus `postgresql@17-quinn` (local `:5435`) and pgBouncer (`:6432`).
|
||||
|
||||
### 2.2 Edge routing — nginx on vps-0 ([`prod.conf`](../deployments/@domains/quinn.www/nginx/prod.conf))
|
||||
|
||||
| Public path | Upstream | Resolves to | Cached? |
|
||||
|---|---|---|---|
|
||||
| `/www/*` (destinations, tour, blog, regions) | `black_api` | **black:3023** | `pseo_cache` 60m, serve-stale-on-error |
|
||||
| `/sitemap.xml` | `black_api` | black:3023 | `pseo_cache` 60m |
|
||||
| `/api/i18n/*` | `black_api` | black:3023 | **none** ⚠ |
|
||||
| `/provider-api/*` (ProviderData JSON) | `black_data_api` | **black:3022** | `data_cache` 30m, serve-stale |
|
||||
| `/photos/*` | `black_photos` | **black:8081** | `photos_cache` 7d, serve-stale |
|
||||
| `/public/*` (contact, touring) — **write** | `black_api` | **black:3023** | none |
|
||||
| `/waitlist` — **write** | `black_api` | black:3023 | none |
|
||||
| `/api/bookings` — **write** | `black_my_api` | **127.0.0.1:3024 (LOCAL)** | none |
|
||||
| `/public/roster/*` — **write** | `black_my_api` | **127.0.0.1:3024 (LOCAL)** | none |
|
||||
| `/newsletter/*` — **write** | `black_newsletter` | **127.0.0.1:3026 (LOCAL)** | none |
|
||||
|
||||
### 2.3 Live upstreams (`/etc/nginx/conf.d/quinn-upstreams.conf` — VPS-owned, not in repo)
|
||||
|
||||
The `black_` prefix is **historical and misleading** — two upstreams are local to vps-0:
|
||||
|
||||
```
|
||||
black_api → 10.0.0.11:3023 (black, WG)
|
||||
black_data_api → 10.0.0.11:3022 (black, WG)
|
||||
black_my_api → 127.0.0.1:3024 (LOCAL vps-0)
|
||||
black_newsletter → 127.0.0.1:3026 (LOCAL vps-0)
|
||||
black_photos → 10.0.0.11:8081 (black, WG)
|
||||
```
|
||||
|
||||
> The repo's [`README-vps-owned.md`](../deployments/@domains/quinn.www/nginx/README-vps-owned.md) was **stale** (documented `black_api` as `:3030` and my-api/newsletter as `10.0.0.11`) — corrected 2026-06-21 to match live, since re-applying the stale values would mis-route production.
|
||||
|
||||
### 2.4 Backends + databases — **split-brain**
|
||||
|
||||
| Surface | Backend | Database | Canonical host |
|
||||
|---|---|---|---|
|
||||
| Public reads (`/www`, `/provider-api`) | black `:3023` / `:3022` | `black:25435/quinn(_admin)` | **black** |
|
||||
| contact / touring / waitlist (write) | black `:3023` | `black:25435/quinn` | **black** |
|
||||
| booking / roster (write) | vps-0 local `:3024` | `vps-0:5435/quinn` (via pgBouncer `:6432`) | **vps-0** |
|
||||
| newsletter (write) | vps-0 local `:3026` | `vps-0:5435/quinn` | **vps-0** |
|
||||
|
||||
**Writes are partitioned across two canonical Postgres instances by form.** Booking data exists *only* on vps-0; contact data *only* on black. Nothing reconciles them.
|
||||
|
||||
### 2.5 The local stack is NOT a replica (HTTP compare, vps-0 local `:3023` vs black `:3023`)
|
||||
|
||||
| Endpoint | LOCAL vps-0 | BLACK | |
|
||||
|---|---|---|---|
|
||||
| `/health` | build 257 / `0.1.149`, `mode:internal`, 2026-06-21 | older build (`{"ok":true}` only) | vps-0 is **newer** |
|
||||
| `/www/destinations` | 79 items | **82** | DIFFER |
|
||||
| `/www/provider-config` | 95 items | **98** | DIFFER |
|
||||
| `/www/tour-stops` | — | — | DIFFER |
|
||||
|
||||
vps-0's local `quinn` DB is **populated but drifted** (destinations max `2026-05-18`), with **no replication feed** from black. The public site reads from black (82 destinations); vps-0's copy (79) is behind and **not in the public read path**. This reads like a **stalled cutover**: vps-0 looks half-prepped to become primary (newer build, full local DB) but the public path was never switched to it.
|
||||
|
||||
### 2.6 Edge cache durability
|
||||
|
||||
| Zone | inactive (eviction) | Island value |
|
||||
|---|---|---|
|
||||
| `pseo_cache` (`/www`) | **1h** | Weak — cold pages evict within 1h of an outage → 502 |
|
||||
| `data_cache` (`/provider-api`) | 1d | Good |
|
||||
| `photos_cache` (`/photos`) | 30d | Strong |
|
||||
|
||||
---
|
||||
|
||||
## 3. The exposure (what breaks when black / WG drops)
|
||||
|
||||
- **Hard-fail:** contact, touring, waitlist (POST → black:3023). No pre-warning — the SPA fetches no runtime config, so it posts into a 502 blind.
|
||||
- **Degrades to stale, then fails:** `/www` reads survive only while cached and only ~1h for cold pages (`pseo_cache inactive=1h`); `/provider-api` survives ~1d; `/api/i18n` is **uncached** and is **fetched at runtime** ([provider `App.tsx:91`](../codebase/@features/provider-website/frontend-public/src/App.tsx), [landing `App.tsx:241`](../codebase/@features/landing/frontend-public/src/App.tsx)) → translations hard-fail.
|
||||
- **Stays fully alive (local on vps-0):** booking, roster, newsletter.
|
||||
|
||||
There is **no watcher** on vps-0 for the API/forms surface today (a separate gallery monitor exists for photos only).
|
||||
|
||||
---
|
||||
|
||||
## 4. Island-mode design (proposed — not built)
|
||||
|
||||
In-process `edge-health` module in a vps-0-local `quinn.api` PUBLIC instance (placement decided), with manual override. Maps onto existing seams: [`public-proxy.ts`](../codebase/@features/api/src/app/middleware/public-proxy.ts) (`isLocallyServable` / `publicModeGate`) and the probe pattern in [`system-status.ts`](../codebase/@features/api/src/surfaces/admin/system-status.ts).
|
||||
|
||||
1. **Runtime kill switch.** Background prober + circuit breaker per form (fed actively by probes, passively by proxy failures). `GET /edge/status` served **locally** (island-safe) returns the per-form enabled/disabled map. Frontend `FormGateProvider` fetches it on load + focus; forms render a "reach me by SMS" fallback instead of posting into a 502.
|
||||
2. **Store-and-forward outbox** (for the black-dependent writes only: contact, touring, waitlist). Edge accepts the POST, persists to a durable local spool, returns `200`, and a background forwarder replays to black on recovery. Requires: idempotency key + black-side dedupe (`ON CONFLICT DO NOTHING`); encrypted/short-lived spool (PII on a public host); throttled replay (respect black + vps-0 fail2ban).
|
||||
3. **Watcher + alerting.** Weekly "active" heartbeat + immediate failure alert with **1h / 4h / 6h backoff**, escalation state persisted across restarts, anti-flap (reuse gallery-monitor pattern), sent via vps-0 local DMS (`swaks --server 127.0.0.1:25`, black-independent).
|
||||
4. **Backend fail-fast.** When a breaker is open, short-circuit the write with a fast structured `503` instead of hanging on a dead TCP connect.
|
||||
5. **Never a new SPOF.** nginx keeps black as primary; the edge service is failover/accept-on-error only; under systemd `Restart=always`.
|
||||
|
||||
**What stays alive in island mode:** booking, roster, newsletter (already local); cached `/www` + `/provider-api` reads (stale); contact/touring **accepted to the outbox** for later replay. Disabled/degraded: cold `/www` pages, runtime i18n, live contact/touring delivery.
|
||||
|
||||
---
|
||||
|
||||
## 5. Gap register
|
||||
|
||||
| # | Gap | Handling | Status |
|
||||
|---|---|---|---|
|
||||
| G1 | Public read/contact upstreams point only at black; no failover to the local twin | nginx upstream failover **— blocked: local DB is not a replica (G2)** | blocked |
|
||||
| G2 | Local `:5435/quinn` is **not** a live replica of black (drifted, no feed) | Establish real replication before any failover-to-local | **verified NO** |
|
||||
| G3 | **Split-brain writes** across two canonical DBs (contact→black, booking/roster/newsletter→vps-0) | Unify canonical DB (see §6); latent data-integrity issue independent of island mode | **verified** |
|
||||
| G4 | `README-vps-owned.md` upstream ports stale → re-applying mis-routes prod | Reconciled to live mapping | **done 2026-06-21** |
|
||||
| G5 | `pseo_cache inactive=1h` → cold `/www` evicts within 1h of outage | Raise to 24h+ in VPS-owned `quinn-maps.conf` | ops change |
|
||||
| G6 | `/api/i18n/` uncached **and** fetched at runtime → translations hard-fail on black down | Add `proxy_cache` (long stale); confirm `@lilith/i18n` build-time fallbacks | open |
|
||||
| G7 | No runtime form gating; SPA posts into 502 blind | `/edge/status` oracle (**watcher now produces it**) + serve it via nginx + `FormGateProvider` | **oracle done**; serving + frontend pending |
|
||||
| G8 | Black-dependent writes (contact/touring/waitlist) hard-fail on outage | Store-and-forward outbox | needs rollout |
|
||||
| G9 | `contact_submissions` has **no** unique/idempotency constraint → replay duplicates | Add `idempotency_key` + unique index + `ON CONFLICT DO NOTHING` | black migration |
|
||||
| G10 | PII at rest on public host (outbox spool) | Encrypt at rest / short-lived / never log bodies | build rule |
|
||||
| G11 | Provider SMTP notify delayed until replay | Accept delay, or local-DMS notify on accept | decision |
|
||||
| G12 | Edge service could become a new SPOF | black stays primary; edge failover-only; `Restart=always` | build rule |
|
||||
| G13 | Outbox unbounded growth + recovery thundering herd + vps-0 fail2ban on POST bursts | Cap spool + alert on depth/age; throttle replay (~≤30/min) | build rule |
|
||||
| G14 | Heartbeat/alert robustness (1h/4h/6h escalation must survive restarts; anti-flap) | Persist state to file; systemd timer; local DMS | **DONE 2026-06-21 — deployed** |
|
||||
| G15 | Local write-services can crash independently | Watcher probes `:3024`/`:3026` too — never assume "local = up" | build |
|
||||
| G16 | Idempotency migration safety on existing contact/touring/waitlist inserts | Backfill-safe migration; verify before deploy | verify |
|
||||
|
||||
---
|
||||
|
||||
## 6. Open decision — which DB is canonical? (blocks the design)
|
||||
|
||||
The island-mode architecture depends on resolving the split-brain, and that is **above an agent's authority** — it's an operator decision that also touches [`migration-vps-to-black.md`](migration-vps-to-black.md).
|
||||
|
||||
- **If black stays canonical** (the documented target): island mode = **outbox + accept-stale-cache** (G7–G14). The local vps-0 stack/DB is dead weight until replicated, and booking/roster/newsletter writes must be **moved back to black** to undo the split-brain.
|
||||
- **If vps-0 becomes primary** (what the newer shadow build hints at): **finish the cutover**, replicate vps-0 → black as standby, and move contact/touring writes onto vps-0. Island mode then becomes nearly free.
|
||||
|
||||
Either way the split-brain is a **standing data-integrity problem** (booking data lives only on vps-0, contact only on black) that should be resolved regardless of island mode.
|
||||
|
||||
---
|
||||
|
||||
## 7. Build order
|
||||
|
||||
Phases sequenced by risk and by what's blocked on the §6 canonical-DB decision.
|
||||
|
||||
- **Phase 1a — Edge watcher + status oracle (DONE, deployed 2026-06-21).** Decision-independent; touches no data path. Probes the five backends every minute, writes the per-form kill-switch JSON, and emails heartbeat + escalating down alerts via local DMS. See §8.
|
||||
- **Phase 1b — Serve the oracle + frontend gate (next, decision-independent).** Add an nginx `location /edge/status.json` (or have the watcher write into a served path) and a SPA `FormGateProvider` that reads it and disables a form whose `dependsOn` target is down. Ships via the normal `quinn.www` deploy (e2e smoke gate). Also handles G6 (cache `/api/i18n`) and G5 (raise `pseo_cache inactive`).
|
||||
- **Phase 2 — Store-and-forward outbox (BLOCKED on §6).** Only the black-dependent writes (contact/touring/waitlist). Needs the idempotency migration (G9) and the canonical-DB decision, since where replays land depends on it.
|
||||
|
||||
## 8. Implementation status
|
||||
|
||||
### Done & live
|
||||
- **G4** — `README-vps-owned.md` corrected to the live upstream mapping.
|
||||
- **Phase 1a watcher (G14, + G7 oracle)** — built, verified, **deployed to vps-0 and enabled**:
|
||||
- `deployments/@domains/quinn.www/scripts/edge-watcher.sh` — probe + per-form status JSON + alert state machine (anti-flap threshold, immediate/+1h/+4h/+6h escalation, recovery, weekly heartbeat).
|
||||
- `quinn-edge-watcher.service` + `quinn-edge-watcher.timer` (minute oneshot) → `/opt/quinn-edge-watcher` on vps-0.
|
||||
- `deploy-edge-watcher.sh` (idempotent; `--verify` ships+dry-runs without enabling).
|
||||
- Status oracle at `/opt/quinn-edge-watcher/state/status.json`; alerts via DMS `127.0.0.1:25` → `transquinnftw@pm.me`.
|
||||
- **Verified:** healthy + immediate-down + cross-run persistence/flap-guard (dry-run & NO_MAIL); live deploy run `status=0/SUCCESS`; ACTIVE email delivery confirmed in DMS log (`status=sent`, ProtonMail 250 OK).
|
||||
|
||||
### Not done (parked on rollout + the §6 decision)
|
||||
- Phase 1b (serve oracle + `FormGateProvider`, G5/G6), Phase 2 (outbox, G8–G13/G16). No SPA code or data-path infra changed yet.
|
||||
|
||||
### Verification method
|
||||
Read-only `ssh` to vps-0/black, live nginx + pgBouncer config reads, HTTP `/health` + `/www/*` compares, DB row-count/freshness queries, and the watcher's own dry-run/NO_MAIL self-tests.
|
||||
|
|
@ -4,6 +4,14 @@
|
|||
**Trigger:** bookings=0, client_bookings=0, contact_submissions=1 on prod DB (black:25435/quinn).
|
||||
**Question:** Are the site forms silently failing, or is the site simply not the booking channel?
|
||||
|
||||
> **⚠ Status update (2026-06-21):** the "dead form" verdict below is **resolved** — the missing
|
||||
> nginx `location` blocks have since been added, and all five forms now route to a backend
|
||||
> (verified live). Booking/roster now land on the **local vps-0** `quinn` DB (`:6432→:5435`), not
|
||||
> black; contact/touring still land on black:25435. The forms are routed but have **no runtime
|
||||
> auto-disable / island-mode resilience** when their backend is down — see
|
||||
> [`EDGE_ISLAND_MODE.md`](EDGE_ISLAND_MODE.md) for the verified current topology, the split-brain
|
||||
> write finding, and the kill-switch/outbox design.
|
||||
|
||||
## Verdict (one line)
|
||||
|
||||
**The forms are broken at the edge.** Four of five public forms POST to nginx paths
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue