Natalie 3975a57ec0 docs(infra): update INFRA.md to document no more black for CI/runners — ct-forge on DO on-demand horizontal runners via Terraform IaC (modeled on LP)

- Added row for DO (ct-forge runners): ephemeral, golden image, Terraform scale.
- References migration, LP script logic in cloud-init.

2026-06-28 17:32:57 -04:00

36 KiB

Raw Permalink Blame History

@cocottetech — Infrastructure Design

Status: Design phase Date: 2026-05-16 Companion to: DESIGN.md

1. Hosts at a glance

Host	Type	Role	Network	OS
plum	Mac mini (Apple Silicon)	macOS-required services only (iOS build, `mail-sync`, `mac-sync`)	LAN: `plum.lan`	macOS
apricot	Linux box (home lab)	Engineering dev box + prod GPU host (`@model-boss`, 2× 24 GB GPU)	LAN: `10.0.0.13`	Bluefin/Bootc
black	Linux box (home lab)	Prod DBs + `quinn.api` (v2) + `platform.api` (V4) + VPN + `cocotte.io` edge	LAN: `10.0.0.11`, Public: `cocotte.io`	Linux
DO (ct-forge runners)	DigitalOcean droplets (ephemeral, on-demand)	Forgejo Actions runners for ct-forge CI (package publish, builds). Horizontally scaled via Terraform (var.runners=N, 0=zero cost). Golden image from packer + cloud-init replicating LP runner logic (no more black for CI/runners).	nyc3 etc	Ubuntu
vps-0	Hetzner VPS (alias `quinn-vps`)	Public web tier + public data cache (no primary state)	Public IP	Linux

Why this split

plum exists solely because we ship a native iOS app and run macOS-only peers. Xcode + fastlane + code-signing for lilith-messenger-ios / @features/ai-copilot/ios-fe; mail-sync wraps Proton Bridge SMTP; mac-sync reads iMessage from macOS APIs. No agent runtimes, no domain services, no ML workloads. If iOS + Proton + iMessage ever stop being requirements, plum goes away.
apricot is (1) the engineering workstation — code editing, ACS commit gate, Forgejo, dev frontends pointed at prod APIs on black; (2) the production GPU host running @model-boss with two 24 GB GPUs, coordinating all GPU-intensive inference (vision, captioning, embeddings) for every AI consumer in the ecosystem. No persistent dev DBs — engineering works against prod APIs; test DBs are ephemeral containers spun up per test run.
black holds every authoritative production database (quinn.db :25435, platform.db :25437, messenger.db :25433, mac-sync.db mirror :25436) and runs every centralized API: quinn.api (v2, verified via deployments/@domains/quinn.api/deploy.sh:7 → REMOTE="${QUINN_API_REMOTE:-black}") and platform.api (V4, port 3060). Also: V4 @ai instance fleet (ai-copilot, content-{surface}, ...), V4 workers (scheduler, ingestor, resolver, notifier), VPN endpoint for engineering access, public edge for cocotte.io.
vps-0 serves the public web tier (V4 web frontends + v2 frontends) and the public data cache (Redis pub/sub + cache, TimescaleDB hot writes for analytics, MinIO hot tier). It owns no primary state — every persistent datum belongs to black. Writes through quinn.admin / platform.api fire cache.invalidate events that vps-0's cache rebuilder consumes. vps-0 is replaceable.
(No separate vps-quinn host — that name in the manifest is just an alias for vps-0.)

2. Topology — ASCII

                          ┌──────────────────────────────────────┐
                          │           PUBLIC INTERNET            │
                          └────────────┬─────────────────────────┘
                                       │
                  ┌────────────────────┼────────────────────────────┐
                  │                    │                            │
                  ▼                    ▼                            ▼
        ┌──────────────────┐  ┌─────────────────┐  ┌──────────────────────────┐
        │  cocotte.io    │  │  quinn.* (v2)   │  │  cocotte.maison (org)    │
        │  (marketing)     │  │                 │  │  ai.cocotte.maison (V4)  │
        │                  │  │                 │  │  sansonnet.maison, ...   │
        └─────────┬────────┘  └────────┬────────┘  └────────┬─────────────────┘
                  │ HTTPS              │ HTTPS              │ HTTPS
                  ▼                    ▼                    ▼
       ┌───────────────────┐    ┌───────────────────────────────────────────────┐
       │  BLACK Caddy      │    │  VPS-0 Caddy → public web tier + cache reads  │
       │  (cocotte.io    │    │                                               │
       │   edge only)      │    │  V4 web FEs:                                  │
       └─────────┬─────────┘    │   ai.cocotte.maison (ai-copilot web-fe) :5201 │
                 │              │   content-portal, engagement-portal (later)   │
                 │              │  v2 web FEs:                                  │
                 │              │   quinn.www, .my, .m, .ai, .admin, .data, .vip│
                 │              │                                               │
                 │              │  Public data cache (no primary state):        │
                 │              │   redis :26379  (cache + pub/sub —            │
                 │              │     cache-rebuilder consumes cache.invalidate)│
                 │              │   timescaledb :25434 (analytics hot writes;   │
                 │              │     periodic rollups → black)                 │
                 │              │   minio :9000 (hot; replicates → black cold)  │
                 │              └─────────────────────┬─────────────────────────┘
                 │                                    │ HTTPS → APIs on black
                 │  ┌─────────────────────────────────┘
                 ▼  ▼
   ┌────────────────────────────────────────────────────────────────────────┐
   │  BLACK (10.0.0.11) — PROD CORE                                         │
   │                                                                        │
   │  Centralized APIs (single API plane for all web + iOS):                │
   │   quinn.api    :3030  v2 (verified deploys here — deploy.sh:7)         │
   │   platform.api :3060  V4 (NestJS, new)                                 │
   │                                                                        │
   │  V4 @ai fleet (one process per specialist):                            │
   │   ai-copilot       :3791   front-door (Quinn-facing)                   │
   │   content-onlyfans :3792   P1                                          │
   │   content-x        :3793   P2                                          │
   │   content-{...}    :3794+  P3                                          │
   │   bookings-{tryst,ts4rent,...}  P3 escort-directory axis               │
   │                                                                        │
   │  V4 workers:                                                           │
   │   scheduler-worker, engagement-ingestor, prospect-resolver, notifier   │
   │   v2 systemd: quinn.hotel-scout                                        │
   │                                                                        │
   │  Authoritative DBs:                                                    │
   │   pg :25435  quinn.db    (v2 — LIVE, UNTOUCHED)                        │
   │   pg :25437  platform.db (V4 — new, isolated; no cross-DB joins to v2) │
   │   pg :25433  messenger.db                                              │
   │   pg :25436  mac-sync.db (read mirror from plum)                       │
   │   minio :9000 (cold; replication target from vps-0)                   │
   │                                                                        │
   │  Edge & misc:                                                          │
   │   cocotte.www (public marketing), docker-mailserver + Rspamd, VPN     │
   └─────────┬──────────────────────────────────────────────────────────────┘
             │ LAN                              ▲
             ▼                                  │ HTTPS (GPU inference)
   ┌─────────────────────────────────┐          │
   │ APRICOT (10.0.0.13)             │──────────┘
   │                                 │
   │ Engineering dev box:            │   Dev frontends at *.apricot.lan
   │  Code editing + ACS gate        │   point at PROD APIs on black.
   │  Forgejo (self-host git)        │   No persistent dev DBs.
   │  No persistent DBs              │   Test DBs are ephemeral.
   │                                 │
   │ Prod GPU host:                  │
   │  @model-boss + 2× 24 GB GPU     │
   │   (vision tags, captions,       │
   │    embeddings — every AI        │
   │    consumer routes here)        │
   └─────────────────────────────────┘

   ┌──────────────────────────────────────────────┐
   │ PLUM (Mac mini)                              │
   │  macOS-required ONLY:                        │
   │   iOS build pipeline (Xcode, fastlane,       │
   │     code-signing for V4 ai-copilot ios-fe    │
   │     and lilith-messenger-ios)                │
   │   mail-sync :4444 (Proton Bridge SMTP)       │
   │   mac-sync  :3100 (iMessage bidir sync)      │
   │  No agent runtimes, no domain services.      │
   └──────────────────────────────────────────────┘

3. Databases — who lives where

Authoritative production DBs — `black` (LAN, 10.0.0.11)

┌──────────────────────────────────────────────────────────────┐
│  black  (AUTHORITATIVE PRODUCTION DBs)                       │
│                                                              │
│  postgres:25435  ─── quinn.db (v2 LIVE — UNTOUCHED)          │
│      └── v2 schema; all v2 services on vps-0 read/write here │
│      ▲ vps-0 v2 apps reach via SSH -R 25435 reverse tunnel   │
│                                                              │
│  postgres:25437  ─── platform.db (V4 — new, isolated)        │
│      ├── users, orgs, org_members          ← tenancy core    │
│      ├── personas, content_plans, content_assets,            │
│      │     content_posts                   ← content engine  │
│      ├── agent_actions                     ← audit spine     │
│      ├── engagement_events, prospects      ← funnel          │
│      ├── (future) bookings, payments,      ← mined from v1   │
│      │     profiles, attributes                              │
│      └── audit_log                                           │
│      ▲ vps-0 V4 apps reach via SSH -R 25437 reverse tunnel   │
│      ▲ V4 ↔ v2 cross-DB data flows over HTTP/MCP only;       │
│        no cross-DB SQL joins. v2 stays oblivious to V4.      │
│                                                              │
│  postgres:25433  ─── messenger.db (iMessage threads)         │
│      ├── threads, messages, contacts                         │
│      └── send_queue (writes from m-sync via tunnel)          │
│                                                              │
│  postgres:25436  ─── mac-sync.db (raw iCloud, read-only)     │
│      └── (mac-sync peer on plum is the writer; mirrored      │
│           here for read access from vps-0/black)             │
│                                                              │
│  minio:9000      ─── object storage (cold tier, photo backup)│
│  docker-mailserver ─ inbound SMTP for cocotte.io           │
│  systemd workers ─── quinn.hotel-scout (hourly timer)        │
└──────────────────────────────────────────────────────────────┘

Public app tier + local cache — `vps-0`

┌──────────────────────────────────────────────────────────────┐
│  vps-0  (Public app tier — DBs are CACHES, not authoritative)│
│                                                              │
│  timescaledb:25434 ── analytics.db (org-analytics events)    │
│      ├── visitor_events (org_id partitioned, hot writes)     │
│      ├── funnels, conversions                                │
│      └── per-org rollups (continuous aggregates)             │
│      ▼ Cold rollups periodically flushed to black            │
│                                                              │
│  redis:26379  ──────── cache + queue                         │
│      ├── analytics ingestion queue (before flush to ts-db)   │
│      ├── BullMQ jobs (queue-worker feature)                  │
│      ├── session cache (SSO JWT validation)                  │
│      └── HTTP response cache for hot reads                   │
│                                                              │
│  minio:9000   ──────── object storage (hot tier)             │
│      └── replicates → black:9000 (cold)                      │
│                                                              │
│  App processes for quinn.* (no persistent state of their own)│
└──────────────────────────────────────────────────────────────┘

Why this split (vps-0 cache, black authoritative):

vps-0 is replaceable — if it dies, spin up a new VPS, redeploy from git, point DNS. Caches rebuild from black.
black is the data crown jewel — kept on a controlled LAN host, harder to attack from public internet.
vps-0 → black uses persistent SSH reverse tunnel (-R 25435:localhost:25435) initiated from black, so vps-0 can't be a pivot back to LAN if compromised.

Apricot has no persistent DBs

Engineering points dev frontends at prod APIs on black. There is no dev API stack and no dev DB tier. Tests use ephemeral containers (Postgres + Redis + MinIO via docker-compose, spun up per test run and torn down on exit). This keeps a single source of truth for schema, migrations, seed data, and engagement state.

Plum-resident state (NOT in any pg)

┌──────────────────────────────────────────────────────────────┐
│  plum  (macOS-only)                                          │
│                                                              │
│  ~/.local/share/mail-sync/mail-sync.db    ── SQLite send Q   │
│  ~/.local/share/mac-sync/mac-sync.db      ── SQLite ingest Q │
│  ~/.local/share/knowledge-platform/*.db   ── Crystal TUI db  │
│                                                              │
│  (These are local-only queues. Source of truth eventually    │
│   lands in black's authoritative DBs via HTTP push.)         │
└──────────────────────────────────────────────────────────────┘

4. Service distribution by host

plum — macOS-required ONLY

Service	Port	Reason it's here
iOS build pipeline	—	Xcode, fastlane, code-signing for `lilith-messenger-ios` + `@features/ai-copilot/ios-fe`. Apple's toolchain is macOS-only.
`mail-sync`	4444	Wraps Proton Bridge SMTP (Mac-only app).
`mac-sync`	3100	Reads iMessage from macOS APIs.

No agent runtimes, no domain services, no ML. Plum exists solely to satisfy the macOS-only requirements above.

apricot — engineering dev box + prod GPU host

Service	Port	Reason it's here
`@model-boss`	(apricot-internal)	Production GPU coordinator — routes all GPU-intensive inference (vision, captioning, embeddings) for every AI consumer in the ecosystem. 2× 24 GB GPUs.
Dev frontends (`*.apricot.lan` via Caddy)	5300–5399	V4 web FEs in dev — call prod APIs on black, no local DBs.
ACS (auto-commit-service)	—	Serializes git commits (apricot is sole writer).
Forgejo	3000	Self-hosted git.
Test container harness	ephemeral	Docker-compose spins up Postgres / Redis / MinIO per test run, tears down on exit.

No persistent dev databases. No dev API stack.

black — prod core + authoritative DBs + APIs

Service	Port	Reason
`quinn.api` (v2)	3030	v2 centralized API. Verified host via `deployments/@domains/quinn.api/deploy.sh:7` → `REMOTE="${QUINN_API_REMOTE:-black}"`.
`platform.api` (V4)	3060	V4 centralized API (NestJS). Owns CRUD over `platform.db`.
`ai-copilot` @ai instance	3791	Quinn-facing front-door specialist.
`content-onlyfans` @ai (P1)	3792	Per-surface specialist (OF lifecycle).
`content-x` @ai (P2)	3793	Per-surface specialist (X lifecycle).
`content-{instagram,tiktok,...}` @ai (P3)	3794+	Per-surface specialists.
`bookings-{tryst,ts4rent,...}` @ai (P3)	3796+	Escort-directory specialists.
`scheduler-worker`	3820 (health)	Polls `content_posts` and dispatches to `@ai/@skills/platform-*` actions.
`engagement-ingestor`	3821 (health)	Pulls inbound across surfaces, normalizes to `engagement_events`.
`prospect-resolver` (P4)	3822 (health)	Cross-surface prospect dedup.
`notifier`	3823 (health)	Multi-channel dispatcher (iOS push, iMessage via mac-sync, email digest).
Postgres (`quinn.db`)	25435	v2 authoritative DB (live, untouched).
Postgres (`platform.db`)	25437	V4 authoritative DB (new, isolated; no cross-DB joins).
Postgres (`messenger.db`)	25433	Authoritative messenger DB.
Postgres (`mac-sync.db`)	25436	Read-only mirror of plum's mac-sync ingest.
MinIO (cold)	9000	Replication target from vps-0 (cold tier, backups).
`cocotte.www`	80/443	Public marketing site (edge for cocotte.io).
`waitlist-api`	3070	Pre-launch collector.
`docker-mailserver` + Rspamd	25/587	Inbound SMTP for `cocotte.io`.
`quinn.hotel-scout` (systemd)	—	Hourly hotel-scraping worker (v2).
VPN endpoint	—	Engineering remote access.
Caddy	80/443	Edge TLS termination.

vps-0 — public web tier + public data cache (no primary state)

v2 web frontends + APIs (deployed today; remain untouched by V4):

Domain	Service	Port
`quinn.www`	Provider website (transquinnftw.com)	5120→443
`quinn.sso`	SSO + device-link	3025→443
`quinn.my`	Provider portal	5174→443
`quinn.m`	Messenger UI	5175→443
`quinn.ai`	AI assistant	5176→443
`quinn.admin`	Admin panel	5121→443
`quinn.data`	Analytics dashboard	5111→443
`quinn.vip`	VIP messaging	5178→443
`quinn.ai-engine`	LLM inference worker	(internal)
`quinn.mail-autoresponder`	Auto-respond engine	(internal)
`quinn.hotel-scout`	Tour booking automation	(internal)
`quinn.price-watcher`	Price monitoring	(internal)
`quinn.m-orchestrator`	Background worker	3803 (health)
`quinn.my-orchestrator`	Background worker	(health)

V4 web frontends (secondary surfaces — iOS is the primary AI UI):

Domain	Service	Port
`ai.cocotte.maison`	`@features/ai-copilot/web-fe` (Cocotte umbrella brand instance, operated by Demimonde back-office Org)	5201→443
(future) `content.cocotte.maison`	`@features/content-portal/web-fe` (calendar, asset library)	5202→443
(future) `engagement.cocotte.maison`	`@features/engagement-portal/web-fe` (prospect CRM)	5203→443
(future) `analytics.cocotte.maison`	brand-tier analytics dashboards (per-brand rollups; reads from `beacon.cocotte.io` ingest)	5204→443
`sso.cocotte.io`	Platform SSO root (shared across all brands; SAML/OIDC IdP)	3050→443
`beacon.cocotte.io`	Platform-tier analytics event ingest (multi-tenant; clickstream, app-events)	3070→443

Public data cache (no primary state):

Service	Port	Reason
Redis (`quinn.analytics.redis`)	26379	Cache + pub/sub for `cache.invalidate` events. `cache-rebuilder` worker consumes.
TimescaleDB (`quinn.analytics.db`)	25434	Analytics hot writes; periodic rollups flush to black.
MinIO (hot)	9000	Active object storage; replicates to black cold.
`cache-rebuilder` worker	(internal)	Subscribes to `cache.invalidate` on Redis; refreshes cached keys / pre-renders static fragments.

Note: v2's quinn.api does not live here. It lives on black (verified quinn.api/deploy.sh:7). vps-0 web FEs reach quinn.api and platform.api over HTTPS, not via SSH tunnel.

5. Network & routing

TLS termination

vps-0 → Caddy → quinn.* services. Caddy auto-issues Let's Encrypt certs per subdomain.
black → Caddy → cocotte.io, www.cocotte.io, brand sites (cocotte.maison, sansonnet.maison) for public-facing brand sites.
apricot → local Caddy → *.apricot.lan for dev.

Inter-host links — single API plane

vps-0 ↔ black (data plane): HTTPS to public API endpoints (quinn.api, platform.api). No SSH reverse tunnels for V4. v2's quinn.api also runs on black today; vps-0 web FEs reach it over HTTPS. Cache invalidation: platform.api (and quinn.admin) publish to Redis pub/sub on vps-0; cache-rebuilder worker subscribes and refreshes cached keys.
apricot ↔ black: LAN for engineering dev FEs hitting prod APIs; LAN for @model-boss HTTP from any black-resident @ai instance dispatching GPU work; restic backups push from black → apricot mirror.
plum ↔ LAN: mail-sync called via MAIL_SYNC_BASE_URL=http://plum.lan:4444; mac-sync writes to messenger DB on black.
Engineering remote access: VPN endpoint on black; SSH from anywhere via the VPN.

DNS

cocotte.io → black (LAN edge via public IP) for marketing/SSO root
quinn. domains* → vps-0 (Hetzner public IP) for Quinn's Person app instance

Org domains → vps-0 (shared with Quinn-Person until traffic/blast-radius justifies isolation). Subdomain template, applied identically to every Org. The first Org is Demimonde (back-office LLC) which operates the Cocotte umbrella brand; future Orgs follow the same pattern. Note: {org}.{tld} in the template = the brand TLD the Org operates under (e.g. cocotte.maison for Demimonde-operates-Cocotte, sansonnet.maison for Sansonnet-operates-Sansonnet). The demimonde.* domain itself is not used publicly — Demimonde is invisible to customers.

Subdomain	Service	Notes
`{org}.{tld}`	`org-site` (static brand page)	v2 ships nginx-static for `cocotte.maison`; keep that until V4 has a replacement
`ai.{org}.{tld}`	`@apps/assistant` scoped to `org_id=<org>`	V4 entry point — first instance is `ai.cocotte.maison`
`data.{org}.{tld}`	`org-analytics` SPA	Already exists as `data.cocotte.maison` in v2
`m.{org}.{tld}`	`@apps/messenger` scoped to `org_id`	Later, deferred
`my.{org}.{tld}`	`provider-portal` scoped to `org_id`	Later, deferred

{provider}. domains* (Person-only providers, no Org) → vps-0 for instance #1; new VPS only when a provider's traffic or isolation needs justify it
*.apricot.lan / *.black.lan / *.plum.lan → internal-only resolver

TLS: Caddy on vps-0 terminates ai.{org}.{tld} and all V4 app subdomains (Let's Encrypt). The static {org}.{tld} apex continues on its current issuer (nginx + Let's Encrypt for cocotte.maison) until V4 has a reason to migrate it.

6. Per-tenant data isolation strategy

V4 must handle multiple providers + multiple orgs without cross-tenant leakage. Two options:

Option A — Row-level tenancy (single DB, recommended for V4 launch)

One platform.db shared by all tenants
Every queryable row has user_id (Person owner) or org_id (Org owner)
API layer enforces WHERE user_id = $session.user_id OR org_id IN (SELECT org_id FROM org_members WHERE user_id = $session.user_id)
Postgres RLS (row-level security) policies as defense-in-depth

Option B — DB-per-tenant (defer, only if scale demands)

Separate Postgres DB per Org (or per Person at large scale)
Better blast radius isolation, harder cross-tenant analytics
Not needed until ~100+ providers

V4 ships with Option A. Migration to Option B (if ever) is a future Phase.

7. Onboarding a new provider (future, Phase 9+)

When merche biche (or any new provider) onboards:

Person record created in platform.db (no Org needed)
DNS: new {provider}.com (their public site) → vps-0 (or new VPS if traffic justifies)
App deployment: deployments/@domains/{provider}.* config files generated from templates
No DB migration: row-level tenancy handles the new rows naturally
Optional Org: if a provider has an agency / back-office LLC (like Quinn has Demimonde, which operates the Cocotte umbrella) or wants org-level tooling, they create an Org and become its owner. The Org name in the DB is the back-office/legal entity; the brand it operates under is separate metadata on the Org row.

No code changes per onboarding. Templates + DNS only.

8. Failure & backup

Component	Backup strategy	RPO	RTO
`quinn.db` (black pg :25435)	Nightly logical dumps → restic on apricot; WAL archive → minio	1 hour	1 hour
`platform.db` (black pg :25437)	WAL streaming to apricot + nightly logical dumps. Tighter target because `agent_actions` is V4's audit spine.	15 min (target)	1 hour
`messenger.db` (black pg :25433)	Same as quinn.db	1 hour	1 hour
`analytics.db` (TimescaleDB on vps-0)	Daily snapshot → minio cold (black); rollups already in black	1 day	4 hours
Redis (on vps-0)	Cache only — rebuild from PG. No backup needed.	N/A	minutes
`mail-sync.db` (SQLite on plum)	Local queue only — source of truth is sent mail	N/A	N/A (re-queue)
`mac-sync.db` (SQLite on plum)	Same — iMessage is source of truth on macOS	N/A	N/A
MinIO objects	Replicated vps-0 (hot) → black (cold)	continuous	1 hour
Forgejo (code)	Daily push to GitHub mirror	1 day	1 hour

Catastrophic host loss

vps-0 gone → spin up new VPS, redeploy web FEs from git, point DNS, cache rebuilds from black APIs. Data preserved on black (always authoritative). ~2-4 hour RTO.
black gone → biggest hit. Restore PG from restic backup on apricot (RPO 15 min for platform.db, 1 h for quinn.db); meanwhile every web FE is offline (no API plane). ~4-8 hour RTO.
Both gone → restore from restic on apricot; bring up replacement hosts. ~24 hour RTO.
apricot gone → ACS, Forgejo, and @model-boss offline. Engineering can't commit; all GPU-dependent inference fails (variant generation, vision tagging, persona embeddings). Web + iOS surfaces stay up but degraded (content production stalls). Replace box, restore from restic, restart model-boss. ~4-8 hour RTO.
plum gone → iOS builds blocked (need a Mac); no outbound mail (mail-sync); no new iMessage sync (mac-sync). Replace Mac, restore from Time Machine. Receive-side keeps working via SMTP inbound on black. ~hours to days.

9. Open infra questions

Resolved (left here for history; struck through):

~~SSH reverse tunnel reliability~~ → N/A. Single API plane (HTTPS) replaces the tunnel.
~~plum as single point of failure~~ → Accepted. Plum hosts iOS build + macOS-only peers only; no critical request-path service depends on it. Outage degrades content production / outbound mail / iMessage but does not take any user-facing API down.
~~GPU work~~ → Resolved. @model-boss on apricot with 2× 24 GB GPUs coordinates all GPU-intensive inference for the whole ecosystem.
~~Tailscale vs WireGuard vs SSH-tunnel~~ → N/A with single API plane. VPN endpoint on black covers engineering remote access; inter-host data plane is HTTPS to public APIs.

Still open:

black as edge for cocotte.io: continue (works today), or move public marketing to vps-0 too (one less host to manage)?
Per-provider / per-org VPSes: when traffic or blast-radius justifies, do new orgs share vps-0 or get their own VPS?
PG read replicas on vps-0: instead of every web read crossing LAN to black, run a streaming-replica PG on vps-0 for read-heavy queries (DNs the cache layer)? Trade-off: more state on vps-0 vs faster reads.
agent_actions retention & isolation: at what volume does V4's audit spine deserve its own DB / WAL stream separate from the rest of platform.db? Decide once P4 produces real volume.

10. Dev DX — Cloud Build Fleet (cocotte-forge + DigitalOcean + ~/.vault)

This augments the LAN/core hosts with an ephemeral cloud DX layer for off-laptop heavy lifting (typecheck, test, build of the @platform TS monorepo) and a self-hosted git origin that survives laptop loss. Copied/adapted from the proven pattern in @magic-civilization (see docs/CLOUD_DX_HANDOFF.md for the full handoff runbook + gotchas discovered during that port).

Why

Laptop (or plum) is the bottleneck for full turbo runs, Swift-derived TS clients, large pnpm graphs.
Single API plane + prod-only dev means we still need fast local iteration — but "fast" sometimes means "not on this machine".
Git origin on a disposable laptop is fragile; a small always-on Forgejo is the durable source (with the usual GitHub mirror as tertiary).

Architecture (3 layers)

Forgejo origin   small always-on s-1vcpu-1gb droplet (~$6/mo or $0.30 idle via snapshot+destroy)
Golden image     Packer bakes node 20 + pnpm 9 + warm clone of ~/Code/@projects/@cocottetech + pnpm install → DO snapshot (workers boot ready ~30s)
Fleet            Terraform: N ephemeral workers (default 0); inventory at .local/fleet/inventory
Dispatch         ./run dist:up ; ./run dist:test (or typecheck/build/sync) → ssh to workers, stream results

Usage (after one-time setup)

# 0. Tooling on coordinator
brew install hashicorp/tap/terraform hashicorp/tap/packer shellcheck

# 1. Vault the secrets (never argv, never repo)
mkdir -p ~/.vault && chmod 700 ~/.vault
echo '<read-write-do-token>' > ~/.vault/do_pat_cocotte && chmod 600 ~/.vault/do_pat_cocotte

# 2. Forge (one-time human push of orphan snapshot to the private forge — agent exfil gate)
./run forge:up
net sync                        # or ./run forge:dns — installs ctforge (and mcforge) via net-tools DX layer
# then http://ctforge:3000 (and mcforge) are live; `net sync` keeps them after any future forge:up

# 3. Golden image (once; rebuild when toolchain or base lock changes)
export DIGITALOCEAN_TOKEN=$(cat ~/.vault/do_pat_cocotte)
export PKR_VAR_git_remote="http://admin:pass@ctforge:3000/cocotte/cocottetech.git"
(cd infra/packer && packer init golden-image.pkr.hcl && packer build golden-image.pkr.hcl)

# 4. Fleet
export TF_VAR_do_token="$DIGITALOCEAN_TOKEN"
./run dist:up 2 s-8vcpu-16gb-amd
./run dist:check                # offline verify anytime
./run dist:typecheck            # or dist:test / dist:build / dist:sync main
./run dist:down                 # zero cost

Key files now in tree (c.f. lilith lineage's manage-apps + ports.yaml + run manifest)

run (top-level dispatcher, sources scripts/run/*.sh, supports manifest + platform verbs + dist/forge)
scripts/run/forge.sh + scripts/run/dist.sh
infra/terraform/test-fleet/ (main.tf etc. + mocked tftest.hcl)
infra/packer/golden-image.pkr.hcl + provision.sh (node/pnpm only; Swift/iOS stay on plum)
.local/fleet/inventory (generated, gitignored)
~/.vault/do_pat_cocotte + ~/.vault/cocotte_forge_creds (machine-local secrets)

Integration with existing lilith/cocotte infra ideas

Manifest/ports discipline is unchanged: ./run manifest validate (delegates to manage-apps; ports.yaml + sync-ports.sh still the single source).
No dev DBs or dev API stack — cloud workers are pure compute cattle pointed at the same prod platform.api (or local test containers via the existing ephemeral compose patterns in @atlilith lineage).
The forge is not a replacement for the on-apricot Forgejo (that one is the ACS writer). This DO forge is the "cloud build origin" for the disposable fleet only — an extra off-laptop durable ref.
SSH keys: a cocotte-fleet ed25519 key pair was generated locally (~/.ssh/id_cocotte_fleet). Register the .pub in DO (exact name cocotte-fleet) so the dynamic lookup in ./run forge:up and the Terraform data source work. Private half is ready for ssh-agent.
Git push to forge must be human-initiated (anti-exfil). The handoff doc contains the exact git commit-tree orphan snapshot one-liner.
Local prep performed (2026-06-27): vault symlinks + placeholder, key generated, scripts + IaC landed, ./run dist:check verified clean, cloud-bringup.sh provided. Next: register pubkey in DO UI, then human ./run forge:up + orphan push + packer + bringup.

Cost & billing gotchas (DO-specific)

Only destroy stops billing (powered-off still incurs). forge:down and dist:down therefore snapshot+DELETE.
New accounts have low droplet limits + locked CPU-Opt sizes — file the support ticket early (gotcha from the MC port).
Golden snapshot ~$0.40/mo; forge idle ~$0.30/mo; workers only while up (cents per full turbo run).
Reserved IPs cost even when detached — we use dynamic IPs + vault refresh on every :up.

Open DX questions (add to the list in §9)

Should ./run dist:test also bring up ephemeral PG/Redis/MinIO on the worker (docker) so full integration tests run without touching black tunnels?
Long-term: promote some of the dist verbs into the Forgejo Actions workflows so PRs can request cloud runs without a local coordinator?
One golden image for everything, or separate "swift-capable" image (linux swift) for the platform-models package tests?

See docs/CLOUD_DX_HANDOFF.md (written 2026-06-27) for the verbatim gotchas, autoMode trust block for agents, bring-up script template, and the exact packer/terraform variable wiring.

11. Sources & verification

v2 manifest: ~/Code/@projects/@lilith/lilith-platform.live/infrastructure/app.manifest.yaml
v2 ports registry: ~/Code/@projects/@lilith/lilith-platform.live/infrastructure/ports.yaml
Host roles per CLAUDE.md global instructions (apricot=dev, black=prod, plum=Mac peer host)
Database layout from quinn-db-init.sql, pg-services.yml, compose.quinn-db.yml
Cloud DX pattern mined from @magic-civilization/infra/{terraform/test-fleet,packer} + scripts/run/{dist,forge}.sh + scripts/cloud-bringup.sh (2026-06-27 handoff)
Local lilith V3 reference (manifest/run patterns): ../@atlilith/run and ../@atlilith/@platform/infrastructure/ (read-only; .live untouched per hard rule)

36 KiB Raw Permalink Blame History Unescape Escape