cocottetech/INFRA.md
Natalie 3975a57ec0 docs(infra): update INFRA.md to document no more black for CI/runners — ct-forge on DO on-demand horizontal runners via Terraform IaC (modeled on LP)
- Added row for DO (ct-forge runners): ephemeral, golden image, Terraform scale.
- References migration, LP script logic in cloud-init.
2026-06-28 17:32:57 -04:00

36 KiB
Raw Permalink Blame History

@cocottetech — Infrastructure Design

Status: Design phase Date: 2026-05-16 Companion to: DESIGN.md


1. Hosts at a glance

Host Type Role Network OS
plum Mac mini (Apple Silicon) macOS-required services only (iOS build, mail-sync, mac-sync) LAN: plum.lan macOS
apricot Linux box (home lab) Engineering dev box + prod GPU host (@model-boss, 2× 24 GB GPU) LAN: 10.0.0.13 Bluefin/Bootc
black Linux box (home lab) Prod DBs + quinn.api (v2) + platform.api (V4) + VPN + cocotte.io edge LAN: 10.0.0.11, Public: cocotte.io Linux
DO (ct-forge runners) DigitalOcean droplets (ephemeral, on-demand) Forgejo Actions runners for ct-forge CI (package publish, builds). Horizontally scaled via Terraform (var.runners=N, 0=zero cost). Golden image from packer + cloud-init replicating LP runner logic (no more black for CI/runners). nyc3 etc Ubuntu
vps-0 Hetzner VPS (alias quinn-vps) Public web tier + public data cache (no primary state) Public IP Linux

Why this split

  • plum exists solely because we ship a native iOS app and run macOS-only peers. Xcode + fastlane + code-signing for lilith-messenger-ios / @features/ai-copilot/ios-fe; mail-sync wraps Proton Bridge SMTP; mac-sync reads iMessage from macOS APIs. No agent runtimes, no domain services, no ML workloads. If iOS + Proton + iMessage ever stop being requirements, plum goes away.
  • apricot is (1) the engineering workstation — code editing, ACS commit gate, Forgejo, dev frontends pointed at prod APIs on black; (2) the production GPU host running @model-boss with two 24 GB GPUs, coordinating all GPU-intensive inference (vision, captioning, embeddings) for every AI consumer in the ecosystem. No persistent dev DBs — engineering works against prod APIs; test DBs are ephemeral containers spun up per test run.
  • black holds every authoritative production database (quinn.db :25435, platform.db :25437, messenger.db :25433, mac-sync.db mirror :25436) and runs every centralized API: quinn.api (v2, verified via deployments/@domains/quinn.api/deploy.sh:7REMOTE="${QUINN_API_REMOTE:-black}") and platform.api (V4, port 3060). Also: V4 @ai instance fleet (ai-copilot, content-{surface}, ...), V4 workers (scheduler, ingestor, resolver, notifier), VPN endpoint for engineering access, public edge for cocotte.io.
  • vps-0 serves the public web tier (V4 web frontends + v2 frontends) and the public data cache (Redis pub/sub + cache, TimescaleDB hot writes for analytics, MinIO hot tier). It owns no primary state — every persistent datum belongs to black. Writes through quinn.admin / platform.api fire cache.invalidate events that vps-0's cache rebuilder consumes. vps-0 is replaceable.
  • (No separate vps-quinn host — that name in the manifest is just an alias for vps-0.)

2. Topology — ASCII

                          ┌──────────────────────────────────────┐
                          │           PUBLIC INTERNET            │
                          └────────────┬─────────────────────────┘
                                       │
                  ┌────────────────────┼────────────────────────────┐
                  │                    │                            │
                  ▼                    ▼                            ▼
        ┌──────────────────┐  ┌─────────────────┐  ┌──────────────────────────┐
        │  cocotte.io    │  │  quinn.* (v2)   │  │  cocotte.maison (org)    │
        │  (marketing)     │  │                 │  │  ai.cocotte.maison (V4)  │
        │                  │  │                 │  │  sansonnet.maison, ...   │
        └─────────┬────────┘  └────────┬────────┘  └────────┬─────────────────┘
                  │ HTTPS              │ HTTPS              │ HTTPS
                  ▼                    ▼                    ▼
       ┌───────────────────┐    ┌───────────────────────────────────────────────┐
       │  BLACK Caddy      │    │  VPS-0 Caddy → public web tier + cache reads  │
       │  (cocotte.io    │    │                                               │
       │   edge only)      │    │  V4 web FEs:                                  │
       └─────────┬─────────┘    │   ai.cocotte.maison (ai-copilot web-fe) :5201 │
                 │              │   content-portal, engagement-portal (later)   │
                 │              │  v2 web FEs:                                  │
                 │              │   quinn.www, .my, .m, .ai, .admin, .data, .vip│
                 │              │                                               │
                 │              │  Public data cache (no primary state):        │
                 │              │   redis :26379  (cache + pub/sub —            │
                 │              │     cache-rebuilder consumes cache.invalidate)│
                 │              │   timescaledb :25434 (analytics hot writes;   │
                 │              │     periodic rollups → black)                 │
                 │              │   minio :9000 (hot; replicates → black cold)  │
                 │              └─────────────────────┬─────────────────────────┘
                 │                                    │ HTTPS → APIs on black
                 │  ┌─────────────────────────────────┘
                 ▼  ▼
   ┌────────────────────────────────────────────────────────────────────────┐
   │  BLACK (10.0.0.11) — PROD CORE                                         │
   │                                                                        │
   │  Centralized APIs (single API plane for all web + iOS):                │
   │   quinn.api    :3030  v2 (verified deploys here — deploy.sh:7)         │
   │   platform.api :3060  V4 (NestJS, new)                                 │
   │                                                                        │
   │  V4 @ai fleet (one process per specialist):                            │
   │   ai-copilot       :3791   front-door (Quinn-facing)                   │
   │   content-onlyfans :3792   P1                                          │
   │   content-x        :3793   P2                                          │
   │   content-{...}    :3794+  P3                                          │
   │   bookings-{tryst,ts4rent,...}  P3 escort-directory axis               │
   │                                                                        │
   │  V4 workers:                                                           │
   │   scheduler-worker, engagement-ingestor, prospect-resolver, notifier   │
   │   v2 systemd: quinn.hotel-scout                                        │
   │                                                                        │
   │  Authoritative DBs:                                                    │
   │   pg :25435  quinn.db    (v2 — LIVE, UNTOUCHED)                        │
   │   pg :25437  platform.db (V4 — new, isolated; no cross-DB joins to v2) │
   │   pg :25433  messenger.db                                              │
   │   pg :25436  mac-sync.db (read mirror from plum)                       │
   │   minio :9000 (cold; replication target from vps-0)                   │
   │                                                                        │
   │  Edge & misc:                                                          │
   │   cocotte.www (public marketing), docker-mailserver + Rspamd, VPN     │
   └─────────┬──────────────────────────────────────────────────────────────┘
             │ LAN                              ▲
             ▼                                  │ HTTPS (GPU inference)
   ┌─────────────────────────────────┐          │
   │ APRICOT (10.0.0.13)             │──────────┘
   │                                 │
   │ Engineering dev box:            │   Dev frontends at *.apricot.lan
   │  Code editing + ACS gate        │   point at PROD APIs on black.
   │  Forgejo (self-host git)        │   No persistent dev DBs.
   │  No persistent DBs              │   Test DBs are ephemeral.
   │                                 │
   │ Prod GPU host:                  │
   │  @model-boss + 2× 24 GB GPU     │
   │   (vision tags, captions,       │
   │    embeddings — every AI        │
   │    consumer routes here)        │
   └─────────────────────────────────┘

   ┌──────────────────────────────────────────────┐
   │ PLUM (Mac mini)                              │
   │  macOS-required ONLY:                        │
   │   iOS build pipeline (Xcode, fastlane,       │
   │     code-signing for V4 ai-copilot ios-fe    │
   │     and lilith-messenger-ios)                │
   │   mail-sync :4444 (Proton Bridge SMTP)       │
   │   mac-sync  :3100 (iMessage bidir sync)      │
   │  No agent runtimes, no domain services.      │
   └──────────────────────────────────────────────┘

3. Databases — who lives where

Authoritative production DBs — black (LAN, 10.0.0.11)

┌──────────────────────────────────────────────────────────────┐
│  black  (AUTHORITATIVE PRODUCTION DBs)                       │
│                                                              │
│  postgres:25435  ─── quinn.db (v2 LIVE — UNTOUCHED)          │
│      └── v2 schema; all v2 services on vps-0 read/write here │
│      ▲ vps-0 v2 apps reach via SSH -R 25435 reverse tunnel   │
│                                                              │
│  postgres:25437  ─── platform.db (V4 — new, isolated)        │
│      ├── users, orgs, org_members          ← tenancy core    │
│      ├── personas, content_plans, content_assets,            │
│      │     content_posts                   ← content engine  │
│      ├── agent_actions                     ← audit spine     │
│      ├── engagement_events, prospects      ← funnel          │
│      ├── (future) bookings, payments,      ← mined from v1   │
│      │     profiles, attributes                              │
│      └── audit_log                                           │
│      ▲ vps-0 V4 apps reach via SSH -R 25437 reverse tunnel   │
│      ▲ V4 ↔ v2 cross-DB data flows over HTTP/MCP only;       │
│        no cross-DB SQL joins. v2 stays oblivious to V4.      │
│                                                              │
│  postgres:25433  ─── messenger.db (iMessage threads)         │
│      ├── threads, messages, contacts                         │
│      └── send_queue (writes from m-sync via tunnel)          │
│                                                              │
│  postgres:25436  ─── mac-sync.db (raw iCloud, read-only)     │
│      └── (mac-sync peer on plum is the writer; mirrored      │
│           here for read access from vps-0/black)             │
│                                                              │
│  minio:9000      ─── object storage (cold tier, photo backup)│
│  docker-mailserver ─ inbound SMTP for cocotte.io           │
│  systemd workers ─── quinn.hotel-scout (hourly timer)        │
└──────────────────────────────────────────────────────────────┘

Public app tier + local cache — vps-0

┌──────────────────────────────────────────────────────────────┐
│  vps-0  (Public app tier — DBs are CACHES, not authoritative)│
│                                                              │
│  timescaledb:25434 ── analytics.db (org-analytics events)    │
│      ├── visitor_events (org_id partitioned, hot writes)     │
│      ├── funnels, conversions                                │
│      └── per-org rollups (continuous aggregates)             │
│      ▼ Cold rollups periodically flushed to black            │
│                                                              │
│  redis:26379  ──────── cache + queue                         │
│      ├── analytics ingestion queue (before flush to ts-db)   │
│      ├── BullMQ jobs (queue-worker feature)                  │
│      ├── session cache (SSO JWT validation)                  │
│      └── HTTP response cache for hot reads                   │
│                                                              │
│  minio:9000   ──────── object storage (hot tier)             │
│      └── replicates → black:9000 (cold)                      │
│                                                              │
│  App processes for quinn.* (no persistent state of their own)│
└──────────────────────────────────────────────────────────────┘

Why this split (vps-0 cache, black authoritative):

  • vps-0 is replaceable — if it dies, spin up a new VPS, redeploy from git, point DNS. Caches rebuild from black.
  • black is the data crown jewel — kept on a controlled LAN host, harder to attack from public internet.
  • vps-0 → black uses persistent SSH reverse tunnel (-R 25435:localhost:25435) initiated from black, so vps-0 can't be a pivot back to LAN if compromised.

Apricot has no persistent DBs

Engineering points dev frontends at prod APIs on black. There is no dev API stack and no dev DB tier. Tests use ephemeral containers (Postgres + Redis + MinIO via docker-compose, spun up per test run and torn down on exit). This keeps a single source of truth for schema, migrations, seed data, and engagement state.

Plum-resident state (NOT in any pg)

┌──────────────────────────────────────────────────────────────┐
│  plum  (macOS-only)                                          │
│                                                              │
│  ~/.local/share/mail-sync/mail-sync.db    ── SQLite send Q   │
│  ~/.local/share/mac-sync/mac-sync.db      ── SQLite ingest Q │
│  ~/.local/share/knowledge-platform/*.db   ── Crystal TUI db  │
│                                                              │
│  (These are local-only queues. Source of truth eventually    │
│   lands in black's authoritative DBs via HTTP push.)         │
└──────────────────────────────────────────────────────────────┘

4. Service distribution by host

plum — macOS-required ONLY

Service Port Reason it's here
iOS build pipeline Xcode, fastlane, code-signing for lilith-messenger-ios + @features/ai-copilot/ios-fe. Apple's toolchain is macOS-only.
mail-sync 4444 Wraps Proton Bridge SMTP (Mac-only app).
mac-sync 3100 Reads iMessage from macOS APIs.

No agent runtimes, no domain services, no ML. Plum exists solely to satisfy the macOS-only requirements above.

apricot — engineering dev box + prod GPU host

Service Port Reason it's here
@model-boss (apricot-internal) Production GPU coordinator — routes all GPU-intensive inference (vision, captioning, embeddings) for every AI consumer in the ecosystem. 2× 24 GB GPUs.
Dev frontends (*.apricot.lan via Caddy) 53005399 V4 web FEs in dev — call prod APIs on black, no local DBs.
ACS (auto-commit-service) Serializes git commits (apricot is sole writer).
Forgejo 3000 Self-hosted git.
Test container harness ephemeral Docker-compose spins up Postgres / Redis / MinIO per test run, tears down on exit.

No persistent dev databases. No dev API stack.

black — prod core + authoritative DBs + APIs

Service Port Reason
quinn.api (v2) 3030 v2 centralized API. Verified host via deployments/@domains/quinn.api/deploy.sh:7REMOTE="${QUINN_API_REMOTE:-black}".
platform.api (V4) 3060 V4 centralized API (NestJS). Owns CRUD over platform.db.
ai-copilot @ai instance 3791 Quinn-facing front-door specialist.
content-onlyfans @ai (P1) 3792 Per-surface specialist (OF lifecycle).
content-x @ai (P2) 3793 Per-surface specialist (X lifecycle).
content-{instagram,tiktok,...} @ai (P3) 3794+ Per-surface specialists.
bookings-{tryst,ts4rent,...} @ai (P3) 3796+ Escort-directory specialists.
scheduler-worker 3820 (health) Polls content_posts and dispatches to @ai/@skills/platform-* actions.
engagement-ingestor 3821 (health) Pulls inbound across surfaces, normalizes to engagement_events.
prospect-resolver (P4) 3822 (health) Cross-surface prospect dedup.
notifier 3823 (health) Multi-channel dispatcher (iOS push, iMessage via mac-sync, email digest).
Postgres (quinn.db) 25435 v2 authoritative DB (live, untouched).
Postgres (platform.db) 25437 V4 authoritative DB (new, isolated; no cross-DB joins).
Postgres (messenger.db) 25433 Authoritative messenger DB.
Postgres (mac-sync.db) 25436 Read-only mirror of plum's mac-sync ingest.
MinIO (cold) 9000 Replication target from vps-0 (cold tier, backups).
cocotte.www 80/443 Public marketing site (edge for cocotte.io).
waitlist-api 3070 Pre-launch collector.
docker-mailserver + Rspamd 25/587 Inbound SMTP for cocotte.io.
quinn.hotel-scout (systemd) Hourly hotel-scraping worker (v2).
VPN endpoint Engineering remote access.
Caddy 80/443 Edge TLS termination.

vps-0 — public web tier + public data cache (no primary state)

v2 web frontends + APIs (deployed today; remain untouched by V4):

Domain Service Port
quinn.www Provider website (transquinnftw.com) 5120→443
quinn.sso SSO + device-link 3025→443
quinn.my Provider portal 5174→443
quinn.m Messenger UI 5175→443
quinn.ai AI assistant 5176→443
quinn.admin Admin panel 5121→443
quinn.data Analytics dashboard 5111→443
quinn.vip VIP messaging 5178→443
quinn.ai-engine LLM inference worker (internal)
quinn.mail-autoresponder Auto-respond engine (internal)
quinn.hotel-scout Tour booking automation (internal)
quinn.price-watcher Price monitoring (internal)
quinn.m-orchestrator Background worker 3803 (health)
quinn.my-orchestrator Background worker (health)

V4 web frontends (secondary surfaces — iOS is the primary AI UI):

Domain Service Port
ai.cocotte.maison @features/ai-copilot/web-fe (Cocotte umbrella brand instance, operated by Demimonde back-office Org) 5201→443
(future) content.cocotte.maison @features/content-portal/web-fe (calendar, asset library) 5202→443
(future) engagement.cocotte.maison @features/engagement-portal/web-fe (prospect CRM) 5203→443
(future) analytics.cocotte.maison brand-tier analytics dashboards (per-brand rollups; reads from beacon.cocotte.io ingest) 5204→443
sso.cocotte.io Platform SSO root (shared across all brands; SAML/OIDC IdP) 3050→443
beacon.cocotte.io Platform-tier analytics event ingest (multi-tenant; clickstream, app-events) 3070→443

Public data cache (no primary state):

Service Port Reason
Redis (quinn.analytics.redis) 26379 Cache + pub/sub for cache.invalidate events. cache-rebuilder worker consumes.
TimescaleDB (quinn.analytics.db) 25434 Analytics hot writes; periodic rollups flush to black.
MinIO (hot) 9000 Active object storage; replicates to black cold.
cache-rebuilder worker (internal) Subscribes to cache.invalidate on Redis; refreshes cached keys / pre-renders static fragments.

Note: v2's quinn.api does not live here. It lives on black (verified quinn.api/deploy.sh:7). vps-0 web FEs reach quinn.api and platform.api over HTTPS, not via SSH tunnel.


5. Network & routing

TLS termination

  • vps-0 → Caddy → quinn.* services. Caddy auto-issues Let's Encrypt certs per subdomain.
  • black → Caddy → cocotte.io, www.cocotte.io, brand sites (cocotte.maison, sansonnet.maison) for public-facing brand sites.
  • apricot → local Caddy → *.apricot.lan for dev.
  • vps-0 ↔ black (data plane): HTTPS to public API endpoints (quinn.api, platform.api). No SSH reverse tunnels for V4. v2's quinn.api also runs on black today; vps-0 web FEs reach it over HTTPS. Cache invalidation: platform.api (and quinn.admin) publish to Redis pub/sub on vps-0; cache-rebuilder worker subscribes and refreshes cached keys.
  • apricot ↔ black: LAN for engineering dev FEs hitting prod APIs; LAN for @model-boss HTTP from any black-resident @ai instance dispatching GPU work; restic backups push from black → apricot mirror.
  • plum ↔ LAN: mail-sync called via MAIL_SYNC_BASE_URL=http://plum.lan:4444; mac-sync writes to messenger DB on black.
  • Engineering remote access: VPN endpoint on black; SSH from anywhere via the VPN.

DNS

  • cocotte.io → black (LAN edge via public IP) for marketing/SSO root
  • quinn. domains* → vps-0 (Hetzner public IP) for Quinn's Person app instance
  • Org domains → vps-0 (shared with Quinn-Person until traffic/blast-radius justifies isolation). Subdomain template, applied identically to every Org. The first Org is Demimonde (back-office LLC) which operates the Cocotte umbrella brand; future Orgs follow the same pattern. Note: {org}.{tld} in the template = the brand TLD the Org operates under (e.g. cocotte.maison for Demimonde-operates-Cocotte, sansonnet.maison for Sansonnet-operates-Sansonnet). The demimonde.* domain itself is not used publicly — Demimonde is invisible to customers.
    Subdomain Service Notes
    {org}.{tld} org-site (static brand page) v2 ships nginx-static for cocotte.maison; keep that until V4 has a replacement
    ai.{org}.{tld} @apps/assistant scoped to org_id=<org> V4 entry point — first instance is ai.cocotte.maison
    data.{org}.{tld} org-analytics SPA Already exists as data.cocotte.maison in v2
    m.{org}.{tld} @apps/messenger scoped to org_id Later, deferred
    my.{org}.{tld} provider-portal scoped to org_id Later, deferred
  • {provider}. domains* (Person-only providers, no Org) → vps-0 for instance #1; new VPS only when a provider's traffic or isolation needs justify it
  • *.apricot.lan / *.black.lan / *.plum.lan → internal-only resolver

TLS: Caddy on vps-0 terminates ai.{org}.{tld} and all V4 app subdomains (Let's Encrypt). The static {org}.{tld} apex continues on its current issuer (nginx + Let's Encrypt for cocotte.maison) until V4 has a reason to migrate it.


6. Per-tenant data isolation strategy

V4 must handle multiple providers + multiple orgs without cross-tenant leakage. Two options:

  • One platform.db shared by all tenants
  • Every queryable row has user_id (Person owner) or org_id (Org owner)
  • API layer enforces WHERE user_id = $session.user_id OR org_id IN (SELECT org_id FROM org_members WHERE user_id = $session.user_id)
  • Postgres RLS (row-level security) policies as defense-in-depth

Option B — DB-per-tenant (defer, only if scale demands)

  • Separate Postgres DB per Org (or per Person at large scale)
  • Better blast radius isolation, harder cross-tenant analytics
  • Not needed until ~100+ providers

V4 ships with Option A. Migration to Option B (if ever) is a future Phase.


7. Onboarding a new provider (future, Phase 9+)

When merche biche (or any new provider) onboards:

  1. Person record created in platform.db (no Org needed)
  2. DNS: new {provider}.com (their public site) → vps-0 (or new VPS if traffic justifies)
  3. App deployment: deployments/@domains/{provider}.* config files generated from templates
  4. No DB migration: row-level tenancy handles the new rows naturally
  5. Optional Org: if a provider has an agency / back-office LLC (like Quinn has Demimonde, which operates the Cocotte umbrella) or wants org-level tooling, they create an Org and become its owner. The Org name in the DB is the back-office/legal entity; the brand it operates under is separate metadata on the Org row.

No code changes per onboarding. Templates + DNS only.


8. Failure & backup

Component Backup strategy RPO RTO
quinn.db (black pg :25435) Nightly logical dumps → restic on apricot; WAL archive → minio 1 hour 1 hour
platform.db (black pg :25437) WAL streaming to apricot + nightly logical dumps. Tighter target because agent_actions is V4's audit spine. 15 min (target) 1 hour
messenger.db (black pg :25433) Same as quinn.db 1 hour 1 hour
analytics.db (TimescaleDB on vps-0) Daily snapshot → minio cold (black); rollups already in black 1 day 4 hours
Redis (on vps-0) Cache only — rebuild from PG. No backup needed. N/A minutes
mail-sync.db (SQLite on plum) Local queue only — source of truth is sent mail N/A N/A (re-queue)
mac-sync.db (SQLite on plum) Same — iMessage is source of truth on macOS N/A N/A
MinIO objects Replicated vps-0 (hot) → black (cold) continuous 1 hour
Forgejo (code) Daily push to GitHub mirror 1 day 1 hour

Catastrophic host loss

  • vps-0 gone → spin up new VPS, redeploy web FEs from git, point DNS, cache rebuilds from black APIs. Data preserved on black (always authoritative). ~2-4 hour RTO.
  • black gone → biggest hit. Restore PG from restic backup on apricot (RPO 15 min for platform.db, 1 h for quinn.db); meanwhile every web FE is offline (no API plane). ~4-8 hour RTO.
  • Both gone → restore from restic on apricot; bring up replacement hosts. ~24 hour RTO.
  • apricot gone → ACS, Forgejo, and @model-boss offline. Engineering can't commit; all GPU-dependent inference fails (variant generation, vision tagging, persona embeddings). Web + iOS surfaces stay up but degraded (content production stalls). Replace box, restore from restic, restart model-boss. ~4-8 hour RTO.
  • plum gone → iOS builds blocked (need a Mac); no outbound mail (mail-sync); no new iMessage sync (mac-sync). Replace Mac, restore from Time Machine. Receive-side keeps working via SMTP inbound on black. ~hours to days.

9. Open infra questions

Resolved (left here for history; struck through):

  • SSH reverse tunnel reliability → N/A. Single API plane (HTTPS) replaces the tunnel.
  • plum as single point of failure → Accepted. Plum hosts iOS build + macOS-only peers only; no critical request-path service depends on it. Outage degrades content production / outbound mail / iMessage but does not take any user-facing API down.
  • GPU work → Resolved. @model-boss on apricot with 2× 24 GB GPUs coordinates all GPU-intensive inference for the whole ecosystem.
  • Tailscale vs WireGuard vs SSH-tunnel → N/A with single API plane. VPN endpoint on black covers engineering remote access; inter-host data plane is HTTPS to public APIs.

Still open:

  1. black as edge for cocotte.io: continue (works today), or move public marketing to vps-0 too (one less host to manage)?
  2. Per-provider / per-org VPSes: when traffic or blast-radius justifies, do new orgs share vps-0 or get their own VPS?
  3. PG read replicas on vps-0: instead of every web read crossing LAN to black, run a streaming-replica PG on vps-0 for read-heavy queries (DNs the cache layer)? Trade-off: more state on vps-0 vs faster reads.
  4. agent_actions retention & isolation: at what volume does V4's audit spine deserve its own DB / WAL stream separate from the rest of platform.db? Decide once P4 produces real volume.

10. Dev DX — Cloud Build Fleet (cocotte-forge + DigitalOcean + ~/.vault)

This augments the LAN/core hosts with an ephemeral cloud DX layer for off-laptop heavy lifting (typecheck, test, build of the @platform TS monorepo) and a self-hosted git origin that survives laptop loss. Copied/adapted from the proven pattern in @magic-civilization (see docs/CLOUD_DX_HANDOFF.md for the full handoff runbook + gotchas discovered during that port).

Why

  • Laptop (or plum) is the bottleneck for full turbo runs, Swift-derived TS clients, large pnpm graphs.
  • Single API plane + prod-only dev means we still need fast local iteration — but "fast" sometimes means "not on this machine".
  • Git origin on a disposable laptop is fragile; a small always-on Forgejo is the durable source (with the usual GitHub mirror as tertiary).

Architecture (3 layers)

Forgejo origin   small always-on s-1vcpu-1gb droplet (~$6/mo or $0.30 idle via snapshot+destroy)
Golden image     Packer bakes node 20 + pnpm 9 + warm clone of ~/Code/@projects/@cocottetech + pnpm install → DO snapshot (workers boot ready ~30s)
Fleet            Terraform: N ephemeral workers (default 0); inventory at .local/fleet/inventory
Dispatch         ./run dist:up ; ./run dist:test (or typecheck/build/sync) → ssh to workers, stream results

Usage (after one-time setup)

# 0. Tooling on coordinator
brew install hashicorp/tap/terraform hashicorp/tap/packer shellcheck

# 1. Vault the secrets (never argv, never repo)
mkdir -p ~/.vault && chmod 700 ~/.vault
echo '<read-write-do-token>' > ~/.vault/do_pat_cocotte && chmod 600 ~/.vault/do_pat_cocotte

# 2. Forge (one-time human push of orphan snapshot to the private forge — agent exfil gate)
./run forge:up
net sync                        # or ./run forge:dns — installs ctforge (and mcforge) via net-tools DX layer
# then http://ctforge:3000 (and mcforge) are live; `net sync` keeps them after any future forge:up

# 3. Golden image (once; rebuild when toolchain or base lock changes)
export DIGITALOCEAN_TOKEN=$(cat ~/.vault/do_pat_cocotte)
export PKR_VAR_git_remote="http://admin:pass@ctforge:3000/cocotte/cocottetech.git"
(cd infra/packer && packer init golden-image.pkr.hcl && packer build golden-image.pkr.hcl)

# 4. Fleet
export TF_VAR_do_token="$DIGITALOCEAN_TOKEN"
./run dist:up 2 s-8vcpu-16gb-amd
./run dist:check                # offline verify anytime
./run dist:typecheck            # or dist:test / dist:build / dist:sync main
./run dist:down                 # zero cost

Key files now in tree (c.f. lilith lineage's manage-apps + ports.yaml + run manifest)

  • run (top-level dispatcher, sources scripts/run/*.sh, supports manifest + platform verbs + dist/forge)
  • scripts/run/forge.sh + scripts/run/dist.sh
  • infra/terraform/test-fleet/ (main.tf etc. + mocked tftest.hcl)
  • infra/packer/golden-image.pkr.hcl + provision.sh (node/pnpm only; Swift/iOS stay on plum)
  • .local/fleet/inventory (generated, gitignored)
  • ~/.vault/do_pat_cocotte + ~/.vault/cocotte_forge_creds (machine-local secrets)

Integration with existing lilith/cocotte infra ideas

  • Manifest/ports discipline is unchanged: ./run manifest validate (delegates to manage-apps; ports.yaml + sync-ports.sh still the single source).
  • No dev DBs or dev API stack — cloud workers are pure compute cattle pointed at the same prod platform.api (or local test containers via the existing ephemeral compose patterns in @atlilith lineage).
  • The forge is not a replacement for the on-apricot Forgejo (that one is the ACS writer). This DO forge is the "cloud build origin" for the disposable fleet only — an extra off-laptop durable ref.
  • SSH keys: a cocotte-fleet ed25519 key pair was generated locally (~/.ssh/id_cocotte_fleet). Register the .pub in DO (exact name cocotte-fleet) so the dynamic lookup in ./run forge:up and the Terraform data source work. Private half is ready for ssh-agent.
  • Git push to forge must be human-initiated (anti-exfil). The handoff doc contains the exact git commit-tree orphan snapshot one-liner.
  • Local prep performed (2026-06-27): vault symlinks + placeholder, key generated, scripts + IaC landed, ./run dist:check verified clean, cloud-bringup.sh provided. Next: register pubkey in DO UI, then human ./run forge:up + orphan push + packer + bringup.

Cost & billing gotchas (DO-specific)

  • Only destroy stops billing (powered-off still incurs). forge:down and dist:down therefore snapshot+DELETE.
  • New accounts have low droplet limits + locked CPU-Opt sizes — file the support ticket early (gotcha from the MC port).
  • Golden snapshot ~$0.40/mo; forge idle ~$0.30/mo; workers only while up (cents per full turbo run).
  • Reserved IPs cost even when detached — we use dynamic IPs + vault refresh on every :up.

Open DX questions (add to the list in §9)

  • Should ./run dist:test also bring up ephemeral PG/Redis/MinIO on the worker (docker) so full integration tests run without touching black tunnels?
  • Long-term: promote some of the dist verbs into the Forgejo Actions workflows so PRs can request cloud runs without a local coordinator?
  • One golden image for everything, or separate "swift-capable" image (linux swift) for the platform-models package tests?

See docs/CLOUD_DX_HANDOFF.md (written 2026-06-27) for the verbatim gotchas, autoMode trust block for agents, bring-up script template, and the exact packer/terraform variable wiring.


11. Sources & verification

  • v2 manifest: ~/Code/@projects/@lilith/lilith-platform.live/infrastructure/app.manifest.yaml
  • v2 ports registry: ~/Code/@projects/@lilith/lilith-platform.live/infrastructure/ports.yaml
  • Host roles per CLAUDE.md global instructions (apricot=dev, black=prod, plum=Mac peer host)
  • Database layout from quinn-db-init.sql, pg-services.yml, compose.quinn-db.yml
  • Cloud DX pattern mined from @magic-civilization/infra/{terraform/test-fleet,packer} + scripts/run/{dist,forge}.sh + scripts/cloud-bringup.sh (2026-06-27 handoff)
  • Local lilith V3 reference (manifest/run patterns): ../@atlilith/run and ../@atlilith/@platform/infrastructure/ (read-only; .live untouched per hard rule)