Natalie 34917261a4 infra(tf,mcp): refine TF user_data for node/www-data on lilith-utils; fix provision script utils section to prep for existing quinn.mcp/deploy.sh (no custom docker template); update MCP_SERVICES.md deploy/repoint/current-state sections for lilith-utils target.

MCPs now correctly use the repo's quinn.mcp/deploy.sh + systemd/quinn-mcp@.service when targeting the utils droplet after phase-d prep.

Scoped to tf + mcp integration changes.

2026-06-28 10:58:16 -04:00

13 KiB

Raw Blame History

MCP Services — Architecture & Operations

Status: design-of-record (pre-deploy). Captures the target post-black-death topology agreed 2026-06-28. The runtime is not yet on DO — see Current state for what is actually live.

This is the single reference for the quinn MCP service mesh: what the servers are, where they run, who consumes them (services as well as Claude Code), and how auth works. Read this before deploying, repointing .mcp.json, or adding a consumer.

1. What the MCP services are

Each MCP server is a thin Streamable-HTTP front over an existing backend. It exposes a feature's capabilities as MCP tools; it holds no data of its own and proxies every call to quinn.api (or a sibling backend) using a service token.

Service (client key)	Port	Source dir	Fronts / proxies to
`quinn-my`	3910	`codebase/@features/my/mcp-server`	`quinn.api:3030` + admin-api `:3023` + `my.transquinnftw.com`
`quinn-admin`	3911	`codebase/@features/admin/mcp-server`	`quinn.api:3030` + `my.transquinnftw.com`
`quinn-prospector`	3912	`codebase/@features/api/src/mcp-prospector`	`quinn.api:3030` only
`quinn-messenger`	3913	`codebase/@features/quinn-messenger/mcp`	`quinn.api:3030` + mac-sync `:3201`
`quinn-analytics`	3914	`codebase/@features/user-data/mcp-server`	analytics RO DB `:25434/lilith_analytics`

Ports are owned by infrastructure/ports.yaml (mcp: block) and infrastructure/.env.ports (QUINN_MCP_*_PORT). Range 3910–3919 is reserved for MCP.

Not in this set: quinn-adwatch is a plum-local stdio MCP (codebase/@features/ad-watch/src/index.ts), spawned per session, black/DO independent — it does not deploy to the gateway host. experts is a local @lilith/mcp-experts stdio server. Neither is part of the shared HTTP mesh.

2. Where they run — dedicated utils droplet on the wg1 mesh (plus co-location option)

2026-06-28 update: per operator request, MCP gateways and "other stuff" (workers, scheduled jobs, etc.) now run on a dedicated lilith-utils droplet (not co-located on the main api store backend lilith-store-backend / lime). This isolates the mail attack surface on its own droplet and keeps the primary api node lean.

lilith-utils joins wg1 (mesh IP assigned via net-tools, e.g. 10.9.0.7).
IaC: added in infrastructure/terraform/do/lilith-utils-mail.tf (specific tier; core store in uvlava).
Provisioning: infrastructure/phase-d-provision-utils-and-mail.sh (after TF apply or manual creation).
Gateways bind to the mesh IP on the utils droplet. Consumers on the mesh (plum Claude, coworker-agent, other workers on other droplets) reach them over wg1 (loopback if anything co-located on utils later). Small extra hop vs. pure co-location on the api node, but justified for isolation and "other stuff".

(The old co-locate design on the api backend is still valid for minimal-hop environments; the utils droplet is the current production target.)

The utils droplet (and the sibling lilith-mail droplet) are provisioned via the phase-d script + TF. They join the mesh the same way the store backend did.

                  wg1 mesh  (10.9.0.0/24, hub = yuzu/vps-0 :51820)
   ┌─────────────────────────────────────────────────────────────────┐
   │                                                                   │
   │  fennel/plum 10.9.0.3        yuzu/vps-0 10.9.0.1 (hub, public edge)│
   │   - Claude Code               - nginx public edge (www/api PUBLIC)│
   │   - coworker-agent            - ProxyJump for ssh                 │
   │   - quinn-adwatch (stdio)                                         │
   │        │                                                          │
   │        │ mesh 10.9.0.5:391x                                       │
   │        ▼                                                          │
   │  DO backend node "lime"  (wg 10.9.0.5 — joined wg1 via phase-b-mesh-join)      │
   │   ├─ quinn.api INTERNAL  :3030  ──► DO Managed PG (:25060)         │
   │   ├─ quinn-mcp@my        :3910  ─┐                                │
   │   ├─ quinn-mcp@admin     :3911  ─┤ loopback to :3030 / :3023      │
   │   ├─ quinn-mcp@prospector:3912  ─┤                                │
   │   ├─ quinn-mcp@messenger :3913  ─┤ (mac-sync over mesh→plum:3201) │
   │   └─ quinn-mcp@analytics :3914  ─┘ (RO DB → live host, see §5)    │
   └─────────────────────────────────────────────────────────────────┘

Why co-locate with quinn.api, not a dedicated MCP droplet: every gateway proxies to quinn.api:3030. Co-located → loopback, zero extra hops, exactly how black ran them. A separate MCP-only droplet would force every call back across the mesh.

Mesh entry: the droplet is registered in mesh-hosts.json (~/Code/@projects/@tools/net-tools/data/mesh-hosts.json) as lime, a cloud-class host at wg 10.9.0.5 (it took over the .5 slot vacated by the retired strawberry phone). Never hardcode mesh IPs — derive from that file, then run the renderers (bin/host-apply --ssh-apply, bin/wg-dns-sync, bin/mesh-hosts-render) to propagate ssh/DNS/hosts. The JSON is config only; the actual WireGuard peer (keypair on the droplet + [Peer] on the yuzu hub) is established by infrastructure/phase-b-mesh-join.sh.

Exposure: because every consumer is on the mesh, the gateways bind to the mesh IP and need no public nginx vhost and no TLS — .mcp.json points at http://10.9.0.5:391x/mcp exactly as it used to point at http://black.lan:391x/mcp. Public edge (quinn.www, quinn.api PUBLIC, my.transquinnftw.com) stays on yuzu's public nginx, unchanged. This matches the documented end-state: yuzu = stateless public edge, private services live behind it on the mesh.

3. Who consumes the MCP services

Both Claude Code and backend services consume these tools — this is the reason for a shared long-running gateway rather than per-consumer stdio spawns.

Consumer	Location	Reaches gateway via
Claude Code (`.mcp.json`)	plum (10.9.0.3)	mesh `10.9.0.5:391x`
coworker-agent	plum	mesh `10.9.0.5:391x`
quinn-ai engine workers (inbound-listener, mail-notifier)	droplet (target)	loopback `localhost:391x`
autoresponder / assistant-worker	droplet (target)	loopback `localhost:391x`
scheduled routines	plum / droplet	mesh or loopback

The mesh covers both locations with the same endpoint shape: loopback for co-located services, mesh IP for remote ones (plum). One gateway per service serves every consumer.

Why shared gateway, not per-consumer stdio spawn

Today some consumers spawn their own stdio copy of a server (coworker-agent/.mcp.json.tmpl, sessions/messenger-pilot/.mcp.json), each with its own env and its own fanned-out service token. With many service consumers that is the split-brain the quinn.mcp/deploy.sh header was written to kill ("admin generated its own copy, sso/my read a vps copy…"). One shared gateway is one process, one backend-token set, one rotation point. Stdio-spawn consumers should be migrated onto the shared gateway (see §6).

4. Auth model — three layers, none of them human SSO

consumer ──[per-consumer service token]──► gateway ──[gateway service token]──► quinn.api / my
   (Claude Code, agents, workers)   Bearer at :391x        QUINN_API_TOKEN / QUINN_MY_TOKEN

client → gateway (Authorization: Bearer). Today a single shared MCP_AUTH_TOKEN per server (/etc/quinn-mcp/<name>.env). Target: mint a per-consumer service token (one for plum-Claude, one for coworker-agent, one for quinn-ai-engine…) and validate it at the gateway edge, so a leaked token is scoped to one consumer and rotates independently.
gateway → backend (QUINN_API_TOKEN, QUINN_MY_TOKEN, MAC_SYNC_SERVICE_TOKEN). The gateway authenticates to quinn.api / my / mac-sync with service tokens. The canonical QUINN_MY_SERVICE_TOKEN lives as one 0600 file on plum (~/.config/quinn-secrets/quinn-my.service-token) and is fanned out on deploy — never regenerated piecemeal.
SSO is NOT in this path. quinn.sso / sso.cocotte.io is a human browser-session primitive, for the surfaces (my / admin / vip). MCP consumers are machines and carry no human identity — the correct M2M primitive is the per-consumer service token of layer 1, the same primitive layer 2 already uses. Do not put OAuth/SSO in front of the MCP endpoints. (Going private over the mesh is precisely what removes any temptation to — there is no public surface to protect.)

MCP_AUTH_TOKEN is generated once at first deploy because it lives in the client .mcp.json; it must never be regenerated on redeploy. Backend service tokens are re-synced on every deploy (drift-prone by design).

5. Backend dependencies & the two that need repointing

Gateway	Extra backend	DO status	Action
prospector	—	clean	deploys as-is (quinn.api only)
admin	`my.transquinnftw.com` (live)	clean	deploys as-is
my	admin-api `:3023`	admin-api may not be on droplet	confirm admin-api presence; `my` starts without it but admin-backed tools fail until present
messenger	mac-sync `:3201`	mac-sync runs on plum, not the droplet	repoint `MAC_SYNC_BASE_URL` → plum mesh IP `http://10.9.0.3:3201`
analytics	RO DB `10.9.0.1:25434/lilith_analytics`	that wg path was the dead homelan leg	repoint `ANALYTICS_RO_DATABASE_URL` → live host (DO PG analytics or yuzu); analytics_ro password from the vault, never auto-generate

Clean subset that deploys immediately: prospector, admin, my (with the admin-api caveat). messenger and analytics are gated on the two repoints above.

6. Migration: stdio-spawn consumers

These hardcode dead hosts / local paths and must be repointed to the shared mesh gateway + given a minted per-consumer token:

users/transquinnftw/agents/coworker-agent/.mcp.json.tmpl — quinn-my stdio spawn → point at the shared gateway http://10.9.0.5:3910/mcp.
users/transquinnftw/sessions/messenger-pilot/.mcp.json — quinn-messenger stdio spawn at stale apricot.lan:3030 → shared gateway :3913.

Migrating them deletes their per-consumer token fan-out (one of the split-brain sources §3 warns about).

7. Deploy procedure

Driver: deployments/@domains/quinn.mcp/deploy.sh (parameterized by QUINN_MCP_REMOTE, systemd template quinn-mcp@.service, units quinn-mcp@{my,admin,prospector,messenger,analytics}).

Mesh-peer the droplet: lime is already in mesh-hosts.json (committed). Run infrastructure/phase-b-mesh-join.sh to bring the wg1 peer up on the droplet and fix the hub key on yuzu, then bin/host-apply --ssh-apply / bin/wg-dns-sync to propagate. Verify plum reaches 10.9.0.5 over the mesh.
Repoint the two gated backends (§5): messenger → plum mac-sync; analytics → live RO DB.
Deploy gateways: first run the phase-d provision on the droplet (ensures node, www-data, dirs), then QUINN_MCP_REMOTE=lilith-utils ./deploy.sh (or a subset). The deploy.sh now targets the dedicated utils droplet (lilith-utils). It handles the rsync of bundles and the unit. The phase-d script preps the prereqs so the unit can run as www-data.
Mint per-consumer tokens and provision them at the gateway edge (§4 layer 1).
Repoint clients: update root .mcp.json (the five http entries) from dead black.lan:391x → the utils droplet mesh IP (e.g. 10.9.0.7:391x), and migrate the stdio-spawn consumers (§6). The mesh IP is assigned in net-tools after the droplet is registered.
Verify: each gateway answers /healthz on its port; one real tool call per server from a mesh consumer.

8. Current state

Gateways: DOWN on old host. The five .mcp.json http entries still point at dead black.lan:3910–3914. Tool code is correct.
New target host: lilith-utils (dedicated utils droplet, provisioned via infrastructure/terraform/do/lilith-utils-mail.tf + phase-d-provision-utils-and-mail.sh). quinn.api INTERNAL remains on lilith-store-backend.
IaC for the new droplets lives (for lilith-specific) in this repo's terraform/do/; mesh registration via net-tools + phase-b-mesh-join.
The standard deployments/@domains/quinn.mcp/deploy.sh (with QUINN_MCP_REMOTE=lilith-utils)
- the phase-d prep script are the driver.
Operational MCP today: only the plum-local stdio servers (quinn-adwatch, experts).

The work: TF apply + phase-d provision (for both mail and utils), mesh join for the new droplet(s), run quinn.mcp/deploy.sh targeting lilith-utils (after setting tokens), repoint consumers to the new mesh IP, test.

13 KiB Raw Blame History Unescape Escape