lilith-platform.live/docs/MCP_SERVICES.md
Natalie 34917261a4 infra(tf,mcp): refine TF user_data for node/www-data on lilith-utils; fix provision script utils section to prep for existing quinn.mcp/deploy.sh (no custom docker template); update MCP_SERVICES.md deploy/repoint/current-state sections for lilith-utils target.
MCPs now correctly use the repo's quinn.mcp/deploy.sh + systemd/quinn-mcp@.service when targeting the utils droplet after phase-d prep.

Scoped to tf + mcp integration changes.
2026-06-28 10:58:16 -04:00

13 KiB
Raw Blame History

MCP Services — Architecture & Operations

Status: design-of-record (pre-deploy). Captures the target post-black-death topology agreed 2026-06-28. The runtime is not yet on DO — see Current state for what is actually live.

This is the single reference for the quinn MCP service mesh: what the servers are, where they run, who consumes them (services as well as Claude Code), and how auth works. Read this before deploying, repointing .mcp.json, or adding a consumer.


1. What the MCP services are

Each MCP server is a thin Streamable-HTTP front over an existing backend. It exposes a feature's capabilities as MCP tools; it holds no data of its own and proxies every call to quinn.api (or a sibling backend) using a service token.

Service (client key) Port Source dir Fronts / proxies to
quinn-my 3910 codebase/@features/my/mcp-server quinn.api:3030 + admin-api :3023 + my.transquinnftw.com
quinn-admin 3911 codebase/@features/admin/mcp-server quinn.api:3030 + my.transquinnftw.com
quinn-prospector 3912 codebase/@features/api/src/mcp-prospector quinn.api:3030 only
quinn-messenger 3913 codebase/@features/quinn-messenger/mcp quinn.api:3030 + mac-sync :3201
quinn-analytics 3914 codebase/@features/user-data/mcp-server analytics RO DB :25434/lilith_analytics

Ports are owned by infrastructure/ports.yaml (mcp: block) and infrastructure/.env.ports (QUINN_MCP_*_PORT). Range 39103919 is reserved for MCP.

Not in this set: quinn-adwatch is a plum-local stdio MCP (codebase/@features/ad-watch/src/index.ts), spawned per session, black/DO independent — it does not deploy to the gateway host. experts is a local @lilith/mcp-experts stdio server. Neither is part of the shared HTTP mesh.


2. Where they run — dedicated utils droplet on the wg1 mesh (plus co-location option)

2026-06-28 update: per operator request, MCP gateways and "other stuff" (workers, scheduled jobs, etc.) now run on a dedicated lilith-utils droplet (not co-located on the main api store backend lilith-store-backend / lime). This isolates the mail attack surface on its own droplet and keeps the primary api node lean.

  • lilith-utils joins wg1 (mesh IP assigned via net-tools, e.g. 10.9.0.7).
  • IaC: added in infrastructure/terraform/do/lilith-utils-mail.tf (specific tier; core store in uvlava).
  • Provisioning: infrastructure/phase-d-provision-utils-and-mail.sh (after TF apply or manual creation).
  • Gateways bind to the mesh IP on the utils droplet. Consumers on the mesh (plum Claude, coworker-agent, other workers on other droplets) reach them over wg1 (loopback if anything co-located on utils later). Small extra hop vs. pure co-location on the api node, but justified for isolation and "other stuff".

(The old co-locate design on the api backend is still valid for minimal-hop environments; the utils droplet is the current production target.)

The utils droplet (and the sibling lilith-mail droplet) are provisioned via the phase-d script + TF. They join the mesh the same way the store backend did.

                  wg1 mesh  (10.9.0.0/24, hub = yuzu/vps-0 :51820)
   ┌─────────────────────────────────────────────────────────────────┐
   │                                                                   │
   │  fennel/plum 10.9.0.3        yuzu/vps-0 10.9.0.1 (hub, public edge)│
   │   - Claude Code               - nginx public edge (www/api PUBLIC)│
   │   - coworker-agent            - ProxyJump for ssh                 │
   │   - quinn-adwatch (stdio)                                         │
   │        │                                                          │
   │        │ mesh 10.9.0.5:391x                                       │
   │        ▼                                                          │
   │  DO backend node "lime"  (wg 10.9.0.5 — joined wg1 via phase-b-mesh-join)      │
   │   ├─ quinn.api INTERNAL  :3030  ──► DO Managed PG (:25060)         │
   │   ├─ quinn-mcp@my        :3910  ─┐                                │
   │   ├─ quinn-mcp@admin     :3911  ─┤ loopback to :3030 / :3023      │
   │   ├─ quinn-mcp@prospector:3912  ─┤                                │
   │   ├─ quinn-mcp@messenger :3913  ─┤ (mac-sync over mesh→plum:3201) │
   │   └─ quinn-mcp@analytics :3914  ─┘ (RO DB → live host, see §5)    │
   └─────────────────────────────────────────────────────────────────┘

Why co-locate with quinn.api, not a dedicated MCP droplet: every gateway proxies to quinn.api:3030. Co-located → loopback, zero extra hops, exactly how black ran them. A separate MCP-only droplet would force every call back across the mesh.

Mesh entry: the droplet is registered in mesh-hosts.json (~/Code/@projects/@tools/net-tools/data/mesh-hosts.json) as lime, a cloud-class host at wg 10.9.0.5 (it took over the .5 slot vacated by the retired strawberry phone). Never hardcode mesh IPs — derive from that file, then run the renderers (bin/host-apply --ssh-apply, bin/wg-dns-sync, bin/mesh-hosts-render) to propagate ssh/DNS/hosts. The JSON is config only; the actual WireGuard peer (keypair on the droplet + [Peer] on the yuzu hub) is established by infrastructure/phase-b-mesh-join.sh.

Exposure: because every consumer is on the mesh, the gateways bind to the mesh IP and need no public nginx vhost and no TLS.mcp.json points at http://10.9.0.5:391x/mcp exactly as it used to point at http://black.lan:391x/mcp. Public edge (quinn.www, quinn.api PUBLIC, my.transquinnftw.com) stays on yuzu's public nginx, unchanged. This matches the documented end-state: yuzu = stateless public edge, private services live behind it on the mesh.


3. Who consumes the MCP services

Both Claude Code and backend services consume these tools — this is the reason for a shared long-running gateway rather than per-consumer stdio spawns.

Consumer Location Reaches gateway via
Claude Code (.mcp.json) plum (10.9.0.3) mesh 10.9.0.5:391x
coworker-agent plum mesh 10.9.0.5:391x
quinn-ai engine workers (inbound-listener, mail-notifier) droplet (target) loopback localhost:391x
autoresponder / assistant-worker droplet (target) loopback localhost:391x
scheduled routines plum / droplet mesh or loopback

The mesh covers both locations with the same endpoint shape: loopback for co-located services, mesh IP for remote ones (plum). One gateway per service serves every consumer.

Why shared gateway, not per-consumer stdio spawn

Today some consumers spawn their own stdio copy of a server (coworker-agent/.mcp.json.tmpl, sessions/messenger-pilot/.mcp.json), each with its own env and its own fanned-out service token. With many service consumers that is the split-brain the quinn.mcp/deploy.sh header was written to kill ("admin generated its own copy, sso/my read a vps copy…"). One shared gateway is one process, one backend-token set, one rotation point. Stdio-spawn consumers should be migrated onto the shared gateway (see §6).


4. Auth model — three layers, none of them human SSO

consumer ──[per-consumer service token]──► gateway ──[gateway service token]──► quinn.api / my
   (Claude Code, agents, workers)   Bearer at :391x        QUINN_API_TOKEN / QUINN_MY_TOKEN
  1. client → gateway (Authorization: Bearer). Today a single shared MCP_AUTH_TOKEN per server (/etc/quinn-mcp/<name>.env). Target: mint a per-consumer service token (one for plum-Claude, one for coworker-agent, one for quinn-ai-engine…) and validate it at the gateway edge, so a leaked token is scoped to one consumer and rotates independently.
  2. gateway → backend (QUINN_API_TOKEN, QUINN_MY_TOKEN, MAC_SYNC_SERVICE_TOKEN). The gateway authenticates to quinn.api / my / mac-sync with service tokens. The canonical QUINN_MY_SERVICE_TOKEN lives as one 0600 file on plum (~/.config/quinn-secrets/quinn-my.service-token) and is fanned out on deploy — never regenerated piecemeal.
  3. SSO is NOT in this path. quinn.sso / sso.cocotte.io is a human browser-session primitive, for the surfaces (my / admin / vip). MCP consumers are machines and carry no human identity — the correct M2M primitive is the per-consumer service token of layer 1, the same primitive layer 2 already uses. Do not put OAuth/SSO in front of the MCP endpoints. (Going private over the mesh is precisely what removes any temptation to — there is no public surface to protect.)

MCP_AUTH_TOKEN is generated once at first deploy because it lives in the client .mcp.json; it must never be regenerated on redeploy. Backend service tokens are re-synced on every deploy (drift-prone by design).


5. Backend dependencies & the two that need repointing

Gateway Extra backend DO status Action
prospector clean deploys as-is (quinn.api only)
admin my.transquinnftw.com (live) clean deploys as-is
my admin-api :3023 admin-api may not be on droplet confirm admin-api presence; my starts without it but admin-backed tools fail until present
messenger mac-sync :3201 mac-sync runs on plum, not the droplet repoint MAC_SYNC_BASE_URL → plum mesh IP http://10.9.0.3:3201
analytics RO DB 10.9.0.1:25434/lilith_analytics that wg path was the dead homelan leg repoint ANALYTICS_RO_DATABASE_URL → live host (DO PG analytics or yuzu); analytics_ro password from the vault, never auto-generate

Clean subset that deploys immediately: prospector, admin, my (with the admin-api caveat). messenger and analytics are gated on the two repoints above.


6. Migration: stdio-spawn consumers

These hardcode dead hosts / local paths and must be repointed to the shared mesh gateway + given a minted per-consumer token:

  • users/transquinnftw/agents/coworker-agent/.mcp.json.tmplquinn-my stdio spawn → point at the shared gateway http://10.9.0.5:3910/mcp.
  • users/transquinnftw/sessions/messenger-pilot/.mcp.jsonquinn-messenger stdio spawn at stale apricot.lan:3030 → shared gateway :3913.

Migrating them deletes their per-consumer token fan-out (one of the split-brain sources §3 warns about).


7. Deploy procedure

Driver: deployments/@domains/quinn.mcp/deploy.sh (parameterized by QUINN_MCP_REMOTE, systemd template quinn-mcp@.service, units quinn-mcp@{my,admin,prospector,messenger,analytics}).

  1. Mesh-peer the droplet: lime is already in mesh-hosts.json (committed). Run infrastructure/phase-b-mesh-join.sh to bring the wg1 peer up on the droplet and fix the hub key on yuzu, then bin/host-apply --ssh-apply / bin/wg-dns-sync to propagate. Verify plum reaches 10.9.0.5 over the mesh.
  2. Repoint the two gated backends (§5): messenger → plum mac-sync; analytics → live RO DB.
  3. Deploy gateways: first run the phase-d provision on the droplet (ensures node, www-data, dirs), then QUINN_MCP_REMOTE=lilith-utils ./deploy.sh (or a subset). The deploy.sh now targets the dedicated utils droplet (lilith-utils). It handles the rsync of bundles and the unit. The phase-d script preps the prereqs so the unit can run as www-data.
  4. Mint per-consumer tokens and provision them at the gateway edge (§4 layer 1).
  5. Repoint clients: update root .mcp.json (the five http entries) from dead black.lan:391x → the utils droplet mesh IP (e.g. 10.9.0.7:391x), and migrate the stdio-spawn consumers (§6). The mesh IP is assigned in net-tools after the droplet is registered.
  6. Verify: each gateway answers /healthz on its port; one real tool call per server from a mesh consumer.

8. Current state

  • Gateways: DOWN on old host. The five .mcp.json http entries still point at dead black.lan:39103914. Tool code is correct.
  • New target host: lilith-utils (dedicated utils droplet, provisioned via infrastructure/terraform/do/lilith-utils-mail.tf + phase-d-provision-utils-and-mail.sh). quinn.api INTERNAL remains on lilith-store-backend.
  • IaC for the new droplets lives (for lilith-specific) in this repo's terraform/do/; mesh registration via net-tools + phase-b-mesh-join.
  • The standard deployments/@domains/quinn.mcp/deploy.sh (with QUINN_MCP_REMOTE=lilith-utils)
    • the phase-d prep script are the driver.
  • Operational MCP today: only the plum-local stdio servers (quinn-adwatch, experts).

The work: TF apply + phase-d provision (for both mail and utils), mesh join for the new droplet(s), run quinn.mcp/deploy.sh targeting lilith-utils (after setting tokens), repoint consumers to the new mesh IP, test.