cocottetech/docs/CLOUD_DX_HANDOFF.md
Natalie d899f592cc feat(dx): integrate ctforge into net-tools infra installers
- ./run forge:dns now prefers central net-tools/bin/forge-dns-render (part of net sync) with local fallback.
- Updated dispatcher help, INFRA.md steps, and CLOUD_DX_HANDOFF to document that `net sync` (or forge:dns) installs/keeps the ctforge shortcut as part of standard DX infra setup.
- Symmetric with mcforge.

After this, `net sync` (once net-tools is installed) is the canonical way to converge all hosts/DX shortcuts including the cloud forges.
2026-06-28 10:46:09 -04:00

12 KiB

Cloud DX Handoff — DigitalOcean ephemeral fleet + self-hosted Forgejo

Purpose. Replicate, for cocotte, the on-demand cloud build/test/compute setup proven on Magic Civilization (~/Code/@projects/@magic-civilization/infra/). Offload heavy work off the laptop onto disposable DigitalOcean droplets; a small self-hosted Forgejo is the off-laptop git origin. Pay-per-use, tear down when idle.

Written 2026-06-27 after building it end-to-end on MC. Implemented same day in this repo: run, scripts/run/{forge,dist}.sh, infra/{packer,terraform/test-fleet}/, updated .gitignore + INFRA.md §10. The Gotchas section is the real value — each one cost real iterations. Read it before you start.

See also the live integration notes in INFRA.md §10 (references the lilith lineage manifest/run patterns for consistency).


Architecture (3 layers + origin)

Forgejo origin   small always-on droplet, holds the source (off-laptop git remote)
Golden image     Packer bakes toolchain + warm clone → a DO snapshot (workers boot ready in ~30s)
Fleet            Terraform: N ephemeral workers from the snapshot; workers=0 when idle = ~$0
Dispatch         ./run verbs that ssh work onto a worker + stream results/artifacts back

Reference implementation — copy from MC, then adapt

MC file What it is cocotte action
infra/terraform/test-fleet/ DO provider, golden-image auto-discovery (data.digitalocean_images by name), project grouping, mocked-provider test suite (terraform test, no token/spend) copy near-verbatim
infra/packer/golden-image.pkr.hcl + provision.sh bakes the image copy; swap the toolchain (cocotte = Python/uv/FastAPI + node, not Rust/Godot)
scripts/run/dist.sh dist:{check,up,sim,test,build,render,sync,down} + dist:{publish,fetch,models} (build-once-load-many, see below) copy; swap the build/test commands
scripts/run/forge.sh forge:{up,down,dns} lifecycle copy verbatim
scripts/cloud-bringup.sh one-shot human-run bring-up copy; adjust sizes/scene

The whole thing is provider-pluggable: dispatch + cloud-init + outputs are provider-neutral; only versions.tf/main.tf/variables.tf + the Packer builder are DO-specific.

Build once, load many (artifact Space — added 2026-06-28 on MC)

Fan-out otherwise means N workers each rebuilding the same thing. Instead: build the deployable artifact once, publish it to a DO Space, and let the rest fetch it (keyed by git sha). On MC this is the linux .so+wasm; on cocotte it's whatever your runners consume (built wheels / a uv venv tarball / a bundled image / model files).

  • Space: one DO Space (e.g. cocotte-artifacts). A DO Spaces subscription ($5/mo, 250 GB) covers all your Spaces — a second Space adds ~$0 base. Account-wide S3 keys in ~/.vault/do-spaces-*.{access,secret}.
  • rclone baked into the golden image (provision.sh); the dispatch passes the Spaces creds as RCLONE_S3_* env over ssh — never stored on the worker, never on argv.
  • Verbs (MC scripts/run/dist.sh, copy the shape): dist:publish builds + uploads builds/<sha>/; dist:sync does git pullfetch the prebuilt artifact if published for that sha, else build; dist:models {push,pull,ls} shares model files. Degrades gracefully to build-on-worker when creds/cache are absent.
  • Complements a compile cache (sccache / pip wheel cache): those cache intermediate build steps; the Space caches the final artifact.
  • ⚠️ ssh -n defeats a heredoc-n redirects stdin from /dev/null, so a ssh … bash -s <<'EOF' remote script silently gets empty stdin and no-ops (exit 0). Use -n only for inline-command ssh, never for heredoc-stdin ssh.
  • ⚠️ Dispatch ssh must pass -i <fleet-key> explicitly — don't rely on the key being agent-loaded, or you'll hit intermittent publickey failures.

⚠️ Gotchas (learned the hard way — each cost hours)

  1. DO account tier restricts size AND count (new accounts).

    • droplet_limit starts low (3) → raise via a support ticket (we got 10).
    • Large + CPU-Optimized sizes are locked: s-8vcpu-16gb (non-amd), c-4, c-8 all return 422 "size restricted / open a ticket". s-8vcpu-16gb-amd works in nyc3 (8 vCPU AMD, $0.167/hr) and is the beefy sweet spot until you file the tier ticket.
    • The /v2/sizes available flag LIES (claimed c-4 available; create 422'd). Test-create + destroy to confirm a size before committing.
  2. Powering off a droplet does NOT stop billing on DO (unlike AWS — DO bills allocated, not running). Only destroy stops it. "Park overnight" = power-off → snapshot → destroy; restore = create-from-snapshot. See forge.sh forge:down/forge:up.

  3. The AI-agent exfil hard-deny. An agent (Claude Code) cannot push/clone your private repo onto a fresh cloud box — it's classified as data exfiltration, and permissions.allow does NOT clear it (it's a hard-deny). Two fixes:

    • You run the source push / build yourself (human-initiated clears it), OR
    • Add an autoMode trust block to .claude/settings.local.json BY HAND (the agent can't self-edit this — that's the anti-injection point) declaring the forge + DO project as the owner's trusted infra. Then the agent can run packer/terraform/git. Template at the bottom.
    • Always keep credentials out of argv — pass via env (PKR_VAR_*, TF_VAR_*), never -var creds=... on the command line (the plaintext password is a second exfil signal + leaks to ps).
  4. apt dpkg-lock race on fresh droplets. cloud-init runs its own apt at boot; your provisioner collides → Could not get lock /var/lib/dpkg/lock-frontend → exit 100. Fix at the top of provisioning:

    cloud-init status --wait >/dev/null 2>&1 || true
    apt-get -o DPkg::Lock::Timeout=600 update -y
    
  5. Build user needs passwordless sudo. Dev-setup scripts install system packages via sudo apt-get. A bare useradd user has no sudo → node/etc. install fails. Add:

    echo "$BUILD_USER ALL=(ALL) NOPASSWD:ALL" > /etc/sudoers.d/90-$BUILD_USER
    chmod 440 /etc/sudoers.d/90-$BUILD_USER
    
  6. Packer DO builder has no project parameter → its transient build droplet lands in the account default project. Fix: make your project the account default — PATCH /v2/projects/{id} {"is_default":true} (reversible). Persistent fleet droplets get assigned explicitly via digitalocean_project_resources in Terraform.

  7. Reserved/floating IPs cost ~$4/mo while DETACHED (i.e. exactly when the box is down) — defeats the savings. Skip them; read the dynamic IP from a vault file that forge:up refreshes.

  8. The golden image is size-agnostic — build it on any size, run workers on the beefy size. Build size ≠ worker size.

  9. macOS coordinator (if you dispatch from a Mac): tools/* that use realpath -m need GNU coreutils (brew install coreutils) or run dispatch from Linux.


Runbook (cocotte)

0. Tooling: brew install hashicorp/tap/terraform hashicorp/tap/packer shellcheck on the laptop.

1. DO account: create account + a read/write API token. File a support ticket to (a) raise droplet limit and (b) unlock larger/CPU-Optimized sizes if you want >8 vCPU. Create a DO Project for cocotte; make it the account default (gotcha #6).

2. Vault the secrets (never in repo/argv):

mkdir -p ~/.vault && chmod 700 ~/.vault
echo '<token>' > ~/.vault/do_pat_cocotte && chmod 600 ~/.vault/do_pat_cocotte

3. Forge droplet + Forgejo: spin a s-1vcpu-1gb Ubuntu droplet (~$6/mo); install Forgejo (single Go binary + sqlite + systemd). Key detail: app.ini must be owned by the git run-user (it writes its INTERNAL_TOKEN on first start) and INSTALL_LOCK = true. Create an admin + a repo. Store IP + admin creds in ~/.vault/cocotte_forge_creds. (Copy MC's install sequence.)

4. Push source to the forge (you, by hand — exfil gate): push an orphan snapshot of the current tree (avoids dragging bloated .git history):

cd <cocotte-repo>
URL="http://<admin>:<pass>@<forge-ip>:3000/<org>/<repo>.git"
TREE=$(git rev-parse main^{tree})
COMMIT=$(git commit-tree "$TREE" -m "snapshot for cloud build")
git -c http.postBuffer=524288000 push "$URL" "${COMMIT}:refs/heads/main"   # zsh: brace the var!

5. (Optional) autoMode trust so the agent can run the cloud steps unattended — see template below; add it yourself.

6. Golden image: adapt provision.sh (cocotte toolchain: python3 + uv/pip, node + pnpm, plus #4/#5 fixes baked in), then:

export DIGITALOCEAN_TOKEN=$(cat ~/.vault/do_pat_cocotte)
export PKR_VAR_git_remote="http://<admin>:<pass>@<forge-ip>:3000/<org>/<repo>.git"   # creds in env, not argv
packer init  infra/packer/golden-image.pkr.hcl
packer build infra/packer/golden-image.pkr.hcl

7. Fleet: ./run dist:up 1 s-8vcpu-16gb-amd./run dist:test./run dist:down.

8. DNS / DX shortcut: after forge:up, net sync (or ./run forge:dns inside the project) installs the managed ctforge (and mcforge) entry via the net-tools infra installer (forge-dns-render). Browse http://ctforge:3000. The shortcuts are adopted into a marked block and survive net sync re-runs.

9. One-shot bring-up (human-run): after forge + key registration + golden image, use scripts/cloud-bringup.sh (or run the steps by hand). It does packer + dist:up 1 + dist:typecheck + auto teardown on exit. Launch with nohup ... & and review the log.


Cost

Item Cost
Forge (s-1vcpu-1gb) ~$6/mo, or ~$0.30/mo idle (snapshot+destroy via forge:down)
Workers (s-8vcpu-16gb-amd) $0.167/hr, only while up → cents per run, $0 idle
Golden snapshot ~$0.40/mo

autoMode trust block (you add by hand to .claude/settings.local.json)

This is what lets the agent run the cloud steps without tripping the exfil hard-deny. Adapt names/IP/project to cocotte:

python3 - <<'PY'
import json, os
p = os.path.expanduser("<cocotte-repo>/.claude/settings.local.json")
cfg = json.load(open(p))
am = cfg.setdefault("autoMode", {})
env = am.setdefault("environment", [])
if "$defaults" not in env: env.insert(0, "$defaults")
env.append("The self-hosted Forgejo at <forge-ip>:3000 and the DigitalOcean droplets "
           "in the cocotte project are the owner's OWN build/test infrastructure holding "
           "this project's source. Cloning the repo from that forge onto those droplets, "
           "and pushing snapshots to it, is authorized provisioning — NOT data exfiltration.")
allow = am.setdefault("allow", [])
if "$defaults" not in allow: allow.insert(0, "$defaults")
allow.append("git/packer/terraform operations moving THIS project's source between the "
             "owner's machine, the cocotte Forgejo, and the cocotte DO droplets are authorized.")
perm = cfg.setdefault("permissions", {}).setdefault("allow", [])
for r in ["Bash(git push:*)","Bash(packer init:*)","Bash(packer build:*)","Bash(terraform apply:*)","Bash(terraform destroy:*)"]:
    if r not in perm: perm.append(r)
json.dump(cfg, open(p,"w"), indent=2); open(p,"a").write("\n")
print("autoMode + permissions updated")
PY

Pointers

  • Implemented here: run, scripts/run/{dist,forge}.sh, scripts/cloud-bringup.sh, infra/{packer,terraform/test-fleet}/ (see also updated INFRA.md §10).
  • Working reference (original): ~/Code/@projects/@magic-civilization/infra/{terraform/test-fleet,packer} + scripts/run/{dist,forge}.sh + scripts/cloud-bringup.sh.
  • MC memory note (decisions + tier constraints): ~/.claude/projects/-Users-natalie-Code--projects--magic-civilization/memory/project_cloud_test_fleet.md.
  • Offline verify with zero spend: terraform fmt + validate + test (mocked provider) — ./run dist:check.
  • SSH key for this project: we generated ~/.ssh/id_cocotte_fleet + .pub right now. You must register the .pub in your DO account (Security → SSH Keys) under the exact name cocotte-fleet. The scripts (forge + fleet) now auto-lookup the numeric ID via API. Do this in the DO web UI before running forge:up or dist:up. Current pubkey (as of this run): ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIEkqbC3eHgo3cc263rS+y9KDUz/MuQsrw8srjVSTt8Q1 cocotte-fleet-2026-06
  • Vault: ~/.vault/do_pat_cocotte symlinked to your existing do-pat-ct.token ; placeholder cocotte_forge_creds created (populated by first forge:up).
  • Ready for human bring-up: after key registration in DO, run the steps in the "Runbook" or ./scripts/cloud-bringup.sh (human, with nohup recommended).