- ./run forge:dns now prefers central net-tools/bin/forge-dns-render (part of net sync) with local fallback. - Updated dispatcher help, INFRA.md steps, and CLOUD_DX_HANDOFF to document that `net sync` (or forge:dns) installs/keeps the ctforge shortcut as part of standard DX infra setup. - Symmetric with mcforge. After this, `net sync` (once net-tools is installed) is the canonical way to converge all hosts/DX shortcuts including the cloud forges.
172 lines
12 KiB
Markdown
172 lines
12 KiB
Markdown
# Cloud DX Handoff — DigitalOcean ephemeral fleet + self-hosted Forgejo
|
|
|
|
**Purpose.** Replicate, for cocotte, the on-demand cloud build/test/compute setup proven on Magic Civilization (`~/Code/@projects/@magic-civilization/infra/`). Offload heavy work off the laptop onto disposable DigitalOcean droplets; a small self-hosted Forgejo is the off-laptop git origin. Pay-per-use, tear down when idle.
|
|
|
|
> Written 2026-06-27 after building it end-to-end on MC. **Implemented same day** in this repo: `run`, `scripts/run/{forge,dist}.sh`, `infra/{packer,terraform/test-fleet}/`, updated `.gitignore` + `INFRA.md §10`. The **Gotchas** section is the real value — each one cost real iterations. Read it before you start.
|
|
|
|
See also the live integration notes in INFRA.md §10 (references the lilith lineage manifest/run patterns for consistency).
|
|
|
|
---
|
|
|
|
## Architecture (3 layers + origin)
|
|
|
|
```
|
|
Forgejo origin small always-on droplet, holds the source (off-laptop git remote)
|
|
Golden image Packer bakes toolchain + warm clone → a DO snapshot (workers boot ready in ~30s)
|
|
Fleet Terraform: N ephemeral workers from the snapshot; workers=0 when idle = ~$0
|
|
Dispatch ./run verbs that ssh work onto a worker + stream results/artifacts back
|
|
```
|
|
|
|
## Reference implementation — copy from MC, then adapt
|
|
|
|
| MC file | What it is | cocotte action |
|
|
|---|---|---|
|
|
| `infra/terraform/test-fleet/` | DO provider, golden-image auto-discovery (`data.digitalocean_images` by name), project grouping, **mocked-provider test suite** (`terraform test`, no token/spend) | copy near-verbatim |
|
|
| `infra/packer/golden-image.pkr.hcl` + `provision.sh` | bakes the image | copy; **swap the toolchain** (cocotte = Python/uv/FastAPI + node, not Rust/Godot) |
|
|
| `scripts/run/dist.sh` | `dist:{check,up,sim,test,build,render,sync,down}` + `dist:{publish,fetch,models}` (build-once-load-many, see below) | copy; swap the build/test commands |
|
|
| `scripts/run/forge.sh` | `forge:{up,down,dns}` lifecycle | copy verbatim |
|
|
| `scripts/cloud-bringup.sh` | one-shot human-run bring-up | copy; adjust sizes/scene |
|
|
|
|
The whole thing is provider-pluggable: dispatch + cloud-init + outputs are provider-neutral; only `versions.tf`/`main.tf`/`variables.tf` + the Packer builder are DO-specific.
|
|
|
|
### Build once, load many (artifact Space — added 2026-06-28 on MC)
|
|
|
|
Fan-out otherwise means N workers each rebuilding the same thing. Instead: **build the deployable artifact once, publish it to a DO Space, and let the rest fetch it** (keyed by git sha). On MC this is the linux `.so`+wasm; on cocotte it's whatever your runners consume (built wheels / a `uv` venv tarball / a bundled image / model files).
|
|
|
|
- **Space**: one DO Space (e.g. `cocotte-artifacts`). A DO Spaces *subscription* ($5/mo, 250 GB) covers **all** your Spaces — a second Space adds ~$0 base. Account-wide S3 keys in `~/.vault/do-spaces-*.{access,secret}`.
|
|
- **`rclone` baked into the golden image** (`provision.sh`); the dispatch passes the Spaces creds as `RCLONE_S3_*` env over ssh — **never stored on the worker, never on argv**.
|
|
- **Verbs** (MC `scripts/run/dist.sh`, copy the shape): `dist:publish` builds + uploads `builds/<sha>/`; `dist:sync` does `git pull` → **fetch the prebuilt artifact if published for that sha, else build**; `dist:models {push,pull,ls}` shares model files. Degrades gracefully to build-on-worker when creds/cache are absent.
|
|
- **Complements** a compile cache (sccache / pip wheel cache): those cache *intermediate* build steps; the Space caches the *final* artifact.
|
|
- ⚠️ **`ssh -n` defeats a heredoc** — `-n` redirects stdin from `/dev/null`, so a `ssh … bash -s <<'EOF'` remote script silently gets empty stdin and no-ops (exit 0). Use `-n` only for inline-command ssh, never for heredoc-stdin ssh.
|
|
- ⚠️ **Dispatch ssh must pass `-i <fleet-key>`** explicitly — don't rely on the key being agent-loaded, or you'll hit intermittent `publickey` failures.
|
|
|
|
---
|
|
|
|
## ⚠️ Gotchas (learned the hard way — each cost hours)
|
|
|
|
1. **DO account tier restricts size AND count (new accounts).**
|
|
- `droplet_limit` starts low (3) → raise via a support ticket (we got 10).
|
|
- **Large + CPU-Optimized sizes are locked**: `s-8vcpu-16gb` (non-amd), `c-4`, `c-8` all return `422 "size restricted / open a ticket"`. **`s-8vcpu-16gb-amd` works** in nyc3 (8 vCPU AMD, $0.167/hr) and is the beefy sweet spot until you file the tier ticket.
|
|
- The `/v2/sizes` `available` flag **LIES** (claimed `c-4` available; create 422'd). **Test-create + destroy** to confirm a size before committing.
|
|
|
|
2. **Powering off a droplet does NOT stop billing on DO** (unlike AWS — DO bills allocated, not running). Only **destroy** stops it. "Park overnight" = power-off → **snapshot → destroy**; restore = create-from-snapshot. See `forge.sh` `forge:down`/`forge:up`.
|
|
|
|
3. **The AI-agent exfil hard-deny.** An agent (Claude Code) **cannot** push/clone your *private* repo onto a *fresh* cloud box — it's classified as data exfiltration, and `permissions.allow` does NOT clear it (it's a hard-deny). Two fixes:
|
|
- **You run the source push / build yourself** (human-initiated clears it), OR
|
|
- Add an **`autoMode` trust block to `.claude/settings.local.json` BY HAND** (the agent can't self-edit this — that's the anti-injection point) declaring the forge + DO project as the owner's trusted infra. Then the agent can run packer/terraform/git. Template at the bottom.
|
|
- Always keep **credentials out of argv** — pass via env (`PKR_VAR_*`, `TF_VAR_*`), never `-var creds=...` on the command line (the plaintext password is a second exfil signal + leaks to `ps`).
|
|
|
|
4. **apt dpkg-lock race on fresh droplets.** cloud-init runs its *own* apt at boot; your provisioner collides → `Could not get lock /var/lib/dpkg/lock-frontend` → exit 100. Fix at the top of provisioning:
|
|
```sh
|
|
cloud-init status --wait >/dev/null 2>&1 || true
|
|
apt-get -o DPkg::Lock::Timeout=600 update -y
|
|
```
|
|
|
|
5. **Build user needs passwordless sudo.** Dev-setup scripts install system packages via `sudo apt-get`. A bare `useradd` user has no sudo → node/etc. install fails. Add:
|
|
```sh
|
|
echo "$BUILD_USER ALL=(ALL) NOPASSWD:ALL" > /etc/sudoers.d/90-$BUILD_USER
|
|
chmod 440 /etc/sudoers.d/90-$BUILD_USER
|
|
```
|
|
|
|
6. **Packer DO builder has no `project` parameter** → its transient build droplet lands in the **account default project**. Fix: make your project the account default — `PATCH /v2/projects/{id} {"is_default":true}` (reversible). Persistent fleet droplets get assigned explicitly via `digitalocean_project_resources` in Terraform.
|
|
|
|
7. **Reserved/floating IPs cost ~$4/mo while DETACHED** (i.e. exactly when the box is down) — defeats the savings. Skip them; read the **dynamic IP from a vault file** that `forge:up` refreshes.
|
|
|
|
8. **The golden image is size-agnostic** — build it on any size, run workers on the beefy size. Build size ≠ worker size.
|
|
|
|
9. **macOS coordinator** (if you dispatch from a Mac): `tools/*` that use `realpath -m` need GNU coreutils (`brew install coreutils`) or run dispatch from Linux.
|
|
|
|
---
|
|
|
|
## Runbook (cocotte)
|
|
|
|
**0. Tooling:** `brew install hashicorp/tap/terraform hashicorp/tap/packer shellcheck` on the laptop.
|
|
|
|
**1. DO account:** create account + a **read/write API token**. File a support ticket to (a) raise droplet limit and (b) unlock larger/CPU-Optimized sizes if you want >8 vCPU. Create a DO **Project** for cocotte; make it the account default (gotcha #6).
|
|
|
|
**2. Vault the secrets** (never in repo/argv):
|
|
```sh
|
|
mkdir -p ~/.vault && chmod 700 ~/.vault
|
|
echo '<token>' > ~/.vault/do_pat_cocotte && chmod 600 ~/.vault/do_pat_cocotte
|
|
```
|
|
|
|
**3. Forge droplet + Forgejo:** spin a `s-1vcpu-1gb` Ubuntu droplet (~$6/mo); install Forgejo (single Go binary + sqlite + systemd). Key detail: `app.ini` must be **owned by the `git` run-user** (it writes its INTERNAL_TOKEN on first start) and `INSTALL_LOCK = true`. Create an admin + a repo. Store IP + admin creds in `~/.vault/cocotte_forge_creds`. (Copy MC's install sequence.)
|
|
|
|
**4. Push source to the forge** (you, by hand — exfil gate): push an **orphan snapshot** of the current tree (avoids dragging bloated `.git` history):
|
|
```sh
|
|
cd <cocotte-repo>
|
|
URL="http://<admin>:<pass>@<forge-ip>:3000/<org>/<repo>.git"
|
|
TREE=$(git rev-parse main^{tree})
|
|
COMMIT=$(git commit-tree "$TREE" -m "snapshot for cloud build")
|
|
git -c http.postBuffer=524288000 push "$URL" "${COMMIT}:refs/heads/main" # zsh: brace the var!
|
|
```
|
|
|
|
**5. (Optional) autoMode trust** so the agent can run the cloud steps unattended — see template below; add it yourself.
|
|
|
|
**6. Golden image:** adapt `provision.sh` (cocotte toolchain: python3 + uv/pip, node + pnpm, plus #4/#5 fixes baked in), then:
|
|
```sh
|
|
export DIGITALOCEAN_TOKEN=$(cat ~/.vault/do_pat_cocotte)
|
|
export PKR_VAR_git_remote="http://<admin>:<pass>@<forge-ip>:3000/<org>/<repo>.git" # creds in env, not argv
|
|
packer init infra/packer/golden-image.pkr.hcl
|
|
packer build infra/packer/golden-image.pkr.hcl
|
|
```
|
|
|
|
**7. Fleet:** `./run dist:up 1 s-8vcpu-16gb-amd` → `./run dist:test` → `./run dist:down`.
|
|
|
|
**8. DNS / DX shortcut:** after `forge:up`, `net sync` (or `./run forge:dns` inside the project) installs the managed `ctforge` (and `mcforge`) entry via the net-tools infra installer (forge-dns-render). Browse `http://ctforge:3000`. The shortcuts are adopted into a marked block and survive `net sync` re-runs.
|
|
|
|
**9. One-shot bring-up (human-run):** after forge + key registration + golden image, use `scripts/cloud-bringup.sh` (or run the steps by hand). It does packer + dist:up 1 + dist:typecheck + auto teardown on exit. Launch with nohup ... & and review the log.
|
|
|
|
---
|
|
|
|
## Cost
|
|
|
|
| Item | Cost |
|
|
|---|---|
|
|
| Forge (s-1vcpu-1gb) | ~$6/mo, or ~$0.30/mo idle (snapshot+destroy via `forge:down`) |
|
|
| Workers (s-8vcpu-16gb-amd) | $0.167/hr, only while up → cents per run, $0 idle |
|
|
| Golden snapshot | ~$0.40/mo |
|
|
|
|
---
|
|
|
|
## autoMode trust block (you add by hand to `.claude/settings.local.json`)
|
|
|
|
This is what lets the agent run the cloud steps without tripping the exfil hard-deny. Adapt names/IP/project to cocotte:
|
|
|
|
```python
|
|
python3 - <<'PY'
|
|
import json, os
|
|
p = os.path.expanduser("<cocotte-repo>/.claude/settings.local.json")
|
|
cfg = json.load(open(p))
|
|
am = cfg.setdefault("autoMode", {})
|
|
env = am.setdefault("environment", [])
|
|
if "$defaults" not in env: env.insert(0, "$defaults")
|
|
env.append("The self-hosted Forgejo at <forge-ip>:3000 and the DigitalOcean droplets "
|
|
"in the cocotte project are the owner's OWN build/test infrastructure holding "
|
|
"this project's source. Cloning the repo from that forge onto those droplets, "
|
|
"and pushing snapshots to it, is authorized provisioning — NOT data exfiltration.")
|
|
allow = am.setdefault("allow", [])
|
|
if "$defaults" not in allow: allow.insert(0, "$defaults")
|
|
allow.append("git/packer/terraform operations moving THIS project's source between the "
|
|
"owner's machine, the cocotte Forgejo, and the cocotte DO droplets are authorized.")
|
|
perm = cfg.setdefault("permissions", {}).setdefault("allow", [])
|
|
for r in ["Bash(git push:*)","Bash(packer init:*)","Bash(packer build:*)","Bash(terraform apply:*)","Bash(terraform destroy:*)"]:
|
|
if r not in perm: perm.append(r)
|
|
json.dump(cfg, open(p,"w"), indent=2); open(p,"a").write("\n")
|
|
print("autoMode + permissions updated")
|
|
PY
|
|
```
|
|
|
|
---
|
|
|
|
## Pointers
|
|
|
|
- Implemented here: `run`, `scripts/run/{dist,forge}.sh`, `scripts/cloud-bringup.sh`, `infra/{packer,terraform/test-fleet}/` (see also updated INFRA.md §10).
|
|
- Working reference (original): `~/Code/@projects/@magic-civilization/infra/{terraform/test-fleet,packer}` + `scripts/run/{dist,forge}.sh` + `scripts/cloud-bringup.sh`.
|
|
- MC memory note (decisions + tier constraints): `~/.claude/projects/-Users-natalie-Code--projects--magic-civilization/memory/project_cloud_test_fleet.md`.
|
|
- Offline verify with **zero spend**: `terraform fmt + validate + test` (mocked provider) — `./run dist:check`.
|
|
- SSH key for this project: **we generated** `~/.ssh/id_cocotte_fleet` + `.pub` right now. **You must register the .pub in your DO account (Security → SSH Keys) under the exact name `cocotte-fleet`**. The scripts (forge + fleet) now auto-lookup the numeric ID via API. Do this in the DO web UI before running forge:up or dist:up.
|
|
Current pubkey (as of this run):
|
|
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIEkqbC3eHgo3cc263rS+y9KDUz/MuQsrw8srjVSTt8Q1 cocotte-fleet-2026-06
|
|
- Vault: ~/.vault/do_pat_cocotte symlinked to your existing do-pat-ct.token ; placeholder cocotte_forge_creds created (populated by first forge:up).
|
|
- Ready for human bring-up: after key registration in DO, run the steps in the "Runbook" or `./scripts/cloud-bringup.sh` (human, with nohup recommended).
|