diff --git a/docs/features/gpu-cost-control.md b/docs/features/gpu-cost-control.md new file mode 100644 index 0000000..04bcf4d --- /dev/null +++ b/docs/features/gpu-cost-control.md @@ -0,0 +1,103 @@ +# GPU cost-control — presence-driven warm-up, live meter, pause + +> The OSS uncensored model runs on an **on-demand DO GPU droplet** (see +> [`draft-engine.md`](./draft-engine.md)). The droplet bills **$3.39/hr for its +> existence** (`provision` → `teardown`), warm or idle — **not** for inference. +> This feature makes that spend visible, consent-gated, and operator-controllable. + +## The cost reality (design premise) + +DO charges from droplet creation to destruction. Therefore: + +- **Warm-up** = `provision` + load weights to VRAM. **Billing starts at provision**, + minutes before the model answers. +- **The only way to stop billing is `teardown`.** Routing work back to the + `template` engine while keeping the droplet warm saves **nothing**. +- The cost meter tracks **droplet uptime (`provisioned_at` → now)**, not activity. + +## Existing infrastructure (reused, not rebuilt) + +`src/gpu/gpu.service.ts` already has the full lifecycle: `provision()`, +`teardown()`, `recordActivity()` (stamps `last_used_at`), and an +`@Interval(60s) idleTimeoutCheck()` that tears down after `GPU_IDLE_TIMEOUT_MINUTES` +(default 30). `HostsView.tsx` shows status + a static `$3.39/hr` label. This feature +adds **presence-driven warm-up, a live cost meter, an explicit pause, and +settings-backed config** on top — no lifecycle rewrite. + +## Decisions (locked) + +- **Pause = stop billing (teardown).** Resume re-provisions (cold start, ~minutes + to boot + reload weights). The only honest "stop the spend" control. +- **Auto-warm = on, with a confirm toast on first warm-up** ("GPU is starting — + you're now billing $3.39/hr"). Consent without per-session friction. + +## Design + +### 1. Presence-driven warm-up + +- PWA emits a heartbeat while focused: `POST /prospector/gpu/heartbeat` + (throttled, e.g. every 30 s; pauses on `visibilitychange` → hidden). +- First heartbeat with `gpu_auto_warm = true` and **no live droplet** → + `provision()` and return `{ warming: true, justStarted: true }` so the PWA fires + the one-time **confirm toast**. +- Subsequent heartbeats call `recordActivity()` to hold it warm. The existing idle + sweep releases it once heartbeats stop + the idle window passes. + +### 2. Live cost meter + +`GET /prospector/gpu/status` payload (`gpu-status.ts#buildGpuStatus`) gains: + +```ts +uptimeSeconds: number | null // now - provisioned_at, null when no droplet +hourlyUsd: number // from gpu_hourly_usd setting (default 3.39) +sessionCostUsd: number | null // uptimeSeconds/3600 * hourlyUsd +``` + +Hosts shows a live banner: **"🔴 GPU warm — 14m · $0.79 this session · $3.39/hr"**. +`hourlyUsd` moves out of the hardcoded `HOURLY_USD` constant into the payload. + +### 3. Pause / resume + +- **Pause** button → `POST /prospector/gpu/teardown` (existing). Meter stops; status + shows "off — not billing". +- **Resume** → `POST /prospector/gpu/provision` (existing), cold start. UI labels it + honestly: "Resume (cold start ~Nm)". +- A `gpu_max_session_minutes` cost-cap: the idle sweep also force-tears-down any + droplet whose uptime exceeds the cap, regardless of activity — a hard ceiling so a + stuck-warm droplet can't bill overnight. + +### 4. Control via config (settings, not env) + +Move GPU policy from env into `prospector_settings` (migration 0013) so it's +editable in the PWA **and** over MCP (Plane-1 parity, see +[`ai-first-v4.md`](./ai-first-v4.md)): + +| Setting | Default | Meaning | +|---|---|---| +| `gpu_auto_warm` | `true` | activity provisions the droplet | +| `gpu_idle_timeout_minutes` | `30` | release after this much inactivity (was env) | +| `gpu_max_session_minutes` | `120` | hard cost-cap auto-teardown | +| `gpu_hourly_usd` | `3.39` | meter rate (size-dependent) | +| `gpu_region` / `gpu_size` | `nyc2` / `gpu-h100x1-80gb` | provision params (were env) | + +`idleTimeoutCheck()` reads these from settings instead of `ConfigService` env. + +## Endpoints summary + +| Method | Route | Status | +|---|---|---| +| `POST /prospector/gpu/heartbeat` | presence ping → auto-warm / record activity | **new** | +| `GET /prospector/gpu/status` | + `uptimeSeconds`, `hourlyUsd`, `sessionCostUsd` | **extend** | +| `POST /prospector/gpu/provision` | warm / resume | exists | +| `POST /prospector/gpu/teardown` | pause / stop billing | exists | + +MCP parity (per `ai-first-v4.md` Phase 1): `prospector_gpu_status`, +`prospector_gpu_warm`, `prospector_gpu_pause`, and the new GPU settings via +`prospector_set_settings` — so an agent sees and controls the spend too. + +## Invariant + +Cost-control never changes send safety. The GPU being warm, cold, paused, or capped +only affects **whether a draft body comes from the model or the `template` +fallback** — Gate-2, the kill-switch, and the macsync outbox floor are untouched. +A cold/paused GPU degrades to `template`, never to an unsafe or placeholder send.