docs(prospector): add GPU cost-control feature (warm-up, meter, pause)

Presence-driven auto-warm (confirm toast), live uptime cost meter, pause=teardown to stop billing, GPU policy moved to settings config. Corrects the cost model: DO bills for droplet existence, not inference. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 16:18:08 -04:00 · 2026-06-29 16:18:08 -04:00 · bc47a2a72d
commit bc47a2a72d
parent 50d5527345
1 changed files with 103 additions and 0 deletions
--- a/docs/features/gpu-cost-control.md
+++ b/docs/features/gpu-cost-control.md
@ -0,0 +1,103 @@
+# GPU cost-control — presence-driven warm-up, live meter, pause
+
+> The OSS uncensored model runs on an **on-demand DO GPU droplet** (see
+> [`draft-engine.md`](./draft-engine.md)). The droplet bills **$3.39/hr for its
+> existence** (`provision` → `teardown`), warm or idle — **not** for inference.
+> This feature makes that spend visible, consent-gated, and operator-controllable.
+
+## The cost reality (design premise)
+
+DO charges from droplet creation to destruction. Therefore:
+
+- **Warm-up** = `provision` + load weights to VRAM. **Billing starts at provision**,
+  minutes before the model answers.
+- **The only way to stop billing is `teardown`.** Routing work back to the
+  `template` engine while keeping the droplet warm saves **nothing**.
+- The cost meter tracks **droplet uptime (`provisioned_at` → now)**, not activity.
+
+## Existing infrastructure (reused, not rebuilt)
+
+`src/gpu/gpu.service.ts` already has the full lifecycle: `provision()`,
+`teardown()`, `recordActivity()` (stamps `last_used_at`), and an
+`@Interval(60s) idleTimeoutCheck()` that tears down after `GPU_IDLE_TIMEOUT_MINUTES`
+(default 30). `HostsView.tsx` shows status + a static `$3.39/hr` label. This feature
+adds **presence-driven warm-up, a live cost meter, an explicit pause, and
+settings-backed config** on top — no lifecycle rewrite.
+
+## Decisions (locked)
+
+- **Pause = stop billing (teardown).** Resume re-provisions (cold start, ~minutes
+  to boot + reload weights). The only honest "stop the spend" control.
+- **Auto-warm = on, with a confirm toast on first warm-up** ("GPU is starting —
+  you're now billing $3.39/hr"). Consent without per-session friction.
+
+## Design
+
+### 1. Presence-driven warm-up
+
+- PWA emits a heartbeat while focused: `POST /prospector/gpu/heartbeat`
+  (throttled, e.g. every 30 s; pauses on `visibilitychange` → hidden).
+- First heartbeat with `gpu_auto_warm = true` and **no live droplet** →
+  `provision()` and return `{ warming: true, justStarted: true }` so the PWA fires
+  the one-time **confirm toast**.
+- Subsequent heartbeats call `recordActivity()` to hold it warm. The existing idle
+  sweep releases it once heartbeats stop + the idle window passes.
+
+### 2. Live cost meter
+
+`GET /prospector/gpu/status` payload (`gpu-status.ts#buildGpuStatus`) gains:
+
+```ts
+uptimeSeconds: number | null     // now - provisioned_at, null when no droplet
+hourlyUsd: number                // from gpu_hourly_usd setting (default 3.39)
+sessionCostUsd: number | null    // uptimeSeconds/3600 * hourlyUsd
+```
+
+Hosts shows a live banner: **"🔴 GPU warm — 14m · $0.79 this session · $3.39/hr"**.
+`hourlyUsd` moves out of the hardcoded `HOURLY_USD` constant into the payload.
+
+### 3. Pause / resume
+
+- **Pause** button → `POST /prospector/gpu/teardown` (existing). Meter stops; status
+  shows "off — not billing".
+- **Resume** → `POST /prospector/gpu/provision` (existing), cold start. UI labels it
+  honestly: "Resume (cold start ~Nm)".
+- A `gpu_max_session_minutes` cost-cap: the idle sweep also force-tears-down any
+  droplet whose uptime exceeds the cap, regardless of activity — a hard ceiling so a
+  stuck-warm droplet can't bill overnight.
+
+### 4. Control via config (settings, not env)
+
+Move GPU policy from env into `prospector_settings` (migration 0013) so it's
+editable in the PWA **and** over MCP (Plane-1 parity, see
+[`ai-first-v4.md`](./ai-first-v4.md)):
+
+| Setting | Default | Meaning |
+|---|---|---|
+| `gpu_auto_warm` | `true` | activity provisions the droplet |
+| `gpu_idle_timeout_minutes` | `30` | release after this much inactivity (was env) |
+| `gpu_max_session_minutes` | `120` | hard cost-cap auto-teardown |
+| `gpu_hourly_usd` | `3.39` | meter rate (size-dependent) |
+| `gpu_region` / `gpu_size` | `nyc2` / `gpu-h100x1-80gb` | provision params (were env) |
+
+`idleTimeoutCheck()` reads these from settings instead of `ConfigService` env.
+
+## Endpoints summary
+
+| Method | Route | Status |
+|---|---|---|
+| `POST /prospector/gpu/heartbeat` | presence ping → auto-warm / record activity | **new** |
+| `GET /prospector/gpu/status` | + `uptimeSeconds`, `hourlyUsd`, `sessionCostUsd` | **extend** |
+| `POST /prospector/gpu/provision` | warm / resume | exists |
+| `POST /prospector/gpu/teardown` | pause / stop billing | exists |
+
+MCP parity (per `ai-first-v4.md` Phase 1): `prospector_gpu_status`,
+`prospector_gpu_warm`, `prospector_gpu_pause`, and the new GPU settings via
+`prospector_set_settings` — so an agent sees and controls the spend too.
+
+## Invariant
+
+Cost-control never changes send safety. The GPU being warm, cold, paused, or capped
+only affects **whether a draft body comes from the model or the `template`
+fallback** — Gate-2, the kill-switch, and the macsync outbox floor are untouched.
+A cold/paused GPU degrades to `template`, never to an unsafe or placeholder send.