docs(prospector): add GPU cost-control feature (warm-up, meter, pause)

Presence-driven auto-warm (confirm toast), live uptime cost meter,
pause=teardown to stop billing, GPU policy moved to settings config.
Corrects the cost model: DO bills for droplet existence, not inference.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Natalie 2026-06-29 16:18:08 -04:00
parent 50d5527345
commit bc47a2a72d

View file

@ -0,0 +1,103 @@
# GPU cost-control — presence-driven warm-up, live meter, pause
> The OSS uncensored model runs on an **on-demand DO GPU droplet** (see
> [`draft-engine.md`](./draft-engine.md)). The droplet bills **$3.39/hr for its
> existence** (`provision``teardown`), warm or idle — **not** for inference.
> This feature makes that spend visible, consent-gated, and operator-controllable.
## The cost reality (design premise)
DO charges from droplet creation to destruction. Therefore:
- **Warm-up** = `provision` + load weights to VRAM. **Billing starts at provision**,
minutes before the model answers.
- **The only way to stop billing is `teardown`.** Routing work back to the
`template` engine while keeping the droplet warm saves **nothing**.
- The cost meter tracks **droplet uptime (`provisioned_at` → now)**, not activity.
## Existing infrastructure (reused, not rebuilt)
`src/gpu/gpu.service.ts` already has the full lifecycle: `provision()`,
`teardown()`, `recordActivity()` (stamps `last_used_at`), and an
`@Interval(60s) idleTimeoutCheck()` that tears down after `GPU_IDLE_TIMEOUT_MINUTES`
(default 30). `HostsView.tsx` shows status + a static `$3.39/hr` label. This feature
adds **presence-driven warm-up, a live cost meter, an explicit pause, and
settings-backed config** on top — no lifecycle rewrite.
## Decisions (locked)
- **Pause = stop billing (teardown).** Resume re-provisions (cold start, ~minutes
to boot + reload weights). The only honest "stop the spend" control.
- **Auto-warm = on, with a confirm toast on first warm-up** ("GPU is starting —
you're now billing $3.39/hr"). Consent without per-session friction.
## Design
### 1. Presence-driven warm-up
- PWA emits a heartbeat while focused: `POST /prospector/gpu/heartbeat`
(throttled, e.g. every 30 s; pauses on `visibilitychange` → hidden).
- First heartbeat with `gpu_auto_warm = true` and **no live droplet**
`provision()` and return `{ warming: true, justStarted: true }` so the PWA fires
the one-time **confirm toast**.
- Subsequent heartbeats call `recordActivity()` to hold it warm. The existing idle
sweep releases it once heartbeats stop + the idle window passes.
### 2. Live cost meter
`GET /prospector/gpu/status` payload (`gpu-status.ts#buildGpuStatus`) gains:
```ts
uptimeSeconds: number | null // now - provisioned_at, null when no droplet
hourlyUsd: number // from gpu_hourly_usd setting (default 3.39)
sessionCostUsd: number | null // uptimeSeconds/3600 * hourlyUsd
```
Hosts shows a live banner: **"🔴 GPU warm — 14m · $0.79 this session · $3.39/hr"**.
`hourlyUsd` moves out of the hardcoded `HOURLY_USD` constant into the payload.
### 3. Pause / resume
- **Pause** button → `POST /prospector/gpu/teardown` (existing). Meter stops; status
shows "off — not billing".
- **Resume**`POST /prospector/gpu/provision` (existing), cold start. UI labels it
honestly: "Resume (cold start ~Nm)".
- A `gpu_max_session_minutes` cost-cap: the idle sweep also force-tears-down any
droplet whose uptime exceeds the cap, regardless of activity — a hard ceiling so a
stuck-warm droplet can't bill overnight.
### 4. Control via config (settings, not env)
Move GPU policy from env into `prospector_settings` (migration 0013) so it's
editable in the PWA **and** over MCP (Plane-1 parity, see
[`ai-first-v4.md`](./ai-first-v4.md)):
| Setting | Default | Meaning |
|---|---|---|
| `gpu_auto_warm` | `true` | activity provisions the droplet |
| `gpu_idle_timeout_minutes` | `30` | release after this much inactivity (was env) |
| `gpu_max_session_minutes` | `120` | hard cost-cap auto-teardown |
| `gpu_hourly_usd` | `3.39` | meter rate (size-dependent) |
| `gpu_region` / `gpu_size` | `nyc2` / `gpu-h100x1-80gb` | provision params (were env) |
`idleTimeoutCheck()` reads these from settings instead of `ConfigService` env.
## Endpoints summary
| Method | Route | Status |
|---|---|---|
| `POST /prospector/gpu/heartbeat` | presence ping → auto-warm / record activity | **new** |
| `GET /prospector/gpu/status` | + `uptimeSeconds`, `hourlyUsd`, `sessionCostUsd` | **extend** |
| `POST /prospector/gpu/provision` | warm / resume | exists |
| `POST /prospector/gpu/teardown` | pause / stop billing | exists |
MCP parity (per `ai-first-v4.md` Phase 1): `prospector_gpu_status`,
`prospector_gpu_warm`, `prospector_gpu_pause`, and the new GPU settings via
`prospector_set_settings` — so an agent sees and controls the spend too.
## Invariant
Cost-control never changes send safety. The GPU being warm, cold, paused, or capped
only affects **whether a draft body comes from the model or the `template`
fallback** — Gate-2, the kill-switch, and the macsync outbox floor are untouched.
A cold/paused GPU degrades to `template`, never to an unsafe or placeholder send.