prospector/docs/features/gpu-cost-control.md
Natalie 2af7c6bda7 docs(prospector): split AI roles into orchestrator vs pipeline
Distinguish three AI roles across two tiers: the orchestrator/chat agent
(Tier A, control-surface CLIENT, user-facing, presence-warmed) vs the
classifier + message-generator (Tier B, pipeline components the app CALLS,
queue-warmed). Plane-3 autonomy agent = same orchestrator, event-driven
entry point. Fix warm-up triggers to be role-specific.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 16:20:21 -04:00

5.3 KiB

GPU cost-control — presence-driven warm-up, live meter, pause

The OSS uncensored model runs on an on-demand DO GPU droplet (see draft-engine.md). The droplet bills $3.39/hr for its existence (provisionteardown), warm or idle — not for inference. This feature makes that spend visible, consent-gated, and operator-controllable.

The cost reality (design premise)

DO charges from droplet creation to destruction. Therefore:

  • Warm-up = provision + load weights to VRAM. Billing starts at provision, minutes before the model answers.
  • The only way to stop billing is teardown. Routing work back to the template engine while keeping the droplet warm saves nothing.
  • The cost meter tracks droplet uptime (provisioned_at → now), not activity.

Existing infrastructure (reused, not rebuilt)

src/gpu/gpu.service.ts already has the full lifecycle: provision(), teardown(), recordActivity() (stamps last_used_at), and an @Interval(60s) idleTimeoutCheck() that tears down after GPU_IDLE_TIMEOUT_MINUTES (default 30). HostsView.tsx shows status + a static $3.39/hr label. This feature adds presence-driven warm-up, a live cost meter, an explicit pause, and settings-backed config on top — no lifecycle rewrite.

Decisions (locked)

  • Pause = stop billing (teardown). Resume re-provisions (cold start, ~minutes to boot + reload weights). The only honest "stop the spend" control.
  • Auto-warm = on, with a confirm toast on first warm-up ("GPU is starting — you're now billing $3.39/hr"). Consent without per-session friction.

Design

1. Presence-driven warm-up (warms the orchestrator, not the pipeline)

Two independent demand signals warm the (shared) droplet — see the AI-roles taxonomy in ai-first-v4.md:

  • User presence → orchestrator chat agent (this section).
  • Inbound volume / queue depth → classifier + message-generator (the queue-driven lifecycle in draft-engine.md).

Both stamp recordActivity(); the idle sweep tears down only when both are quiet. Presence warm-up:

  • PWA emits a heartbeat while focused: POST /prospector/gpu/heartbeat (throttled, e.g. every 30 s; pauses on visibilitychange → hidden). Presence means "Quinn may chat with her orchestrator" — it does not imply inbound work.
  • First heartbeat with gpu_auto_warm = true and no live dropletprovision() and return { warming: true, justStarted: true } so the PWA fires the one-time confirm toast.
  • Subsequent heartbeats call recordActivity() to hold it warm. The existing idle sweep releases it once heartbeats stop, the queue is empty, and the idle window passes.

2. Live cost meter

GET /prospector/gpu/status payload (gpu-status.ts#buildGpuStatus) gains:

uptimeSeconds: number | null     // now - provisioned_at, null when no droplet
hourlyUsd: number                // from gpu_hourly_usd setting (default 3.39)
sessionCostUsd: number | null    // uptimeSeconds/3600 * hourlyUsd

Hosts shows a live banner: "🔴 GPU warm — 14m · $0.79 this session · $3.39/hr". hourlyUsd moves out of the hardcoded HOURLY_USD constant into the payload.

3. Pause / resume

  • Pause button → POST /prospector/gpu/teardown (existing). Meter stops; status shows "off — not billing".
  • ResumePOST /prospector/gpu/provision (existing), cold start. UI labels it honestly: "Resume (cold start ~Nm)".
  • A gpu_max_session_minutes cost-cap: the idle sweep also force-tears-down any droplet whose uptime exceeds the cap, regardless of activity — a hard ceiling so a stuck-warm droplet can't bill overnight.

4. Control via config (settings, not env)

Move GPU policy from env into prospector_settings (migration 0013) so it's editable in the PWA and over MCP (Plane-1 parity, see ai-first-v4.md):

Setting Default Meaning
gpu_auto_warm true activity provisions the droplet
gpu_idle_timeout_minutes 30 release after this much inactivity (was env)
gpu_max_session_minutes 120 hard cost-cap auto-teardown
gpu_hourly_usd 3.39 meter rate (size-dependent)
gpu_region / gpu_size nyc2 / gpu-h100x1-80gb provision params (were env)

idleTimeoutCheck() reads these from settings instead of ConfigService env.

Endpoints summary

Method Route Status
POST /prospector/gpu/heartbeat presence ping → auto-warm / record activity new
GET /prospector/gpu/status + uptimeSeconds, hourlyUsd, sessionCostUsd extend
POST /prospector/gpu/provision warm / resume exists
POST /prospector/gpu/teardown pause / stop billing exists

MCP parity (per ai-first-v4.md Phase 1): prospector_gpu_status, prospector_gpu_warm, prospector_gpu_pause, and the new GPU settings via prospector_set_settings — so an agent sees and controls the spend too.

Invariant

Cost-control never changes send safety. The GPU being warm, cold, paused, or capped only affects whether a draft body comes from the model or the template fallback — Gate-2, the kill-switch, and the macsync outbox floor are untouched. A cold/paused GPU degrades to template, never to an unsafe or placeholder send.