Distinguish three AI roles across two tiers: the orchestrator/chat agent (Tier A, control-surface CLIENT, user-facing, presence-warmed) vs the classifier + message-generator (Tier B, pipeline components the app CALLS, queue-warmed). Plane-3 autonomy agent = same orchestrator, event-driven entry point. Fix warm-up triggers to be role-specific. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
5.3 KiB
GPU cost-control — presence-driven warm-up, live meter, pause
The OSS uncensored model runs on an on-demand DO GPU droplet (see
draft-engine.md). The droplet bills $3.39/hr for its existence (provision→teardown), warm or idle — not for inference. This feature makes that spend visible, consent-gated, and operator-controllable.
The cost reality (design premise)
DO charges from droplet creation to destruction. Therefore:
- Warm-up =
provision+ load weights to VRAM. Billing starts at provision, minutes before the model answers. - The only way to stop billing is
teardown. Routing work back to thetemplateengine while keeping the droplet warm saves nothing. - The cost meter tracks droplet uptime (
provisioned_at→ now), not activity.
Existing infrastructure (reused, not rebuilt)
src/gpu/gpu.service.ts already has the full lifecycle: provision(),
teardown(), recordActivity() (stamps last_used_at), and an
@Interval(60s) idleTimeoutCheck() that tears down after GPU_IDLE_TIMEOUT_MINUTES
(default 30). HostsView.tsx shows status + a static $3.39/hr label. This feature
adds presence-driven warm-up, a live cost meter, an explicit pause, and
settings-backed config on top — no lifecycle rewrite.
Decisions (locked)
- Pause = stop billing (teardown). Resume re-provisions (cold start, ~minutes to boot + reload weights). The only honest "stop the spend" control.
- Auto-warm = on, with a confirm toast on first warm-up ("GPU is starting — you're now billing $3.39/hr"). Consent without per-session friction.
Design
1. Presence-driven warm-up (warms the orchestrator, not the pipeline)
Two independent demand signals warm the (shared) droplet — see the AI-roles
taxonomy in ai-first-v4.md:
- User presence → orchestrator chat agent (this section).
- Inbound volume / queue depth → classifier + message-generator (the
queue-driven lifecycle in
draft-engine.md).
Both stamp recordActivity(); the idle sweep tears down only when both are
quiet. Presence warm-up:
- PWA emits a heartbeat while focused:
POST /prospector/gpu/heartbeat(throttled, e.g. every 30 s; pauses onvisibilitychange→ hidden). Presence means "Quinn may chat with her orchestrator" — it does not imply inbound work. - First heartbeat with
gpu_auto_warm = trueand no live droplet →provision()and return{ warming: true, justStarted: true }so the PWA fires the one-time confirm toast. - Subsequent heartbeats call
recordActivity()to hold it warm. The existing idle sweep releases it once heartbeats stop, the queue is empty, and the idle window passes.
2. Live cost meter
GET /prospector/gpu/status payload (gpu-status.ts#buildGpuStatus) gains:
uptimeSeconds: number | null // now - provisioned_at, null when no droplet
hourlyUsd: number // from gpu_hourly_usd setting (default 3.39)
sessionCostUsd: number | null // uptimeSeconds/3600 * hourlyUsd
Hosts shows a live banner: "🔴 GPU warm — 14m · $0.79 this session · $3.39/hr".
hourlyUsd moves out of the hardcoded HOURLY_USD constant into the payload.
3. Pause / resume
- Pause button →
POST /prospector/gpu/teardown(existing). Meter stops; status shows "off — not billing". - Resume →
POST /prospector/gpu/provision(existing), cold start. UI labels it honestly: "Resume (cold start ~Nm)". - A
gpu_max_session_minutescost-cap: the idle sweep also force-tears-down any droplet whose uptime exceeds the cap, regardless of activity — a hard ceiling so a stuck-warm droplet can't bill overnight.
4. Control via config (settings, not env)
Move GPU policy from env into prospector_settings (migration 0013) so it's
editable in the PWA and over MCP (Plane-1 parity, see
ai-first-v4.md):
| Setting | Default | Meaning |
|---|---|---|
gpu_auto_warm |
true |
activity provisions the droplet |
gpu_idle_timeout_minutes |
30 |
release after this much inactivity (was env) |
gpu_max_session_minutes |
120 |
hard cost-cap auto-teardown |
gpu_hourly_usd |
3.39 |
meter rate (size-dependent) |
gpu_region / gpu_size |
nyc2 / gpu-h100x1-80gb |
provision params (were env) |
idleTimeoutCheck() reads these from settings instead of ConfigService env.
Endpoints summary
| Method | Route | Status |
|---|---|---|
POST /prospector/gpu/heartbeat |
presence ping → auto-warm / record activity | new |
GET /prospector/gpu/status |
+ uptimeSeconds, hourlyUsd, sessionCostUsd |
extend |
POST /prospector/gpu/provision |
warm / resume | exists |
POST /prospector/gpu/teardown |
pause / stop billing | exists |
MCP parity (per ai-first-v4.md Phase 1): prospector_gpu_status,
prospector_gpu_warm, prospector_gpu_pause, and the new GPU settings via
prospector_set_settings — so an agent sees and controls the spend too.
Invariant
Cost-control never changes send safety. The GPU being warm, cold, paused, or capped
only affects whether a draft body comes from the model or the template
fallback — Gate-2, the kill-switch, and the macsync outbox floor are untouched.
A cold/paused GPU degrades to template, never to an unsafe or placeholder send.