Natalie 2af7c6bda7 docs(prospector): split AI roles into orchestrator vs pipeline

Distinguish three AI roles across two tiers: the orchestrator/chat agent
(Tier A, control-surface CLIENT, user-facing, presence-warmed) vs the
classifier + message-generator (Tier B, pipeline components the app CALLS,
queue-warmed). Plane-3 autonomy agent = same orchestrator, event-driven
entry point. Fix warm-up triggers to be role-specific.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-29 16:20:21 -04:00

5.3 KiB

Raw Permalink Blame History

GPU cost-control — presence-driven warm-up, live meter, pause

The OSS uncensored model runs on an on-demand DO GPU droplet (see draft-engine.md). The droplet bills $3.39/hr for its existence (provision → teardown), warm or idle — not for inference. This feature makes that spend visible, consent-gated, and operator-controllable.

The cost reality (design premise)

DO charges from droplet creation to destruction. Therefore:

Warm-up = provision + load weights to VRAM. Billing starts at provision, minutes before the model answers.
The only way to stop billing is teardown. Routing work back to the template engine while keeping the droplet warm saves nothing.
The cost meter tracks droplet uptime (provisioned_at → now), not activity.

Existing infrastructure (reused, not rebuilt)

src/gpu/gpu.service.ts already has the full lifecycle: provision(), teardown(), recordActivity() (stamps last_used_at), and an @Interval(60s) idleTimeoutCheck() that tears down after GPU_IDLE_TIMEOUT_MINUTES (default 30). HostsView.tsx shows status + a static $3.39/hr label. This feature adds presence-driven warm-up, a live cost meter, an explicit pause, and settings-backed config on top — no lifecycle rewrite.

Decisions (locked)

Pause = stop billing (teardown). Resume re-provisions (cold start, ~minutes to boot + reload weights). The only honest "stop the spend" control.
Auto-warm = on, with a confirm toast on first warm-up ("GPU is starting — you're now billing $3.39/hr"). Consent without per-session friction.

Design

1. Presence-driven warm-up (warms the orchestrator, not the pipeline)

Two independent demand signals warm the (shared) droplet — see the AI-roles taxonomy in ai-first-v4.md:

User presence → orchestrator chat agent (this section).
Inbound volume / queue depth → classifier + message-generator (the queue-driven lifecycle in draft-engine.md).

Both stamp recordActivity(); the idle sweep tears down only when both are quiet. Presence warm-up:

PWA emits a heartbeat while focused: POST /prospector/gpu/heartbeat (throttled, e.g. every 30 s; pauses on visibilitychange → hidden). Presence means "Quinn may chat with her orchestrator" — it does not imply inbound work.
First heartbeat with gpu_auto_warm = true and no live droplet → provision() and return { warming: true, justStarted: true } so the PWA fires the one-time confirm toast.
Subsequent heartbeats call recordActivity() to hold it warm. The existing idle sweep releases it once heartbeats stop, the queue is empty, and the idle window passes.

2. Live cost meter

GET /prospector/gpu/status payload (gpu-status.ts#buildGpuStatus) gains:

uptimeSeconds: number | null     // now - provisioned_at, null when no droplet
hourlyUsd: number                // from gpu_hourly_usd setting (default 3.39)
sessionCostUsd: number | null    // uptimeSeconds/3600 * hourlyUsd

Hosts shows a live banner: "🔴 GPU warm — 14m · $0.79 this session · $3.39/hr". hourlyUsd moves out of the hardcoded HOURLY_USD constant into the payload.

3. Pause / resume

Pause button → POST /prospector/gpu/teardown (existing). Meter stops; status shows "off — not billing".
Resume → POST /prospector/gpu/provision (existing), cold start. UI labels it honestly: "Resume (cold start ~Nm)".
A gpu_max_session_minutes cost-cap: the idle sweep also force-tears-down any droplet whose uptime exceeds the cap, regardless of activity — a hard ceiling so a stuck-warm droplet can't bill overnight.

4. Control via config (settings, not env)

Move GPU policy from env into prospector_settings (migration 0013) so it's editable in the PWA and over MCP (Plane-1 parity, see ai-first-v4.md):

Setting	Default	Meaning
`gpu_auto_warm`	`true`	activity provisions the droplet
`gpu_idle_timeout_minutes`	`30`	release after this much inactivity (was env)
`gpu_max_session_minutes`	`120`	hard cost-cap auto-teardown
`gpu_hourly_usd`	`3.39`	meter rate (size-dependent)
`gpu_region` / `gpu_size`	`nyc2` / `gpu-h100x1-80gb`	provision params (were env)

idleTimeoutCheck() reads these from settings instead of ConfigService env.

Endpoints summary

Method	Route	Status
`POST /prospector/gpu/heartbeat`	presence ping → auto-warm / record activity	new
`GET /prospector/gpu/status`	+ `uptimeSeconds`, `hourlyUsd`, `sessionCostUsd`	extend
`POST /prospector/gpu/provision`	warm / resume	exists
`POST /prospector/gpu/teardown`	pause / stop billing	exists

MCP parity (per ai-first-v4.md Phase 1): prospector_gpu_status, prospector_gpu_warm, prospector_gpu_pause, and the new GPU settings via prospector_set_settings — so an agent sees and controls the spend too.

Invariant

Cost-control never changes send safety. The GPU being warm, cold, paused, or capped only affects whether a draft body comes from the model or the template fallback — Gate-2, the kill-switch, and the macsync outbox floor are untouched. A cold/paused GPU degrades to template, never to an unsafe or placeholder send.

5.3 KiB Raw Permalink Blame History