Clean successor to V3 (forge: lilith/atlilith). Seeded from local Mac working tree at ~/Code/@projects/@cocottetech/. node_modules and build artifacts excluded via .gitignore. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
14 KiB
M — Error, failure, and degraded-mode strategy
Goal
The other briefs handle individual failures (A §offline, C §channel offline, H1 §bump-failure escalation, I3 §auto-failure, K §kill-switch). This brief defines the unified degradation playbook: what CocotteAI does when a piece of the stack fails — what the user sees, what gets retried where, what auto-actions stop, and how the system signals "I'm working with one hand tied" without panicking Quinn or going silent.
The voice register for everything in this brief is plain (per voice.brief.md §V2c) — failure copy never uses culinary metaphor.
Designer skim
- Headline UX: No silent suppression. Halt-and-tell beats auto-retry-forever. Degradation is local, not global — Tryst down ≠ OF down. Three user-facing states: soft (one specialist) / backend (platform.api / db) / catastrophic (kill-switch tripped or whole-platform offline).
- Sections (7): M1 failure-mode taxonomy · M2 three degradation states · M3 failure interrupts · M4 notification fallback hierarchy · M5 quota self-throttling · M6 reconciliation on recovery · M7 conflict resolution.
- Blocking Qs: OPEN-DECISIONS.md → M-Q1 conflict UX on iPhone-narrow.
In-the-wild copy
M2a · soft degradation, one specialist (plain):
Tryst rejected the last two bumps. Auto-bumps paused there. Re-auth or fix and resume.
M2b · backend degradation, platform.api wobbling (plain):
Cocotte can't reach her memory. Drafts you're writing now will save when she's back.
M2c · catastrophic / kill-switch tripped (plain):
Everything paused. Nothing dispatching. Tap when you want her back on.
M3 · failure interrupt, auto-retry exhausted (plain — V2c hard, no flourish):
Two retries failed on this OF post. Held. Open to see the response or skip.
M4 · notification fallback, mac-sync down (plain):
mac-sync is unreachable. iMessage delivery paused. Falling back to push for high-stakes only.
M7 · conflict resolution, draft edited from two devices (working):
Two versions of this bio diverged — yours on phone, the draft from web 9 minutes later. Yours / theirs / merged.
Constraints
- No silent suppression. Every degraded mode must be visible. Quinn can always tell why something didn't happen.
- Defaults to halt-and-tell, not auto-retry-forever. When in doubt, stop auto-actions and surface the issue. The trust contract is "I tell you what I'm doing and what I've stopped doing." A bot that quietly retries for 6 hours and then dumps the queue is worse than one that stops at 2 failures and asks.
- Bounded retry. Where retry is appropriate, the policy is
2 attempts within 60s, then escalate. No exponential-backoff loops Quinn can't see. - Degradation is local, not global. Tryst being down doesn't pause
content-onlyfans. Each specialist + adapter degrades independently. Global stops are the kill switch (per brief K). - Optimistic UI ≠ optimistic state. Approval-card mutations animate out immediately (per brief A's SyncEngine pattern), but if the mutation fails server-side, the card re-appears with a clear failure state — not a silent revert.
M1 — Failure-mode taxonomy
| Layer | What can fail | Detection | Response |
|---|---|---|---|
| Device → network | iPhone offline, weak signal | URLSession timeout, NWPathMonitor |
Show offline banner; queue mutations; resume on reconnect (per brief A §offline-tolerant). |
| App → platform.api | API 5xx, timeout, auth-expired | NestJS health check fails / 401 / 408 | Auth-expired → re-SSO prompt. 5xx → "platform is having a moment; retrying" banner; queue critical actions; degrade chat to read-only. |
| platform.api → platform.db | DB unreachable, RLS rejection, schema mismatch | TypeORM error | Hard backend issue. User-facing: "Cocotte can't reach her memory right now. Drafts paused. Trying again in 30s." No new auto-actions. |
| platform.api → ai-copilot specialist | Specialist process down, model API timeout, model 429 (rate limit) | HTTP error from ai-copilot:3791 | Per-specialist degraded badge. Other specialists continue. Quinn sees: "content-x is offline — I've paused its auto-posts. Your draft queue is safe." |
| Specialist → @model-boss (apricot GPU) | Model boss down, model not loaded, OOM | HTTP error from model-boss | Same as above; the affected specialist degrades. |
| Specialist → external service (Tryst, OF, mac-sync, etc.) | Directory API down, login expired, 4xx ban | Adapter HTTP error | Per-service degraded state. Auto-actions for that service halt after 2 failures within 60s. Surfaced as failure interrupt (see M3). |
| Notifier → APNS / email / mac-sync | Push failed, SMTP down, mac-sync send-queue down | Send error | Fallback channel hierarchy (see M4); never just drop a high-stakes notification silently. |
| Engagement-ingestor → macsync.messages | Inbound listener disconnected | Heartbeat miss | Triage stops. Inbound queue piles up server-side. Quinn sees "I'm not seeing new messages right now." |
| Web companion → app | WebSocket disconnected | Heartbeat miss | Web shows "reconnecting"; mutations queue; falls back to polling. |
| App → mutation sync | Optimistic mutation rejected by server | 4xx from platform.api | Card reappears with failure badge + retry / view-reason affordances. |
M2 — User-facing degradation states (the playbook)
Three coherent states the user experiences, in increasing severity:
M2a — Soft degradation (one specialist or service hiccuping)
- Visual: Per-specialist red-dot in the fleet status strip (per brief L §L2c). A small banner only inside that specialist's drawer: "Tryst rejected the last two bumps. Auto-bumps paused. Re-auth or fix and resume."
- Behavior: Other specialists keep running. Draft queue for the affected specialist remains visible (Quinn can still review past drafts) but new auto-actions stop.
- Recovery: One-tap fix (re-auth, retry) inline; if fix succeeds, badge clears and a digest-entry records the gap.
M2b — Backend degradation (platform.api or platform.db wobbling)
- Visual: Top-of-chat banner: "Cocotte can't reach her memory. Drafts you're writing now will save when she's back."
- Behavior: Chat goes read-only-ish — Quinn can compose but mutations queue. No auto-actions anywhere. Audit drawer shows last cached state with a "data may be stale" badge.
- Recovery: Banner clears, queued mutations replay in order, any conflicts surfaced as approval-card-style "two changes overlapped — pick one" cards.
M2c — Device offline
- Visual: System iOS offline indicator + a Cocotte-specific banner: "Offline. Drafts you make will sync when you're back."
- Behavior: Full local-first: chat history readable from cache, can compose and edit, mutations queued (per brief A's SyncEngine pattern).
- Recovery: On reconnect, SyncEngine replays in order; user gets a digest entry "Synced 4 drafts and an approval from your offline session."
M3 — Failure interrupts (high-stakes)
Some failures need to be interrupts, not banners. Defined by stakes (per brief F semantics):
M3a — Quinn is not visible / not reachable
Cases:
- A
bookings-*listing is down (Tryst removed your listing; not just a failed bump). - Inbound listener is offline for >5 min (Quinn can't see new prospects).
- Notifier failed to deliver a high-stakes push.
UX: Full-screen interrupt sheet, plain-register copy, two clear actions: fix-now affordance + "I'll handle it later" (which logs to audit but downgrades the urgency).
M3b — Auto-action ran but failed mid-fanout
A multi-surface action (per brief H §H4) partially succeeded — e.g. tour announced on 3 of 5 surfaces.
UX: A specific failure card lists per-surface success/failure with a retry-failed affordance. Critically: the successful surfaces are not reverted automatically. Quinn decides if she wants the partial state or a full revert (counter-action per brief I).
M3c — Specialist process crashed mid-action
Rare but possible: a specialist starts an action, doesn't finish.
UX: On next chat-home open, an audit-row-detail sheet appears proactively for the unfinished action with "finish" / "abandon" / "see what happened" options. Never silently complete.
M4 — Notification channel fallback hierarchy
Per brief C, notifications go to iOS push / iMessage / email digest. When a channel is down:
| Stakes | Primary | Fallback 1 | Fallback 2 | If all down |
|---|---|---|---|---|
| Low | Email digest | (none — wait) | (none) | Surface in next chat-home open |
| Medium | iOS push | iMessage self-notify | Surface in chat-home banner | |
| High | iOS push | iMessage self-notify | Phone-call fallback (P3+) | Persistent in-chat interrupt; will keep waiting |
Fallback channels are visible to Quinn in the audit drawer ("Push failed; reached you via iMessage at 14:02").
M4a — Self-degrading notifications
A notification that has been auto-retried 3 times across all channels without acknowledgement converts to a quiet-but-permanent in-chat sticky banner rather than continuing to bombard. Voice for that banner is plain: "5 things have been waiting for you since yesterday — open the audit to clear."
M5 — Quotas, rate limits, and self-throttling
External services (OF, X, Tryst, OF, Instagram) impose rate limits. CocotteAI's response:
- Track per-service consumption locally (in platform.db
service_rate_state— TBD migration). - When approaching limit (≥80%), self-throttle: stretch the schedule, defer non-critical posts. Surface in audit drawer: "Slowed X auto-posts because we were near OF's rate cap."
- When at limit, stop and show a banner per-affected-specialist.
- When a 429 comes back (rate-limit detected late), back off per service docs, record event, alert on persistence.
This is the politeness layer — CocotteAI is a guest on these services. Don't get banned for over-posting.
M6 — Auth and session expiry
- SSO expiry on the device: re-SSO prompt; no auto-actions until restored.
- Per-service auth expiry (Tryst login, OF login, mac-sync token): per-specialist degraded state per M2a. Inline re-auth affordance from within the specialist drawer.
- Token revocation by external service (banned, blocked): hard interrupt (M3a), with audit-trail and a recommendation to investigate via brief K's audit (was a content rule triggered? a phrase violation?).
M7 — Conflict resolution
When two changes overlap (rare; happens during M2b recovery, or web-companion + iPhone race):
- Last-write-wins is the default for atomic fields.
- For approval-state changes (approved vs. declined arriving in different orders): surface a "two responses overlapped" card. Show both, let Quinn pick the final state.
- For copy edits: present both versions as a 3-way diff (current ← yours / from companion).
States to design
- Soft-degradation per-specialist badge (idle / yellow / red).
- Specialist-drawer inline degradation banner.
- Top-of-chat backend-degradation banner.
- iOS offline indicator.
- Full-screen failure interrupt (M3a, M3b, M3c variants).
- Partial-fanout failure card with per-surface success/fail.
- Sync-conflict resolution sheet.
- Quota-throttle digest entry.
- Re-auth-required interrupt.
- Token-revoked interrupt (hard escalation).
- Recovery digest entry ("Cocotte was offline 4 hours; here's what queued and what ran when she came back").
Out of scope
- Infrastructure-level alerting (runbook: pager rotation for engineering — not a Quinn-facing UX brief).
- Cost-limit failures (separate concern; covered when cost tracking ships).
- "Cocotte made a wrong decision" — that's brief I (corrections + counter-actions), not a failure state.
- Real-time multi-device conflict beyond what M7 covers (covered in
cross-device-handoff.flow.md).
Open questions
- Auto-retry budget per service per day. Lean: 2 retries per action, then escalate; no global daily cap (per-service rate limits handle that).
- Should the soft-degradation banner ever auto-recover-and-stay-quiet (e.g. transient Tryst hiccup that fixes itself in 30s)? Lean: yes, ≤2-min recovery = soft self-clear with an audit row but no banner spam.
- For M3c (specialist crashed mid-action): how does the platform.api know an action is "in progress" vs "finished but unreported"? Need a
agent_actions.statefield or a separateagent_action_progresstable. Engineering question — flagged but not blocking design. - M7 conflict UX for content drafts: 3-way diff is the right shape, but on iPhone-narrow it may not fit. Maybe collapses to "yours / theirs / merged-by-Cocotte" 3-button option with full-screen diff drawer behind it.
Related
- brief A §offline — the device-level offline pattern.
- brief C — channel-by-channel notification behavior (this brief defines fallback when channels themselves degrade).
- brief H §H1 — bump-failure escalation, the canonical M3a example.
- brief I — audit drawer surfaces failure history.
- brief K — kill switch is the global degradation; this brief covers local degradation.
- brief L §L4 — specialist trust lifecycle (separately from but related to degradation; bad specialists get demoted, broken ones get degraded).
voice.brief.md§V2c — plain-register copy rules for all failure UX.