natalie 1b719e1fd7 chore(bootstrap): initial V4 commit

Clean successor to V3 (forge: lilith/atlilith). Seeded from local Mac
working tree at ~/Code/@projects/@cocottetech/. node_modules and build
artifacts excluded via .gitignore.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-18 08:11:41 -07:00

14 KiB

Raw Blame History

M — Error, failure, and degraded-mode strategy

Goal

The other briefs handle individual failures (A §offline, C §channel offline, H1 §bump-failure escalation, I3 §auto-failure, K §kill-switch). This brief defines the unified degradation playbook: what CocotteAI does when a piece of the stack fails — what the user sees, what gets retried where, what auto-actions stop, and how the system signals "I'm working with one hand tied" without panicking Quinn or going silent.

The voice register for everything in this brief is plain (per voice.brief.md §V2c) — failure copy never uses culinary metaphor.

Designer skim

Headline UX: No silent suppression. Halt-and-tell beats auto-retry-forever. Degradation is local, not global — Tryst down ≠ OF down. Three user-facing states: soft (one specialist) / backend (platform.api / db) / catastrophic (kill-switch tripped or whole-platform offline).
Sections (7): M1 failure-mode taxonomy · M2 three degradation states · M3 failure interrupts · M4 notification fallback hierarchy · M5 quota self-throttling · M6 reconciliation on recovery · M7 conflict resolution.
Blocking Qs: OPEN-DECISIONS.md → M-Q1 conflict UX on iPhone-narrow.

In-the-wild copy

M2a · soft degradation, one specialist (plain):

Tryst rejected the last two bumps. Auto-bumps paused there. Re-auth or fix and resume.

M2b · backend degradation, platform.api wobbling (plain):

Cocotte can't reach her memory. Drafts you're writing now will save when she's back.

M2c · catastrophic / kill-switch tripped (plain):

Everything paused. Nothing dispatching. Tap when you want her back on.

M3 · failure interrupt, auto-retry exhausted (plain — V2c hard, no flourish):

Two retries failed on this OF post. Held. Open to see the response or skip.

M4 · notification fallback, mac-sync down (plain):

mac-sync is unreachable. iMessage delivery paused. Falling back to push for high-stakes only.

M7 · conflict resolution, draft edited from two devices (working):

Two versions of this bio diverged — yours on phone, the draft from web 9 minutes later. Yours / theirs / merged.

Constraints

No silent suppression. Every degraded mode must be visible. Quinn can always tell why something didn't happen.
Defaults to halt-and-tell, not auto-retry-forever. When in doubt, stop auto-actions and surface the issue. The trust contract is "I tell you what I'm doing and what I've stopped doing." A bot that quietly retries for 6 hours and then dumps the queue is worse than one that stops at 2 failures and asks.
Bounded retry. Where retry is appropriate, the policy is 2 attempts within 60s, then escalate. No exponential-backoff loops Quinn can't see.
Degradation is local, not global. Tryst being down doesn't pause content-onlyfans. Each specialist + adapter degrades independently. Global stops are the kill switch (per brief K).
Optimistic UI ≠ optimistic state. Approval-card mutations animate out immediately (per brief A's SyncEngine pattern), but if the mutation fails server-side, the card re-appears with a clear failure state — not a silent revert.

M1 — Failure-mode taxonomy

Layer	What can fail	Detection	Response
Device → network	iPhone offline, weak signal	`URLSession` timeout, `NWPathMonitor`	Show offline banner; queue mutations; resume on reconnect (per brief A §offline-tolerant).
App → platform.api	API 5xx, timeout, auth-expired	NestJS health check fails / 401 / 408	Auth-expired → re-SSO prompt. 5xx → "platform is having a moment; retrying" banner; queue critical actions; degrade chat to read-only.
platform.api → platform.db	DB unreachable, RLS rejection, schema mismatch	TypeORM error	Hard backend issue. User-facing: "Cocotte can't reach her memory right now. Drafts paused. Trying again in 30s." No new auto-actions.
platform.api → ai-copilot specialist	Specialist process down, model API timeout, model 429 (rate limit)	HTTP error from ai-copilot:3791	Per-specialist degraded badge. Other specialists continue. Quinn sees: "`content-x` is offline — I've paused its auto-posts. Your draft queue is safe."
Specialist → @model-boss (apricot GPU)	Model boss down, model not loaded, OOM	HTTP error from model-boss	Same as above; the affected specialist degrades.
Specialist → external service (Tryst, OF, mac-sync, etc.)	Directory API down, login expired, 4xx ban	Adapter HTTP error	Per-service degraded state. Auto-actions for that service halt after 2 failures within 60s. Surfaced as failure interrupt (see M3).
Notifier → APNS / email / mac-sync	Push failed, SMTP down, mac-sync send-queue down	Send error	Fallback channel hierarchy (see M4); never just drop a high-stakes notification silently.
Engagement-ingestor → macsync.messages	Inbound listener disconnected	Heartbeat miss	Triage stops. Inbound queue piles up server-side. Quinn sees "I'm not seeing new messages right now."
Web companion → app	WebSocket disconnected	Heartbeat miss	Web shows "reconnecting"; mutations queue; falls back to polling.
App → mutation sync	Optimistic mutation rejected by server	4xx from platform.api	Card reappears with failure badge + retry / view-reason affordances.

M2 — User-facing degradation states (the playbook)

Three coherent states the user experiences, in increasing severity:

M2a — Soft degradation (one specialist or service hiccuping)

Visual: Per-specialist red-dot in the fleet status strip (per brief L §L2c). A small banner only inside that specialist's drawer: "Tryst rejected the last two bumps. Auto-bumps paused. Re-auth or fix and resume."
Behavior: Other specialists keep running. Draft queue for the affected specialist remains visible (Quinn can still review past drafts) but new auto-actions stop.
Recovery: One-tap fix (re-auth, retry) inline; if fix succeeds, badge clears and a digest-entry records the gap.

M2b — Backend degradation (platform.api or platform.db wobbling)

Visual: Top-of-chat banner: "Cocotte can't reach her memory. Drafts you're writing now will save when she's back."
Behavior: Chat goes read-only-ish — Quinn can compose but mutations queue. No auto-actions anywhere. Audit drawer shows last cached state with a "data may be stale" badge.
Recovery: Banner clears, queued mutations replay in order, any conflicts surfaced as approval-card-style "two changes overlapped — pick one" cards.

M2c — Device offline

Visual: System iOS offline indicator + a Cocotte-specific banner: "Offline. Drafts you make will sync when you're back."
Behavior: Full local-first: chat history readable from cache, can compose and edit, mutations queued (per brief A's SyncEngine pattern).
Recovery: On reconnect, SyncEngine replays in order; user gets a digest entry "Synced 4 drafts and an approval from your offline session."

M3 — Failure interrupts (high-stakes)

Some failures need to be interrupts, not banners. Defined by stakes (per brief F semantics):

M3a — Quinn is not visible / not reachable

Cases:

A bookings-* listing is down (Tryst removed your listing; not just a failed bump).
Inbound listener is offline for >5 min (Quinn can't see new prospects).
Notifier failed to deliver a high-stakes push.

UX: Full-screen interrupt sheet, plain-register copy, two clear actions: fix-now affordance + "I'll handle it later" (which logs to audit but downgrades the urgency).

M3b — Auto-action ran but failed mid-fanout

A multi-surface action (per brief H §H4) partially succeeded — e.g. tour announced on 3 of 5 surfaces.

UX: A specific failure card lists per-surface success/failure with a retry-failed affordance. Critically: the successful surfaces are not reverted automatically. Quinn decides if she wants the partial state or a full revert (counter-action per brief I).

M3c — Specialist process crashed mid-action

Rare but possible: a specialist starts an action, doesn't finish.

UX: On next chat-home open, an audit-row-detail sheet appears proactively for the unfinished action with "finish" / "abandon" / "see what happened" options. Never silently complete.

M4 — Notification channel fallback hierarchy

Per brief C, notifications go to iOS push / iMessage / email digest. When a channel is down:

Stakes	Primary	Fallback 1	Fallback 2	If all down
Low	Email digest	(none — wait)	(none)	Surface in next chat-home open
Medium	iOS push	iMessage self-notify	Email	Surface in chat-home banner
High	iOS push	iMessage self-notify	Phone-call fallback (P3+)	Persistent in-chat interrupt; will keep waiting

Fallback channels are visible to Quinn in the audit drawer ("Push failed; reached you via iMessage at 14:02").

M4a — Self-degrading notifications

A notification that has been auto-retried 3 times across all channels without acknowledgement converts to a quiet-but-permanent in-chat sticky banner rather than continuing to bombard. Voice for that banner is plain: "5 things have been waiting for you since yesterday — open the audit to clear."

M5 — Quotas, rate limits, and self-throttling

External services (OF, X, Tryst, OF, Instagram) impose rate limits. CocotteAI's response:

Track per-service consumption locally (in platform.db service_rate_state — TBD migration).
When approaching limit (≥80%), self-throttle: stretch the schedule, defer non-critical posts. Surface in audit drawer: "Slowed X auto-posts because we were near OF's rate cap."
When at limit, stop and show a banner per-affected-specialist.
When a 429 comes back (rate-limit detected late), back off per service docs, record event, alert on persistence.

This is the politeness layer — CocotteAI is a guest on these services. Don't get banned for over-posting.

M6 — Auth and session expiry

SSO expiry on the device: re-SSO prompt; no auto-actions until restored.
Per-service auth expiry (Tryst login, OF login, mac-sync token): per-specialist degraded state per M2a. Inline re-auth affordance from within the specialist drawer.
Token revocation by external service (banned, blocked): hard interrupt (M3a), with audit-trail and a recommendation to investigate via brief K's audit (was a content rule triggered? a phrase violation?).

M7 — Conflict resolution

When two changes overlap (rare; happens during M2b recovery, or web-companion + iPhone race):

Last-write-wins is the default for atomic fields.
For approval-state changes (approved vs. declined arriving in different orders): surface a "two responses overlapped" card. Show both, let Quinn pick the final state.
For copy edits: present both versions as a 3-way diff (current ← yours / from companion).

States to design

Soft-degradation per-specialist badge (idle / yellow / red).
Specialist-drawer inline degradation banner.
Top-of-chat backend-degradation banner.
iOS offline indicator.
Full-screen failure interrupt (M3a, M3b, M3c variants).
Partial-fanout failure card with per-surface success/fail.
Sync-conflict resolution sheet.
Quota-throttle digest entry.
Re-auth-required interrupt.
Token-revoked interrupt (hard escalation).
Recovery digest entry ("Cocotte was offline 4 hours; here's what queued and what ran when she came back").

Out of scope

Infrastructure-level alerting (runbook: pager rotation for engineering — not a Quinn-facing UX brief).
Cost-limit failures (separate concern; covered when cost tracking ships).
"Cocotte made a wrong decision" — that's brief I (corrections + counter-actions), not a failure state.
Real-time multi-device conflict beyond what M7 covers (covered in cross-device-handoff.flow.md).

Open questions

Auto-retry budget per service per day. Lean: 2 retries per action, then escalate; no global daily cap (per-service rate limits handle that).
Should the soft-degradation banner ever auto-recover-and-stay-quiet (e.g. transient Tryst hiccup that fixes itself in 30s)? Lean: yes, ≤2-min recovery = soft self-clear with an audit row but no banner spam.
For M3c (specialist crashed mid-action): how does the platform.api know an action is "in progress" vs "finished but unreported"? Need a agent_actions.state field or a separate agent_action_progress table. Engineering question — flagged but not blocking design.
M7 conflict UX for content drafts: 3-way diff is the right shape, but on iPhone-narrow it may not fit. Maybe collapses to "yours / theirs / merged-by-Cocotte" 3-button option with full-screen diff drawer behind it.

brief A §offline — the device-level offline pattern.
brief C — channel-by-channel notification behavior (this brief defines fallback when channels themselves degrade).
brief H §H1 — bump-failure escalation, the canonical M3a example.
brief I — audit drawer surfaces failure history.
brief K — kill switch is the global degradation; this brief covers local degradation.
brief L §L4 — specialist trust lifecycle (separately from but related to degradation; bad specialists get demoted, broken ones get degraded).
voice.brief.md §V2c — plain-register copy rules for all failure UX.

14 KiB Raw Blame History