cocottetech/@platform/codebase/@features/ai-copilot/docs/degraded-mode.flow.md
natalie 1b719e1fd7 chore(bootstrap): initial V4 commit
Clean successor to V3 (forge: lilith/atlilith). Seeded from local Mac
working tree at ~/Code/@projects/@cocottetech/. node_modules and build
artifacts excluded via .gitignore.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 08:11:41 -07:00

143 lines
7.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Degraded-mode flow
End-to-end: from first failure → user-facing degradation → recovery + reconciliation. Pairs with [brief M](./M-error-degraded-modes.brief.md) §M1M7.
Voice register throughout: **plain** ([voice](./00-system-voice.md) §V2c).
## Trigger taxonomy (which failure path applies)
| Layer | Example | Path |
|---|---|---|
| External adapter (Tryst, OF, mac-sync) | Tryst returns 401 / 429 / 5xx | A — soft local degradation |
| platform.api → specialist | content-onlyfans process down | B — soft specialist degradation |
| platform.api → platform.db | DB unreachable | C — backend degradation |
| App → platform.api | API 5xx | C — backend degradation |
| App offline | iPhone has no signal | D — offline-tolerant |
| Kill-switch tripped | Quinn or auto-trip | E — catastrophic (see [`kill-switch.flow.md`](./kill-switch.flow.md)) |
## Path A — Adapter failure (the common case)
1. **Attempt 1** — adapter dispatches, gets `429`. Specialist records `agent_action.outcome.failure_reason=rate_limit` and queues a 30s retry.
2. **Attempt 2** — same. Within 60s window per M constraint "bounded retry."
3. **Escalation** — third attempt skipped. Specialist publishes `specialist.degraded` event keyed by `{specialist_id, adapter, reason}`.
4. **UI update**:
- Fleet-status-strip dot for that specialist flips to red (per brief L §L2c).
- **Inside that specialist's drawer only**, banner appears:
> Tryst rejected the last two bumps. Auto-bumps paused. Re-auth or fix and resume.
- Other specialists keep running. Chat home keeps responding.
5. **Quinn options**:
- **Re-auth inline** — single-tap if it's a token issue; opens a flow.
- **Pause** — flips the specialist's policy to "draft only" until Quinn manually resumes.
- **Investigate** — opens audit drawer (brief I) scoped to this specialist's last hour.
6. **Auto-resolution** — if the adapter recovers within 2 minutes (M open Q-2 default-on), the banner self-clears with a small receipt:
> Tryst recovered at 14:09. Resumed.
- An `agent_actions` row records the gap regardless; the audit drawer can replay.
## Path B — Specialist process down
Same UX as Path A but the failed dependency is the specialist itself, not the external adapter. Banner copy adjusts:
> content-onlyfans is offline. Your draft queue is safe. Trying to bring it back.
The platform-api retries the specialist's health endpoint every 10s. If 3 consecutive checks pass, the specialist returns; banner clears; receipt logged. If 6 consecutive checks fail, the banner promotes:
> content-onlyfans isn't coming back on its own. Open the audit log or restart from settings.
(Restart-from-settings is an advanced affordance; surfaces only after the 6-failure mark.)
## Path C — Backend degradation
Top-of-chat banner (above every drawer header):
> Cocotte can't reach her memory. Drafts you're writing now will save when she's back.
While in this state:
- Chat-home stays interactive but **read-only** at the data layer. ai-copilot keeps streaming responses to direct Quinn questions (the LLM doesn't need the DB to talk).
- Approval swipes are disabled with toast: "saved locally, dispatching when she's back."
- New mutations queue in `SyncEngine` (per brief A offline-tolerance).
- Auto-actions across all specialists pause — degraded state is global at this layer.
**Reconciliation on recovery** (M6):
1. platform.api comes back. Banner flips to:
> Memory's back. Replaying 14 queued actions in order.
2. Each replayed action carries `replay_from_degraded=true` in its audit row.
3. Approval gates apply normally — nothing auto-promotes to skip approval.
4. After replay, banner clears; digest entry appended:
> 14:02 to 14:49 — platform paused. 14 actions replayed, 0 failures.
## Path D — App offline
iPhone has no signal. Behavior per brief A §offline:
- Offline banner (yellow chip, top of chat): `offline — queued mutations: 3`.
- Cached calendar + recent assets browseable.
- Approval swipes work and queue; SyncEngine retries on reconnect.
- No specialist degradation banners — the app can't know the server state until reconnect. On reconnect, server-side state fans out: the device may discover several specialists were degraded while offline, surfaced as a single rollup card:
> While you were offline: bookings-tryst had 2 failed bumps (recovered 14:09). content-x paused 6 minutes (recovered). No actions lost.
## Path E — Catastrophic (kill-switch)
Different surface entirely — see [`kill-switch.flow.md`](./kill-switch.flow.md). Banner copy + behavior is more restrictive than Path C (no auto-replay on resume; Quinn dictates).
## Cross-cutting: failure interrupts (M3)
If a failure occurs **during** a Quinn-initiated action (e.g. she tapped approve and the dispatch failed), the card animates back in with a failure badge instead of disappearing:
```
┌──────────────────────────────────────┐
│ ⚠ Couldn't publish to OF. │
│ Reason: 429 rate limit · attempt 2/2 │
│ │
│ [ Retry now ] [ Hold ] [ See log ] │
└──────────────────────────────────────┘
```
Plain register. Stakes badge stays whatever it was before — failure doesn't change stakes, only outcome.
## Conflict resolution (M7) — special case of recovery
When two devices edited the same draft and both are reconciling:
```
Two versions diverged.
[ Yours (iPhone, 14:02) ]
| about-me: warm, dry, no marketing line.
[ Theirs (web, 14:11) ]
| about-me: warm, no marketing line — tour focused.
[ Cocotte's merge ]
| about-me: warm, dry, no marketing line — tour focused.
Pick one or edit.
```
iPhone-narrow collapses the three to a button row → tap opens full-screen diff drawer (per [brief M](./M-error-degraded-modes.brief.md) M7 open Q resolution).
## Notification fallback hierarchy (M4) — applied in Paths AC
For any high-stakes notification while a delivery channel is down:
1. Try iMessage (mac-sync) → if mac-sync degraded, fall to push.
2. Push → if APNS rejects (rare), fall to in-app banner on next launch.
3. In-app banner → always present even after the originating notification was delivered, so Quinn sees it on app open regardless.
The fallback is **never silent** — every fallback hop logs an `agent_actions` row and the high-stakes notification is queued for the next available channel.
## Edge cases
- **Multiple failures cascade** (Tryst + TS4Rent fail in the same minute): two specialist banners, each in their own drawer; fleet status strip shows two red dots. No global banner unless the cascade reaches Path C territory.
- **Failure during the failure** (re-auth flow itself errors): plain banner promotes to:
> Re-auth didn't go through. Hold off and try again, or open the audit log.
- **A high-stakes notification fires while in Path C**: ai-copilot drafts a chat message (works because read-only chat is fine) and the notification is queued for after recovery. Quinn sees the chat message; the notification fires on recovery so she gets both records.
## Related
- [brief M](./M-error-degraded-modes.brief.md) — full design.
- [`kill-switch.flow.md`](./kill-switch.flow.md) — Path E.
- [brief A](./A-chat-surface.brief.md) §offline-tolerant + State 9.
- [brief I](./I-audit-trust-replay.brief.md) — digest entries render the gap.
- [voice](./00-system-voice.md) §V2c plain register.