cocottetech/@platform/codebase/@features/ai-copilot/docs/K-safety-blocklist.brief.md

18 KiB
Raw Blame History

K — Safety, blocklists, kill switch

Goal

The hard edges of CocotteAI: what Quinn explicitly forbids the system to do, who it must never engage with, what content must never ship, and how to slam the brakes if something goes wrong. These are not corrections (brief I's 👎 + correction patterns) — they are invariants: rules the system must respect categorically, with no model-based judgment override.

Designer skim

  • Headline UX: Deterministic gates at the adapter boundary. Quinn opts out of a default-on rule, never opts in. A K3 hit surfaces in chat with the specific rule label, never silent suppression.
  • Categories (5): K1 prospect blocklist · K2 phrase/topic · K3 surface combo (K3a NSFW / K3b funnel-link / K3c identity / K3d exclusivity / K3e three-ts disambiguation / K3f location / K3g per-surface format / K3h channel-vs-surface / K3i brand-sensitive split / K3j defaults onboarding / K3k how-a-hit-surfaces) · K4 jurisdiction · K5 kill switch.
  • Blocking Qs: OPEN-DECISIONS.md → K-Q1 show-what-was-blocked privacy.

Constraints

  • Rules here are deterministic gates, not probabilistic suggestions. A blocked client doesn't get a "low-confidence reply draft" — they get nothing.
  • Rules are checked at the adapter boundary (in @cocottetech/@platform/codebase/@features/{bookings,content}-*/adapter/), so they apply uniformly across every specialist that might dispatch. Application-layer enforcement; defense-in-depth via the same RLS spine.
  • Rules surface in chat as visible invariants — Quinn should always be able to see what's blocking, never feel ghosted by silent suppression.
  • This brief covers UX only — the policy storage + enforcement is a platform.api + skills concern (separate engineering brief later).

Inputs

  • GET /api/v1/safety/blocklist?user_id=... → list of blocklist entries.
  • POST /api/v1/safety/blocklist → add an entry.
  • Blocklist entry shape:
    {
      id: string;
      user_id: string;
      org_id: string | null;
      kind: 'prospect' | 'phrase' | 'topic' | 'surface_combo' | 'jurisdiction';
      value: string;                        // the actual thing being blocked
      scope: 'global' | SurfaceKind[];      // applies everywhere, or only on certain surfaces
      reason: string | null;                // Quinn's note, optional, surfaced when the block fires
      expires_at: string | null;            // null = permanent
      created_at: string;
      created_by: 'user' | 'auto';          // some entries auto-added (chargeback detected, etc.)
    }
    

Five categories of safety surface

K1 — Prospect blocklist (people)

  • Quinn marks a fan / prospect as blocked. Triage never drafts replies to them; their messages don't appear in the engagement feed (collapsed under a "5 blocked-prospect messages today — review?" card).
  • Entry points:
    • Engagement card → overflow menu → "Block this prospect."
    • Prospect detail drawer (brief B3) → "Block" button (with confirmation).
    • Audit log → on a drafted-to-this-prospect row → "Counter + block sender."
  • States: prospect-card-with-block-badge, "show blocked messages" expander, unblock confirmation.

K2 — Phrase / topic blocklist (content)

  • Words, phrases, or topics that must never appear in outgoing content (drafted captions, DM replies, profile copy, tour descriptions, anything).
  • Examples Quinn might add: real first name, location-revealing details, certain kinks she doesn't perform, ex-partner mentions.
  • Persona's off_limits JSONB is the canonical store; the safety surface is a friendlier editor over it (overlap acknowledged — UI choice: one surface or two?).
  • Block hits in drafts surface inline: a card says "content-x drafted a caption but it tripped the 'real name' blocklist — re-drafting." Quinn sees the originally-drafted-then-suppressed text only if she opts into "show me what was blocked."

K3 — Surface combo rules (cross-platform constraints)

Cross-platform guardrails protect Quinn's accounts from ToS-violation bans, protect her identity from cross-surface leakage (deadname / govt name / home location), and protect commercial exclusivity (one surface promised exclusive content). Each rule is deterministic at the adapter boundary — the rule fires at @cocottetech/@platform/codebase/@features/{bookings,content}-{name}/adapter/ send-time, never via model judgment.

Surface this as a settings page listing each rule with a per-rule explainer + a "default-on / opt out" toggle. Quinn opts out of a rule (with explicit "yes I know X allows this") — never opts in.

K3a — NSFW gating (anchored on brief O N1/N2 NSFW-allowed column)

Per the brief O roster, NSFW is allowed on: onlyfans, fansly, bluesky (per-server policy), reddit (in NSFW subs only), all of N2 escort directories. NSFW is banned on: x (regional restrictions), instagram, tiktok, youtube, twitch, facebook, threads.

Rule Default Notes
K3a-1 Never publish NSFW media to instagram, tiktok, youtube, twitch, facebook on Hard ban; opt-out disabled (no jurisdiction permits this).
K3a-2 Never publish NSFW to x without per-region check on Opt-out via "I confirm my X region allows adult content" toggle.
K3a-3 Never publish NSFW to threads on Meta-owned; same posture as IG.
K3a-4 Reddit NSFW only to flagged-NSFW subreddits on Subreddit-aware gate at @cocottetech/@platform/codebase/@features/content-reddit/adapter/publish-post.
K3a-5 Bluesky NSFW only to servers with adult-content policy enabled on AT Protocol per-server flag; check before dispatch.

The surface that hosts the link matters more than the destination. A link to onlyfans.com/quinn from instagram triggers IG's adult-content filter; from x it's usually fine; from tiktok it's an instant suspension.

Rule Default Notes
K3b-1 Never include direct onlyfans.com/* or fansly.com/* links in instagram, tiktok, youtube, twitch, facebook content on Use linktree/brand-site indirection instead.
K3b-2 Never include direct directory URLs (tryst.link, ts4rent.com, etc.) in x, instagram, tiktok posts on Directories cross-listed only via transquinnftw.com.
K3b-3 Brand-site transquinnftw.com is the ONLY canonical link allowed across all SFW surfaces on Single redirect hub; ToS-safe everywhere.
K3b-4 Newsletter (email channel) may include any link except those K2-blocked off Email is Quinn's owned channel; less restrictive.

K3c — Identity / deadname / govt-name leakage

The KYC surfaces hold Quinn's govt ID and likely her deadname. Brief O calls out: ts4rent requires Sumsub KYC (govt ID); privatedelights requires face+ID+DOB; eros is blocked-on-legal-name-change explicitly because of this. Cross-leakage is a permanent identity risk.

Rule Default Notes
K3c-1 No content draft (caption, bio, DM, tour copy) on any surface may reference Quinn's govt name on, can never disable Hard-coded in adapter; matches the deadname blocklist Quinn maintains via K2.
K3c-2 KYC artifacts (ID photos, face-match videos, signed-paper photos) must never appear in content_assets table on, can never disable Variant pipeline rejects on ingest; separate KYC vault.
K3c-3 Verification field values (govt-name on TS4Rent / PD profile internals) must never echo back into public profile copy on the same surface on Adapter reads from public_persona, never verification_payload.
K3c-4 eros is in "blocked-on-legal-name-change" state — all eros actions disabled until cleared on Status-state from brief F F5b; gates the whole adapter, not per-action.

K3d — Anchor-surface / exclusivity gates

Some platforms forbid cross-platform competitor links or content syndication. Some Quinn-side commercial commitments forbid the same.

Rule Default Notes
K3d-1 OnlyFans bio must not reference fansly directly (competitor) on OF has historically removed creators for this.
K3d-2 Fansly bio must not reference onlyfans directly on Symmetric.
K3d-3 PPV content cross-posted to fansly requires explicit confirmation (commercial exclusivity check) on High-stakes gate; brief H4 multi-surface card splits this off per K3i below.
K3d-4 seeking (sugar-dating context) bio must not reference any N2 escort-directory profile on Seeking ToS distinguishes companionship from escort services; cross-link violates.

K3e — Three-"ts"-surface disambiguation

Three surfaces share the "ts" prefix and trans-specific framing: ts4rent, tsescorts, ts.live. Handle conflicts and identity bleed between them.

Rule Default Notes
K3e-1 Handle / display-name on ts.live must not exactly equal ts4rent or tsescorts handle on Avoid prospect confusion + adapter routing ambiguity.
K3e-2 Tour dates pushed to tsescorts only via "first save" path (per brief O note); subsequent edits via update-profile not update-tour-dates on Adapter quirk; not really a Quinn-policy rule but K3 is where surface-specific gotchas live.
K3e-3 Profile diff cards (H2) for tsescorts show the editor-strips-  warning inline on Editor mangles non-breaking spaces and /hr — adapter normalizes.

K3f — Location / home-base privacy

Quinn's current home cities (per brief O; Tryst's home-city set is tier-dependent per surface-tryst §canonical-facts — 1 city Basic/Standard/Premium, 3 Premium+) and tour cities are public; her exact address is not. Multiple directories ask for ZIP / neighborhood granularity Quinn may not want broadcast.

Rule Default Notes
K3f-1 Profile / about-me drafts must never include street address, exact ZIP, building name, or named neighborhood smaller than city-district on, can never disable Hard rule; phrase-blocklist (K2) backs it up.
K3f-2 Tour dates may reveal city + date range, never accommodation name on, can never disable Hotel addresses are confidential per brief R.
K3f-3 Time-of-day "I'm available now" bumps (H1) must not include current location more specific than city on Tryst's home-city setting is sufficient; per-bump location adds nothing and reveals movement patterns.

K3g — Per-surface format / cap quirks (from brief O)

Adapter-level gates that prevent silent truncation or formatting failures. These read as boring infra but they show up here because Quinn experiences them as "the agent posted something that looks broken on AdultLook."

Rule Default Surface Notes
K3g-1 Reject draft if adultsearch body exceeds ~2800 chars on adultsearch Adapter uses ✦ as spacer, not  .
K3g-2 Reject draft if adultlook body exceeds ~500 chars OR contains HTML on adultlook Plain text only; compressed 4-section format.
K3g-3 eroticmonkey photo uploads must use Safari (Firefox-broken per brief O) on eroticmonkey Build-tier concern; adapter sets browser UA.
K3g-4 tsescorts website-field add must happen on first edit, not first save on tsescorts Sequence gate at adapter.

K3h — Channel vs surface separation (N4 channels)

The unified inbox (brief P) carries iMessage, SMS, multiple Proton mailboxes, Gmail, and per-surface DMs. Rules here prevent channel-of-arrival from leaking into content-of-origin.

Rule Default Notes
K3h-1 A prospect contacting Quinn via email channel must not have their email address auto-quoted into outbound on a different surface on Stops accidental dox via reply-on-wrong-channel.
K3h-2 iMessage replies stay in iMessage thread — never auto-promoted to a surface DM with the same prospect on Different consent context; Quinn re-routes manually.
K3h-3 Auto-conversation engine (brief Q3) restricted to per-surface DMs; never spans email or signal on Q3 invariant; restated here for K3 completeness.

K3i — Brand-sensitive split (works with H5e)

When a multi-surface card (H4/H5) would dispatch to surfaces with materially different risk envelopes, K3 forces the specialist to split the card rather than approve as one unit.

Rule Default Trigger
K3i-1 Split card if any included surface has NSFW-allowed=yes AND any other has NSFW-allowed=no AND draft contains NSFW-classified material on Per draft, per dispatch.
K3i-2 Split card if any included surface is in pending verification or blocked state on Pending surfaces need eyes; routine surfaces don't wait.
K3i-3 Split card if cross-surface confidence variance >20% on High-confidence batch + low-confidence one-off can't share a single approval.

K3j — Defaults onboarding (first-run)

At first run (post-persona-seed; brief D), present a single screen titled "Cross-surface guardrails":

  • Group K3 rules by category (NSFW gating / identity / exclusivity / format / location).
  • Show each rule as a row with the explainer and a default-on toggle.
  • "Accept all defaults" affordance is the primary button; per-rule customization is secondary.
  • Hard rules (K3c-1, K3c-2, K3c-4, K3f-1, K3f-2) render with the toggle disabled and a small lock glyph — informational, not Quinn-editable. Show them anyway; transparency over hidden invariants.

K3k — How a K3 hit surfaces in chat

When a K3 rule fires on a draft (during a multi-surface fan-out or a single-surface post), the user-facing card shows:

  • The specific rule label ("K3b-1: Never link onlyfans.com from instagram") — not a generic "blocked."
  • The exact substring or attribute that tripped the rule, redacted if it's in the K2 phrase-blocklist itself.
  • A "show me the original draft" affordance (opt-in, per K's existing "show what was blocked" pattern).
  • A "re-draft without this" affordance routed back to the originating specialist.
  • A "this rule is wrong here" affordance that opens a one-tap exception flow — creates an exception_request row Quinn approves once, never auto-applies.
  • Some content / surfaces / actions are restricted in certain jurisdictions. Quinn declares her home + tour jurisdictions; the system applies the rules.
  • Settings: a list of declared jurisdictions, expandable to show what each restricts.
  • When she declares a tour to a jurisdiction with stricter rules, the tour-approval card (brief H3) shows which content/surfaces auto-pause for the duration.

K5 — Kill switch (panic)

  • Single action that pauses every specialist immediately, queues zero further auto-actions, lets in-flight ones complete (or aborts where safe), and routes everything to Quinn's approval queue.
  • Entry points:
    • Settings → top of page, big red destructive button: "Stop everything."
    • Voice: "Hey copilot, stop everything" → ai-copilot confirms with a single-tap card (no typed confirmation needed — kill switch must be fast).
    • Long-press the CocotteAI app icon → "Emergency stop" quick action (iOS Home Screen).
  • After activation:
    • Chat banner top of every surface: "All specialists paused. Tap to resume."
    • All policies (H1 bumps, scheduler-worker dispatch, triage auto-replies) frozen.
    • Audit row recorded with reason field Quinn can fill ("just being cautious", "drama with X fan", whatever).
    • Resume requires explicit reactivation per specialist OR "resume all" with confirmation.

States to design across K1K5

  • Blocklist settings root — single page with sections per category, search across all entries.
  • Add-blocklist-entry sheet — pick kind, fill value, optional reason, optional expiry.
  • Block fired-in-flight notification — when a draft is suppressed, a small chat card surfaces it ("content-x's caption hit your 'real name' rule — re-drafted").
  • Show-what-was-blocked toggle — opt-in viewer for the raw blocked content (Quinn might want to confirm the rule is working as intended).
  • Default-on platform rules onboarding — first-run interview question: "Here are 12 recommended platform rules — review or accept defaults."
  • Kill switch activation card — confirms scope, gives reason field, immediately effective.
  • Kill switch banner — persistent until reactivated.
  • Per-specialist resume — one specialist at a time vs all at once.
  • Auto-added blocklist entry surface — when the system auto-detects a chargebacker / harasser pattern, a card explains why it was added and asks Quinn to confirm or revert.

In-the-wild copy

K3 hit · NSFW gate on Instagram (plain — K3k requires exact rule label):

K3a-1: never publish NSFW to Instagram. Re-drafting without it.

K3 hit · cross-link gate (plain):

K3b-1: no onlyfans.com link on Instagram. Routed via transquinnftw.com instead.

K3 hit · identity hard rule (plain):

K3c-1: a phrase in this draft matches your name blocklist. Held. Show me what was blocked / re-draft.

K3 hit · split card (working — K3i):

These 8 fan out together. This 1 mentions Berlin near a hotel name — that one needs your eyes.

K5 · kill-switch voice trigger confirmation (plain — fast, no metaphor):

Stop everything. All specialists paused. No drafts in the queue. Confirm.

K5 · banner after activation (plain):

All specialists paused. Tap to resume.

K1 · auto-added block notice (plain):

Cocotte flagged a sender after two chargebacks. Confirm the block or revert?

Out of scope

  • ML-based safety classification (content moderation NSFW detection on uploads) — different concern, lives in the variant pipeline.
  • Multi-user safety governance (Quinn's manager vetoing her actions) — single-Quinn for P0.

Open questions

  • Persona off_limits JSONB vs blocklist kind='phrase' — one storage and one editor, or both with a unified surface?
  • Kill switch — should it also revoke in-flight LLM calls (cancel the inference) or only block dispatch of the result? Cancelling inference is harder; blocking dispatch is sufficient.
  • "Show what was blocked" — privacy implication: if Quinn ever shares her screen, blocked drafts shouldn't be visible by default. Keep behind a toggle that auto-resets after each session?
  • Auto-added entries (chargeback / harassment detection) — who decides what triggers auto-add? A separate specialist? Or rules baked into triage?