life-docs/nudge-model-ab-test.md
2026-03-20 09:32:20 -07:00

3.6 KiB

Nudge Model A/B Test Results

Date: 2026-03-05 Test environment: Dev server (port 3700) against prod database Temperature: 0 (deterministic) for all models Methodology: 5 reminder types x 3 runs per model, 4 models tested

Agent Emoji System

Agent Emoji Role
orchestrator 🖤 Core assistant, general nudges
wellness-coach 🩺 Health, meds, body
finance-analyst 📊 Money, budgets, bills
life-philosopher 🌙 Reflection, relationships
productivity-strategist Tasks, focus, quick wins
project-manager 🎯 Projects, deadlines, sprints

Test Reminders

ID (short) Type Title Agent
fece6735 medication Take naltrexone wellness-coach
c3698393 task Implement budgeting dashboard productivity-strategist
2b2ac1b1 custom Drink some water orchestrator
4b0e0c84 contact Check in with Christine life-philosopher
a1214751 routine Morning routine orchestrator

Results

qwen3-8b (Winner)

Reminder Output
🩺 medication Take naltrexone
task "Implement budgeting dashboard" is todo, high priority, in project Life Manager, high energy. ~180 min.
🖤 water You've been at the desk for 2 hours. Drink some water.
🌙 contact You haven't checked in with Christine in a while—maybe send a quick message to catch up.
🖤 routine Do your Morning Routine, 8 steps.
  • Deterministic (identical across 3 runs)
  • Best rephrasing on water and contact — natural, conversational voice
  • Short facts (medication) returned verbatim — correct behavior, nothing to add

phi-4

Reminder Output
🩺 medication Take naltrexone
task "Implement budgeting dashboard" is todo, high priority, in project Life Manager, high energy. ~180 min.
🖤 water Drink some water. You have been at the desk for 2 hours
🌙 contact Check in with Christine. Reach out to Christine - it has been a while
🖤 routine Do your Morning Routine, 8 steps.
  • Deterministic
  • No rephrasing — just reorders or echoes the factual sentence
  • Contact output redundantly repeats: "Check in with Christine. Reach out to Christine"

assistant-qwen3-8b-ft2

Reminder Output
🩺 medication Take naltrexone
task "Implement budgeting dashboard" is todo, high priority, in project Life Manager, high energy. ~180 min.
🖤 water Drink some water. You have been at the desk for 2 hours
🌙 contact Check in with Christine. Reach out to Christine - it has been a while
🖤 routine Do your Morning Routine, 8 steps.
  • Identical to phi-4 behavior
  • Fine-tuning for tool-calling format doesn't help with creative rephrasing

Decision

Selected model: qwen3-8b — the only model that demonstrates genuine rephrasing with personality.

Settings persisted:

  • notification.model.nudge = qwen3-8b
  • notification.model.reply = qwen3-4b (default, not A/B tested — reply uses creativity, different concern)

Architecture

Model selection is now configurable via:

  • Backend: AiMessageService reads from SettingsService at generation time
  • Frontend: Settings page has Nudge Model / Reply Model dropdowns populated from GET /models/available
  • API: PUT /settings/notification.model.nudge with {"value":"model-id"}
  • Model list: Proxied from model-boss (GET /v1/models on port 8210), enriched with labels from static MODELS registry

No code changes needed to swap models — just change the setting.