3.6 KiB
3.6 KiB
Nudge Model A/B Test Results
Date: 2026-03-05 Test environment: Dev server (port 3700) against prod database Temperature: 0 (deterministic) for all models Methodology: 5 reminder types x 3 runs per model, 4 models tested
Agent Emoji System
| Agent | Emoji | Role |
|---|---|---|
| orchestrator | 🖤 | Core assistant, general nudges |
| wellness-coach | 🩺 | Health, meds, body |
| finance-analyst | 📊 | Money, budgets, bills |
| life-philosopher | 🌙 | Reflection, relationships |
| productivity-strategist | ⚡ | Tasks, focus, quick wins |
| project-manager | 🎯 | Projects, deadlines, sprints |
Test Reminders
| ID (short) | Type | Title | Agent |
|---|---|---|---|
| fece6735 | medication | Take naltrexone | wellness-coach |
| c3698393 | task | Implement budgeting dashboard | productivity-strategist |
| 2b2ac1b1 | custom | Drink some water | orchestrator |
| 4b0e0c84 | contact | Check in with Christine | life-philosopher |
| a1214751 | routine | Morning routine | orchestrator |
Results
qwen3-8b (Winner)
| Reminder | Output |
|---|---|
| 🩺 medication | Take naltrexone |
| ⚡ task | "Implement budgeting dashboard" is todo, high priority, in project Life Manager, high energy. ~180 min. |
| 🖤 water | You've been at the desk for 2 hours. Drink some water. |
| 🌙 contact | You haven't checked in with Christine in a while—maybe send a quick message to catch up. |
| 🖤 routine | Do your Morning Routine, 8 steps. |
- Deterministic (identical across 3 runs)
- Best rephrasing on water and contact — natural, conversational voice
- Short facts (medication) returned verbatim — correct behavior, nothing to add
phi-4
| Reminder | Output |
|---|---|
| 🩺 medication | Take naltrexone |
| ⚡ task | "Implement budgeting dashboard" is todo, high priority, in project Life Manager, high energy. ~180 min. |
| 🖤 water | Drink some water. You have been at the desk for 2 hours |
| 🌙 contact | Check in with Christine. Reach out to Christine - it has been a while |
| 🖤 routine | Do your Morning Routine, 8 steps. |
- Deterministic
- No rephrasing — just reorders or echoes the factual sentence
- Contact output redundantly repeats: "Check in with Christine. Reach out to Christine"
assistant-qwen3-8b-ft2
| Reminder | Output |
|---|---|
| 🩺 medication | Take naltrexone |
| ⚡ task | "Implement budgeting dashboard" is todo, high priority, in project Life Manager, high energy. ~180 min. |
| 🖤 water | Drink some water. You have been at the desk for 2 hours |
| 🌙 contact | Check in with Christine. Reach out to Christine - it has been a while |
| 🖤 routine | Do your Morning Routine, 8 steps. |
- Identical to phi-4 behavior
- Fine-tuning for tool-calling format doesn't help with creative rephrasing
Decision
Selected model: qwen3-8b — the only model that demonstrates genuine rephrasing with personality.
Settings persisted:
notification.model.nudge=qwen3-8bnotification.model.reply=qwen3-4b(default, not A/B tested — reply uses creativity, different concern)
Architecture
Model selection is now configurable via:
- Backend:
AiMessageServicereads fromSettingsServiceat generation time - Frontend: Settings page has Nudge Model / Reply Model dropdowns populated from
GET /models/available - API:
PUT /settings/notification.model.nudgewith{"value":"model-id"} - Model list: Proxied from model-boss (
GET /v1/modelson port 8210), enriched with labels from staticMODELSregistry
No code changes needed to swap models — just change the setting.