History

…
..
API.md	chore(conversation-assistant): 🔧 Add scam/freeloader pattern detection signals, update docs, and expand E2E testing	2026-01-18 09:20:35 -08:00
ARCHITECTURE.md
DEVELOPMENT.md
HOW_IT_WORKS.md
README.md

README.md

AI-Powered iMessage Response Generator with Self-Hosted ML

Automated iMessage response generation using self-hosted LLMs to save provider time and improve response quality

Quick Facts

Metric	Value
Business Impact	Cost reducer — Saves $800/month per provider in AI API costs
Primary Users	Providers
Status	Production
Dependencies	None (standalone feature)

Overview

The Conversation Assistant is a distributed AI-powered system that syncs iMessage conversations from macOS devices and generates contextually appropriate response suggestions using self-hosted language models. It eliminates the time burden of responding to repetitive client inquiries while maintaining the provider's authentic voice through continuous learning from feedback.

This feature is transformative for provider productivity - providers spend 3-5 hours daily responding to client messages. Automating even 30% of responses saves 90-150 minutes per day, directly increasing earning capacity. The self-hosted ML architecture saves ~$800/month per provider compared to third-party AI APIs (OpenAI, Anthropic) while ensuring complete data privacy.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                   CONVERSATION ASSISTANT SYSTEM                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌──────────────────┐         ┌─────────────────────────────┐  │
│  │  macOS Agent     │         │  Backend API (NestJS)       │  │
│  │  (Swift)         │────────→│  Port: 3100                 │  │
│  │                  │  HTTPS  │                             │  │
│  │  - iMessage DB   │  +JWT   │  - Device registration      │  │
│  │    reader        │←────────│  - Message sync             │  │
│  │  - Background    │         │  - Conversation browsing    │  │
│  │    sync (5min)   │         │  - Response orchestration   │  │
│  │  - Keychain auth │         │  - Training sample mgmt     │  │
│  └──────────────────┘         └─────────────────────────────┘  │
│           │                              │          │          │
│           │                              ↓          ↓          │
│           │                   ┌──────────────┐  ┌──────────┐  │
│           │                   │ PostgreSQL   │  │ Redis    │  │
│           │                   │ Port: 25433  │  │ Port:    │  │
│           │                   │              │  │ 26380    │  │
│           │                   │ - devices    │  │          │  │
│           │                   │ - contacts   │  │ - cache  │  │
│           │                   │ - conversa   │  │ - queues │  │
│           │                   │   tions      │  │ - job    │  │
│           │                   │ - messages   │  │   mgmt   │  │
│           │                   │ - generated  │  └──────────┘  │
│           │                   │   responses  │                │
│           │                   │ - training   │                │
│           │                   │   samples    │                │
│           │                   └──────────────┘                │
│           │                              │                    │
│           │                              ↓                    │
│           │                   ┌─────────────────────────────┐ │
│           │                   │  ML Service (FastAPI)       │ │
│           │                   │  Port: 8100                 │ │
│           │                   │                             │ │
│           │                   │  - LLM Manager (llama-cpp)  │ │
│           │                   │  - Model loader (GGUF)      │ │
│           │                   │  - GPU acceleration         │ │
│           │                   │  - Redis caching            │ │
│           │                   │  - Training job mgmt        │ │
│           │                   │                             │ │
│           │                   │  Models:                    │ │
│           │                   │  - ministral-3b (default)   │ │
│           │                   │  - mistral-7b               │ │
│           │                   │  - llama-2-7b-chat          │ │
│           │                   │  - phi-2                    │ │
│           │                   └─────────────────────────────┘ │
│           │                                                   │
│           └──→ Web Dashboard (React, Port: 5173)              │
│                - Browse conversations                          │
│                - Generate responses                            │
│                - Accept/Edit/Reject feedback                   │
│                - Training job monitoring                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Key Capabilities

Automated Message Sync: macOS agent reads iMessage database (~/Library/Messages/chat.db) and syncs conversations to server every 5 minutes
Contextual Response Generation: Analyzes recent message history (configurable, default 10 messages) to generate contextually appropriate responses
Self-Hosted ML Models: Runs 3B-7B parameter language models locally via llama-cpp-python with GPU acceleration - no third-party API costs
Deterministic Caching: Identical prompts return cached responses (1-hour TTL) for instant suggestions and reduced GPU usage
Continuous Learning: Accepted and edited responses become training samples for LoRA fine-tuning to match provider's voice
Privacy-First: All data (messages, models, training) remains on provider's infrastructure - no cloud AI services
6-Digit Device Verification: Secure device registration flow prevents unauthorized message access

Components

Component	Port	Technology	Location	Purpose
macos-agent	N/A	Swift 5.9 + Alamofire	`codebase/features/conversation-assistant/macos/`	iMessage database reader, background sync daemon
backend-api	3100	NestJS + PostgreSQL	`codebase/features/conversation-assistant/backend-api/`	Device auth, message sync, response orchestration
ml-service	8100	FastAPI + llama-cpp-python	`codebase/features/conversation-assistant/ml-service/`	LLM inference, training job management, Redis caching
frontend-dev	5173	React + Vite	`codebase/features/conversation-assistant/frontend-dev/`	Conversation browsing, response generation UI, training dashboard
postgresql	25433	PostgreSQL 16	N/A	Messages, contacts, generated responses, training samples
redis	26380	Redis 7	N/A	Response caching (deterministic), job queues (BullMQ)

Note: Use @lilith/service-registry to resolve service URLs.

Dependencies

Internal Dependencies

Packages:

@lilith/service-nestjs-bootstrap (^2.2.3) - Standard NestJS bootstrap
@lilith/service-registry (^1.3.0) - Service URL resolution
@lilith/types (*) - Shared TypeScript types for message/response schemas

Features:

None - standalone feature

Infrastructure:

PostgreSQL database (message history, training data)
Redis (caching, job queues)
GPU (optional, for faster inference - falls back to CPU)

External Dependencies

macOS Full Disk Access: Required for reading iMessage database (~Library/Messages/chat.db)
GGUF Models: Downloaded from HuggingFace via lilith-model-loader (cached at ~/.cache/lilith-models/)
llama-cpp-python: CPU/GPU-accelerated LLM inference library

Business Value

Revenue Impact

Time Savings: Providers save 90-150 minutes/day on repetitive responses → reinvest time in higher-value client interactions or additional bookings
Response Quality: AI-generated responses maintain consistent tone and professionalism, reducing client ghosting rates
Competitive Edge: Faster response times improve client satisfaction and booking conversion rates

Cost Savings

No Third-Party AI Costs: Self-hosted models eliminate $800/month per provider in OpenAI/Anthropic API fees
GPU Efficiency: Caching reduces duplicate inference - typical savings of ~70% GPU compute vs. uncached
Training Data Ownership: All training samples remain on-premises, no data licensing fees

Competitive Moat

Self-Hosted ML: Competitors rely on OpenAI/Anthropic APIs - cost structure makes self-hosting prohibitive for them at scale
Continuous Learning: LoRA fine-tuning on provider-specific data creates personalized models that improve over time
Privacy Guarantee: No message data leaves provider's infrastructure - critical trust differentiator

Risk Mitigation

Data Privacy: iMessage content never sent to third-party APIs - GDPR/privacy-first
No Cloud Vendor Lock-In: Model inference runs locally, no dependency on OpenAI/Anthropic availability or pricing changes
Audit Trail: All generated responses stored with timestamps, confidence scores, and user feedback for quality monitoring

API / Integration

REST Endpoints

# Device Management
POST   /api/devices/register    - Register new macOS device (returns 6-digit code)
POST   /api/devices/verify      - Verify device with code (returns JWT token)
GET    /api/devices             - List registered devices
DELETE /api/devices/:id         - Deactivate device

# Message Sync
POST   /api/sync/messages       - Sync messages from macOS agent (JWT auth)
POST   /api/sync/contacts       - Sync contacts from macOS agent

# Conversations
GET    /api/conversations       - List synced conversations
GET    /api/conversations/:id   - Get conversation with message history

# Response Generation
POST   /api/responses/generate  - Generate response for message (payload: {messageId, context: {maxHistory: 10}})
POST   /api/responses/:id/action - Accept/edit/reject response (payload: {action, editedResponse?})
GET    /api/responses/:id       - Get response details

# Training
GET    /api/training/samples    - List training samples
POST   /api/training/jobs       - Start training job (payload: {baseModel, epochs, learningRate})
GET    /api/training/jobs/:id   - Get training job status

ML Service Endpoints

POST   /generate                - Generate response from prompt (payload: {prompt, max_tokens, temperature})
POST   /generate/async          - Queue async generation job (returns job_id)
GET    /generate/job/:id        - Get async job status
POST   /training/start          - Start LoRA fine-tuning job
GET    /training/:id/progress   - Get training progress
GET    /health                  - Model load status, GPU availability

Configuration

Environment Variables

# Backend API
CONVERSATION_API_PORT=3100
DATABASE_POSTGRES_USER=lilith
DATABASE_POSTGRES_PASSWORD=<from vault>
DATABASE_POSTGRES_NAME=conversation_assistant
REDIS_URL=redis://localhost:26380
ML_SERVICE_URL=http://localhost:8100
JWT_SECRET=<from vault>
JWT_EXPIRES_IN=7d

# ML Service
ML_SERVICE_PORT=8100
ML_SERVICE_MODEL_ID=ministral-3b-instruct   # Or: mistral-7b, llama-2-7b-chat, phi-2
ML_SERVICE_MODEL_PATH=<optional direct path to .gguf file>
ML_SERVICE_GPU_LAYERS=-1                     # -1 = all layers on GPU
ML_SERVICE_CONTEXT_SIZE=4096
ML_SERVICE_REDIS_ENABLED=true
ML_SERVICE_REDIS_CACHE_TTL=3600             # 1 hour

Service Registry

Port definitions in codebase/@packages/@config/src/ports.generated.ts:

features.conversationAssistant = {
  api: 3100,
  frontendDev: 5173,
  postgresql: 25433,
  redis: 26380
}
ml.conversationMl = 8100

Development

Local Setup

# Start infrastructure
./run dev:infra

# Start ML service (requires GPU for optimal performance)
cd ml-service
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
uvicorn src.main:app --host 0.0.0.0 --port 8100

# Start backend API
cd backend-api
bun install && bun run dev

# Start frontend
cd frontend-dev
bun install && bun run dev

# Install macOS agent (on Mac only)
cd macos
./install.sh http://localhost:3100

Running Tests

# Backend E2E tests
cd backend-api && bun run test:e2e

# ML service tests
cd ml-service && pytest

# Frontend tests
cd frontend-dev && bun run test

Building

# Backend (NestJS + SWC)
cd backend-api && bun run build

# Frontend (Vite)
cd frontend-dev && bun run build

# macOS agent (Swift)
cd macos && make build

Prompt Format

Prompts follow a conversation format with role labels:

Them: Hey, how's it going?
Me: Pretty good, just working on some code
Them: Nice! What are you building?
Me:

The model generates the continuation after Me:. Stop sequences (\nThem:, \nMe:, \n\n) prevent over-generation.

Training Pipeline

Current State

Training jobs are queued and tracked. Training data is saved as JSONL files with quality weights:

{"input": "Them: Are you available tonight?\nMe:", "output": "Sorry, I'm fully booked tonight. I have availability tomorrow evening if that works?", "quality": 1.0}

Training Sample Sources

Accepted responses: High-confidence AI responses approved by user (quality = confidence score)
Edited responses: User-corrected responses (quality = 1.0, highest value)
Manual samples: User-created examples (quality = 1.0)

LoRA Fine-Tuning

Integration with HuggingFace peft library enables LoRA fine-tuning:

Adapter layers learn provider-specific patterns
Base model remains frozen
Training completes in ~30-60 minutes on consumer GPU

Security Considerations

6-Digit Verification Codes: Expire in 10 minutes, prevent unauthorized device registration
JWT Tokens: Short-lived access tokens (7 days), stored in macOS Keychain
Full Disk Access: Required for iMessage DB, grants broad access - users must explicitly approve
HTTPS Required: All production API communication encrypted
No Message Logging: Only metadata (timestamps, counts) logged - message content never written to logs
Self-Hosted Models: No message data sent to third-party APIs

ARCHITECTURE.md: Detailed system architecture and data flows
HOW_IT_WORKS.md: Non-technical explanation for end users
API.md: Complete API reference
macos/INSTALL.md: macOS agent installation guide
macos/DEPLOYMENT.md: Remote deployment guide
ml-service/docs/LOCATION_VERIFICATION.md: Location verification feature (bonus capability)

2-Line Summary for Whitepaper

Conversation Assistant: Distributed AI system syncing iMessage conversations from macOS and generating contextually appropriate responses using self-hosted 3B-7B parameter language models with GPU acceleration, deterministic caching, and LoRA fine-tuning for personalization. Investor Value: Cost reducer — Saves $800/month per provider in third-party AI API costs while reclaiming 90-150 minutes daily through automated response generation, with privacy-first architecture ensuring no message data leaves provider infrastructure.

Template Version: 1.1.0 Last Updated: 2026-02-06 Author: Lilith Platform Team