platform-codebase/features/conversation-assistant/ml-service
Quinn Ftw 4bf0c27b28 feat: ML classification for conversation-assistant and analytics refactor
Major updates:
- Add ML-powered contact classification with confidence indicators
- New ClassificationBadge, ClassificationSelector, ConfidenceIndicator components
- Add MLSuggestionCard for AI-assisted response suggestions
- New ContactsPage, ContactDetailPage, DashboardPage, ReviewQueuePage
- Refactor analytics-service to new features/analytics/ structure
- Remove deprecated analytics-service/server implementation
- Add conversation-assistant CI pipeline and VPS deployment config
- Add SSO client library and improve SSO backend tests
- Update various admin frontends (i18n, SEO, truth-validation, platform-admin)
- Fix react-query-utils mutation options and add tests

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-29 17:13:54 -08:00
..
src feat(conversation-assistant): add deployment infrastructure and ML enhancements 2025-12-29 04:59:33 -08:00
tests test(conversation-assistant): add E2E, ML service, and registry tests 2025-12-29 05:11:37 -08:00
.env.example feat(conversation-assistant): add deployment infrastructure and ML enhancements 2025-12-29 04:59:33 -08:00
conversation-ml.service feat(conversation-assistant): add deployment infrastructure and ML enhancements 2025-12-29 04:59:33 -08:00
pyproject.toml feat(conversation-assistant): add deployment infrastructure and ML enhancements 2025-12-29 04:59:33 -08:00
README.md feat: ML classification for conversation-assistant and analytics refactor 2025-12-29 17:13:54 -08:00
requirements.txt fix(dating-autopilot): replace vm2 with acorn for syntax validation 2025-12-28 18:35:36 -08:00

Conversation Assistant ML Service

FastAPI-based ML inference service with LoRA fine-tuning, Redis caching, and model hot-swapping.

Architecture

┌─────────────────────────────────────────────────────────────┐
│ ML Service (Port 8100)                                       │
├─────────────────────────────────────────────────────────────┤
│ FastAPI Application                                          │
│ ├── /generate          - Sync text generation               │
│ ├── /generate/async    - Async job queue                    │
│ ├── /training/start    - Start LoRA fine-tuning             │
│ ├── /training/status   - Training progress                  │
│ ├── /model/deploy      - Hot-swap trained model             │
│ └── /health            - Health status                      │
├─────────────────────────────────────────────────────────────┤
│ Components                                                   │
│ ├── LLM Manager        - GGUF model loading (llama-cpp)     │
│ ├── LoRA Trainer       - QLoRA fine-tuning (peft/trl)       │
│ ├── GGUF Converter     - HuggingFace → GGUF                 │
│ └── Redis Client       - Caching + job queuing             │
└─────────────────────────────────────────────────────────────┘

Quick Start

# 1. Create virtual environment
python3 -m venv .venv
source .venv/bin/activate

# 2. Install dependencies
pip install -e .
pip install -e ~/Code/@packages/@ml/@tools/model-loader
pip install -e ~/Code/@packages/@ml/ml-service-base

# 3. Copy environment configuration
cp .env.example .env

# 4. Start service
python -m uvicorn src.main:app --host 0.0.0.0 --port 8100 --reload

Configuration

Environment Variables

Variable Default Description
MODEL_NAME meta-llama/Llama-3.2-3B-Instruct Base model for inference
MODEL_CACHE_DIR /opt/conversation-ml/models Model download directory
MAX_MODEL_LENGTH 4096 Maximum context length
TEMPERATURE 0.7 Generation temperature
TOP_P 0.95 Top-p sampling
REDIS_HOST 0.1984.nasty.sh Redis host
REDIS_PORT 6379 Redis port
REDIS_PASSWORD - Redis password (required)
REDIS_DB 0 Redis database number
SERVICE_PORT 8100 Service port
LOG_LEVEL info Logging level
WORKERS 2 Uvicorn workers
CUDA_VISIBLE_DEVICES 0 GPU device(s)
GPU_MEMORY_UTILIZATION 0.8 GPU memory limit
API_KEY - API authentication key
ALLOWED_HOSTS 10.9.0.0/24,10.8.0.0/24 VPN CIDR ranges

API Reference

Health Check

GET /health

Returns service health and model status.

Response:

{
  "status": "healthy",
  "model_loaded": true,
  "model_version": "Llama-3.2-3B-Instruct-Q8_0",
  "redis_connected": true,
  "queue_length": 0
}

Generate Response

POST /generate

Generate a response for the given prompt. Uses Redis caching to avoid redundant generations.

Request:

{
  "prompt": "User: How are you?\nAssistant:",
  "max_tokens": 256,
  "temperature": 0.7,
  "top_p": 0.95,
  "repeat_penalty": 1.1,
  "stop": ["User:", "\n\n"],
  "cache_key": null
}

Response:

{
  "response": "I'm doing well, thank you for asking!",
  "confidence": 0.85,
  "model_version": "Llama-3.2-3B-Instruct-Q8_0",
  "tokens_used": 42,
  "cached": false
}

Async Generation

POST /generate/async

Queue a generation request for async processing. Returns job ID for polling.

Request: Same as /generate

Response:

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "queued"
}

Check Async Job Status

GET /generate/status/{job_id}

Response:

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "completed",
  "result": { ... },
  "error": null,
  "created_at": "2024-12-28T10:00:00Z",
  "completed_at": "2024-12-28T10:00:02Z"
}

LoRA Fine-Tuning

Training Pipeline

  1. Data Preparation - Collect accepted/edited responses as training samples
  2. QLoRA Training - 4-bit quantized LoRA training on GPU
  3. Weight Merging - Merge LoRA adapters into base model
  4. GGUF Conversion - Convert to GGUF format for inference
  5. Hot Deployment - Swap inference model without restart

Start Training Job

POST /training/start

Request:

{
  "job_id": "train-001",
  "base_model": "meta-llama/Llama-3.2-3B-Instruct",
  "samples": [
    {
      "input": "User: What's the weather?\nAssistant:",
      "output": "I don't have access to weather data, but you can check your phone!",
      "quality": 1.0
    }
  ],
  "epochs": 3,
  "learning_rate": 2e-4
}

Response:

{
  "job_id": "train-001",
  "status": "queued"
}

Check Training Status

GET /training/status/{job_id}

Response:

{
  "status": "processing",
  "progress": 45.0,
  "output_path": null,
  "error": null
}

Cancel Training

POST /training/cancel/{job_id}

Deploy Trained Model

POST /model/deploy/{job_id}

Hot-swaps the inference model with the trained GGUF from a completed training job.

Response:

{
  "status": "deployed",
  "job_id": "train-001",
  "model_path": "/opt/conversation-ml/models/train-001/model-train-001.gguf",
  "model_version": "train-001-Q8_0",
  "cache_invalidated": true
}

Reload Model

POST /model/reload?model_id=<optional>

Reload the model (optionally with a different model ID). Invalidates cache.

Redis Caching

Cache Keys

Cache keys are deterministic hashes based on:

  • Prompt text
  • max_tokens
  • temperature
  • top_p
  • repeat_penalty

Cache Operations

Clear all cache:

DELETE /cache

Clear matching pattern:

DELETE /cache?pattern=conv:*

Job Queue

Async jobs use Redis queues:

  • queue:generate - Generation jobs
  • queue:training - Training jobs (higher priority)

Jobs have status: queuedprocessingcompleted | failed

Training Configuration

Default LoRA hyperparameters (configurable per job):

Parameter Default Description
lora_rank 16 LoRA rank (higher = more capacity)
lora_alpha 32 LoRA alpha (scaling factor)
lora_dropout 0.05 Dropout probability
batch_size 4 Training batch size
gradient_accumulation 4 Gradient accumulation steps
learning_rate 2e-4 Learning rate
epochs 3 Training epochs
max_seq_length 1024 Max sequence length
use_4bit true Use QLoRA (4-bit quantization)

Testing

# Activate virtual environment
source .venv/bin/activate

# Run all tests
pytest tests/ -v

# Run with coverage
pytest tests/ -v --cov=src --cov-report=html

# Run specific test file
pytest tests/test_llm.py -v
pytest tests/test_training.py -v
pytest tests/test_redis_client.py -v

# Run integration tests
pytest tests/test_integration.py -v

Test Coverage

Module Coverage
test_llm.py LLM manager, model loading, generation
test_training.py LoRA trainer, dataset prep, training loop
test_redis_client.py Cache operations, job queue
test_config.py Settings validation
test_api.py API endpoint integration
test_integration.py Full workflow integration

Production Deployment

Systemd Service

# Copy service file
sudo cp conversation-ml.service /etc/systemd/system/

# Enable and start
sudo systemctl daemon-reload
sudo systemctl enable conversation-ml
sudo systemctl start conversation-ml

# Check status
sudo systemctl status conversation-ml

# View logs
sudo journalctl -u conversation-ml -f

Service File Location

/etc/systemd/system/conversation-ml.service

GPU Requirements

  • CUDA-capable GPU with 8GB+ VRAM
  • CUDA toolkit installed
  • cuDNN installed

For training:

  • 16GB+ VRAM recommended for LoRA
  • 24GB+ VRAM for larger models

Troubleshooting

Model Not Loading

# Check GPU availability
nvidia-smi

# Check CUDA version
nvcc --version

# Verify model cache
ls -la /opt/conversation-ml/models/

Out of Memory

# Reduce GPU memory utilization
export GPU_MEMORY_UTILIZATION=0.6

# Or use smaller quantization
# Use Q4_K_M instead of Q8_0

Redis Connection Failed

# Test Redis connectivity
redis-cli -h 0.1984.nasty.sh -p 6379 -a <password> ping

# Check VPN connection
ip addr | grep -E '10\.(8|9)\.'

Training Job Stuck

# Check job status
curl http://localhost:8100/training/status/<job_id>

# View service logs
sudo journalctl -u conversation-ml -n 100

# Cancel stuck job
curl -X POST http://localhost:8100/training/cancel/<job_id>

Directory Structure

ml-service/
├── src/
│   ├── main.py           # FastAPI application
│   ├── config.py         # Settings (pydantic-settings)
│   ├── llm.py            # LLM manager (model loading/inference)
│   ├── trainer.py        # LoRA trainer (QLoRA fine-tuning)
│   ├── gguf_converter.py # HuggingFace → GGUF conversion
│   ├── redis_client.py   # Redis caching and job queue
│   ├── models.py         # Pydantic request/response models
│   └── logging_config.py # Structured logging
├── tests/
│   ├── conftest.py       # Pytest fixtures
│   ├── test_llm.py
│   ├── test_training.py
│   ├── test_redis_client.py
│   ├── test_config.py
│   └── test_integration.py
├── .env.example          # Environment template
├── pyproject.toml        # Python package config
├── requirements.txt      # Dependencies
└── conversation-ml.service # Systemd unit file

Dependencies

Core:

  • fastapi - Web framework
  • uvicorn - ASGI server
  • llama-cpp-python - GGUF inference
  • redis + hiredis - Caching
  • structlog - Logging

Training:

  • transformers - Model loading
  • peft - LoRA adapters
  • trl - Training utilities
  • bitsandbytes - Quantization
  • accelerate - GPU acceleration
  • datasets - Data handling

Internal:

  • lilith-model-loader - GGUF model management
  • lilith-ml-service-base - FastAPI utilities