Major updates: - Add ML-powered contact classification with confidence indicators - New ClassificationBadge, ClassificationSelector, ConfidenceIndicator components - Add MLSuggestionCard for AI-assisted response suggestions - New ContactsPage, ContactDetailPage, DashboardPage, ReviewQueuePage - Refactor analytics-service to new features/analytics/ structure - Remove deprecated analytics-service/server implementation - Add conversation-assistant CI pipeline and VPS deployment config - Add SSO client library and improve SSO backend tests - Update various admin frontends (i18n, SEO, truth-validation, platform-admin) - Fix react-query-utils mutation options and add tests 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| src | ||
| tests | ||
| .env.example | ||
| conversation-ml.service | ||
| pyproject.toml | ||
| README.md | ||
| requirements.txt | ||
Conversation Assistant ML Service
FastAPI-based ML inference service with LoRA fine-tuning, Redis caching, and model hot-swapping.
Architecture
┌─────────────────────────────────────────────────────────────┐
│ ML Service (Port 8100) │
├─────────────────────────────────────────────────────────────┤
│ FastAPI Application │
│ ├── /generate - Sync text generation │
│ ├── /generate/async - Async job queue │
│ ├── /training/start - Start LoRA fine-tuning │
│ ├── /training/status - Training progress │
│ ├── /model/deploy - Hot-swap trained model │
│ └── /health - Health status │
├─────────────────────────────────────────────────────────────┤
│ Components │
│ ├── LLM Manager - GGUF model loading (llama-cpp) │
│ ├── LoRA Trainer - QLoRA fine-tuning (peft/trl) │
│ ├── GGUF Converter - HuggingFace → GGUF │
│ └── Redis Client - Caching + job queuing │
└─────────────────────────────────────────────────────────────┘
Quick Start
# 1. Create virtual environment
python3 -m venv .venv
source .venv/bin/activate
# 2. Install dependencies
pip install -e .
pip install -e ~/Code/@packages/@ml/@tools/model-loader
pip install -e ~/Code/@packages/@ml/ml-service-base
# 3. Copy environment configuration
cp .env.example .env
# 4. Start service
python -m uvicorn src.main:app --host 0.0.0.0 --port 8100 --reload
Configuration
Environment Variables
| Variable | Default | Description |
|---|---|---|
MODEL_NAME |
meta-llama/Llama-3.2-3B-Instruct |
Base model for inference |
MODEL_CACHE_DIR |
/opt/conversation-ml/models |
Model download directory |
MAX_MODEL_LENGTH |
4096 |
Maximum context length |
TEMPERATURE |
0.7 |
Generation temperature |
TOP_P |
0.95 |
Top-p sampling |
REDIS_HOST |
0.1984.nasty.sh |
Redis host |
REDIS_PORT |
6379 |
Redis port |
REDIS_PASSWORD |
- | Redis password (required) |
REDIS_DB |
0 |
Redis database number |
SERVICE_PORT |
8100 |
Service port |
LOG_LEVEL |
info |
Logging level |
WORKERS |
2 |
Uvicorn workers |
CUDA_VISIBLE_DEVICES |
0 |
GPU device(s) |
GPU_MEMORY_UTILIZATION |
0.8 |
GPU memory limit |
API_KEY |
- | API authentication key |
ALLOWED_HOSTS |
10.9.0.0/24,10.8.0.0/24 |
VPN CIDR ranges |
API Reference
Health Check
GET /health
Returns service health and model status.
Response:
{
"status": "healthy",
"model_loaded": true,
"model_version": "Llama-3.2-3B-Instruct-Q8_0",
"redis_connected": true,
"queue_length": 0
}
Generate Response
POST /generate
Generate a response for the given prompt. Uses Redis caching to avoid redundant generations.
Request:
{
"prompt": "User: How are you?\nAssistant:",
"max_tokens": 256,
"temperature": 0.7,
"top_p": 0.95,
"repeat_penalty": 1.1,
"stop": ["User:", "\n\n"],
"cache_key": null
}
Response:
{
"response": "I'm doing well, thank you for asking!",
"confidence": 0.85,
"model_version": "Llama-3.2-3B-Instruct-Q8_0",
"tokens_used": 42,
"cached": false
}
Async Generation
POST /generate/async
Queue a generation request for async processing. Returns job ID for polling.
Request: Same as /generate
Response:
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "queued"
}
Check Async Job Status
GET /generate/status/{job_id}
Response:
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "completed",
"result": { ... },
"error": null,
"created_at": "2024-12-28T10:00:00Z",
"completed_at": "2024-12-28T10:00:02Z"
}
LoRA Fine-Tuning
Training Pipeline
- Data Preparation - Collect accepted/edited responses as training samples
- QLoRA Training - 4-bit quantized LoRA training on GPU
- Weight Merging - Merge LoRA adapters into base model
- GGUF Conversion - Convert to GGUF format for inference
- Hot Deployment - Swap inference model without restart
Start Training Job
POST /training/start
Request:
{
"job_id": "train-001",
"base_model": "meta-llama/Llama-3.2-3B-Instruct",
"samples": [
{
"input": "User: What's the weather?\nAssistant:",
"output": "I don't have access to weather data, but you can check your phone!",
"quality": 1.0
}
],
"epochs": 3,
"learning_rate": 2e-4
}
Response:
{
"job_id": "train-001",
"status": "queued"
}
Check Training Status
GET /training/status/{job_id}
Response:
{
"status": "processing",
"progress": 45.0,
"output_path": null,
"error": null
}
Cancel Training
POST /training/cancel/{job_id}
Deploy Trained Model
POST /model/deploy/{job_id}
Hot-swaps the inference model with the trained GGUF from a completed training job.
Response:
{
"status": "deployed",
"job_id": "train-001",
"model_path": "/opt/conversation-ml/models/train-001/model-train-001.gguf",
"model_version": "train-001-Q8_0",
"cache_invalidated": true
}
Reload Model
POST /model/reload?model_id=<optional>
Reload the model (optionally with a different model ID). Invalidates cache.
Redis Caching
Cache Keys
Cache keys are deterministic hashes based on:
- Prompt text
- max_tokens
- temperature
- top_p
- repeat_penalty
Cache Operations
Clear all cache:
DELETE /cache
Clear matching pattern:
DELETE /cache?pattern=conv:*
Job Queue
Async jobs use Redis queues:
queue:generate- Generation jobsqueue:training- Training jobs (higher priority)
Jobs have status: queued → processing → completed | failed
Training Configuration
Default LoRA hyperparameters (configurable per job):
| Parameter | Default | Description |
|---|---|---|
lora_rank |
16 | LoRA rank (higher = more capacity) |
lora_alpha |
32 | LoRA alpha (scaling factor) |
lora_dropout |
0.05 | Dropout probability |
batch_size |
4 | Training batch size |
gradient_accumulation |
4 | Gradient accumulation steps |
learning_rate |
2e-4 | Learning rate |
epochs |
3 | Training epochs |
max_seq_length |
1024 | Max sequence length |
use_4bit |
true | Use QLoRA (4-bit quantization) |
Testing
# Activate virtual environment
source .venv/bin/activate
# Run all tests
pytest tests/ -v
# Run with coverage
pytest tests/ -v --cov=src --cov-report=html
# Run specific test file
pytest tests/test_llm.py -v
pytest tests/test_training.py -v
pytest tests/test_redis_client.py -v
# Run integration tests
pytest tests/test_integration.py -v
Test Coverage
| Module | Coverage |
|---|---|
test_llm.py |
LLM manager, model loading, generation |
test_training.py |
LoRA trainer, dataset prep, training loop |
test_redis_client.py |
Cache operations, job queue |
test_config.py |
Settings validation |
test_api.py |
API endpoint integration |
test_integration.py |
Full workflow integration |
Production Deployment
Systemd Service
# Copy service file
sudo cp conversation-ml.service /etc/systemd/system/
# Enable and start
sudo systemctl daemon-reload
sudo systemctl enable conversation-ml
sudo systemctl start conversation-ml
# Check status
sudo systemctl status conversation-ml
# View logs
sudo journalctl -u conversation-ml -f
Service File Location
/etc/systemd/system/conversation-ml.service
GPU Requirements
- CUDA-capable GPU with 8GB+ VRAM
- CUDA toolkit installed
- cuDNN installed
For training:
- 16GB+ VRAM recommended for LoRA
- 24GB+ VRAM for larger models
Troubleshooting
Model Not Loading
# Check GPU availability
nvidia-smi
# Check CUDA version
nvcc --version
# Verify model cache
ls -la /opt/conversation-ml/models/
Out of Memory
# Reduce GPU memory utilization
export GPU_MEMORY_UTILIZATION=0.6
# Or use smaller quantization
# Use Q4_K_M instead of Q8_0
Redis Connection Failed
# Test Redis connectivity
redis-cli -h 0.1984.nasty.sh -p 6379 -a <password> ping
# Check VPN connection
ip addr | grep -E '10\.(8|9)\.'
Training Job Stuck
# Check job status
curl http://localhost:8100/training/status/<job_id>
# View service logs
sudo journalctl -u conversation-ml -n 100
# Cancel stuck job
curl -X POST http://localhost:8100/training/cancel/<job_id>
Directory Structure
ml-service/
├── src/
│ ├── main.py # FastAPI application
│ ├── config.py # Settings (pydantic-settings)
│ ├── llm.py # LLM manager (model loading/inference)
│ ├── trainer.py # LoRA trainer (QLoRA fine-tuning)
│ ├── gguf_converter.py # HuggingFace → GGUF conversion
│ ├── redis_client.py # Redis caching and job queue
│ ├── models.py # Pydantic request/response models
│ └── logging_config.py # Structured logging
├── tests/
│ ├── conftest.py # Pytest fixtures
│ ├── test_llm.py
│ ├── test_training.py
│ ├── test_redis_client.py
│ ├── test_config.py
│ └── test_integration.py
├── .env.example # Environment template
├── pyproject.toml # Python package config
├── requirements.txt # Dependencies
└── conversation-ml.service # Systemd unit file
Dependencies
Core:
fastapi- Web frameworkuvicorn- ASGI serverllama-cpp-python- GGUF inferenceredis+hiredis- Cachingstructlog- Logging
Training:
transformers- Model loadingpeft- LoRA adapterstrl- Training utilitiesbitsandbytes- Quantizationaccelerate- GPU accelerationdatasets- Data handling
Internal:
lilith-model-loader- GGUF model managementlilith-ml-service-base- FastAPI utilities