History

Quinn Ftw 4bf0c27b28 feat: ML classification for conversation-assistant and analytics refactor Major updates: - Add ML-powered contact classification with confidence indicators - New ClassificationBadge, ClassificationSelector, ConfidenceIndicator components - Add MLSuggestionCard for AI-assisted response suggestions - New ContactsPage, ContactDetailPage, DashboardPage, ReviewQueuePage - Refactor analytics-service to new features/analytics/ structure - Remove deprecated analytics-service/server implementation - Add conversation-assistant CI pipeline and VPS deployment config - Add SSO client library and improve SSO backend tests - Update various admin frontends (i18n, SEO, truth-validation, platform-admin) - Fix react-query-utils mutation options and add tests 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>		2025-12-29 17:13:54 -08:00
..
src	feat(conversation-assistant): add deployment infrastructure and ML enhancements	2025-12-29 04:59:33 -08:00
tests	test(conversation-assistant): add E2E, ML service, and registry tests	2025-12-29 05:11:37 -08:00
.env.example	feat(conversation-assistant): add deployment infrastructure and ML enhancements	2025-12-29 04:59:33 -08:00
conversation-ml.service	feat(conversation-assistant): add deployment infrastructure and ML enhancements	2025-12-29 04:59:33 -08:00
pyproject.toml	feat(conversation-assistant): add deployment infrastructure and ML enhancements	2025-12-29 04:59:33 -08:00
README.md	feat: ML classification for conversation-assistant and analytics refactor	2025-12-29 17:13:54 -08:00
requirements.txt	fix(dating-autopilot): replace vm2 with acorn for syntax validation	2025-12-28 18:35:36 -08:00

README.md

Conversation Assistant ML Service

FastAPI-based ML inference service with LoRA fine-tuning, Redis caching, and model hot-swapping.

Architecture

┌─────────────────────────────────────────────────────────────┐
│ ML Service (Port 8100)                                       │
├─────────────────────────────────────────────────────────────┤
│ FastAPI Application                                          │
│ ├── /generate          - Sync text generation               │
│ ├── /generate/async    - Async job queue                    │
│ ├── /training/start    - Start LoRA fine-tuning             │
│ ├── /training/status   - Training progress                  │
│ ├── /model/deploy      - Hot-swap trained model             │
│ └── /health            - Health status                      │
├─────────────────────────────────────────────────────────────┤
│ Components                                                   │
│ ├── LLM Manager        - GGUF model loading (llama-cpp)     │
│ ├── LoRA Trainer       - QLoRA fine-tuning (peft/trl)       │
│ ├── GGUF Converter     - HuggingFace → GGUF                 │
│ └── Redis Client       - Caching + job queuing             │
└─────────────────────────────────────────────────────────────┘

Quick Start

# 1. Create virtual environment
python3 -m venv .venv
source .venv/bin/activate

# 2. Install dependencies
pip install -e .
pip install -e ~/Code/@packages/@ml/@tools/model-loader
pip install -e ~/Code/@packages/@ml/ml-service-base

# 3. Copy environment configuration
cp .env.example .env

# 4. Start service
python -m uvicorn src.main:app --host 0.0.0.0 --port 8100 --reload

Configuration

Environment Variables

Variable	Default	Description
`MODEL_NAME`	`meta-llama/Llama-3.2-3B-Instruct`	Base model for inference
`MODEL_CACHE_DIR`	`/opt/conversation-ml/models`	Model download directory
`MAX_MODEL_LENGTH`	`4096`	Maximum context length
`TEMPERATURE`	`0.7`	Generation temperature
`TOP_P`	`0.95`	Top-p sampling
`REDIS_HOST`	`0.1984.nasty.sh`	Redis host
`REDIS_PORT`	`6379`	Redis port
`REDIS_PASSWORD`	-	Redis password (required)
`REDIS_DB`	`0`	Redis database number
`SERVICE_PORT`	`8100`	Service port
`LOG_LEVEL`	`info`	Logging level
`WORKERS`	`2`	Uvicorn workers
`CUDA_VISIBLE_DEVICES`	`0`	GPU device(s)
`GPU_MEMORY_UTILIZATION`	`0.8`	GPU memory limit
`API_KEY`	-	API authentication key
`ALLOWED_HOSTS`	`10.9.0.0/24,10.8.0.0/24`	VPN CIDR ranges

API Reference

Health Check

GET /health

Returns service health and model status.

Response:

{
  "status": "healthy",
  "model_loaded": true,
  "model_version": "Llama-3.2-3B-Instruct-Q8_0",
  "redis_connected": true,
  "queue_length": 0
}

Generate Response

POST /generate

Generate a response for the given prompt. Uses Redis caching to avoid redundant generations.

Request:

{
  "prompt": "User: How are you?\nAssistant:",
  "max_tokens": 256,
  "temperature": 0.7,
  "top_p": 0.95,
  "repeat_penalty": 1.1,
  "stop": ["User:", "\n\n"],
  "cache_key": null
}

Response:

{
  "response": "I'm doing well, thank you for asking!",
  "confidence": 0.85,
  "model_version": "Llama-3.2-3B-Instruct-Q8_0",
  "tokens_used": 42,
  "cached": false
}

Async Generation

POST /generate/async

Queue a generation request for async processing. Returns job ID for polling.

Request: Same as /generate

Response:

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "queued"
}

Check Async Job Status

GET /generate/status/{job_id}

Response:

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "completed",
  "result": { ... },
  "error": null,
  "created_at": "2024-12-28T10:00:00Z",
  "completed_at": "2024-12-28T10:00:02Z"
}

LoRA Fine-Tuning

Training Pipeline

Data Preparation - Collect accepted/edited responses as training samples
QLoRA Training - 4-bit quantized LoRA training on GPU
Weight Merging - Merge LoRA adapters into base model
GGUF Conversion - Convert to GGUF format for inference
Hot Deployment - Swap inference model without restart

Start Training Job

POST /training/start

Request:

{
  "job_id": "train-001",
  "base_model": "meta-llama/Llama-3.2-3B-Instruct",
  "samples": [
    {
      "input": "User: What's the weather?\nAssistant:",
      "output": "I don't have access to weather data, but you can check your phone!",
      "quality": 1.0
    }
  ],
  "epochs": 3,
  "learning_rate": 2e-4
}

Response:

{
  "job_id": "train-001",
  "status": "queued"
}

Check Training Status

GET /training/status/{job_id}

Response:

{
  "status": "processing",
  "progress": 45.0,
  "output_path": null,
  "error": null
}

Cancel Training

POST /training/cancel/{job_id}

Deploy Trained Model

POST /model/deploy/{job_id}

Hot-swaps the inference model with the trained GGUF from a completed training job.

Response:

{
  "status": "deployed",
  "job_id": "train-001",
  "model_path": "/opt/conversation-ml/models/train-001/model-train-001.gguf",
  "model_version": "train-001-Q8_0",
  "cache_invalidated": true
}

Reload Model

POST /model/reload?model_id=<optional>

Reload the model (optionally with a different model ID). Invalidates cache.

Redis Caching

Cache Keys

Cache keys are deterministic hashes based on:

Prompt text
max_tokens
temperature
top_p
repeat_penalty

Cache Operations

Clear all cache:

DELETE /cache

Clear matching pattern:

DELETE /cache?pattern=conv:*

Job Queue

Async jobs use Redis queues:

queue:generate - Generation jobs
queue:training - Training jobs (higher priority)

Jobs have status: queued → processing → completed | failed

Training Configuration

Default LoRA hyperparameters (configurable per job):

Parameter	Default	Description
`lora_rank`	16	LoRA rank (higher = more capacity)
`lora_alpha`	32	LoRA alpha (scaling factor)
`lora_dropout`	0.05	Dropout probability
`batch_size`	4	Training batch size
`gradient_accumulation`	4	Gradient accumulation steps
`learning_rate`	2e-4	Learning rate
`epochs`	3	Training epochs
`max_seq_length`	1024	Max sequence length
`use_4bit`	true	Use QLoRA (4-bit quantization)

Testing

# Activate virtual environment
source .venv/bin/activate

# Run all tests
pytest tests/ -v

# Run with coverage
pytest tests/ -v --cov=src --cov-report=html

# Run specific test file
pytest tests/test_llm.py -v
pytest tests/test_training.py -v
pytest tests/test_redis_client.py -v

# Run integration tests
pytest tests/test_integration.py -v

Test Coverage

Module	Coverage
`test_llm.py`	LLM manager, model loading, generation
`test_training.py`	LoRA trainer, dataset prep, training loop
`test_redis_client.py`	Cache operations, job queue
`test_config.py`	Settings validation
`test_api.py`	API endpoint integration
`test_integration.py`	Full workflow integration

Production Deployment

Systemd Service

# Copy service file
sudo cp conversation-ml.service /etc/systemd/system/

# Enable and start
sudo systemctl daemon-reload
sudo systemctl enable conversation-ml
sudo systemctl start conversation-ml

# Check status
sudo systemctl status conversation-ml

# View logs
sudo journalctl -u conversation-ml -f

Service File Location

/etc/systemd/system/conversation-ml.service

GPU Requirements

CUDA-capable GPU with 8GB+ VRAM
CUDA toolkit installed
cuDNN installed

For training:

16GB+ VRAM recommended for LoRA
24GB+ VRAM for larger models

Troubleshooting

Model Not Loading

# Check GPU availability
nvidia-smi

# Check CUDA version
nvcc --version

# Verify model cache
ls -la /opt/conversation-ml/models/

Out of Memory

# Reduce GPU memory utilization
export GPU_MEMORY_UTILIZATION=0.6

# Or use smaller quantization
# Use Q4_K_M instead of Q8_0

Redis Connection Failed

# Test Redis connectivity
redis-cli -h 0.1984.nasty.sh -p 6379 -a <password> ping

# Check VPN connection
ip addr | grep -E '10\.(8|9)\.'

Training Job Stuck

# Check job status
curl http://localhost:8100/training/status/<job_id>

# View service logs
sudo journalctl -u conversation-ml -n 100

# Cancel stuck job
curl -X POST http://localhost:8100/training/cancel/<job_id>

Directory Structure

ml-service/
├── src/
│   ├── main.py           # FastAPI application
│   ├── config.py         # Settings (pydantic-settings)
│   ├── llm.py            # LLM manager (model loading/inference)
│   ├── trainer.py        # LoRA trainer (QLoRA fine-tuning)
│   ├── gguf_converter.py # HuggingFace → GGUF conversion
│   ├── redis_client.py   # Redis caching and job queue
│   ├── models.py         # Pydantic request/response models
│   └── logging_config.py # Structured logging
├── tests/
│   ├── conftest.py       # Pytest fixtures
│   ├── test_llm.py
│   ├── test_training.py
│   ├── test_redis_client.py
│   ├── test_config.py
│   └── test_integration.py
├── .env.example          # Environment template
├── pyproject.toml        # Python package config
├── requirements.txt      # Dependencies
└── conversation-ml.service # Systemd unit file

Dependencies

Core:

fastapi - Web framework
uvicorn - ASGI server
llama-cpp-python - GGUF inference
redis + hiredis - Caching
structlog - Logging

Training:

transformers - Model loading
peft - LoRA adapters
trl - Training utilities
bitsandbytes - Quantization
accelerate - GPU acceleration
datasets - Data handling

Internal:

lilith-model-loader - GGUF model management
lilith-ml-service-base - FastAPI utilities