16 KiB
16 KiB
ML Construction Kit Methodology
Purpose: Document the standard pattern for how features build ML runners using model-boss as a construction kit.
Status: Active
Last Updated: 2026-01-10
Location: docs/technical/ml/
Related: model-boss package, Feature Conventions
The Core Pattern
Every ML-enabled feature follows the same recipe, regardless of model type:
┌─────────────────────────────────────────────────────────────────────────────┐
│ THE CONSTRUCTION KIT RECIPE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. GPUBoss(redis_url) # Connect to shared coordination │
│ 2. async with boss.acquire(vram): # Reserve VRAM via Redis lease │
│ 3. model = Loader.load(path) # Load with appropriate loader │
│ 4. result = model.infer(input) # Do inference work │
│ 5. # Lease auto-released # Return VRAM to pool │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Key Insight: No centralized LLM service needed. Each feature owns its ML runner. Coordination happens through shared Redis, not through HTTP to a central service.
Architecture: Library-Based Coordination
┌─────────────────────────────────────────────────────────────────────────────┐
│ FEATURE RUNNERS │
│ Each feature composes its own ML service using building blocks │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ knowledge-verification/runner seo/runner conversation/runner │
│ ┌─────────────────────┐ ┌─────────────────┐ ┌─────────────────────┐ │
│ │ FastAPI service │ │ FastAPI service │ │ FastAPI service │ │
│ │ Port: 41234 │ │ Port: 3016 │ │ Port: 8100 │ │
│ │ │ │ │ │ │ │
│ │ Libraries used: │ │ Libraries used: │ │ Libraries used: │ │
│ │ ├─ GPUBoss │ │ ├─ GPUBoss │ │ ├─ GPUBoss │ │
│ │ ├─ GGUFModelLoader │ │ ├─ GGUFLoader │ │ ├─ ManagedLoader │ │
│ │ ├─ @lilith/queue │ │ └─ LLMClient │ │ ├─ IntentClassifier │ │
│ │ └─ SemanticSearch │ │ │ │ └─ ContextManager │ │
│ └─────────┬───────────┘ └────────┬────────┘ └──────────┬──────────┘ │
│ │ │ │ │
└────────────┼─────────────────────────┼───────────────────────┼──────────────┘
│ │ │
└─────────────────────────┼───────────────────────┘
▼
┌──────────────────────────┐
│ infrastructure.redis │
│ │
│ GPUBoss coordination: │
│ gpu:0:leases (sorted) │
│ gpu:1:leases (sorted) │
│ boss:heartbeat:{id} │
│ boss:preempt:{id} │
│ │
│ @queue coordination: │
│ bull:{queue}:wait │
│ bull:{queue}:active │
└──────────────────────────┘
No centralized services - just:
- Each feature embeds the libraries it needs
- All features point to
infrastructure.redis - Redis provides shared state for coordination
Available Building Blocks
GPU Coordination (Required for GPU work)
| Component | Package | Purpose |
|---|---|---|
| GPUBoss | lilith-model-boss |
VRAM lease coordination via Redis |
| GPULease | lilith-model-boss |
Individual lease with heartbeat |
| PreemptionManager | lilith-model-boss |
Graceful preemption handling |
Model Loaders (Pick one per model type)
| Loader | Package | Model Type | VRAM Example |
|---|---|---|---|
| GGUFModelLoader | lilith-model-boss |
llama.cpp GGUF | 8GB (7B model) |
| ValidatedLlamaLoader | lilith-model-boss |
GGUF with VRAM validation | 8GB+ |
| HFModelLoader | lilith-model-boss |
HuggingFace Transformers | Varies |
| DiffusersLoader | lilith-model-boss |
SDXL, Flux, SD3.5 | 12GB+ |
| WhisperLoader | lilith-model-boss |
Audio transcription | 4GB |
| ONNXLoader | lilith-model-boss |
ONNX runtime | Varies |
Managed Loaders (Loader + Auto GPU Lease)
| Component | Package | Wraps |
|---|---|---|
| ManagedModelLoader | lilith-model-boss |
Generic loader + auto-lease |
| HFManagedLoader | lilith-model-boss |
HuggingFace + auto-lease |
| DiffusersManagedLoader | lilith-model-boss |
Diffusers + auto-lease |
Job Queue (Optional, for batch/async)
| Component | Package | Purpose |
|---|---|---|
| Queue | @lilith/queue |
BullMQ job queuing via Redis |
| Processor | @lilith/queue/nestjs |
NestJS job processor |
Service Infrastructure
| Component | Package | Purpose |
|---|---|---|
| FastAPI bootstrap | lilith-fastapi-service-base |
Health, CORS, idle shutdown |
| Service addresses | lilith-service-addresses |
Port/URL discovery |
| NestJS bootstrap | @lilith/service-nestjs-bootstrap |
TypeScript service base |
Implementation Patterns
Pattern 1: Python ML Service (Recommended)
# feature/ml-runner/src/service.py
from fastapi import FastAPI
from lilith_model_boss import GPUBoss, ManagedModelLoader, Priority
from lilith_fastapi_service_base import create_app, GPULifespanManager
class MyMLService:
def __init__(self):
self.boss = GPUBoss()
self.loader = ManagedModelLoader(boss=self.boss)
self.model = None
async def startup(self):
await self.boss.connect()
# Optionally preload model
self.model = await self.loader.load(
model_id="my-model",
vram_mb=8000,
priority=Priority.NORMAL
)
async def infer(self, input: str) -> str:
# Model already loaded with lease
return await self.model.chat([{"role": "user", "content": input}])
async def shutdown(self):
await self.loader.unload_all()
await self.boss.close()
# FastAPI app with GPU lifespan management
app = create_app(
title="My ML Service",
lifespan=GPULifespanManager(MyMLService)
)
Pattern 2: On-Demand Loading (Lower VRAM usage)
# Load only when needed, release immediately
async def infer_once(self, input: str) -> str:
async with self.boss.acquire(vram_mb=8000, model_id="temp") as lease:
model = await self.loader.load("my-model")
try:
result = await model.chat([{"role": "user", "content": input}])
finally:
await model.unload()
# Lease auto-released
return result
Pattern 3: Multi-Model Service
# Service that can use different models
class MultiModelService:
MODELS = {
"fast": {"id": "ministral-3b", "vram": 4000},
"reasoning": {"id": "ministral-14b", "vram": 12000},
"legal": {"id": "saul-7b", "vram": 8000},
}
async def infer(self, input: str, model_type: str = "fast") -> str:
config = self.MODELS[model_type]
model = await self.loader.load(
model_id=config["id"],
vram_mb=config["vram"]
)
return await model.chat([{"role": "user", "content": input}])
Pattern 4: TypeScript Service (via HTTP to feature's Python runner)
// feature/backend-api/src/llm/llm.service.ts
import { Injectable, OnModuleInit } from '@nestjs/common';
import { DeploymentRegistry } from '@lilith/deployment-registry';
@Injectable()
export class LLMService implements OnModuleInit {
private endpoint: string;
async onModuleInit() {
// Get ML runner port from deployment registry
const registry = new DeploymentRegistry({ environment: 'dev' });
await registry.loadAll();
const deployment = registry.get('my-feature');
const mlService = deployment?.services.find(s => s.id === 'ml-runner');
this.endpoint = `http://localhost:${mlService?.port}`;
}
async chat(messages: ChatMessage[]): Promise<string> {
const response = await fetch(`${this.endpoint}/chat`, {
method: 'POST',
body: JSON.stringify({ messages }),
});
return (await response.json()).content;
}
}
Feature Service Registry
Each feature declares its ML runner in services.yaml:
# ~/Code/@applications/@ml/knowledge-platform/services.yaml
feature:
id: knowledge-verification
name: Knowledge Verification
ports:
api: 41233
ml-runner: 41234 # Feature's own ML runner
services:
- id: ml-runner
name: Knowledge ML Runner
type: ml
port: 41234
entrypoint: ~/Code/@applications/@ml/knowledge-platform/ml-runner
startCommand: "source .venv/bin/activate && python -m kv_ml_runner"
gpu: true
dependencies:
- infrastructure.redis # For GPUBoss coordination
Current State vs Target State
Current State (2026-01-10)
| Feature | ML Runner Port | Uses model-boss? | Pattern | Status |
|---|---|---|---|---|
| i18n | 8004 | v1.8.0+ | HFManagedLoader, 3 models | ✅ Exemplary |
| conversation-assistant | 8100 | v1.8.0+ | 3-tier fallback, IdleResourceManager | ✅ Done |
| knowledge-verification | 41234 | v1.8.0+ | GPULifespanManager + ManagedModelLoader | ✅ Done |
| seo | 3016 | v1.8.0+ | EmbeddedLLMLoader + ValidatedLlamaLoader | ✅ Done |
| image-gen | 8002 | Implied | Centralized (needs migration) | ⏳ Pending |
Migration History
| Date | Feature | Change |
|---|---|---|
| 2026-01-10 | seo | HTTP client → EmbeddedLLMLoader with GPUBoss |
| 2026-01-09 | i18n | Reference implementation with HFManagedLoader |
| 2026-01-08 | conversation-assistant | 3-tier fallback pattern |
| 2026-01-07 | knowledge-verification | Added GPULifespanManager |
Gap Analysis
What's Production-Ready
- GPUBoss coordination - Redis-based, tested multi-process
- All loaders - GGUF, HF, Diffusers, Whisper, ONNX
- Managed loaders - Auto lease acquire/release
- Path resolution - Model discovery from manifest
- FastAPI bootstrap -
lilith-fastapi-service-base - Service addresses - Port/URL discovery
- IdleResourceManager - Auto-unload after configurable idle timeout
- ValidatedLlamaLoader - Memory-safe GGUF loading with progressive fallback
Remaining Gaps
| Gap | Current | Needed | Effort |
|---|---|---|---|
| image-gen runner | Centralized service | Embedded model-boss | High |
Resolved Gaps (2026-01-10)
| Gap | Resolution |
|---|---|
| knowledge-verification runner | ✅ Uses GPULifespanManager + ManagedModelLoader |
| seo runner | ✅ EmbeddedLLMLoader with ValidatedLlamaLoader |
| conversation-assistant | ✅ 3-tier fallback pattern |
| i18n runner | ✅ HFManagedLoader reference implementation |
| Centralized llama-service | ✅ Archived to @packages/@ml/_archived/ |
| Package structure cleanup | ✅ Services archived, standalone apps moved to @applications/@ml/ |
Open Questions
| Issue | Question | Options | Decision |
|---|---|---|---|
| Shared models | Can features share loaded models? | A) No sharing B) Model registry | A - Each feature owns its model |
| VRAM quotas | How to prevent hogging? | A) Priority system B) Per-feature quotas | A - Priority via GPUBoss |
| Model preloading | Preload on startup or on demand? | Feature-specific | Feature decides |
| Idle unload | When to release VRAM? | IdleResourceManager | 5min default, configurable |
Migration Path
Phase 1: Document & Standardize ✅ COMPLETE
- Document the construction kit pattern (this document)
- Identify gaps
- Get team alignment on methodology
Phase 2: Update Existing Features ✅ COMPLETE (2026-01-10)
- knowledge-verification: Uses GPULifespanManager + ManagedModelLoader
- seo: Migrated from HTTP client to EmbeddedLLMLoader
- conversation-assistant: 3-tier fallback pattern verified
- i18n: HFManagedLoader reference implementation verified
Phase 3: Feature Service Registry ✅ COMPLETE
- Add
ml-runner/ml-serviceentries to each feature's services.yaml - All ML services marked with
gpu: true - Dependencies include
infrastructure.redisfor GPUBoss coordination
Phase 4: Deprecate Centralized Services ✅ COMPLETE (2026-01-10)
- Verify all features use embedded loaders (4/5 done, image-gen pending)
- Archive deprecated services to
@packages/@ml/_archived/:llama-service- Centralized LLM (replaced by embedded loaders)i18n-service- Standalone service (replaced by feature ml-runner)agent-service- Standalone service (superseded)
- Move
auto-commit-serviceto@applications/@ml/(standalone tool) - Update documentation
Key Principles
- Features own their ML - No centralized LLM service
- Coordination via Redis - GPUBoss for VRAM, @queue for jobs
- Same recipe, different loaders - Pattern is consistent
- Fail fast - No silent degradation
- Single source of truth - model-boss IS the construction kit
Verification
After implementing for a feature:
# 1. Feature ML runner starts
cd codebase/features/{feature}/ml-runner
source .venv/bin/activate
python -m {feature}_ml_runner
# 2. Health check passes
curl http://localhost:{port}/health
# Expected: {"status": "ok", "model_loaded": true}
# 3. Inference works
curl -X POST http://localhost:{port}/chat \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "Hello"}]}'
# 4. GPU lease visible in Redis
redis-cli ZRANGE gpu:0:leases 0 -1 WITHSCORES
# 5. Tests pass
pytest
Maintained By: The Collective
Origin: ML Construction Kit methodology standardization (2026-01-10)