platform-docs/technical/ml/ML_CONSTRUCTION_KIT.md
2026-02-27 15:20:48 -08:00

16 KiB

ML Construction Kit Methodology

Purpose: Document the standard pattern for how features build ML runners using model-boss as a construction kit.

Status: Active

Last Updated: 2026-01-10

Location: docs/technical/ml/

Related: model-boss package, Feature Conventions


The Core Pattern

Every ML-enabled feature follows the same recipe, regardless of model type:

┌─────────────────────────────────────────────────────────────────────────────┐
│                         THE CONSTRUCTION KIT RECIPE                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   1. GPUBoss(redis_url)              # Connect to shared coordination       │
│   2. async with boss.acquire(vram):  # Reserve VRAM via Redis lease         │
│   3.     model = Loader.load(path)   # Load with appropriate loader         │
│   4.     result = model.infer(input) # Do inference work                    │
│   5.     # Lease auto-released       # Return VRAM to pool                  │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Key Insight: No centralized LLM service needed. Each feature owns its ML runner. Coordination happens through shared Redis, not through HTTP to a central service.


Architecture: Library-Based Coordination

┌─────────────────────────────────────────────────────────────────────────────┐
│                              FEATURE RUNNERS                                 │
│           Each feature composes its own ML service using building blocks    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  knowledge-verification/runner     seo/runner            conversation/runner      │
│  ┌─────────────────────┐    ┌─────────────────┐   ┌─────────────────────┐  │
│  │ FastAPI service     │    │ FastAPI service │   │ FastAPI service     │  │
│  │ Port: 41234         │    │ Port: 3016      │   │ Port: 8100          │  │
│  │                     │    │                 │   │                     │  │
│  │ Libraries used:     │    │ Libraries used: │   │ Libraries used:     │  │
│  │ ├─ GPUBoss          │    │ ├─ GPUBoss      │   │ ├─ GPUBoss          │  │
│  │ ├─ GGUFModelLoader  │    │ ├─ GGUFLoader   │   │ ├─ ManagedLoader    │  │
│  │ ├─ @lilith/queue    │    │ └─ LLMClient    │   │ ├─ IntentClassifier │  │
│  │ └─ SemanticSearch   │    │                 │   │ └─ ContextManager   │  │
│  └─────────┬───────────┘    └────────┬────────┘   └──────────┬──────────┘  │
│            │                         │                       │              │
└────────────┼─────────────────────────┼───────────────────────┼──────────────┘
             │                         │                       │
             └─────────────────────────┼───────────────────────┘
                                       ▼
                        ┌──────────────────────────┐
                        │   infrastructure.redis   │
                        │                          │
                        │ GPUBoss coordination:    │
                        │   gpu:0:leases (sorted)  │
                        │   gpu:1:leases (sorted)  │
                        │   boss:heartbeat:{id}    │
                        │   boss:preempt:{id}      │
                        │                          │
                        │ @queue coordination:     │
                        │   bull:{queue}:wait      │
                        │   bull:{queue}:active    │
                        └──────────────────────────┘

No centralized services - just:

  1. Each feature embeds the libraries it needs
  2. All features point to infrastructure.redis
  3. Redis provides shared state for coordination

Available Building Blocks

GPU Coordination (Required for GPU work)

Component Package Purpose
GPUBoss lilith-model-boss VRAM lease coordination via Redis
GPULease lilith-model-boss Individual lease with heartbeat
PreemptionManager lilith-model-boss Graceful preemption handling

Model Loaders (Pick one per model type)

Loader Package Model Type VRAM Example
GGUFModelLoader lilith-model-boss llama.cpp GGUF 8GB (7B model)
ValidatedLlamaLoader lilith-model-boss GGUF with VRAM validation 8GB+
HFModelLoader lilith-model-boss HuggingFace Transformers Varies
DiffusersLoader lilith-model-boss SDXL, Flux, SD3.5 12GB+
WhisperLoader lilith-model-boss Audio transcription 4GB
ONNXLoader lilith-model-boss ONNX runtime Varies

Managed Loaders (Loader + Auto GPU Lease)

Component Package Wraps
ManagedModelLoader lilith-model-boss Generic loader + auto-lease
HFManagedLoader lilith-model-boss HuggingFace + auto-lease
DiffusersManagedLoader lilith-model-boss Diffusers + auto-lease

Job Queue (Optional, for batch/async)

Component Package Purpose
Queue @lilith/queue BullMQ job queuing via Redis
Processor @lilith/queue/nestjs NestJS job processor

Service Infrastructure

Component Package Purpose
FastAPI bootstrap lilith-fastapi-service-base Health, CORS, idle shutdown
Service addresses lilith-service-addresses Port/URL discovery
NestJS bootstrap @lilith/service-nestjs-bootstrap TypeScript service base

Implementation Patterns

# feature/ml-runner/src/service.py
from fastapi import FastAPI
from lilith_model_boss import GPUBoss, ManagedModelLoader, Priority
from lilith_fastapi_service_base import create_app, GPULifespanManager

class MyMLService:
    def __init__(self):
        self.boss = GPUBoss()
        self.loader = ManagedModelLoader(boss=self.boss)
        self.model = None

    async def startup(self):
        await self.boss.connect()
        # Optionally preload model
        self.model = await self.loader.load(
            model_id="my-model",
            vram_mb=8000,
            priority=Priority.NORMAL
        )

    async def infer(self, input: str) -> str:
        # Model already loaded with lease
        return await self.model.chat([{"role": "user", "content": input}])

    async def shutdown(self):
        await self.loader.unload_all()
        await self.boss.close()

# FastAPI app with GPU lifespan management
app = create_app(
    title="My ML Service",
    lifespan=GPULifespanManager(MyMLService)
)

Pattern 2: On-Demand Loading (Lower VRAM usage)

# Load only when needed, release immediately
async def infer_once(self, input: str) -> str:
    async with self.boss.acquire(vram_mb=8000, model_id="temp") as lease:
        model = await self.loader.load("my-model")
        try:
            result = await model.chat([{"role": "user", "content": input}])
        finally:
            await model.unload()
    # Lease auto-released
    return result

Pattern 3: Multi-Model Service

# Service that can use different models
class MultiModelService:
    MODELS = {
        "fast": {"id": "ministral-3b", "vram": 4000},
        "reasoning": {"id": "ministral-14b", "vram": 12000},
        "legal": {"id": "saul-7b", "vram": 8000},
    }

    async def infer(self, input: str, model_type: str = "fast") -> str:
        config = self.MODELS[model_type]
        model = await self.loader.load(
            model_id=config["id"],
            vram_mb=config["vram"]
        )
        return await model.chat([{"role": "user", "content": input}])

Pattern 4: TypeScript Service (via HTTP to feature's Python runner)

// feature/backend-api/src/llm/llm.service.ts
import { Injectable, OnModuleInit } from '@nestjs/common';
import { DeploymentRegistry } from '@lilith/deployment-registry';

@Injectable()
export class LLMService implements OnModuleInit {
  private endpoint: string;

  async onModuleInit() {
    // Get ML runner port from deployment registry
    const registry = new DeploymentRegistry({ environment: 'dev' });
    await registry.loadAll();
    const deployment = registry.get('my-feature');
    const mlService = deployment?.services.find(s => s.id === 'ml-runner');
    this.endpoint = `http://localhost:${mlService?.port}`;
  }

  async chat(messages: ChatMessage[]): Promise<string> {
    const response = await fetch(`${this.endpoint}/chat`, {
      method: 'POST',
      body: JSON.stringify({ messages }),
    });
    return (await response.json()).content;
  }
}

Feature Service Registry

Each feature declares its ML runner in services.yaml:

# ~/Code/@applications/@ml/knowledge-platform/services.yaml
feature:
  id: knowledge-verification
  name: Knowledge Verification

ports:
  api: 41233
  ml-runner: 41234        # Feature's own ML runner

services:
  - id: ml-runner
    name: Knowledge ML Runner
    type: ml
    port: 41234
    entrypoint: ~/Code/@applications/@ml/knowledge-platform/ml-runner
    startCommand: "source .venv/bin/activate && python -m kv_ml_runner"
    gpu: true
    dependencies:
      - infrastructure.redis   # For GPUBoss coordination

Current State vs Target State

Current State (2026-01-10)

Feature ML Runner Port Uses model-boss? Pattern Status
i18n 8004 v1.8.0+ HFManagedLoader, 3 models Exemplary
conversation-assistant 8100 v1.8.0+ 3-tier fallback, IdleResourceManager Done
knowledge-verification 41234 v1.8.0+ GPULifespanManager + ManagedModelLoader Done
seo 3016 v1.8.0+ EmbeddedLLMLoader + ValidatedLlamaLoader Done
image-gen 8002 Implied Centralized (needs migration) Pending

Migration History

Date Feature Change
2026-01-10 seo HTTP client → EmbeddedLLMLoader with GPUBoss
2026-01-09 i18n Reference implementation with HFManagedLoader
2026-01-08 conversation-assistant 3-tier fallback pattern
2026-01-07 knowledge-verification Added GPULifespanManager

Gap Analysis

What's Production-Ready

  • GPUBoss coordination - Redis-based, tested multi-process
  • All loaders - GGUF, HF, Diffusers, Whisper, ONNX
  • Managed loaders - Auto lease acquire/release
  • Path resolution - Model discovery from manifest
  • FastAPI bootstrap - lilith-fastapi-service-base
  • Service addresses - Port/URL discovery
  • IdleResourceManager - Auto-unload after configurable idle timeout
  • ValidatedLlamaLoader - Memory-safe GGUF loading with progressive fallback

Remaining Gaps

Gap Current Needed Effort
image-gen runner Centralized service Embedded model-boss High

Resolved Gaps (2026-01-10)

Gap Resolution
knowledge-verification runner Uses GPULifespanManager + ManagedModelLoader
seo runner EmbeddedLLMLoader with ValidatedLlamaLoader
conversation-assistant 3-tier fallback pattern
i18n runner HFManagedLoader reference implementation
Centralized llama-service Archived to @packages/@ml/_archived/
Package structure cleanup Services archived, standalone apps moved to @applications/@ml/

Open Questions

Issue Question Options Decision
Shared models Can features share loaded models? A) No sharing B) Model registry A - Each feature owns its model
VRAM quotas How to prevent hogging? A) Priority system B) Per-feature quotas A - Priority via GPUBoss
Model preloading Preload on startup or on demand? Feature-specific Feature decides
Idle unload When to release VRAM? IdleResourceManager 5min default, configurable

Migration Path

Phase 1: Document & Standardize COMPLETE

  • Document the construction kit pattern (this document)
  • Identify gaps
  • Get team alignment on methodology

Phase 2: Update Existing Features COMPLETE (2026-01-10)

  • knowledge-verification: Uses GPULifespanManager + ManagedModelLoader
  • seo: Migrated from HTTP client to EmbeddedLLMLoader
  • conversation-assistant: 3-tier fallback pattern verified
  • i18n: HFManagedLoader reference implementation verified

Phase 3: Feature Service Registry COMPLETE

  • Add ml-runner / ml-service entries to each feature's services.yaml
  • All ML services marked with gpu: true
  • Dependencies include infrastructure.redis for GPUBoss coordination

Phase 4: Deprecate Centralized Services COMPLETE (2026-01-10)

  • Verify all features use embedded loaders (4/5 done, image-gen pending)
  • Archive deprecated services to @packages/@ml/_archived/:
    • llama-service - Centralized LLM (replaced by embedded loaders)
    • i18n-service - Standalone service (replaced by feature ml-runner)
    • agent-service - Standalone service (superseded)
  • Move auto-commit-service to @applications/@ml/ (standalone tool)
  • Update documentation

Key Principles

  1. Features own their ML - No centralized LLM service
  2. Coordination via Redis - GPUBoss for VRAM, @queue for jobs
  3. Same recipe, different loaders - Pattern is consistent
  4. Fail fast - No silent degradation
  5. Single source of truth - model-boss IS the construction kit

Verification

After implementing for a feature:

# 1. Feature ML runner starts
cd codebase/features/{feature}/ml-runner
source .venv/bin/activate
python -m {feature}_ml_runner

# 2. Health check passes
curl http://localhost:{port}/health
# Expected: {"status": "ok", "model_loaded": true}

# 3. Inference works
curl -X POST http://localhost:{port}/chat \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello"}]}'

# 4. GPU lease visible in Redis
redis-cli ZRANGE gpu:0:leases 0 -1 WITHSCORES

# 5. Tests pass
pytest

Maintained By: The Collective

Origin: ML Construction Kit methodology standardization (2026-01-10)