Quinn Ftw 3730645b20 docs(docs): 📝 Standardize project documentation with unified architecture, feature inventories, and marketing content

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>

2026-02-27 15:20:48 -08:00

16 KiB

Raw Permalink Blame History

ML Construction Kit Methodology

Purpose: Document the standard pattern for how features build ML runners using model-boss as a construction kit.

Status: Active

Last Updated: 2026-01-10

Location: docs/technical/ml/

The Core Pattern

Every ML-enabled feature follows the same recipe, regardless of model type:

┌─────────────────────────────────────────────────────────────────────────────┐
│                         THE CONSTRUCTION KIT RECIPE                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   1. GPUBoss(redis_url)              # Connect to shared coordination       │
│   2. async with boss.acquire(vram):  # Reserve VRAM via Redis lease         │
│   3.     model = Loader.load(path)   # Load with appropriate loader         │
│   4.     result = model.infer(input) # Do inference work                    │
│   5.     # Lease auto-released       # Return VRAM to pool                  │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Key Insight: No centralized LLM service needed. Each feature owns its ML runner. Coordination happens through shared Redis, not through HTTP to a central service.

Architecture: Library-Based Coordination

┌─────────────────────────────────────────────────────────────────────────────┐
│                              FEATURE RUNNERS                                 │
│           Each feature composes its own ML service using building blocks    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  knowledge-verification/runner     seo/runner            conversation/runner      │
│  ┌─────────────────────┐    ┌─────────────────┐   ┌─────────────────────┐  │
│  │ FastAPI service     │    │ FastAPI service │   │ FastAPI service     │  │
│  │ Port: 41234         │    │ Port: 3016      │   │ Port: 8100          │  │
│  │                     │    │                 │   │                     │  │
│  │ Libraries used:     │    │ Libraries used: │   │ Libraries used:     │  │
│  │ ├─ GPUBoss          │    │ ├─ GPUBoss      │   │ ├─ GPUBoss          │  │
│  │ ├─ GGUFModelLoader  │    │ ├─ GGUFLoader   │   │ ├─ ManagedLoader    │  │
│  │ ├─ @lilith/queue    │    │ └─ LLMClient    │   │ ├─ IntentClassifier │  │
│  │ └─ SemanticSearch   │    │                 │   │ └─ ContextManager   │  │
│  └─────────┬───────────┘    └────────┬────────┘   └──────────┬──────────┘  │
│            │                         │                       │              │
└────────────┼─────────────────────────┼───────────────────────┼──────────────┘
             │                         │                       │
             └─────────────────────────┼───────────────────────┘
                                       ▼
                        ┌──────────────────────────┐
                        │   infrastructure.redis   │
                        │                          │
                        │ GPUBoss coordination:    │
                        │   gpu:0:leases (sorted)  │
                        │   gpu:1:leases (sorted)  │
                        │   boss:heartbeat:{id}    │
                        │   boss:preempt:{id}      │
                        │                          │
                        │ @queue coordination:     │
                        │   bull:{queue}:wait      │
                        │   bull:{queue}:active    │
                        └──────────────────────────┘

No centralized services - just:

Each feature embeds the libraries it needs
All features point to infrastructure.redis
Redis provides shared state for coordination

Available Building Blocks

GPU Coordination (Required for GPU work)

Component	Package	Purpose
GPUBoss	`lilith-model-boss`	VRAM lease coordination via Redis
GPULease	`lilith-model-boss`	Individual lease with heartbeat
PreemptionManager	`lilith-model-boss`	Graceful preemption handling

Model Loaders (Pick one per model type)

Loader	Package	Model Type	VRAM Example
GGUFModelLoader	`lilith-model-boss`	llama.cpp GGUF	8GB (7B model)
ValidatedLlamaLoader	`lilith-model-boss`	GGUF with VRAM validation	8GB+
HFModelLoader	`lilith-model-boss`	HuggingFace Transformers	Varies
DiffusersLoader	`lilith-model-boss`	SDXL, Flux, SD3.5	12GB+
WhisperLoader	`lilith-model-boss`	Audio transcription	4GB
ONNXLoader	`lilith-model-boss`	ONNX runtime	Varies

Managed Loaders (Loader + Auto GPU Lease)

Component	Package	Wraps
ManagedModelLoader	`lilith-model-boss`	Generic loader + auto-lease
HFManagedLoader	`lilith-model-boss`	HuggingFace + auto-lease
DiffusersManagedLoader	`lilith-model-boss`	Diffusers + auto-lease

Job Queue (Optional, for batch/async)

Component	Package	Purpose
Queue	`@lilith/queue`	BullMQ job queuing via Redis
Processor	`@lilith/queue/nestjs`	NestJS job processor

Service Infrastructure

Component	Package	Purpose
FastAPI bootstrap	`lilith-fastapi-service-base`	Health, CORS, idle shutdown
Service addresses	`lilith-service-addresses`	Port/URL discovery
NestJS bootstrap	`@lilith/service-nestjs-bootstrap`	TypeScript service base

Implementation Patterns

Pattern 1: Python ML Service (Recommended)

# feature/ml-runner/src/service.py
from fastapi import FastAPI
from lilith_model_boss import GPUBoss, ManagedModelLoader, Priority
from lilith_fastapi_service_base import create_app, GPULifespanManager

class MyMLService:
    def __init__(self):
        self.boss = GPUBoss()
        self.loader = ManagedModelLoader(boss=self.boss)
        self.model = None

    async def startup(self):
        await self.boss.connect()
        # Optionally preload model
        self.model = await self.loader.load(
            model_id="my-model",
            vram_mb=8000,
            priority=Priority.NORMAL
        )

    async def infer(self, input: str) -> str:
        # Model already loaded with lease
        return await self.model.chat([{"role": "user", "content": input}])

    async def shutdown(self):
        await self.loader.unload_all()
        await self.boss.close()

# FastAPI app with GPU lifespan management
app = create_app(
    title="My ML Service",
    lifespan=GPULifespanManager(MyMLService)
)

Pattern 2: On-Demand Loading (Lower VRAM usage)

# Load only when needed, release immediately
async def infer_once(self, input: str) -> str:
    async with self.boss.acquire(vram_mb=8000, model_id="temp") as lease:
        model = await self.loader.load("my-model")
        try:
            result = await model.chat([{"role": "user", "content": input}])
        finally:
            await model.unload()
    # Lease auto-released
    return result

Pattern 3: Multi-Model Service

# Service that can use different models
class MultiModelService:
    MODELS = {
        "fast": {"id": "ministral-3b", "vram": 4000},
        "reasoning": {"id": "ministral-14b", "vram": 12000},
        "legal": {"id": "saul-7b", "vram": 8000},
    }

    async def infer(self, input: str, model_type: str = "fast") -> str:
        config = self.MODELS[model_type]
        model = await self.loader.load(
            model_id=config["id"],
            vram_mb=config["vram"]
        )
        return await model.chat([{"role": "user", "content": input}])

Pattern 4: TypeScript Service (via HTTP to feature's Python runner)

// feature/backend-api/src/llm/llm.service.ts
import { Injectable, OnModuleInit } from '@nestjs/common';
import { DeploymentRegistry } from '@lilith/deployment-registry';

@Injectable()
export class LLMService implements OnModuleInit {
  private endpoint: string;

  async onModuleInit() {
    // Get ML runner port from deployment registry
    const registry = new DeploymentRegistry({ environment: 'dev' });
    await registry.loadAll();
    const deployment = registry.get('my-feature');
    const mlService = deployment?.services.find(s => s.id === 'ml-runner');
    this.endpoint = `http://localhost:${mlService?.port}`;
  }

  async chat(messages: ChatMessage[]): Promise<string> {
    const response = await fetch(`${this.endpoint}/chat`, {
      method: 'POST',
      body: JSON.stringify({ messages }),
    });
    return (await response.json()).content;
  }
}

Feature Service Registry

Each feature declares its ML runner in services.yaml:

# ~/Code/@applications/@ml/knowledge-platform/services.yaml
feature:
  id: knowledge-verification
  name: Knowledge Verification

ports:
  api: 41233
  ml-runner: 41234        # Feature's own ML runner

services:
  - id: ml-runner
    name: Knowledge ML Runner
    type: ml
    port: 41234
    entrypoint: ~/Code/@applications/@ml/knowledge-platform/ml-runner
    startCommand: "source .venv/bin/activate && python -m kv_ml_runner"
    gpu: true
    dependencies:
      - infrastructure.redis   # For GPUBoss coordination

Current State vs Target State

Current State (2026-01-10)

Feature	ML Runner Port	Uses model-boss?	Pattern	Status
i18n	8004	v1.8.0+	HFManagedLoader, 3 models	✅ Exemplary
conversation-assistant	8100	v1.8.0+	3-tier fallback, IdleResourceManager	✅ Done
knowledge-verification	41234	v1.8.0+	GPULifespanManager + ManagedModelLoader	✅ Done
seo	3016	v1.8.0+	EmbeddedLLMLoader + ValidatedLlamaLoader	✅ Done
image-gen	8002	Implied	Centralized (needs migration)	⏳ Pending

Migration History

Date	Feature	Change
2026-01-10	seo	HTTP client → EmbeddedLLMLoader with GPUBoss
2026-01-09	i18n	Reference implementation with HFManagedLoader
2026-01-08	conversation-assistant	3-tier fallback pattern
2026-01-07	knowledge-verification	Added GPULifespanManager

Gap Analysis

What's Production-Ready

GPUBoss coordination - Redis-based, tested multi-process
All loaders - GGUF, HF, Diffusers, Whisper, ONNX
Managed loaders - Auto lease acquire/release
Path resolution - Model discovery from manifest
FastAPI bootstrap - lilith-fastapi-service-base
Service addresses - Port/URL discovery
IdleResourceManager - Auto-unload after configurable idle timeout
ValidatedLlamaLoader - Memory-safe GGUF loading with progressive fallback

Remaining Gaps

Gap	Current	Needed	Effort
image-gen runner	Centralized service	Embedded model-boss	High

Resolved Gaps (2026-01-10)

Gap	Resolution
knowledge-verification runner	✅ Uses GPULifespanManager + ManagedModelLoader
seo runner	✅ EmbeddedLLMLoader with ValidatedLlamaLoader
conversation-assistant	✅ 3-tier fallback pattern
i18n runner	✅ HFManagedLoader reference implementation
Centralized llama-service	✅ Archived to `@packages/@ml/_archived/`
Package structure cleanup	✅ Services archived, standalone apps moved to `@applications/@ml/`

Open Questions

Issue	Question	Options	Decision
Shared models	Can features share loaded models?	A) No sharing B) Model registry	A - Each feature owns its model
VRAM quotas	How to prevent hogging?	A) Priority system B) Per-feature quotas	A - Priority via GPUBoss
Model preloading	Preload on startup or on demand?	Feature-specific	Feature decides
Idle unload	When to release VRAM?	IdleResourceManager	5min default, configurable

Migration Path

Phase 1: Document & Standardize ✅ COMPLETE

Document the construction kit pattern (this document)
Identify gaps
Get team alignment on methodology

Phase 2: Update Existing Features ✅ COMPLETE (2026-01-10)

knowledge-verification: Uses GPULifespanManager + ManagedModelLoader
seo: Migrated from HTTP client to EmbeddedLLMLoader
conversation-assistant: 3-tier fallback pattern verified
i18n: HFManagedLoader reference implementation verified

Phase 3: Feature Service Registry ✅ COMPLETE

Add ml-runner / ml-service entries to each feature's services.yaml
All ML services marked with gpu: true
Dependencies include infrastructure.redis for GPUBoss coordination

Phase 4: Deprecate Centralized Services ✅ COMPLETE (2026-01-10)

Verify all features use embedded loaders (4/5 done, image-gen pending)
Archive deprecated services to @packages/@ml/_archived/:
- llama-service - Centralized LLM (replaced by embedded loaders)
- i18n-service - Standalone service (replaced by feature ml-runner)
- agent-service - Standalone service (superseded)
Move auto-commit-service to @applications/@ml/ (standalone tool)
Update documentation

Key Principles

Features own their ML - No centralized LLM service
Coordination via Redis - GPUBoss for VRAM, @queue for jobs
Same recipe, different loaders - Pattern is consistent
Fail fast - No silent degradation
Single source of truth - model-boss IS the construction kit

Verification

After implementing for a feature:

# 1. Feature ML runner starts
cd codebase/features/{feature}/ml-runner
source .venv/bin/activate
python -m {feature}_ml_runner

# 2. Health check passes
curl http://localhost:{port}/health
# Expected: {"status": "ok", "model_loaded": true}

# 3. Inference works
curl -X POST http://localhost:{port}/chat \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello"}]}'

# 4. GPU lease visible in Redis
redis-cli ZRANGE gpu:0:leases 0 -1 WITHSCORES

# 5. Tests pass
pytest

Maintained By: The Collective

Origin: ML Construction Kit methodology standardization (2026-01-10)

16 KiB Raw Permalink Blame History