ml/ARCHITECTURE.md

5.6 KiB

@ml Package Architecture

Overview

The @ml workspace now has a clean separation of concerns with dedicated packages for different resource types.

Package Structure

@ml/
├── vram-boss-py/       GPU/VRAM lease coordination (Python)
├── vram-boss-ts/       GPU/VRAM lease coordination (TypeScript)
├── ram-boss-py/        RAM lease coordination + cache management (Python)
├── ram-boss-ts/        RAM lease coordination (TypeScript)
├── model-boss-py/      Model loading library (uses vram-boss internally)
└── model-boss-ts/      Model path resolution (TypeScript)

Dependency Graph

Applications
    ↓
    ├─→ vram-boss (standalone GPU coordination)
    ├─→ ram-boss (standalone RAM coordination)
    └─→ model-boss (model loading, uses vram-boss internally)
            ↓
        vram-boss

Package Responsibilities

vram-boss / vram-boss-ts

Purpose: GPU/VRAM lease coordination

Provides:

  • GPUBoss - VRAM lease coordinator
  • GPULease - Lease with heartbeat & preemption
  • Priority system (URGENT, HIGH, NORMAL, LOW, BATCH)
  • Redis-based coordination
  • CLI: vram-boss status, cleanup, drain

Use When: You need to coordinate GPU access across processes

ram-boss / ram-boss-ts

Purpose: System RAM lease coordination + cache management

Provides:

  • RAMBoss - RAM lease coordinator
  • RAMLease - Lease with heartbeat & preemption
  • MemoryAnalyzer - Parse /proc/meminfo, pressure detection
  • CacheManager - Intelligent cache cleanup (auto/conservative/balanced/aggressive)
  • ProcessMonitor - Process memory tracking
  • CLI: ram-boss status, analyze, clear, cleanup

Use When: You need to coordinate RAM usage or manage system cache

model-boss / model-boss-ts

Purpose: Model loading library

Provides:

  • ManagedModelLoader - Automatic VRAM lease + model loading
  • Framework loaders (HuggingFace, Diffusers, GGUF, ONNX, Whisper)
  • Path resolution (ensure_model, resolve_model)
  • Model manifest and discovery

Uses Internally: vram-boss for GPU coordination

Use When: You need to load ML models with optional GPU coordination

Usage Patterns

Pattern 1: Standalone VRAM Coordination

from lilith_vram_boss import GPUBoss, Priority

async with GPUBoss() as boss:
    async with boss.acquire(vram_mb=8000) as lease:
        # Your GPU work here
        pass

Pattern 2: Standalone RAM Coordination

from lilith_ram_boss import RAMBoss

async with RAMBoss() as boss:
    async with boss.acquire(ram_mb=16000) as lease:
        # Your memory-intensive work here
        pass

Pattern 3: Model Loading with GPU Coordination

from lilith_vram_boss import GPUBoss
from lilith_model_boss import ManagedModelLoader

boss = GPUBoss()
await boss.connect()

# ManagedModelLoader uses vram-boss internally
loader = ManagedModelLoader(boss=boss)
model = await loader.load("deepseek-r1", vram_mb=8000)

Pattern 4: Unified RAM + VRAM Coordination

from lilith_vram_boss import GPUBoss
from lilith_ram_boss import RAMBoss

gpu_boss = GPUBoss()
ram_boss = RAMBoss()

await gpu_boss.connect()
await ram_boss.connect()

# Coordinate both resources
async with ram_boss.acquire(ram_mb=16000) as ram_lease:
    async with gpu_boss.acquire(vram_mb=8000) as vram_lease:
        # Work with both RAM and VRAM reserved
        pass

Pattern 5: Cache Cleanup Only

from lilith_ram_boss import MemoryAnalyzer, CacheManager

# Analyze memory pressure
analyzer = MemoryAnalyzer()
analysis = analyzer.analyze()
print(f"Pressure: {analysis.pressure.value}")

# Clean if needed
if analysis.pressure != "low":
    manager = CacheManager()
    freed_mb = manager.cleanup(mode="auto")
    print(f"Freed {freed_mb} MB")

Pattern 6: Unified CLI (bitch)

# Global CLI provides unified interface to all memory management

# VRAM operations
bitch vram status        # Check GPU status
bitch vram cleanup       # Clean stale leases
bitch vram drain         # Unload all models

# RAM operations
bitch ram analyze        # Memory analysis
bitch ram clear auto     # Intelligent cleanup
bitch ram status         # Check RAM leases

# Direct access still works
vram-boss status         # Same as: bitch vram status
ram-boss analyze         # Same as: bitch ram analyze

Installation: npm install -g @lilith/bitch --registry=http://forge.nasty.sh/api/packages/lilith/npm/

Key Design Principles

  1. Separation of Concerns

    • vram-boss: GPU coordination only
    • ram-boss: RAM coordination + system cache management
    • model-boss: Model loading (uses vram-boss)
  2. No Re-exports

    • Each package has a clear API boundary
    • model-boss does NOT re-export vram-boss classes
    • Import from the correct package for what you need
  3. Standalone Packages

    • Each boss package can be used independently
    • No circular dependencies
    • Optional dependencies (model-boss can use vram-boss, but vram-boss is standalone)
  4. Consistent Patterns

    • All boss packages use Redis for coordination
    • All use same Priority enum
    • All support heartbeat, preemption, queuing
    • Python and TypeScript packages mirror each other

Migration from Old Architecture

Before (model-boss 1.9.0)

# Everything was in model-boss
from lilith_model_boss import GPUBoss, ensure_model

After (model-boss 2.0.0)

# Separate imports for GPU coordination vs model loading
from lilith_vram_boss import GPUBoss
from lilith_model_boss import ensure_model

Impact: Cleaner imports, better separation of concerns, each package can evolve independently.