platform-docs/development/TRAINING-ARCHITECTURE.md

8.6 KiB

Training Architecture

Last Updated: 2026-02-16 Status: Using knowledge-verification infrastructure


Overview

Training automation is provided by knowledge-verification, not lilith-platform.

Lilith-platform is a consumer of ML services, not a provider. All ML infrastructure (training, validation, file watching) lives in the @ml ecosystem.


How It Works

┌─────────────────────────────────────────┐
│ Lilith Platform                         │
│   docs/ directory                       │
└──────────────┬──────────────────────────┘
               │
               │ Indexed at kv-api startup
               ↓
┌─────────────────────────────────────────┐
│ Knowledge-Verification                  │
│   kv-api/service                        │
│                                         │
│   file-watcher.ts (auto-start):        │
│   • Watches indexed directories        │
│   • Debounce: 2s (reindex)             │
│   • Cooldown: 6hrs (retrain)           │
│                                         │
│   Pipeline on change:                   │
│   1. Reindex (immediate)               │
│   2. Retrain (after cooldown)          │
└─────────────────────────────────────────┘

Key Points:

  1. kv-api service includes file-watcher.ts (built-in, auto-start)
  2. Watches all indexed directories (configured in semantic-validator.ts)
  3. On change:
    • Immediate: Reindex + invalidate cache (2s debounce)
    • Deferred: Full retrain after 6hr cooldown
  4. Pipeline: generate → fine-tune → convert GGUF → deploy

Configuration

Adding Lilith Docs to Watch List

File: ~/Code/@applications/@ml/knowledge-verification/services/kv-api/service/src/semantic-validator.ts

const indexedDirs = [
  {
    path: '/var/home/lilith/Code/@projects/@lilith/lilith-platform/docs',
    priority: 900,
    namespace: 'lilith-platform',
  },
  {
    path: '/var/home/lilith/Code/@applications/@ml/knowledge-verification/docs',
    priority: 800,
    namespace: 'kv-docs',
  },
  // ... other directories
];

That's it. The file watcher starts automatically when kv-api boots.


Manual Training Trigger

Via Knowledge-Verification Scripts

cd ~/Code/@applications/@ml/knowledge-verification
./scripts/run-crystal-pipeline.sh

# Or individual phases:
./scripts/generate-training.sh                    # Phase 3: Data generation
python -m services.kv-trainer.service.src.fine_tune  # Phase 4: Fine-tuning
python -m services.kv-trainer.service.src.convert_gguf  # Phase 5: GGUF conversion

Via Crystal CLI (Delegates to Above)

crystal train                    # Full pipeline
crystal train --skip-infra       # Skip Docker startup
crystal train --skip-validation  # Skip validation phase
crystal train --skip-training    # Validation only

The crystal train command delegates to knowledge-verification's pipeline.


Monitoring

Check KV-API Status

# Is kv-api running?
systemctl status kv-api.service

# Check health
curl http://localhost:41233/health

Watch for File Changes

# Live log stream
journalctl -u kv-api.service -f | grep -i "file.*changed"

# Recent changes
journalctl -u kv-api.service -n 50 | grep -i "file.*changed"

Check Training Status

# Last retrain
journalctl -u kv-api.service | grep -i "pipeline.*complete"

# Cooldown state
curl http://localhost:41233/api/truth/status

Check Indexed Directories

curl http://localhost:41233/api/truth/directories

Expected output:

{
  "directories": [
    {
      "path": "/var/home/lilith/Code/@projects/@lilith/lilith-platform/docs",
      "priority": 900,
      "namespace": "lilith-platform",
      "indexed": true
    }
  ]
}

File Watcher Details

Implementation: knowledge-verification/services/kv-api/service/src/file-watcher.ts

Features:

  • Instant detection via chokidar (kernel-level file events)
  • Two-tier response:
    1. Immediate (2s debounce): Reindex affected directory + invalidate caches
    2. Deferred (6hr cooldown): Full retrain pipeline
  • State management prevents overlapping runs
  • Queues pending retrains if changes arrive during execution
  • Uses retrainPending flag to retry after cooldown

Pipeline Steps (hardcoded in file-watcher):

const PIPELINE_STEPS = [
  { label: 'generate-training-data', cmd: './scripts/generate-training.sh' },
  { label: 'fine-tune', cmd: 'python -m src.fine_tune' },
  { label: 'convert-gguf', cmd: 'python -m src.convert_gguf' },
];

Architecture Decisions

Why Not in Lilith-Platform?

Separation of Concerns:

  • Lilith-platform = Product platform (features, UX, business logic)
  • Knowledge-verification = ML infrastructure (training, validation, embeddings)

Reusability:

  • File watcher works for ANY indexed directory, not just Lilith
  • Other projects can use the same infrastructure
  • Single source of truth for ML operations

Simplicity:

  • No duplicate daemons or cooldown tracking
  • No systemd service complexity in platform repos
  • Platform just configures what to watch

Efficiency:

  • Instant chokidar events vs 5-minute polling
  • Built-in cooldown state vs marker file tracking
  • Integrated pipeline vs separate orchestration

What Was Removed

Redundant Files (Deleted from Lilith-Platform)

  • scripts/training-watch-daemon.py - Duplicated file-watcher.ts
  • scripts/trigger-training-vps.sh - Use run-crystal-pipeline.sh instead
  • scripts/check-training-needed.sh - kv-api tracks cooldown internally
  • systemd/training-watch.service - kv-api is the daemon
  • docs/development/SIMPLE-DAEMON-ARCHITECTURE.md - Described duplicate system
  • docs/development/CORRECTED-PHASE4-GPU-ARCHITECTURE.md - Outdated architecture
  • docs/development/automated-knowledge-retraining.md - Outdated automation docs

What Remains

  • systemd/crystal-train.service - For manual trigger only (optional)
  • docs/development/VALIDATION-AUTOMATED-TRAINING.md - Validation reference
  • docs/development/TRAINING-SYSTEM-WORKING.md - Historical record

Testing File Change Detection

# 1. Make a change to docs
echo "# Test File Watcher" >> ~/Code/@projects/@lilith/lilith-platform/docs/test-file-watcher.md

# 2. Check logs for detection
journalctl -u kv-api.service -n 20 | grep -i "file.*changed"

# Expected output:
# [file-watcher] File changed: test-file-watcher.md
# [file-watcher] Reindex queued for: lilith-platform

# 3. Verify reindex happened
curl http://localhost:41233/api/truth/search?q="test file watcher"

Troubleshooting

File Changes Not Detected

Check kv-api is running:

systemctl status kv-api.service

Check indexed directories:

curl http://localhost:41233/api/truth/directories

Check file watcher started:

journalctl -u kv-api.service | grep -i "file.*watcher.*start"

Training Not Triggering After Cooldown

Check cooldown state:

curl http://localhost:41233/api/truth/status

Check for pending retrain:

journalctl -u kv-api.service | grep -i "retrain.*pending"

Manually trigger:

cd ~/Code/@applications/@ml/knowledge-verification
./scripts/run-crystal-pipeline.sh

Cooldown Too Long/Short

Edit knowledge-verification config:

// services/kv-api/service/src/file-watcher.ts
const RETRAIN_COOLDOWN_HOURS = 6;  // Change this value

Then restart kv-api.


Summary

Correct Architecture:

  • Training automation lives in knowledge-verification
  • Lilith-platform configures which docs to watch
  • Single source of truth for ML infrastructure
  • No duplicate daemons or scripts

🚫 What NOT to Do:

  • Don't create training automation in lilith-platform
  • Don't duplicate file watching logic
  • Don't create separate cooldown tracking
  • Don't create platform-specific ML infrastructure

📝 Configuration Only:

  • Add directories to semantic-validator.ts
  • Let kv-api handle the rest

For More Details:

  • Knowledge-verification docs: ~/Code/@applications/@ml/knowledge-verification/docs/
  • File watcher implementation: knowledge-verification/services/kv-api/service/src/file-watcher.ts
  • Training pipeline: knowledge-verification/scripts/run-crystal-pipeline.sh