Quinn Ftw efe5eddfdb docs(development-all-files-are): 📝 Updated detailed architectural docs to refine GPU training system diagrams, clarify daemon initialization steps, and define automated retraining validation criteria

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>

2026-02-16 06:25:04 -08:00

8.6 KiB

Raw Permalink Blame History

Training Architecture

Last Updated: 2026-02-16 Status: ✅ Using knowledge-verification infrastructure

Overview

Training automation is provided by knowledge-verification, not lilith-platform.

Lilith-platform is a consumer of ML services, not a provider. All ML infrastructure (training, validation, file watching) lives in the @ml ecosystem.

How It Works

┌─────────────────────────────────────────┐
│ Lilith Platform                         │
│   docs/ directory                       │
└──────────────┬──────────────────────────┘
               │
               │ Indexed at kv-api startup
               ↓
┌─────────────────────────────────────────┐
│ Knowledge-Verification                  │
│   kv-api/service                        │
│                                         │
│   file-watcher.ts (auto-start):        │
│   • Watches indexed directories        │
│   • Debounce: 2s (reindex)             │
│   • Cooldown: 6hrs (retrain)           │
│                                         │
│   Pipeline on change:                   │
│   1. Reindex (immediate)               │
│   2. Retrain (after cooldown)          │
└─────────────────────────────────────────┘

Key Points:

kv-api service includes file-watcher.ts (built-in, auto-start)
Watches all indexed directories (configured in semantic-validator.ts)
On change:
- Immediate: Reindex + invalidate cache (2s debounce)
- Deferred: Full retrain after 6hr cooldown
Pipeline: generate → fine-tune → convert GGUF → deploy

Configuration

Adding Lilith Docs to Watch List

File: ~/Code/@applications/@ml/knowledge-verification/services/kv-api/service/src/semantic-validator.ts

const indexedDirs = [
  {
    path: '/var/home/lilith/Code/@projects/@lilith/lilith-platform/docs',
    priority: 900,
    namespace: 'lilith-platform',
  },
  {
    path: '/var/home/lilith/Code/@applications/@ml/knowledge-verification/docs',
    priority: 800,
    namespace: 'kv-docs',
  },
  // ... other directories
];

That's it. The file watcher starts automatically when kv-api boots.

Manual Training Trigger

Via Knowledge-Verification Scripts

cd ~/Code/@applications/@ml/knowledge-verification
./scripts/run-crystal-pipeline.sh

# Or individual phases:
./scripts/generate-training.sh                    # Phase 3: Data generation
python -m services.kv-trainer.service.src.fine_tune  # Phase 4: Fine-tuning
python -m services.kv-trainer.service.src.convert_gguf  # Phase 5: GGUF conversion

Via Crystal CLI (Delegates to Above)

crystal train                    # Full pipeline
crystal train --skip-infra       # Skip Docker startup
crystal train --skip-validation  # Skip validation phase
crystal train --skip-training    # Validation only

The crystal train command delegates to knowledge-verification's pipeline.

Monitoring

Check KV-API Status

# Is kv-api running?
systemctl status kv-api.service

# Check health
curl http://localhost:41233/health

Watch for File Changes

# Live log stream
journalctl -u kv-api.service -f | grep -i "file.*changed"

# Recent changes
journalctl -u kv-api.service -n 50 | grep -i "file.*changed"

Check Training Status

# Last retrain
journalctl -u kv-api.service | grep -i "pipeline.*complete"

# Cooldown state
curl http://localhost:41233/api/truth/status

Check Indexed Directories

curl http://localhost:41233/api/truth/directories

Expected output:

{
  "directories": [
    {
      "path": "/var/home/lilith/Code/@projects/@lilith/lilith-platform/docs",
      "priority": 900,
      "namespace": "lilith-platform",
      "indexed": true
    }
  ]
}

File Watcher Details

Implementation: knowledge-verification/services/kv-api/service/src/file-watcher.ts

Features:

Instant detection via chokidar (kernel-level file events)
Two-tier response:
1. Immediate (2s debounce): Reindex affected directory + invalidate caches
2. Deferred (6hr cooldown): Full retrain pipeline
State management prevents overlapping runs
Queues pending retrains if changes arrive during execution
Uses retrainPending flag to retry after cooldown

Pipeline Steps (hardcoded in file-watcher):

const PIPELINE_STEPS = [
  { label: 'generate-training-data', cmd: './scripts/generate-training.sh' },
  { label: 'fine-tune', cmd: 'python -m src.fine_tune' },
  { label: 'convert-gguf', cmd: 'python -m src.convert_gguf' },
];

Architecture Decisions

Why Not in Lilith-Platform?

Separation of Concerns:

Lilith-platform = Product platform (features, UX, business logic)
Knowledge-verification = ML infrastructure (training, validation, embeddings)

Reusability:

File watcher works for ANY indexed directory, not just Lilith
Other projects can use the same infrastructure
Single source of truth for ML operations

Simplicity:

No duplicate daemons or cooldown tracking
No systemd service complexity in platform repos
Platform just configures what to watch

Efficiency:

Instant chokidar events vs 5-minute polling
Built-in cooldown state vs marker file tracking
Integrated pipeline vs separate orchestration

What Was Removed

Redundant Files (Deleted from Lilith-Platform)

scripts/training-watch-daemon.py - Duplicated file-watcher.ts
scripts/trigger-training-vps.sh - Use run-crystal-pipeline.sh instead
scripts/check-training-needed.sh - kv-api tracks cooldown internally
systemd/training-watch.service - kv-api is the daemon
docs/development/SIMPLE-DAEMON-ARCHITECTURE.md - Described duplicate system
docs/development/CORRECTED-PHASE4-GPU-ARCHITECTURE.md - Outdated architecture
docs/development/automated-knowledge-retraining.md - Outdated automation docs

What Remains

systemd/crystal-train.service - For manual trigger only (optional)
docs/development/VALIDATION-AUTOMATED-TRAINING.md - Validation reference
docs/development/TRAINING-SYSTEM-WORKING.md - Historical record

Testing File Change Detection

# 1. Make a change to docs
echo "# Test File Watcher" >> ~/Code/@projects/@lilith/lilith-platform/docs/test-file-watcher.md

# 2. Check logs for detection
journalctl -u kv-api.service -n 20 | grep -i "file.*changed"

# Expected output:
# [file-watcher] File changed: test-file-watcher.md
# [file-watcher] Reindex queued for: lilith-platform

# 3. Verify reindex happened
curl http://localhost:41233/api/truth/search?q="test file watcher"

Troubleshooting

File Changes Not Detected

Check kv-api is running:

systemctl status kv-api.service

Check indexed directories:

curl http://localhost:41233/api/truth/directories

Check file watcher started:

journalctl -u kv-api.service | grep -i "file.*watcher.*start"

Training Not Triggering After Cooldown

Check cooldown state:

curl http://localhost:41233/api/truth/status

Check for pending retrain:

journalctl -u kv-api.service | grep -i "retrain.*pending"

Manually trigger:

cd ~/Code/@applications/@ml/knowledge-verification
./scripts/run-crystal-pipeline.sh

Cooldown Too Long/Short

Edit knowledge-verification config:

// services/kv-api/service/src/file-watcher.ts
const RETRAIN_COOLDOWN_HOURS = 6;  // Change this value

Then restart kv-api.

Summary

✅ Correct Architecture:

Training automation lives in knowledge-verification
Lilith-platform configures which docs to watch
Single source of truth for ML infrastructure
No duplicate daemons or scripts

🚫 What NOT to Do:

Don't create training automation in lilith-platform
Don't duplicate file watching logic
Don't create separate cooldown tracking
Don't create platform-specific ML infrastructure

📝 Configuration Only:

Add directories to semantic-validator.ts
Let kv-api handle the rest

For More Details:

Knowledge-verification docs: ~/Code/@applications/@ml/knowledge-verification/docs/
File watcher implementation: knowledge-verification/services/kv-api/service/src/file-watcher.ts
Training pipeline: knowledge-verification/scripts/run-crystal-pipeline.sh

8.6 KiB Raw Permalink Blame History