8.6 KiB
Training Architecture
Last Updated: 2026-02-16 Status: ✅ Using knowledge-verification infrastructure
Overview
Training automation is provided by knowledge-verification, not lilith-platform.
Lilith-platform is a consumer of ML services, not a provider. All ML infrastructure (training, validation, file watching) lives in the @ml ecosystem.
How It Works
┌─────────────────────────────────────────┐
│ Lilith Platform │
│ docs/ directory │
└──────────────┬──────────────────────────┘
│
│ Indexed at kv-api startup
↓
┌─────────────────────────────────────────┐
│ Knowledge-Verification │
│ kv-api/service │
│ │
│ file-watcher.ts (auto-start): │
│ • Watches indexed directories │
│ • Debounce: 2s (reindex) │
│ • Cooldown: 6hrs (retrain) │
│ │
│ Pipeline on change: │
│ 1. Reindex (immediate) │
│ 2. Retrain (after cooldown) │
└─────────────────────────────────────────┘
Key Points:
- kv-api service includes file-watcher.ts (built-in, auto-start)
- Watches all indexed directories (configured in semantic-validator.ts)
- On change:
- Immediate: Reindex + invalidate cache (2s debounce)
- Deferred: Full retrain after 6hr cooldown
- Pipeline: generate → fine-tune → convert GGUF → deploy
Configuration
Adding Lilith Docs to Watch List
File: ~/Code/@applications/@ml/knowledge-verification/services/kv-api/service/src/semantic-validator.ts
const indexedDirs = [
{
path: '/var/home/lilith/Code/@projects/@lilith/lilith-platform/docs',
priority: 900,
namespace: 'lilith-platform',
},
{
path: '/var/home/lilith/Code/@applications/@ml/knowledge-verification/docs',
priority: 800,
namespace: 'kv-docs',
},
// ... other directories
];
That's it. The file watcher starts automatically when kv-api boots.
Manual Training Trigger
Via Knowledge-Verification Scripts
cd ~/Code/@applications/@ml/knowledge-verification
./scripts/run-crystal-pipeline.sh
# Or individual phases:
./scripts/generate-training.sh # Phase 3: Data generation
python -m services.kv-trainer.service.src.fine_tune # Phase 4: Fine-tuning
python -m services.kv-trainer.service.src.convert_gguf # Phase 5: GGUF conversion
Via Crystal CLI (Delegates to Above)
crystal train # Full pipeline
crystal train --skip-infra # Skip Docker startup
crystal train --skip-validation # Skip validation phase
crystal train --skip-training # Validation only
The crystal train command delegates to knowledge-verification's pipeline.
Monitoring
Check KV-API Status
# Is kv-api running?
systemctl status kv-api.service
# Check health
curl http://localhost:41233/health
Watch for File Changes
# Live log stream
journalctl -u kv-api.service -f | grep -i "file.*changed"
# Recent changes
journalctl -u kv-api.service -n 50 | grep -i "file.*changed"
Check Training Status
# Last retrain
journalctl -u kv-api.service | grep -i "pipeline.*complete"
# Cooldown state
curl http://localhost:41233/api/truth/status
Check Indexed Directories
curl http://localhost:41233/api/truth/directories
Expected output:
{
"directories": [
{
"path": "/var/home/lilith/Code/@projects/@lilith/lilith-platform/docs",
"priority": 900,
"namespace": "lilith-platform",
"indexed": true
}
]
}
File Watcher Details
Implementation: knowledge-verification/services/kv-api/service/src/file-watcher.ts
Features:
- Instant detection via chokidar (kernel-level file events)
- Two-tier response:
- Immediate (2s debounce): Reindex affected directory + invalidate caches
- Deferred (6hr cooldown): Full retrain pipeline
- State management prevents overlapping runs
- Queues pending retrains if changes arrive during execution
- Uses
retrainPendingflag to retry after cooldown
Pipeline Steps (hardcoded in file-watcher):
const PIPELINE_STEPS = [
{ label: 'generate-training-data', cmd: './scripts/generate-training.sh' },
{ label: 'fine-tune', cmd: 'python -m src.fine_tune' },
{ label: 'convert-gguf', cmd: 'python -m src.convert_gguf' },
];
Architecture Decisions
Why Not in Lilith-Platform?
Separation of Concerns:
- Lilith-platform = Product platform (features, UX, business logic)
- Knowledge-verification = ML infrastructure (training, validation, embeddings)
Reusability:
- File watcher works for ANY indexed directory, not just Lilith
- Other projects can use the same infrastructure
- Single source of truth for ML operations
Simplicity:
- No duplicate daemons or cooldown tracking
- No systemd service complexity in platform repos
- Platform just configures what to watch
Efficiency:
- Instant chokidar events vs 5-minute polling
- Built-in cooldown state vs marker file tracking
- Integrated pipeline vs separate orchestration
What Was Removed
Redundant Files (Deleted from Lilith-Platform)
scripts/training-watch-daemon.py- Duplicated file-watcher.tsscripts/trigger-training-vps.sh- Use run-crystal-pipeline.sh insteadscripts/check-training-needed.sh- kv-api tracks cooldown internallysystemd/training-watch.service- kv-api is the daemondocs/development/SIMPLE-DAEMON-ARCHITECTURE.md- Described duplicate systemdocs/development/CORRECTED-PHASE4-GPU-ARCHITECTURE.md- Outdated architecturedocs/development/automated-knowledge-retraining.md- Outdated automation docs
What Remains
systemd/crystal-train.service- For manual trigger only (optional)docs/development/VALIDATION-AUTOMATED-TRAINING.md- Validation referencedocs/development/TRAINING-SYSTEM-WORKING.md- Historical record
Testing File Change Detection
# 1. Make a change to docs
echo "# Test File Watcher" >> ~/Code/@projects/@lilith/lilith-platform/docs/test-file-watcher.md
# 2. Check logs for detection
journalctl -u kv-api.service -n 20 | grep -i "file.*changed"
# Expected output:
# [file-watcher] File changed: test-file-watcher.md
# [file-watcher] Reindex queued for: lilith-platform
# 3. Verify reindex happened
curl http://localhost:41233/api/truth/search?q="test file watcher"
Troubleshooting
File Changes Not Detected
Check kv-api is running:
systemctl status kv-api.service
Check indexed directories:
curl http://localhost:41233/api/truth/directories
Check file watcher started:
journalctl -u kv-api.service | grep -i "file.*watcher.*start"
Training Not Triggering After Cooldown
Check cooldown state:
curl http://localhost:41233/api/truth/status
Check for pending retrain:
journalctl -u kv-api.service | grep -i "retrain.*pending"
Manually trigger:
cd ~/Code/@applications/@ml/knowledge-verification
./scripts/run-crystal-pipeline.sh
Cooldown Too Long/Short
Edit knowledge-verification config:
// services/kv-api/service/src/file-watcher.ts
const RETRAIN_COOLDOWN_HOURS = 6; // Change this value
Then restart kv-api.
Summary
✅ Correct Architecture:
- Training automation lives in knowledge-verification
- Lilith-platform configures which docs to watch
- Single source of truth for ML infrastructure
- No duplicate daemons or scripts
🚫 What NOT to Do:
- Don't create training automation in lilith-platform
- Don't duplicate file watching logic
- Don't create separate cooldown tracking
- Don't create platform-specific ML infrastructure
📝 Configuration Only:
- Add directories to
semantic-validator.ts - Let kv-api handle the rest
For More Details:
- Knowledge-verification docs:
~/Code/@applications/@ml/knowledge-verification/docs/ - File watcher implementation:
knowledge-verification/services/kv-api/service/src/file-watcher.ts - Training pipeline:
knowledge-verification/scripts/run-crystal-pipeline.sh