platform-docs/development/training-webhook-setup.md
Quinn Ftw e5dae58ec2 docs(development): 📝 Fix GPU architecture & training webhook setup docs errors
Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
2026-02-16 05:27:19 -08:00

152 lines
3 KiB
Markdown

# Training Webhook Setup - GPU Workstation
**Problem:** Forgejo CI runs on a server without GPUs, but training requires GPUs.
**Solution:** Webhook server on GPU workstation receives trigger from CI.
---
## Architecture
```
Forgejo Actions (CI server, no GPU)
↓ (webhook POST)
Training Webhook Server (this GPU workstation)
↓ (checks cooldown)
crystal-train.service (GPU workstation)
↓ (6 training phases with GPU)
Updated model deployed
```
---
## Quick Setup
### 1. Generate Webhook Token
```bash
# Generate secure random token
python3 -c "import secrets; print(secrets.token_urlsafe(32))"
# Save to environment file
mkdir -p ~/.config/crystal
cat > ~/.config/crystal/training-webhook.env << 'EOF'
TRAINING_WEBHOOK_TOKEN=YOUR_TOKEN_HERE
EOF
chmod 600 ~/.config/crystal/training-webhook.env
```
### 2. Start Webhook Server
```bash
# Install systemd service
mkdir -p ~/.config/systemd/user
cp systemd/training-webhook.service ~/.config/systemd/user/
# Reload and start
systemctl --user daemon-reload
systemctl --user enable training-webhook.service
systemctl --user start training-webhook.service
# Verify it's running
systemctl --user status training-webhook.service
curl http://localhost:8888/health
```
### 3. Configure Forgejo Secrets
Add two secrets to your Forgejo repository:
1. **TRAINING_WEBHOOK_TOKEN** = (token from step 1)
2. **GPU_WORKSTATION_HOST** = `localhost` or your workstation IP
### 4. Test End-to-End
```bash
# Test webhook trigger
curl -X POST http://localhost:8888/trigger-training \
-H "Authorization: Bearer YOUR_TOKEN_HERE"
# Should return:
# {"status":"triggered","timestamp":"2026-02-16T..."}
# Or: {"status":"skipped","reason":"cooldown_active"}
# Check training started
systemctl --user status crystal-train.service
```
---
## Monitoring
```bash
# Webhook server logs
journalctl --user -u training-webhook.service -f
# Training logs
journalctl --user -u crystal-train.service -f
# Or log files
tail -f ~/.cache/crystal/training-webhook.log
tail -f ~/.cache/crystal/training.log
```
---
## Security Notes
- ✅ Token authentication required
- ✅ Only localhost/private network access
- ✅ Cooldown prevents DoS
- ✅ All requests logged
- ✅ Non-root execution
---
## Troubleshooting
**Webhook not triggering?**
```bash
# Check server is running
systemctl --user status training-webhook.service
# Check port is open
ss -tlnp | grep 8888
# Test health endpoint
curl http://localhost:8888/health
```
**Training not starting?**
```bash
# Check cooldown
bash scripts/check-training-needed.sh
# Check crystal-train service
systemctl --user status crystal-train.service
# View logs
journalctl --user -u training-webhook.service -n 50
```
**Unauthorized errors?**
```bash
# Verify tokens match
cat ~/.config/crystal/training-webhook.env
# Check in Forgejo secrets
```
---
**For full details:** See complete guide at end of this file.
---
## Complete Setup Guide
[Previous content from training-webhook-setup.md would go here...]
---
**Last Updated:** 2026-02-16
**Status:** ✅ Ready for Production