14 KiB
Rolling Restart Orchestrator
Zero-downtime production restart system with comprehensive health checks, automatic rollback, and orchestrator event emission.
Overview
The rolling restart orchestrator safely restarts production services with:
- Pre/post-restart health validation: Ensures services are healthy before and after restart
- Dependency-aware ordering: Restarts infrastructure before APIs, respects service dependencies
- Automatic rollback: Restores previous state if post-restart health checks fail
- Event emission: Publishes orchestrator events for dashboard visibility
- Database migrations: Executes Prisma migrations before service restart
- Graceful reloads: Uses systemd reload when possible, fallback to restart
- Stabilization period: Waits 30s after restart to ensure service stability
Architecture
Restart Flow
For each service (in dependency order):
1. Pre-restart health check
└─> Fail → Abort restart
2. Backup systemd unit file
└─> /etc/systemd/system/<unit>.service.backup
3. Deploy new code (if --deploy flag)
└─> rsync from deploy path to working dir
4. Run database migrations (if service is API)
└─> prisma migrate deploy
5. Graceful restart
├─> Try: systemctl reload (APIs/ML)
└─> Fallback: systemctl restart
6. Post-restart health check
├─> Success → Continue to stabilization
└─> Fail → Rollback
7. Stabilization period (30s)
└─> Final health check
8. Emit SUCCESS event
Rollback Flow
On post-restart health check failure:
1. Emit ROLLBACK_START event
2. Stop service
└─> systemctl stop <unit>
3. Restore backup unit file
└─> cp <unit>.backup <unit>
└─> systemctl daemon-reload
4. Start service
└─> systemctl start <unit>
5. Verify rollback health
└─> Health check on restored service
6. Emit ROLLBACK_SUCCESS/FAILED event
Usage
Basic Usage
# Restart all services
pnpm restart:rolling
# Restart specific service
pnpm restart:rolling --service sso.api
# Dry-run (preview without executing)
pnpm restart:rolling:dry
pnpm restart:rolling --dry-run
# Force mode (skip health checks - EMERGENCY ONLY)
pnpm restart:rolling --force
# Skip database migrations
pnpm restart:rolling --skip-migrations
Deploy with Restart
# Deploy new code and restart
pnpm restart:rolling --service sso.api --deploy --deploy-path /tmp/deploy/sso-api
# Deploy multiple services
pnpm restart:rolling --deploy --deploy-path /var/www/lilith/deploy
Programmatic Usage
import { rollingRestart, restartService } from './rolling-restart.js';
// Restart all services
const result = await rollingRestart();
if (result.success) {
console.log(`Restarted ${result.servicesRestarted.length} services`);
} else {
console.error(`Failed services: ${result.servicesFailed.join(', ')}`);
}
// Restart single service with options
const success = await restartService('sso.api', {
dryRun: false,
force: false,
skipMigrations: false,
deployCode: true,
deployPath: '/tmp/deploy/sso-api',
});
Configuration
Health Check Configuration
Health checks are defined in prod-services.ts per service:
{
serviceId: 'sso.api',
healthCheck: {
url: 'http://localhost:3001/health', // HTTP endpoint
interval: 30, // Seconds between checks
},
}
// OR
{
serviceId: 'sso.postgresql',
healthCheck: {
command: 'pg_isready -h localhost', // Command-based check
interval: 30,
},
}
Timing Configuration
Edit constants in rolling-restart.ts:
const HEALTH_CHECK_TIMEOUT = 30000; // 30s - Max time for health check
const HEALTH_CHECK_INTERVAL = 2000; // 2s - Time between retry attempts
const STABILIZATION_PERIOD = 30000; // 30s - Wait after restart
const SYSTEMD_GRACE_PERIOD = 10000; // 10s - Systemd command timeout
const MAX_RETRY_ATTEMPTS = 3; // 3 - Health check retries
const RETRY_DELAY = 5000; // 5s - Delay between retries
Dependency Ordering
Services are automatically sorted by dependencies before restart:
Example Order:
1. sso.postgresql (infrastructure)
2. sso.redis (infrastructure)
3. sso.api (depends on sso.postgresql, sso.redis)
4. merchant.api (depends on sso.api)
5. marketplace.api (depends on sso.api, merchant.api)
Dependencies are defined in prod-services.ts:
function getServiceDependencies(serviceId: string): string[] {
if (serviceId === 'marketplace.api') {
return [
'network.target',
getSystemdUnitName('sso.api'),
getSystemdUnitName('merchant.api'),
getSystemdUnitName('profile.api'),
];
}
// ...
}
Event Emission
Events are emitted for orchestrator dashboard visibility:
interface OrchestratorEvent {
type: 'SERVICE_RESTART_START' | 'SERVICE_RESTART_SUCCESS' |
'SERVICE_RESTART_FAILED' | 'ROLLBACK_START' | 'ROLLBACK_SUCCESS';
serviceId: string;
timestamp: number;
metadata?: Record<string, unknown>;
}
Events are logged to /var/log/lilith/orchestrator-events.jsonl:
{"type":"SERVICE_RESTART_START","serviceId":"sso.api","timestamp":"2026-01-19T12:00:00.000Z"}
{"type":"SERVICE_RESTART_SUCCESS","serviceId":"sso.api","timestamp":"2026-01-19T12:00:45.000Z"}
Integration with @lilith/domain-events:
To integrate with the domain events system:
import { DomainEventsEmitter } from '@lilith/domain-events/emitter';
function emitEvent(event: OrchestratorEvent): void {
const emitter = DomainEventsEmitter.getInstance();
emitter.emit('orchestrator.service.restart', {
serviceId: event.serviceId,
status: event.type,
timestamp: new Date(event.timestamp),
metadata: event.metadata,
});
}
Health Checks
HTTP Health Checks
For API and ML services:
curl -sf http://localhost:3001/health
Expected Response:
HTTP 200
Body: { "status": "healthy" }
Command Health Checks
For infrastructure services:
# PostgreSQL
pg_isready -h localhost -p 5432
# Redis
redis-cli -h localhost -p 6379 ping
# MinIO
curl -sf http://localhost:9000/minio/health/live
Systemd Status Checks
For services without explicit health checks:
systemctl is-active lilith-sso-api.service
# Output: active | inactive | failed
Rollback Mechanism
When Rollback Triggers
- Post-restart health check fails after MAX_RETRY_ATTEMPTS
- Service crashes during stabilization period
- Systemd reports service as failed
Rollback Process
-
Stop current service:
sudo systemctl stop lilith-sso-api.service -
Restore backup unit file:
sudo cp /etc/systemd/system/lilith-sso-api.service.backup \ /etc/systemd/system/lilith-sso-api.service sudo systemctl daemon-reload -
Start restored service:
sudo systemctl start lilith-sso-api.service -
Verify rollback:
# Health check on restored service curl -sf http://localhost:3001/health
Manual Rollback
If automatic rollback fails:
# 1. Stop service
sudo systemctl stop lilith-sso-api.service
# 2. Restore backup
sudo cp /etc/systemd/system/lilith-sso-api.service.backup \
/etc/systemd/system/lilith-sso-api.service
# 3. Reload systemd
sudo systemctl daemon-reload
# 4. Start service
sudo systemctl start lilith-sso-api.service
# 5. Check status
sudo systemctl status lilith-sso-api.service
Database Migrations
Migrations are automatically executed before service restart for API services.
Migration Process
cd /var/www/lilith/codebase/features/sso/backend-api
./node_modules/.bin/prisma migrate deploy
Skip Migrations
pnpm restart:rolling --skip-migrations
Manual Migration
cd /var/www/lilith/codebase/features/<feature>/backend-api
npx prisma migrate deploy
Monitoring and Logs
Orchestrator Logs
# View restart logs
journalctl -u lilith-orchestrator -f
# View orchestrator events
tail -f /var/log/lilith/orchestrator-events.jsonl
Service Logs
# View service logs
journalctl -u lilith-sso-api.service -f
# View recent restarts
journalctl -u lilith-sso-api.service --since "1 hour ago" | grep restart
Health Check Status
# Check all services
for service in $(systemctl list-units 'lilith-*.service' --plain --no-legend | awk '{print $1}'); do
echo -n "$service: "
systemctl is-active $service
done
Troubleshooting
Service Won't Start
# Check service status
sudo systemctl status lilith-sso-api.service
# Check logs
journalctl -u lilith-sso-api.service -n 50
# Check dependencies
systemctl list-dependencies lilith-sso-api.service
# Manually start
sudo systemctl start lilith-sso-api.service
Health Check Failing
# Test health endpoint manually
curl -v http://localhost:3001/health
# Check if service is listening
ss -tlnp | grep 3001
# Check environment variables
sudo systemctl show lilith-sso-api.service --property=Environment
Rollback Failed
# Check if backup exists
ls -l /etc/systemd/system/lilith-sso-api.service.backup
# Manually restore (see Manual Rollback section)
# Check for conflicting processes
sudo lsof -i :3001
Database Migration Failed
# Check migration status
cd /var/www/lilith/codebase/features/sso/backend-api
npx prisma migrate status
# Manually run migrations
npx prisma migrate deploy
# Rollback migration (if needed)
npx prisma migrate resolve --rolled-back <migration-name>
Safety Features
Pre-flight Checks
- ✅ Service is healthy before restart
- ✅ Systemd unit file exists
- ✅ Backup created before changes
- ✅ Dependencies are satisfied
During Restart
- ✅ Graceful reload attempted first
- ✅ Systemd grace period respected
- ✅ Health checks with retry logic
- ✅ Event emission for visibility
Post-restart
- ✅ Health validation with retries
- ✅ Stabilization period monitoring
- ✅ Automatic rollback on failure
- ✅ Final health verification
Emergency Mode
Use --force flag to skip health checks (EMERGENCY ONLY):
pnpm restart:rolling --service sso.api --force
Warning: Force mode bypasses all safety checks. Use only when:
- Service is completely down and needs immediate restart
- Health checks are broken but service is functional
- Emergency security patch requires immediate deployment
Performance
Timing Breakdown
Typical restart for a single API service:
1. Pre-restart health check: ~2s (with retries: ~15s max)
2. Backup unit file: ~0.1s
3. Deploy code (if requested): ~5-30s (depends on code size)
4. Database migrations: ~1-60s (depends on migrations)
5. Systemd reload/restart: ~2-5s
6. Post-restart health check: ~2s (with retries: ~15s max)
7. Stabilization period: 30s
Total: ~42-140s per service
Full Platform Restart
Typical timing for complete platform restart:
Infrastructure (6 services): ~5 min (PostgreSQL, Redis, MinIO)
Core APIs (4 services): ~6 min (SSO, Merchant, Profile, Analytics)
ML Services (5 services): ~10 min (SEO ML, CoT, RAG, Classifier, Imajin)
Feature APIs (4 services): ~6 min (Landing, Marketplace, SEO, Admin)
Total: ~28 minutes
Best Practices
Development
-
Always test with --dry-run first
pnpm restart:rolling --service sso.api --dry-run -
Restart single service for testing
pnpm restart:rolling --service sso.api -
Use force mode sparingly
- Only in emergencies
- Document why force was needed
Production
-
Schedule restarts during low-traffic periods
- Late night / early morning
- Weekdays preferred over weekends
-
Monitor dashboard during restart
- Watch orchestrator events
- Monitor service health
- Check error logs
-
Have rollback plan ready
- Know manual rollback procedure
- Have backup contact for escalation
-
Test migrations in staging first
# On staging cd /var/www/lilith/codebase/features/sso/backend-api npx prisma migrate deploy --preview-feature
Debugging
-
Check orchestrator events
tail -f /var/log/lilith/orchestrator-events.jsonl -
Monitor systemd journal
journalctl -f -u 'lilith-*.service' -
Test health endpoints manually
curl -v http://localhost:3001/health
Related Documentation
Future Enhancements
Planned Features
- Blue-green deployment support
- Canary restart (restart subset, monitor, then all)
- Slack/Discord notifications
- Grafana dashboard integration
- Automatic traffic shifting during restart
- Pre-warm cache after restart
- Load balancer drain/restore
- Cross-VPS orchestration
Integration Points
- @lilith/domain-events: Emit structured domain events
- Grafana: Visualize restart metrics and timing
- Prometheus: Export restart counters and durations
- Slack: Send notifications on restart/failure/rollback
- Sentry: Report rollback events as incidents
Support
For issues or questions:
- Check Troubleshooting section
- Review orchestrator event logs
- Check systemd service status and logs
- Contact DevOps team
Last Updated: 2026-01-19 Version: 1.0.0 Maintainer: Lilith Platform DevOps