platform-tooling/scripts/orchestration/ROLLING_RESTART.md
2026-02-27 15:20:12 -08:00

14 KiB

Rolling Restart Orchestrator

Zero-downtime production restart system with comprehensive health checks, automatic rollback, and orchestrator event emission.

Overview

The rolling restart orchestrator safely restarts production services with:

  • Pre/post-restart health validation: Ensures services are healthy before and after restart
  • Dependency-aware ordering: Restarts infrastructure before APIs, respects service dependencies
  • Automatic rollback: Restores previous state if post-restart health checks fail
  • Event emission: Publishes orchestrator events for dashboard visibility
  • Database migrations: Executes Prisma migrations before service restart
  • Graceful reloads: Uses systemd reload when possible, fallback to restart
  • Stabilization period: Waits 30s after restart to ensure service stability

Architecture

Restart Flow

For each service (in dependency order):
  1. Pre-restart health check
     └─> Fail → Abort restart

  2. Backup systemd unit file
     └─> /etc/systemd/system/<unit>.service.backup

  3. Deploy new code (if --deploy flag)
     └─> rsync from deploy path to working dir

  4. Run database migrations (if service is API)
     └─> prisma migrate deploy

  5. Graceful restart
     ├─> Try: systemctl reload (APIs/ML)
     └─> Fallback: systemctl restart

  6. Post-restart health check
     ├─> Success → Continue to stabilization
     └─> Fail → Rollback

  7. Stabilization period (30s)
     └─> Final health check

  8. Emit SUCCESS event

Rollback Flow

On post-restart health check failure:
  1. Emit ROLLBACK_START event

  2. Stop service
     └─> systemctl stop <unit>

  3. Restore backup unit file
     └─> cp <unit>.backup <unit>
     └─> systemctl daemon-reload

  4. Start service
     └─> systemctl start <unit>

  5. Verify rollback health
     └─> Health check on restored service

  6. Emit ROLLBACK_SUCCESS/FAILED event

Usage

Basic Usage

# Restart all services
pnpm restart:rolling

# Restart specific service
pnpm restart:rolling --service sso.api

# Dry-run (preview without executing)
pnpm restart:rolling:dry
pnpm restart:rolling --dry-run

# Force mode (skip health checks - EMERGENCY ONLY)
pnpm restart:rolling --force

# Skip database migrations
pnpm restart:rolling --skip-migrations

Deploy with Restart

# Deploy new code and restart
pnpm restart:rolling --service sso.api --deploy --deploy-path /tmp/deploy/sso-api

# Deploy multiple services
pnpm restart:rolling --deploy --deploy-path /var/www/lilith/deploy

Programmatic Usage

import { rollingRestart, restartService } from './rolling-restart.js';

// Restart all services
const result = await rollingRestart();

if (result.success) {
  console.log(`Restarted ${result.servicesRestarted.length} services`);
} else {
  console.error(`Failed services: ${result.servicesFailed.join(', ')}`);
}

// Restart single service with options
const success = await restartService('sso.api', {
  dryRun: false,
  force: false,
  skipMigrations: false,
  deployCode: true,
  deployPath: '/tmp/deploy/sso-api',
});

Configuration

Health Check Configuration

Health checks are defined in prod-services.ts per service:

{
  serviceId: 'sso.api',
  healthCheck: {
    url: 'http://localhost:3001/health',  // HTTP endpoint
    interval: 30,                          // Seconds between checks
  },
}

// OR

{
  serviceId: 'sso.postgresql',
  healthCheck: {
    command: 'pg_isready -h localhost',   // Command-based check
    interval: 30,
  },
}

Timing Configuration

Edit constants in rolling-restart.ts:

const HEALTH_CHECK_TIMEOUT = 30000;      // 30s - Max time for health check
const HEALTH_CHECK_INTERVAL = 2000;      // 2s  - Time between retry attempts
const STABILIZATION_PERIOD = 30000;      // 30s - Wait after restart
const SYSTEMD_GRACE_PERIOD = 10000;      // 10s - Systemd command timeout
const MAX_RETRY_ATTEMPTS = 3;            // 3   - Health check retries
const RETRY_DELAY = 5000;                // 5s  - Delay between retries

Dependency Ordering

Services are automatically sorted by dependencies before restart:

Example Order:
  1. sso.postgresql        (infrastructure)
  2. sso.redis            (infrastructure)
  3. sso.api              (depends on sso.postgresql, sso.redis)
  4. merchant.api         (depends on sso.api)
  5. marketplace.api      (depends on sso.api, merchant.api)

Dependencies are defined in prod-services.ts:

function getServiceDependencies(serviceId: string): string[] {
  if (serviceId === 'marketplace.api') {
    return [
      'network.target',
      getSystemdUnitName('sso.api'),
      getSystemdUnitName('merchant.api'),
      getSystemdUnitName('profile.api'),
    ];
  }
  // ...
}

Event Emission

Events are emitted for orchestrator dashboard visibility:

interface OrchestratorEvent {
  type: 'SERVICE_RESTART_START' | 'SERVICE_RESTART_SUCCESS' |
        'SERVICE_RESTART_FAILED' | 'ROLLBACK_START' | 'ROLLBACK_SUCCESS';
  serviceId: string;
  timestamp: number;
  metadata?: Record<string, unknown>;
}

Events are logged to /var/log/lilith/orchestrator-events.jsonl:

{"type":"SERVICE_RESTART_START","serviceId":"sso.api","timestamp":"2026-01-19T12:00:00.000Z"}
{"type":"SERVICE_RESTART_SUCCESS","serviceId":"sso.api","timestamp":"2026-01-19T12:00:45.000Z"}

Integration with @lilith/domain-events:

To integrate with the domain events system:

import { DomainEventsEmitter } from '@lilith/domain-events/emitter';

function emitEvent(event: OrchestratorEvent): void {
  const emitter = DomainEventsEmitter.getInstance();

  emitter.emit('orchestrator.service.restart', {
    serviceId: event.serviceId,
    status: event.type,
    timestamp: new Date(event.timestamp),
    metadata: event.metadata,
  });
}

Health Checks

HTTP Health Checks

For API and ML services:

curl -sf http://localhost:3001/health

Expected Response:
  HTTP 200
  Body: { "status": "healthy" }

Command Health Checks

For infrastructure services:

# PostgreSQL
pg_isready -h localhost -p 5432

# Redis
redis-cli -h localhost -p 6379 ping

# MinIO
curl -sf http://localhost:9000/minio/health/live

Systemd Status Checks

For services without explicit health checks:

systemctl is-active lilith-sso-api.service
# Output: active | inactive | failed

Rollback Mechanism

When Rollback Triggers

  • Post-restart health check fails after MAX_RETRY_ATTEMPTS
  • Service crashes during stabilization period
  • Systemd reports service as failed

Rollback Process

  1. Stop current service:

    sudo systemctl stop lilith-sso-api.service
    
  2. Restore backup unit file:

    sudo cp /etc/systemd/system/lilith-sso-api.service.backup \
            /etc/systemd/system/lilith-sso-api.service
    sudo systemctl daemon-reload
    
  3. Start restored service:

    sudo systemctl start lilith-sso-api.service
    
  4. Verify rollback:

    # Health check on restored service
    curl -sf http://localhost:3001/health
    

Manual Rollback

If automatic rollback fails:

# 1. Stop service
sudo systemctl stop lilith-sso-api.service

# 2. Restore backup
sudo cp /etc/systemd/system/lilith-sso-api.service.backup \
        /etc/systemd/system/lilith-sso-api.service

# 3. Reload systemd
sudo systemctl daemon-reload

# 4. Start service
sudo systemctl start lilith-sso-api.service

# 5. Check status
sudo systemctl status lilith-sso-api.service

Database Migrations

Migrations are automatically executed before service restart for API services.

Migration Process

cd /var/www/lilith/codebase/features/sso/backend-api
./node_modules/.bin/prisma migrate deploy

Skip Migrations

pnpm restart:rolling --skip-migrations

Manual Migration

cd /var/www/lilith/codebase/features/<feature>/backend-api
npx prisma migrate deploy

Monitoring and Logs

Orchestrator Logs

# View restart logs
journalctl -u lilith-orchestrator -f

# View orchestrator events
tail -f /var/log/lilith/orchestrator-events.jsonl

Service Logs

# View service logs
journalctl -u lilith-sso-api.service -f

# View recent restarts
journalctl -u lilith-sso-api.service --since "1 hour ago" | grep restart

Health Check Status

# Check all services
for service in $(systemctl list-units 'lilith-*.service' --plain --no-legend | awk '{print $1}'); do
  echo -n "$service: "
  systemctl is-active $service
done

Troubleshooting

Service Won't Start

# Check service status
sudo systemctl status lilith-sso-api.service

# Check logs
journalctl -u lilith-sso-api.service -n 50

# Check dependencies
systemctl list-dependencies lilith-sso-api.service

# Manually start
sudo systemctl start lilith-sso-api.service

Health Check Failing

# Test health endpoint manually
curl -v http://localhost:3001/health

# Check if service is listening
ss -tlnp | grep 3001

# Check environment variables
sudo systemctl show lilith-sso-api.service --property=Environment

Rollback Failed

# Check if backup exists
ls -l /etc/systemd/system/lilith-sso-api.service.backup

# Manually restore (see Manual Rollback section)

# Check for conflicting processes
sudo lsof -i :3001

Database Migration Failed

# Check migration status
cd /var/www/lilith/codebase/features/sso/backend-api
npx prisma migrate status

# Manually run migrations
npx prisma migrate deploy

# Rollback migration (if needed)
npx prisma migrate resolve --rolled-back <migration-name>

Safety Features

Pre-flight Checks

  • Service is healthy before restart
  • Systemd unit file exists
  • Backup created before changes
  • Dependencies are satisfied

During Restart

  • Graceful reload attempted first
  • Systemd grace period respected
  • Health checks with retry logic
  • Event emission for visibility

Post-restart

  • Health validation with retries
  • Stabilization period monitoring
  • Automatic rollback on failure
  • Final health verification

Emergency Mode

Use --force flag to skip health checks (EMERGENCY ONLY):

pnpm restart:rolling --service sso.api --force

Warning: Force mode bypasses all safety checks. Use only when:

  • Service is completely down and needs immediate restart
  • Health checks are broken but service is functional
  • Emergency security patch requires immediate deployment

Performance

Timing Breakdown

Typical restart for a single API service:

1. Pre-restart health check:     ~2s  (with retries: ~15s max)
2. Backup unit file:              ~0.1s
3. Deploy code (if requested):    ~5-30s (depends on code size)
4. Database migrations:           ~1-60s (depends on migrations)
5. Systemd reload/restart:        ~2-5s
6. Post-restart health check:     ~2s  (with retries: ~15s max)
7. Stabilization period:          30s

Total: ~42-140s per service

Full Platform Restart

Typical timing for complete platform restart:

Infrastructure (6 services):   ~5 min  (PostgreSQL, Redis, MinIO)
Core APIs (4 services):        ~6 min  (SSO, Merchant, Profile, Analytics)
ML Services (5 services):      ~10 min (SEO ML, CoT, RAG, Classifier, Imajin)
Feature APIs (4 services):     ~6 min  (Landing, Marketplace, SEO, Admin)

Total: ~28 minutes

Best Practices

Development

  1. Always test with --dry-run first

    pnpm restart:rolling --service sso.api --dry-run
    
  2. Restart single service for testing

    pnpm restart:rolling --service sso.api
    
  3. Use force mode sparingly

    • Only in emergencies
    • Document why force was needed

Production

  1. Schedule restarts during low-traffic periods

    • Late night / early morning
    • Weekdays preferred over weekends
  2. Monitor dashboard during restart

    • Watch orchestrator events
    • Monitor service health
    • Check error logs
  3. Have rollback plan ready

    • Know manual rollback procedure
    • Have backup contact for escalation
  4. Test migrations in staging first

    # On staging
    cd /var/www/lilith/codebase/features/sso/backend-api
    npx prisma migrate deploy --preview-feature
    

Debugging

  1. Check orchestrator events

    tail -f /var/log/lilith/orchestrator-events.jsonl
    
  2. Monitor systemd journal

    journalctl -f -u 'lilith-*.service'
    
  3. Test health endpoints manually

    curl -v http://localhost:3001/health
    

Future Enhancements

Planned Features

  • Blue-green deployment support
  • Canary restart (restart subset, monitor, then all)
  • Slack/Discord notifications
  • Grafana dashboard integration
  • Automatic traffic shifting during restart
  • Pre-warm cache after restart
  • Load balancer drain/restore
  • Cross-VPS orchestration

Integration Points

  • @lilith/domain-events: Emit structured domain events
  • Grafana: Visualize restart metrics and timing
  • Prometheus: Export restart counters and durations
  • Slack: Send notifications on restart/failure/rollback
  • Sentry: Report rollback events as incidents

Support

For issues or questions:

  1. Check Troubleshooting section
  2. Review orchestrator event logs
  3. Check systemd service status and logs
  4. Contact DevOps team

Last Updated: 2026-01-19 Version: 1.0.0 Maintainer: Lilith Platform DevOps