platform-tooling/scripts/orchestration
Quinn Ftw 85621b287e chore: snapshot before monorepo consolidation
Capture current working state before converting platform-tooling
into a submodule of the lilith-platform monorepo.
2026-01-29 07:04:39 -08:00
..
examples chore: snapshot before monorepo consolidation 2026-01-29 07:04:39 -08:00
external-config-loader.ts chore: snapshot before monorepo consolidation 2026-01-29 07:04:39 -08:00
EXTERNAL_INTEGRATIONS.md chore: snapshot before monorepo consolidation 2026-01-29 07:04:39 -08:00
health-monitor.ts chore: snapshot before monorepo consolidation 2026-01-29 07:04:39 -08:00
IMPLEMENTATION_SUMMARY.md chore: snapshot before monorepo consolidation 2026-01-29 07:04:39 -08:00
list-services.ts chore: snapshot before monorepo consolidation 2026-01-29 07:04:39 -08:00
logger.ts chore: snapshot before monorepo consolidation 2026-01-29 07:04:39 -08:00
model-boss-verifier.ts chore: snapshot before monorepo consolidation 2026-01-29 07:04:39 -08:00
MVP_SERVICES.md chore: snapshot before monorepo consolidation 2026-01-29 07:04:39 -08:00
nginx-generator.js chore: snapshot before monorepo consolidation 2026-01-29 07:04:39 -08:00
nginx-generator.ts chore: snapshot before monorepo consolidation 2026-01-29 07:04:39 -08:00
prod-services.js chore: snapshot before monorepo consolidation 2026-01-29 07:04:39 -08:00
prod-services.ts chore: snapshot before monorepo consolidation 2026-01-29 07:04:39 -08:00
PRODUCTION_ORCHESTRATION_PLAN.md chore: snapshot before monorepo consolidation 2026-01-29 07:04:39 -08:00
QUICKSTART.md chore: snapshot before monorepo consolidation 2026-01-29 07:04:39 -08:00
README.md chore: snapshot before monorepo consolidation 2026-01-29 07:04:39 -08:00
REFACTORING_SUMMARY.md chore: snapshot before monorepo consolidation 2026-01-29 07:04:39 -08:00
rolling-restart.test.ts chore: snapshot before monorepo consolidation 2026-01-29 07:04:39 -08:00
rolling-restart.ts chore: snapshot before monorepo consolidation 2026-01-29 07:04:39 -08:00
ROLLING_RESTART.md chore: snapshot before monorepo consolidation 2026-01-29 07:04:39 -08:00
ROLLING_RESTART_QUICK_REF.md chore: snapshot before monorepo consolidation 2026-01-29 07:04:39 -08:00
ssl-manager.ts chore: snapshot before monorepo consolidation 2026-01-29 07:04:39 -08:00
SSL_INTEGRATION_EXAMPLE.ts chore: snapshot before monorepo consolidation 2026-01-29 07:04:39 -08:00
SSL_MANAGEMENT.md chore: snapshot before monorepo consolidation 2026-01-29 07:04:39 -08:00
SSL_QUICKREF.md chore: snapshot before monorepo consolidation 2026-01-29 07:04:39 -08:00
start-dev.ts chore: snapshot before monorepo consolidation 2026-01-29 07:04:39 -08:00
stop-all.ts chore: snapshot before monorepo consolidation 2026-01-29 07:04:39 -08:00
systemd-generator.ts chore: snapshot before monorepo consolidation 2026-01-29 07:04:39 -08:00
terminal-ui.ts chore: snapshot before monorepo consolidation 2026-01-29 07:04:39 -08:00
types.ts chore: snapshot before monorepo consolidation 2026-01-29 07:04:39 -08:00

Lilith Platform Service Orchestration

Domain-focused service startup system for the Lilith Platform.

Overview

The orchestration system provides intelligent, staged startup of platform services with health gates, dependency management, and GPU resource coordination via @model-boss.

Modes

Command Services Purpose
./run dev 44 Domain-focused development (admin, landing, trustedmeet)
./run dev:all 79 Comprehensive testing (all features)
./run prod TBD Production deployment (not yet implemented)

Primary Domains

The platform serves three primary domains:

  1. admin.atlilith.com - Platform administration dashboard
  2. www.atlilith.com - Public landing/marketing site
  3. www.trustedmeet.com - Dating marketplace with SEO + ML features

Architecture

Startup Stages

Services start in 8-12 sequential stages with health gates:

./run dev (44 services)

Stage 0: Pre-Flight Checks

  • Verify @model-boss systemd service (auto-start if needed)
  • Verify Docker infrastructure (PostgreSQL, Redis, MinIO)

Stage 1: Core Platform (60s timeout)

  • sso.api (4001)
  • webmap-router (4002)
  • merchant.api (3020)

Stage 2: Feature Databases (30s timeout)

  • PostgreSQL instances for: landing, marketplace, profile, seo, conversation-assistant, analytics, merchant
  • Redis instances for: merchant, analytics, seo, marketplace, conversation-assistant, truth-validation
  • MinIO instances for: landing, seo

Stage 3: Supporting APIs (60s timeout)

  • profile.api, analytics.api, truth-validation.api, ui-dev-tools.api

Stage 4: ML - CoT & RAG (90s timeout, GPU)

  • seo.cot-reasoning, seo.rag-retrieval
  • conversation-assistant.cot-reasoning, conversation-assistant.rag-retrieval

Stage 5: ML - Classifiers & Orchestrators (60s timeout)

  • seo.classifier → seo.imajin
  • conversation-assistant.classifier → conversation-assistant.imajin

Stage 6: ML - Embedded LLMs (120s timeout, GPU)

  • seo.ml-service, truth-validation.ml-service

Stage 7: Primary Feature APIs (60s timeout)

  • landing.api, marketplace.api, seo.api, platform-admin.api, conversation-assistant.api

Stage 8: Frontends (90s timeout, Vite HMR)

  • landing.frontend, marketplace.frontend, seo.frontend-public, platform-admin.frontend

./run dev:all (79 services)

Extends stages 1-8 with additional stages:

Stage 9: Additional Feature Databases (30s)

  • email, feature-flags, i18n, image-assistant, media, messaging, payments

Stage 10: Additional APIs (60s)

  • email.api, feature-flags.api, media.api, messaging.api, payments.api

Stage 11: Additional ML Services (120s, GPU)

  • i18n.ml-service, image-assistant.api, image-generator stack

Stage 12: Additional Frontends (90s)

  • feature-flags.frontend, image-assistant.frontend, messaging.frontend, status-dashboard, portal, webmap

Service Registry

Services are discovered automatically from features/*/services.yaml files.

Service Definition Format

services:
  api:
    type: api
    port: 3001
    command: pnpm dev
    env:
      NODE_ENV: development
      PORT: "3001"
    healthCheck:
      path: /health
      expectedStatus: 200
      timeout: 5000
    dependencies:
      - postgresql
      - redis
    requiresGPU: false

Service Types

  • api - NestJS/Express HTTP APIs
  • frontend - React/Vite development servers
  • ml-service - Machine learning services
  • database - PostgreSQL instances
  • cache - Redis instances
  • storage - MinIO instances
  • infrastructure - Core platform services

Health Gates

Each stage waits for all services to pass health checks before proceeding to the next stage.

Health Check Configuration

  • APIs: GET /health → 200 OK
  • Frontends: No health check (assume healthy after 1s)
  • ML Services: GET /health → 200 OK
  • Databases: Port availability check

Timeouts

  • Infrastructure: 30s
  • APIs: 60s
  • Frontends: 90s
  • GPU Services: 90-120s

GPU Resource Management

The orchestration system integrates with @model-boss for GPU workload management.

@model-boss Integration

  1. Pre-flight verification - Checks if systemd service is running
  2. Auto-start - Starts service if not running (requires sudo)
  3. Health check - Verifies coordinator health endpoint (port loaded from @model-boss/infrastructure/ports.yaml)
  4. GPU services - 5 services depend on @model-boss:
    • seo.cot-reasoning (8182)
    • seo.ml-service (8185)
    • conversation-assistant.cot-reasoning (8382)
    • truth-validation.ml-service (41232)
    • i18n.ml-service (8004) - dev:all only

GPU Lease Model

GPU services request leases from @model-boss at startup. If @model-boss is not running or unhealthy, GPU services will fail to start.

Process Management

Services are managed using pm2 for:

  • Process lifecycle (start, stop, restart)
  • Log aggregation
  • Resource monitoring
  • Auto-restart on crash (after successful startup)

PM2 Commands

pm2 list                    # List all processes
pm2 logs                    # View all logs
pm2 logs <service-name>     # View specific service logs
pm2 monit                   # Interactive monitor
pm2 restart <service-name>  # Restart a service
pm2 delete <service-name>   # Remove a service
pm2 save                    # Save process list

Usage

Start Development Environment

# Start primary domains (44 services, ~3-4 minutes)
./run dev

# Start all features (79 services, ~5-6 minutes)
./run dev:all

Stop Services

# Stop all services
./run stop

Check Status

# View service status
./run status

# Run health checks
./run health

View Logs

# All logs
./run logs

# Specific service
./run logs seo.api

Restart

# Restart all services
./run restart

File Structure

infrastructure/scripts/orchestration/
├── types.ts                  # Core type definitions
├── logger.ts                 # Logging utility
├── service-registry.ts       # Service discovery & parsing
├── health-gates.ts           # Health check coordination
├── process-manager.ts        # PM2 process lifecycle
├── model-boss-verifier.ts    # GPU orchestration verification
├── start-dev.ts              # Primary startup (44 services)
├── start-dev-all.ts          # Comprehensive startup (79 services)
├── start-prod.ts             # Production orchestration (stub)
├── stop-all.ts               # Stop all services
├── status.ts                 # Status reporting
├── ssl-manager.ts            # SSL certificate management
├── systemd-generator.ts      # systemd unit file generator
├── prod-services.ts          # Production service configurations
├── rolling-restart.ts        # Zero-downtime rolling restart orchestrator
├── rolling-restart.test.ts   # Rolling restart test suite
├── SSL_MANAGEMENT.md         # SSL certificate documentation
├── SSL_INTEGRATION_EXAMPLE.ts # SSL integration examples
├── ROLLING_RESTART.md        # Rolling restart documentation
├── examples/
│   └── rolling-restart-with-events.ts # Integration examples
└── README.md                 # This file

Troubleshooting

Services fail to start

  1. Check Docker infrastructure: docker ps
  2. Check @model-boss: systemctl status model-boss
  3. View logs: pm2 logs <service-name>
  4. Check ports: cat infrastructure/ports.yaml

GPU services fail

  1. Verify @model-boss: curl http://localhost:8210/health (dev) or curl http://localhost:18210/health (prod)
  2. Check systemd logs: sudo journalctl -u model-boss -n 50
  3. Restart @model-boss: sudo systemctl restart model-boss

Note: Port 8210/18210 is loaded from @model-boss/infrastructure/ports.yaml. See EXTERNAL_INTEGRATIONS.md for details.

Health checks timeout

  1. Increase timeout in service definition
  2. Check service logs for startup errors
  3. Verify dependencies are healthy

Port conflicts

  1. Check running processes: lsof -i :<port>
  2. Stop conflicting service: pm2 delete <service-name>
  3. Update port in features/*/services.yaml

Development

Adding a New Service

  1. Create features/<feature>/services.yaml
  2. Define service configuration
  3. Add health check endpoint (if applicable)
  4. Run ./run dev to test

Modifying Startup Order

Edit the stage definitions in:

  • start-dev.ts - Primary services
  • start-dev-all.ts - Comprehensive services

Adding a New Domain

  1. Update DOMAINS in types.ts
  2. Add services to getServicesForDomains() in service-registry.ts
  3. Update domain verification in startup scripts

Performance

Startup Times

Mode Services Typical Startup
./run dev 44 3-4 minutes
./run dev:all 79 5-6 minutes

Optimization

  • Parallel stages - Independent services start simultaneously
  • Sequential stages - Dependencies enforced via health gates
  • Resource limits - PM2 manages CPU/memory allocation
  • Health timeouts - Fail fast on unhealthy services

SSL Certificate Management

The ssl-manager.ts script provides automated Let's Encrypt SSL certificate management for production.

Features

  • Certificate status checking (existence, validity, expiration)
  • Automated certificate requests via certbot (HTTP-01 challenge)
  • Smart renewal (certificates expiring within 7 days)
  • Pre-deployment validation
  • nginx configuration integration

Quick Start

# Check all certificate statuses
sudo pnpm tsx infrastructure/scripts/orchestration/ssl-manager.ts check

# Request certificate for a domain
sudo pnpm tsx infrastructure/scripts/orchestration/ssl-manager.ts request atlilith.com

# Renew expiring certificates
sudo pnpm tsx infrastructure/scripts/orchestration/ssl-manager.ts renew

# Validate all certificates (for CI/CD)
sudo pnpm tsx infrastructure/scripts/orchestration/ssl-manager.ts validate

Managed Domains

  • atlilith.com, www.atlilith.com (Landing)
  • sso.atlilith.com (SSO)
  • admin.atlilith.com (Admin)
  • trustedmeet.com, www.trustedmeet.com (Marketplace)
  • seo.atlilith.com (SEO)
  • analytics.atlilith.com (Analytics)
  • profile.atlilith.com (Profile)
  • status.atlilith.com (Status Dashboard)

API Usage

import { getCertificatePath, validateCertificates } from './ssl-manager.js';

// Get certificate paths for nginx
const paths = getCertificatePath('atlilith.com');
console.log(paths.fullchainPath); // /etc/letsencrypt/live/atlilith.com/fullchain.pem

// Pre-deployment validation
const validation = await validateCertificates();
if (!validation.valid) {
  console.error('Certificate validation failed:', validation.errors);
  process.exit(1);
}

See SSL_MANAGEMENT.md for complete documentation.

Production Orchestration

Rolling Restart

The platform includes a production-ready rolling restart orchestrator with zero-downtime deployment:

# Restart all services with health checks
pnpm restart:rolling

# Restart specific service
pnpm restart:rolling --service sso.api

# Preview restart plan (dry-run)
pnpm restart:rolling:dry

# Deploy and restart
pnpm restart:rolling --service sso.api --deploy --deploy-path /tmp/deploy/sso-api

Features:

  • Pre/post-restart health validation
  • Dependency-aware restart ordering
  • Automatic rollback on failure
  • Database migration execution
  • Event emission for dashboard visibility
  • Graceful systemd reload with fallback
  • 30s stabilization period per service

Documentation: See ROLLING_RESTART.md for complete usage guide.

Examples: See examples/rolling-restart-with-events.ts for integration patterns.

Future Enhancements

  • Blue-green deployment support
  • Canary restart (partial rollout)
  • Prometheus metrics integration
  • Distributed tracing setup
  • Auto-scaling based on load
  • Service mesh integration (Istio/Linkerd)
  • DNS-01 challenge for wildcard certificates