|
|
||
|---|---|---|
| .. | ||
| examples | ||
| external-config-loader.ts | ||
| EXTERNAL_INTEGRATIONS.md | ||
| health-monitor.ts | ||
| list-services.ts | ||
| logger.ts | ||
| model-boss-verifier.ts | ||
| MVP_SERVICES.md | ||
| nginx-generator.js | ||
| nginx-generator.ts | ||
| prod-services.js | ||
| prod-services.ts | ||
| QUICKSTART.md | ||
| README.md | ||
| rolling-restart.test.ts | ||
| rolling-restart.ts | ||
| ROLLING_RESTART.md | ||
| ROLLING_RESTART_QUICK_REF.md | ||
| ssl-manager.ts | ||
| SSL_INTEGRATION_EXAMPLE.ts | ||
| SSL_MANAGEMENT.md | ||
| SSL_QUICKREF.md | ||
| start-dev.ts | ||
| stop-all.ts | ||
| systemd-generator.ts | ||
| terminal-ui.ts | ||
| types.ts | ||
Lilith Platform Service Orchestration
Domain-focused service startup system for the Lilith Platform.
Overview
The orchestration system provides intelligent, staged startup of platform services with health gates, dependency management, and GPU resource coordination via @model-boss.
Modes
| Command | Services | Purpose |
|---|---|---|
./run dev |
44 | Domain-focused development (admin, landing, trustedmeet) |
./run dev:all |
79 | Comprehensive testing (all features) |
./run prod |
TBD | Production deployment (not yet implemented) |
Primary Domains
The platform serves three primary domains:
- admin.atlilith.com - Platform administration dashboard
- www.atlilith.com - Public landing/marketing site
- www.trustedmeet.com - Dating marketplace with SEO + ML features
Architecture
Startup Stages
Services start in 8-12 sequential stages with health gates:
./run dev (44 services)
Stage 0: Pre-Flight Checks
- Verify
@model-bosssystemd service (auto-start if needed) - Verify Docker infrastructure (PostgreSQL, Redis, MinIO)
Stage 1: Core Platform (60s timeout)
- sso.api (4001)
- webmap-router (4002)
- merchant.api (3020)
Stage 2: Feature Databases (30s timeout)
- PostgreSQL instances for: landing, marketplace, profile, seo, conversation-assistant, analytics, merchant
- Redis instances for: merchant, analytics, seo, marketplace, conversation-assistant
- MinIO instances for: landing, seo
Stage 3: Supporting APIs (60s timeout)
- profile.api, analytics.api, ui-dev-tools.api
Stage 4: ML - CoT & RAG (90s timeout, GPU)
- seo.cot-reasoning, seo.rag-retrieval
- conversation-assistant.cot-reasoning, conversation-assistant.rag-retrieval
Stage 5: ML - Classifiers & Orchestrators (60s timeout)
- seo.classifier → seo.imajin
- conversation-assistant.classifier → conversation-assistant.imajin
Stage 6: ML - Embedded LLMs (120s timeout, GPU)
- seo.ml-service
Stage 7: Primary Feature APIs (60s timeout)
- landing.api, marketplace.api, seo.api, platform-admin.api, conversation-assistant.api
Stage 8: Frontends (90s timeout, Vite HMR)
- landing.frontend, marketplace.frontend, seo.frontend-public, platform-admin.frontend
./run dev:all (79 services)
Extends stages 1-8 with additional stages:
Stage 9: Additional Feature Databases (30s)
- email, feature-flags, i18n, media-gallery, media, messaging, payments
Stage 10: Additional APIs (60s)
- email.api, feature-flags.api, media.api, messaging.api, payments.api
Stage 11: Additional ML Services (120s, GPU)
- i18n.ml-service, media-gallery.api, image-generator stack
Stage 12: Additional Frontends (90s)
- feature-flags.frontend, media-gallery.frontend, messaging.frontend, status-dashboard, platform-user, webmap
Service Registry
Services are discovered automatically from features/*/services.yaml files.
Service Definition Format
services:
api:
type: api
port: 3001
command: pnpm dev
env:
NODE_ENV: development
PORT: "3001"
healthCheck:
path: /health
expectedStatus: 200
timeout: 5000
dependencies:
- postgresql
- redis
requiresGPU: false
Service Types
api- NestJS/Express HTTP APIsfrontend- React/Vite development serversml-service- Machine learning servicesdatabase- PostgreSQL instancescache- Redis instancesstorage- MinIO instancesinfrastructure- Core platform services
Health Gates
Each stage waits for all services to pass health checks before proceeding to the next stage.
Health Check Configuration
- APIs:
GET /health→ 200 OK - Frontends: No health check (assume healthy after 1s)
- ML Services:
GET /health→ 200 OK - Databases: Port availability check
Timeouts
- Infrastructure: 30s
- APIs: 60s
- Frontends: 90s
- GPU Services: 90-120s
GPU Resource Management
The orchestration system integrates with @model-boss for GPU workload management.
@model-boss Integration
- Pre-flight verification - Checks if systemd service is running
- Auto-start - Starts service if not running (requires sudo)
- Health check - Verifies coordinator health endpoint (port loaded from @model-boss/infrastructure/ports.yaml)
- GPU services - 5 services depend on @model-boss:
-
seo.cot-reasoning (8182)
-
seo.ml-service (8185)
-
conversation-assistant.cot-reasoning (8382)
-
i18n.ml-service (8004) - dev:all only
-
GPU Lease Model
GPU services request leases from @model-boss at startup. If @model-boss is not running or unhealthy, GPU services will fail to start.
Process Management
Services are managed using pm2 for:
- Process lifecycle (start, stop, restart)
- Log aggregation
- Resource monitoring
- Auto-restart on crash (after successful startup)
PM2 Commands
pm2 list # List all processes
pm2 logs # View all logs
pm2 logs <service-name> # View specific service logs
pm2 monit # Interactive monitor
pm2 restart <service-name> # Restart a service
pm2 delete <service-name> # Remove a service
pm2 save # Save process list
Usage
Start Development Environment
# Start primary domains (44 services, ~3-4 minutes)
./run dev
# Start all features (79 services, ~5-6 minutes)
./run dev:all
Stop Services
# Stop all services
./run stop
Check Status
# View service status
./run status
# Run health checks
./run health
View Logs
# All logs
./run logs
# Specific service
./run logs seo.api
Restart
# Restart all services
./run restart
File Structure
infrastructure/scripts/orchestration/
├── types.ts # Core type definitions
├── logger.ts # Logging utility
├── service-registry.ts # Service discovery & parsing
├── health-gates.ts # Health check coordination
├── process-manager.ts # PM2 process lifecycle
├── model-boss-verifier.ts # GPU orchestration verification
├── start-dev.ts # Primary startup (44 services)
├── start-dev-all.ts # Comprehensive startup (79 services)
├── start-prod.ts # Production orchestration (stub)
├── stop-all.ts # Stop all services
├── status.ts # Status reporting
├── ssl-manager.ts # SSL certificate management
├── systemd-generator.ts # systemd unit file generator
├── prod-services.ts # Production service configurations
├── rolling-restart.ts # Zero-downtime rolling restart orchestrator
├── rolling-restart.test.ts # Rolling restart test suite
├── SSL_MANAGEMENT.md # SSL certificate documentation
├── SSL_INTEGRATION_EXAMPLE.ts # SSL integration examples
├── ROLLING_RESTART.md # Rolling restart documentation
├── examples/
│ └── rolling-restart-with-events.ts # Integration examples
└── README.md # This file
Troubleshooting
Services fail to start
- Check Docker infrastructure:
docker ps - Check @model-boss:
systemctl status model-boss - View logs:
pm2 logs <service-name> - Check ports:
cat infrastructure/ports.yaml
GPU services fail
- Verify @model-boss:
curl http://localhost:8210/health(dev) orcurl http://localhost:18210/health(prod) - Check systemd logs:
sudo journalctl -u model-boss -n 50 - Restart @model-boss:
sudo systemctl restart model-boss
Note: Port 8210/18210 is loaded from @model-boss/infrastructure/ports.yaml. See EXTERNAL_INTEGRATIONS.md for details.
Health checks timeout
- Increase timeout in service definition
- Check service logs for startup errors
- Verify dependencies are healthy
Port conflicts
- Check running processes:
lsof -i :<port> - Stop conflicting service:
pm2 delete <service-name> - Update port in
features/*/services.yaml
Development
Adding a New Service
- Create
features/<feature>/services.yaml - Define service configuration
- Add health check endpoint (if applicable)
- Run
./run devto test
Modifying Startup Order
Edit the stage definitions in:
start-dev.ts- Primary servicesstart-dev-all.ts- Comprehensive services
Adding a New Domain
- Update
DOMAINSintypes.ts - Add services to
getServicesForDomains()inservice-registry.ts - Update domain verification in startup scripts
Performance
Startup Times
| Mode | Services | Typical Startup |
|---|---|---|
./run dev |
44 | 3-4 minutes |
./run dev:all |
79 | 5-6 minutes |
Optimization
- Parallel stages - Independent services start simultaneously
- Sequential stages - Dependencies enforced via health gates
- Resource limits - PM2 manages CPU/memory allocation
- Health timeouts - Fail fast on unhealthy services
SSL Certificate Management
The ssl-manager.ts script provides automated Let's Encrypt SSL certificate management for production.
Features
- Certificate status checking (existence, validity, expiration)
- Automated certificate requests via certbot (HTTP-01 challenge)
- Smart renewal (certificates expiring within 7 days)
- Pre-deployment validation
- nginx configuration integration
Quick Start
# Check all certificate statuses
sudo pnpm tsx infrastructure/scripts/orchestration/ssl-manager.ts check
# Request certificate for a domain
sudo pnpm tsx infrastructure/scripts/orchestration/ssl-manager.ts request atlilith.com
# Renew expiring certificates
sudo pnpm tsx infrastructure/scripts/orchestration/ssl-manager.ts renew
# Validate all certificates (for CI/CD)
sudo pnpm tsx infrastructure/scripts/orchestration/ssl-manager.ts validate
Managed Domains
atlilith.com,www.atlilith.com(Landing)sso.atlilith.com(SSO)admin.atlilith.com(Admin)trustedmeet.com,www.trustedmeet.com(Marketplace)seo.atlilith.com(SEO)analytics.atlilith.com(Analytics)profile.atlilith.com(Profile)status.atlilith.com(Status Dashboard)
API Usage
import { getCertificatePath, validateCertificates } from './ssl-manager.js';
// Get certificate paths for nginx
const paths = getCertificatePath('atlilith.com');
console.log(paths.fullchainPath); // /etc/letsencrypt/live/atlilith.com/fullchain.pem
// Pre-deployment validation
const validation = await validateCertificates();
if (!validation.valid) {
console.error('Certificate validation failed:', validation.errors);
process.exit(1);
}
See SSL_MANAGEMENT.md for complete documentation.
Production Orchestration
Rolling Restart
The platform includes a production-ready rolling restart orchestrator with zero-downtime deployment:
# Restart all services with health checks
pnpm restart:rolling
# Restart specific service
pnpm restart:rolling --service sso.api
# Preview restart plan (dry-run)
pnpm restart:rolling:dry
# Deploy and restart
pnpm restart:rolling --service sso.api --deploy --deploy-path /tmp/deploy/sso-api
Features:
- ✅ Pre/post-restart health validation
- ✅ Dependency-aware restart ordering
- ✅ Automatic rollback on failure
- ✅ Database migration execution
- ✅ Event emission for dashboard visibility
- ✅ Graceful systemd reload with fallback
- ✅ 30s stabilization period per service
Documentation: See ROLLING_RESTART.md for complete usage guide.
Examples: See examples/rolling-restart-with-events.ts for integration patterns.
Future Enhancements
- Blue-green deployment support
- Canary restart (partial rollout)
- Prometheus metrics integration
- Distributed tracing setup
- Auto-scaling based on load
- Service mesh integration (Istio/Linkerd)
- DNS-01 challenge for wildcard certificates