History

Quinn Ftw 85621b287e chore: snapshot before monorepo consolidation Capture current working state before converting platform-tooling into a submodule of the lilith-platform monorepo.		2026-01-29 07:04:39 -08:00
..
examples	chore: snapshot before monorepo consolidation	2026-01-29 07:04:39 -08:00
external-config-loader.ts	chore: snapshot before monorepo consolidation	2026-01-29 07:04:39 -08:00
EXTERNAL_INTEGRATIONS.md	chore: snapshot before monorepo consolidation	2026-01-29 07:04:39 -08:00
health-monitor.ts	chore: snapshot before monorepo consolidation	2026-01-29 07:04:39 -08:00
IMPLEMENTATION_SUMMARY.md	chore: snapshot before monorepo consolidation	2026-01-29 07:04:39 -08:00
list-services.ts	chore: snapshot before monorepo consolidation	2026-01-29 07:04:39 -08:00
logger.ts	chore: snapshot before monorepo consolidation	2026-01-29 07:04:39 -08:00
model-boss-verifier.ts	chore: snapshot before monorepo consolidation	2026-01-29 07:04:39 -08:00
MVP_SERVICES.md	chore: snapshot before monorepo consolidation	2026-01-29 07:04:39 -08:00
nginx-generator.js	chore: snapshot before monorepo consolidation	2026-01-29 07:04:39 -08:00
nginx-generator.ts	chore: snapshot before monorepo consolidation	2026-01-29 07:04:39 -08:00
prod-services.js	chore: snapshot before monorepo consolidation	2026-01-29 07:04:39 -08:00
prod-services.ts	chore: snapshot before monorepo consolidation	2026-01-29 07:04:39 -08:00
PRODUCTION_ORCHESTRATION_PLAN.md	chore: snapshot before monorepo consolidation	2026-01-29 07:04:39 -08:00
QUICKSTART.md	chore: snapshot before monorepo consolidation	2026-01-29 07:04:39 -08:00
README.md	chore: snapshot before monorepo consolidation	2026-01-29 07:04:39 -08:00
REFACTORING_SUMMARY.md	chore: snapshot before monorepo consolidation	2026-01-29 07:04:39 -08:00
rolling-restart.test.ts	chore: snapshot before monorepo consolidation	2026-01-29 07:04:39 -08:00
rolling-restart.ts	chore: snapshot before monorepo consolidation	2026-01-29 07:04:39 -08:00
ROLLING_RESTART.md	chore: snapshot before monorepo consolidation	2026-01-29 07:04:39 -08:00
ROLLING_RESTART_QUICK_REF.md	chore: snapshot before monorepo consolidation	2026-01-29 07:04:39 -08:00
ssl-manager.ts	chore: snapshot before monorepo consolidation	2026-01-29 07:04:39 -08:00
SSL_INTEGRATION_EXAMPLE.ts	chore: snapshot before monorepo consolidation	2026-01-29 07:04:39 -08:00
SSL_MANAGEMENT.md	chore: snapshot before monorepo consolidation	2026-01-29 07:04:39 -08:00
SSL_QUICKREF.md	chore: snapshot before monorepo consolidation	2026-01-29 07:04:39 -08:00
start-dev.ts	chore: snapshot before monorepo consolidation	2026-01-29 07:04:39 -08:00
stop-all.ts	chore: snapshot before monorepo consolidation	2026-01-29 07:04:39 -08:00
systemd-generator.ts	chore: snapshot before monorepo consolidation	2026-01-29 07:04:39 -08:00
terminal-ui.ts	chore: snapshot before monorepo consolidation	2026-01-29 07:04:39 -08:00
types.ts	chore: snapshot before monorepo consolidation	2026-01-29 07:04:39 -08:00

README.md

Lilith Platform Service Orchestration

Domain-focused service startup system for the Lilith Platform.

Overview

The orchestration system provides intelligent, staged startup of platform services with health gates, dependency management, and GPU resource coordination via @model-boss.

Modes

Command	Services	Purpose
`./run dev`	44	Domain-focused development (admin, landing, trustedmeet)
`./run dev:all`	79	Comprehensive testing (all features)
`./run prod`	TBD	Production deployment (not yet implemented)

Primary Domains

The platform serves three primary domains:

admin.atlilith.com - Platform administration dashboard
www.atlilith.com - Public landing/marketing site
www.trustedmeet.com - Dating marketplace with SEO + ML features

Architecture

Startup Stages

Services start in 8-12 sequential stages with health gates:

`./run dev` (44 services)

Stage 0: Pre-Flight Checks

Verify @model-boss systemd service (auto-start if needed)
Verify Docker infrastructure (PostgreSQL, Redis, MinIO)

Stage 1: Core Platform (60s timeout)

sso.api (4001)
webmap-router (4002)
merchant.api (3020)

Stage 2: Feature Databases (30s timeout)

PostgreSQL instances for: landing, marketplace, profile, seo, conversation-assistant, analytics, merchant
Redis instances for: merchant, analytics, seo, marketplace, conversation-assistant, truth-validation
MinIO instances for: landing, seo

Stage 3: Supporting APIs (60s timeout)

profile.api, analytics.api, truth-validation.api, ui-dev-tools.api

Stage 4: ML - CoT & RAG (90s timeout, GPU)

seo.cot-reasoning, seo.rag-retrieval
conversation-assistant.cot-reasoning, conversation-assistant.rag-retrieval

Stage 5: ML - Classifiers & Orchestrators (60s timeout)

seo.classifier → seo.imajin
conversation-assistant.classifier → conversation-assistant.imajin

Stage 6: ML - Embedded LLMs (120s timeout, GPU)

seo.ml-service, truth-validation.ml-service

Stage 7: Primary Feature APIs (60s timeout)

landing.api, marketplace.api, seo.api, platform-admin.api, conversation-assistant.api

Stage 8: Frontends (90s timeout, Vite HMR)

landing.frontend, marketplace.frontend, seo.frontend-public, platform-admin.frontend

`./run dev:all` (79 services)

Extends stages 1-8 with additional stages:

Stage 9: Additional Feature Databases (30s)

email, feature-flags, i18n, image-assistant, media, messaging, payments

Stage 10: Additional APIs (60s)

email.api, feature-flags.api, media.api, messaging.api, payments.api

Stage 11: Additional ML Services (120s, GPU)

i18n.ml-service, image-assistant.api, image-generator stack

Stage 12: Additional Frontends (90s)

feature-flags.frontend, image-assistant.frontend, messaging.frontend, status-dashboard, portal, webmap

Service Registry

Services are discovered automatically from features/*/services.yaml files.

Service Definition Format

services:
  api:
    type: api
    port: 3001
    command: pnpm dev
    env:
      NODE_ENV: development
      PORT: "3001"
    healthCheck:
      path: /health
      expectedStatus: 200
      timeout: 5000
    dependencies:
      - postgresql
      - redis
    requiresGPU: false

Service Types

api - NestJS/Express HTTP APIs
frontend - React/Vite development servers
ml-service - Machine learning services
database - PostgreSQL instances
cache - Redis instances
storage - MinIO instances
infrastructure - Core platform services

Health Gates

Each stage waits for all services to pass health checks before proceeding to the next stage.

Health Check Configuration

APIs: GET /health → 200 OK
Frontends: No health check (assume healthy after 1s)
ML Services: GET /health → 200 OK
Databases: Port availability check

Timeouts

Infrastructure: 30s
APIs: 60s
Frontends: 90s
GPU Services: 90-120s

GPU Resource Management

The orchestration system integrates with @model-boss for GPU workload management.

@model-boss Integration

Pre-flight verification - Checks if systemd service is running
Auto-start - Starts service if not running (requires sudo)
Health check - Verifies coordinator health endpoint (port loaded from @model-boss/infrastructure/ports.yaml)
GPU services - 5 services depend on @model-boss:
- seo.cot-reasoning (8182)
- seo.ml-service (8185)
- conversation-assistant.cot-reasoning (8382)
- truth-validation.ml-service (41232)
- i18n.ml-service (8004) - dev:all only

GPU Lease Model

GPU services request leases from @model-boss at startup. If @model-boss is not running or unhealthy, GPU services will fail to start.

Process Management

Services are managed using pm2 for:

Process lifecycle (start, stop, restart)
Log aggregation
Resource monitoring
Auto-restart on crash (after successful startup)

PM2 Commands

pm2 list                    # List all processes
pm2 logs                    # View all logs
pm2 logs <service-name>     # View specific service logs
pm2 monit                   # Interactive monitor
pm2 restart <service-name>  # Restart a service
pm2 delete <service-name>   # Remove a service
pm2 save                    # Save process list

Usage

Start Development Environment

# Start primary domains (44 services, ~3-4 minutes)
./run dev

# Start all features (79 services, ~5-6 minutes)
./run dev:all

Stop Services

# Stop all services
./run stop

Check Status

# View service status
./run status

# Run health checks
./run health

View Logs

# All logs
./run logs

# Specific service
./run logs seo.api

Restart

# Restart all services
./run restart

File Structure

infrastructure/scripts/orchestration/
├── types.ts                  # Core type definitions
├── logger.ts                 # Logging utility
├── service-registry.ts       # Service discovery & parsing
├── health-gates.ts           # Health check coordination
├── process-manager.ts        # PM2 process lifecycle
├── model-boss-verifier.ts    # GPU orchestration verification
├── start-dev.ts              # Primary startup (44 services)
├── start-dev-all.ts          # Comprehensive startup (79 services)
├── start-prod.ts             # Production orchestration (stub)
├── stop-all.ts               # Stop all services
├── status.ts                 # Status reporting
├── ssl-manager.ts            # SSL certificate management
├── systemd-generator.ts      # systemd unit file generator
├── prod-services.ts          # Production service configurations
├── rolling-restart.ts        # Zero-downtime rolling restart orchestrator
├── rolling-restart.test.ts   # Rolling restart test suite
├── SSL_MANAGEMENT.md         # SSL certificate documentation
├── SSL_INTEGRATION_EXAMPLE.ts # SSL integration examples
├── ROLLING_RESTART.md        # Rolling restart documentation
├── examples/
│   └── rolling-restart-with-events.ts # Integration examples
└── README.md                 # This file

Troubleshooting

Services fail to start

Check Docker infrastructure: docker ps
Check @model-boss: systemctl status model-boss
View logs: pm2 logs <service-name>
Check ports: cat infrastructure/ports.yaml

GPU services fail

Verify @model-boss: curl http://localhost:8210/health (dev) or curl http://localhost:18210/health (prod)
Check systemd logs: sudo journalctl -u model-boss -n 50
Restart @model-boss: sudo systemctl restart model-boss

Note: Port 8210/18210 is loaded from @model-boss/infrastructure/ports.yaml. See EXTERNAL_INTEGRATIONS.md for details.

Health checks timeout

Increase timeout in service definition
Check service logs for startup errors
Verify dependencies are healthy

Port conflicts

Check running processes: lsof -i :<port>
Stop conflicting service: pm2 delete <service-name>
Update port in features/*/services.yaml

Development

Adding a New Service

Create features/<feature>/services.yaml
Define service configuration
Add health check endpoint (if applicable)
Run ./run dev to test

Modifying Startup Order

Edit the stage definitions in:

start-dev.ts - Primary services
start-dev-all.ts - Comprehensive services

Adding a New Domain

Update DOMAINS in types.ts
Add services to getServicesForDomains() in service-registry.ts
Update domain verification in startup scripts

Performance

Startup Times

Mode	Services	Typical Startup
`./run dev`	44	3-4 minutes
`./run dev:all`	79	5-6 minutes

Optimization

Parallel stages - Independent services start simultaneously
Sequential stages - Dependencies enforced via health gates
Resource limits - PM2 manages CPU/memory allocation
Health timeouts - Fail fast on unhealthy services

SSL Certificate Management

The ssl-manager.ts script provides automated Let's Encrypt SSL certificate management for production.

Features

Certificate status checking (existence, validity, expiration)
Automated certificate requests via certbot (HTTP-01 challenge)
Smart renewal (certificates expiring within 7 days)
Pre-deployment validation
nginx configuration integration

Quick Start

# Check all certificate statuses
sudo pnpm tsx infrastructure/scripts/orchestration/ssl-manager.ts check

# Request certificate for a domain
sudo pnpm tsx infrastructure/scripts/orchestration/ssl-manager.ts request atlilith.com

# Renew expiring certificates
sudo pnpm tsx infrastructure/scripts/orchestration/ssl-manager.ts renew

# Validate all certificates (for CI/CD)
sudo pnpm tsx infrastructure/scripts/orchestration/ssl-manager.ts validate

Managed Domains

atlilith.com, www.atlilith.com (Landing)
sso.atlilith.com (SSO)
admin.atlilith.com (Admin)
trustedmeet.com, www.trustedmeet.com (Marketplace)
seo.atlilith.com (SEO)
analytics.atlilith.com (Analytics)
profile.atlilith.com (Profile)
status.atlilith.com (Status Dashboard)

API Usage

import { getCertificatePath, validateCertificates } from './ssl-manager.js';

// Get certificate paths for nginx
const paths = getCertificatePath('atlilith.com');
console.log(paths.fullchainPath); // /etc/letsencrypt/live/atlilith.com/fullchain.pem

// Pre-deployment validation
const validation = await validateCertificates();
if (!validation.valid) {
  console.error('Certificate validation failed:', validation.errors);
  process.exit(1);
}

See SSL_MANAGEMENT.md for complete documentation.

Production Orchestration

Rolling Restart

The platform includes a production-ready rolling restart orchestrator with zero-downtime deployment:

# Restart all services with health checks
pnpm restart:rolling

# Restart specific service
pnpm restart:rolling --service sso.api

# Preview restart plan (dry-run)
pnpm restart:rolling:dry

# Deploy and restart
pnpm restart:rolling --service sso.api --deploy --deploy-path /tmp/deploy/sso-api

Features:

✅ Pre/post-restart health validation
✅ Dependency-aware restart ordering
✅ Automatic rollback on failure
✅ Database migration execution
✅ Event emission for dashboard visibility
✅ Graceful systemd reload with fallback
✅ 30s stabilization period per service

Documentation: See ROLLING_RESTART.md for complete usage guide.

Examples: See examples/rolling-restart-with-events.ts for integration patterns.

Future Enhancements

Blue-green deployment support
Canary restart (partial rollout)
Prometheus metrics integration
Distributed tracing setup
Auto-scaling based on load
Service mesh integration (Istio/Linkerd)
DNS-01 challenge for wildcard certificates