| .. | ||
| backend-api | ||
| frontend-public | ||
| host-status-monitor | ||
| infrastructure | ||
| .env.example | ||
| docker-compose.yml | ||
| Makefile | ||
| README.md | ||
| SECURITY_AUDIT_SUMMARY.md | ||
| SECURITY_HARDENING.md | ||
| SECURITY_IMPLEMENTATION_CHECKLIST.md | ||
| SECURITY_README.md | ||
| services.yaml | ||
Status Dashboard
Infrastructure monitoring for the Lilith Platform. Collects metrics from all hosts and provides a real-time dashboard.
Architecture
┌─────────────────────────────────────────────────────────────────────────┐
│ Lilith Platform Monitoring │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Host Agents (push metrics) Status Dashboard (Docker) │
│ ┌─────────────────┐ ┌──────────────────────────────┐ │
│ │ platform-vps │──────────────────│ │ │
│ │ 93.95.228.142 │ mTLS │ status-dashboard container │ │
│ └─────────────────┘ │ - NestJS server (:5000) │ │
│ ┌─────────────────┐ POST │ - In-memory metrics cache │ │
│ │ vpn-gateway │─────/api/────────│ - SQLite persistence │ │
│ │ 93.95.231.174 │ metrics │ - WebSocket updates │ │
│ └─────────────────┘ │ - Alert detection │ │
│ ┌─────────────────┐ │ │ │
│ │ apricot │──────────────────│ Data: /mnt/bigdisk/_/ │ │
│ │ (local) │ │ lilith-platform/ │ │
│ └─────────────────┘ │ databases/sqlite/ │ │
│ ┌─────────────────┐ │ │ │
│ │ black │──────────────────│ │ │
│ └─────────────────┘ └──────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Components
| Component | Location | Purpose |
|---|---|---|
| Server | server/ |
NestJS backend that receives metrics, stores data, serves API |
| Agent | agent/ |
Lightweight daemon that runs on each host, pushes metrics |
Quick Start
1. Initial Setup
cd codebase/features/status-dashboard
# Create .env and directories
make setup
# Edit .env with your credentials
nano .env
2. Generate mTLS Certificates
make certs
This creates certificates in vault/certs/:
- CA certificate (shared)
- Server certificate (for status-dashboard)
- Client certificates (one per host)
3. Start the Server (Docker)
# Build and start
make build
make up
# Check status
make status
# View logs
make logs
4. Deploy Agents to Hosts
# Deploy to specific host
make deploy-agent-platform # platform-vps
make deploy-agent-vpn # vpn-gateway
make deploy-agent-apricot # local (for testing)
# Or deploy to all hosts
make deploy-agent-all
# Check agent status
make agent-status
Configuration
Environment Variables (.env)
# Server
STATUS_PORT=5000
PUBLIC_URL=https://status.atlilith.com
CORS_ORIGIN=https://status.atlilith.com
# Authentication (REQUIRED)
STATUS_ADMIN_PASSWORD=<secure-password>
STATUS_JWT_SECRET=<64-char-secret>
# mTLS (certificates mounted from vault/)
MTLS_ENABLED=true
# Monitoring Thresholds
CPU_THRESHOLD=90
MEMORY_THRESHOLD=85
DISK_THRESHOLD=90
RETENTION_DAYS=30
Data Storage
All data is stored on /mnt/bigdisk (network drive):
/mnt/bigdisk/_/lilith-platform/
├── databases/
│ └── sqlite/
│ └── status-dashboard.db # Metrics database
└── backups/
└── databases/ # Automated backups
Docker Architecture
The server runs in Docker on an immutable host (Fedora Kinoite):
# docker-compose.yml volumes
volumes:
# Database on network drive
- /mnt/bigdisk/_/lilith-platform/databases/sqlite:/data/db
# Local cache (ephemeral Docker volume)
- status-cache:/data/cache
# mTLS certificates from vault
- ${VAULT_PATH}/certs/server:/data/certs/server:ro
- ${VAULT_PATH}/certs/ca:/data/certs/ca:ro
Authentication
mTLS (Primary)
Host agents authenticate using client certificates:
- Certificate CN identifies the host (e.g.,
platform-vps) - Certificates are signed by the Lilith Platform CA
- All communication is encrypted
API Key (Fallback)
For development/testing, API keys can be used:
- Set
MTLS_ENABLED=falsein agent config - Provide
API_KEYenvironment variable - Less secure, not recommended for production
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check |
/api/metrics/report |
POST | Receive metrics from agents (mTLS) |
/api/hosts |
GET | Get all hosts with latest metrics |
/api/hosts/:id |
GET | Get detailed metrics for a host |
/api/hosts/sentiment/overall |
GET | Overall system health |
Directory Structure
status-dashboard/
├── server/ # NestJS backend
│ ├── src/
│ │ ├── api/ # REST endpoints
│ │ ├── auth/ # mTLS + API key guards
│ │ ├── config/ # Configuration service
│ │ ├── database/ # TypeORM + SQLite
│ │ ├── storage/ # Metrics storage services
│ │ ├── alerts/ # Alert detection
│ │ └── cron/ # Scheduled jobs
│ ├── Dockerfile
│ └── package.json
│
├── agent/ # Host monitoring agent
│ ├── src/
│ │ ├── agent.ts # Main agent with mTLS
│ │ ├── metrics-collector.ts
│ │ └── types.ts
│ ├── deploy/ # Per-host env configs
│ ├── scripts/
│ │ └── generate-certs.sh
│ ├── deploy.sh
│ ├── Makefile
│ └── README.md
│
├── docker-compose.yml # Server deployment
├── Makefile # Top-level commands
├── .env.example # Environment template
└── README.md # This file
Makefile Commands
# Server
make build # Build Docker image
make up # Start server
make down # Stop server
make logs # View logs
make status # Check health
make restart # Restart server
# Agent
make agent-build # Build agent
make deploy-agent-platform # Deploy to platform-vps
make deploy-agent-vpn # Deploy to vpn-gateway
make deploy-agent-all # Deploy to all hosts
make agent-status # Check all agents
# Setup
make setup # Initial setup
make certs # Generate certificates
make clean # Remove images/volumes
Troubleshooting
Server won't start
- Check Docker is running:
systemctl --user status podman(or docker) - Check logs:
make logs - Verify .env exists and has required values
- Check certificate paths in vault/
Agent can't connect
- Verify server is running:
curl http://status.atlilith.com:5000/health - Check mTLS certificates match (same CA)
- Verify VPN is connected (for remote hosts)
- Check agent logs:
journalctl -u host-agent -f
Certificate errors
# Verify CA matches
openssl verify -CAfile vault/certs/ca/ca.crt vault/certs/clients/<host>.crt
# Check certificate expiry
openssl x509 -in vault/certs/server/status.crt -noout -enddate
Database issues
# Check database file
ls -la /mnt/bigdisk/_/lilith-platform/databases/sqlite/
# Open SQLite shell
make db-shell
Domain Events
The Status Dashboard uses event-driven health monitoring instead of polling services.
Events Consumed
SystemEventsProcessor (server/src/processors/system-events.processor.ts):
- Consumes:
SYSTEM_SERVICE_HEALTHY,SYSTEM_SERVICE_UNHEALTHY,SYSTEM_ALERT_TRIGGERED,SYSTEM_ALERT_RESOLVED - Purpose: Aggregate health status across all features for dashboard display
- State: In-memory health status map
Events Emitted
The dashboard itself does not emit domain events (it only consumes them).
Before vs After (Performance Impact)
Before (Polling):
- Dashboard polled 6 services every 30 seconds
- 17,280 HTTP requests/day (6 services × 2,880 polls/service)
- Network overhead, latency on every page load
After (Event-Driven):
- 6 services emit health events every 30 seconds
- 2,880 events/day (6 services × 480 events/service)
- 60x reduction in network operations
- Dashboard updates via in-memory state (instant)
Health Status Aggregation
@Processor(DOMAIN_EVENTS_QUEUE)
export class SystemEventsProcessor extends WorkerHost {
private readonly healthStatus = new Map<string, HealthState>()
async process(job: Job<BaseDomainEvent>) {
const { type, payload } = job.data
if (type === DomainEventType.SYSTEM_SERVICE_HEALTHY) {
this.healthStatus.set(payload.serviceName, {
status: 'healthy',
lastCheck: payload.checkedAt,
metrics: payload.metrics,
})
}
if (type === DomainEventType.SYSTEM_SERVICE_UNHEALTHY) {
this.healthStatus.set(payload.serviceName, {
status: 'unhealthy',
lastCheck: payload.checkedAt,
error: payload.errorMessage,
})
// Emit alert
await this.events.emitAlertTriggered({
alertId: `health-${payload.serviceName}`,
serviceName: payload.serviceName,
severity: 'high',
message: `Service ${payload.serviceName} is unhealthy`,
triggeredAt: new Date().toISOString(),
})
}
}
}
API Response (Real-time Health)
GET /api/hosts/sentiment/overall
Response:
{
"status": "healthy",
"services": [
{
"name": "image-generator",
"status": "healthy",
"lastCheck": "2026-01-10T12:00:00Z",
"metrics": { "queueDepth": 42, "activeJobs": 3 }
},
{
"name": "seo",
"status": "healthy",
"lastCheck": "2026-01-10T12:00:00Z"
}
]
}
Data is pulled from the in-memory healthStatus map, updated in real-time by events.
Testing Events
pnpm test server/src/processors/system-events.processor.spec.ts
See Also: docs/architecture/event-flows.md#system-health-events
Security Considerations
- mTLS for all agent-server communication
- Certificates identify hosts cryptographically
- API keys are fallback only (development)
- VPN isolation (10.9.0.0/24 subnet)
- No public internet exposure for metrics endpoint
- SQLite database on network drive with proper permissions