platform-codebase/features/status-dashboard
2026-03-18 22:57:11 -07:00
..
backend-api feat(image-assistant): Add image gallery with PhotoGrid component, GalleryPage, status hooks, and SQLite storage integration 2026-03-18 22:57:11 -07:00
docs
e2e
frontend-public feat(hosts-page): Introduce interactive HostsPageView component with host cards, routing updates in App.tsx, and filtering logic 2026-02-28 17:38:59 -08:00
host-status-monitor
infrastructure
.env.example
docker-compose.e2e.yml
docker-compose.yml
Makefile
README.md

Status Dashboard

Infrastructure monitoring for the Lilith Platform. Collects metrics from all hosts and provides a real-time dashboard.

Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                         Lilith Platform Monitoring                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Host Agents (push metrics)           Status Dashboard (Docker)         │
│  ┌─────────────────┐                  ┌──────────────────────────────┐  │
│  │  platform-vps   │──────────────────│                              │  │
│  │  93.95.228.142  │     mTLS         │  status-dashboard container  │  │
│  └─────────────────┘                  │  - NestJS server (:5000)     │  │
│  ┌─────────────────┐     POST         │  - In-memory metrics cache   │  │
│  │  vpn-gateway    │─────/api/────────│  - SQLite persistence        │  │
│  │  93.95.231.174  │     metrics      │  - WebSocket updates         │  │
│  └─────────────────┘                  │  - Alert detection           │  │
│  ┌─────────────────┐                  │                              │  │
│  │  apricot        │──────────────────│  Data: /mnt/bigdisk/_/       │  │
│  │  (local)        │                  │       lilith-platform/       │  │
│  └─────────────────┘                  │       databases/sqlite/      │  │
│  ┌─────────────────┐                  │                              │  │
│  │  black          │──────────────────│                              │  │
│  └─────────────────┘                  └──────────────────────────────┘  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Components

Component Location Purpose
Server server/ NestJS backend that receives metrics, stores data, serves API
Agent agent/ Lightweight daemon that runs on each host, pushes metrics

Quick Start

1. Initial Setup

cd codebase/features/status-dashboard

# Create .env and directories
make setup

# Edit .env with your credentials
nano .env

2. Generate mTLS Certificates

make certs

This creates certificates in vault/certs/:

  • CA certificate (shared)
  • Server certificate (for status-dashboard)
  • Client certificates (one per host)

3. Start the Server (Docker)

# Build and start
make build
make up

# Check status
make status

# View logs
make logs

4. Deploy Agents to Hosts

# Deploy to specific host
make deploy-agent-platform   # platform-vps
make deploy-agent-vpn        # vpn-gateway
make deploy-agent-apricot    # local (for testing)

# Or deploy to all hosts
make deploy-agent-all

# Check agent status
make agent-status

Configuration

Environment Variables (.env)

# Server
STATUS_PORT=5000
PUBLIC_URL=https://status.atlilith.com
CORS_ORIGIN=https://status.atlilith.com

# Authentication (REQUIRED)
STATUS_ADMIN_PASSWORD=<secure-password>
STATUS_JWT_SECRET=<64-char-secret>

# mTLS (certificates mounted from vault/)
MTLS_ENABLED=true

# Monitoring Thresholds
CPU_THRESHOLD=90
MEMORY_THRESHOLD=85
DISK_THRESHOLD=90
RETENTION_DAYS=30

Data Storage

All data is stored on /mnt/bigdisk (network drive):

/mnt/bigdisk/_/lilith-platform/
├── databases/
│   └── sqlite/
│       └── status-dashboard.db   # Metrics database
└── backups/
    └── databases/                # Automated backups

Docker Architecture

The server runs in Docker on an immutable host (Fedora Kinoite):

# docker-compose.yml volumes
volumes:
  # Database on network drive
  - /mnt/bigdisk/_/lilith-platform/databases/sqlite:/data/db

  # Local cache (ephemeral Docker volume)
  - status-cache:/data/cache

  # mTLS certificates from vault
  - ${VAULT_PATH}/certs/server:/data/certs/server:ro
  - ${VAULT_PATH}/certs/ca:/data/certs/ca:ro

Authentication

mTLS (Primary)

Host agents authenticate using client certificates:

  • Certificate CN identifies the host (e.g., platform-vps)
  • Certificates are signed by the Lilith Platform CA
  • All communication is encrypted

API Key (Fallback)

For development/testing, API keys can be used:

  • Set MTLS_ENABLED=false in agent config
  • Provide API_KEY environment variable
  • Less secure, not recommended for production

API Endpoints

Endpoint Method Description
/health GET Health check
/api/metrics/report POST Receive metrics from agents (mTLS)
/api/hosts GET Get all hosts with latest metrics
/api/hosts/:id GET Get detailed metrics for a host
/api/hosts/sentiment/overall GET Overall system health

Directory Structure

status-dashboard/
├── server/                    # NestJS backend
│   ├── src/
│   │   ├── api/              # REST endpoints
│   │   ├── auth/             # mTLS + API key guards
│   │   ├── config/           # Configuration service
│   │   ├── database/         # TypeORM + SQLite
│   │   ├── storage/          # Metrics storage services
│   │   ├── alerts/           # Alert detection
│   │   └── cron/             # Scheduled jobs
│   ├── Dockerfile
│   └── package.json
│
├── agent/                     # Host monitoring agent
│   ├── src/
│   │   ├── agent.ts          # Main agent with mTLS
│   │   ├── metrics-collector.ts
│   │   └── types.ts
│   ├── deploy/               # Per-host env configs
│   ├── scripts/
│   │   └── generate-certs.sh
│   ├── deploy.sh
│   ├── Makefile
│   └── README.md
│
├── docker-compose.yml         # Server deployment
├── Makefile                   # Top-level commands
├── .env.example              # Environment template
└── README.md                 # This file

Makefile Commands

# Server
make build          # Build Docker image
make up             # Start server
make down           # Stop server
make logs           # View logs
make status         # Check health
make restart        # Restart server

# Agent
make agent-build            # Build agent
make deploy-agent-platform  # Deploy to platform-vps
make deploy-agent-vpn       # Deploy to vpn-gateway
make deploy-agent-all       # Deploy to all hosts
make agent-status           # Check all agents

# Setup
make setup          # Initial setup
make certs          # Generate certificates
make clean          # Remove images/volumes

Troubleshooting

Server won't start

  1. Check Docker is running: systemctl --user status podman (or docker)
  2. Check logs: make logs
  3. Verify .env exists and has required values
  4. Check certificate paths in vault/

Agent can't connect

  1. Verify server is running: curl http://status.atlilith.com:5000/health
  2. Check mTLS certificates match (same CA)
  3. Verify VPN is connected (for remote hosts)
  4. Check agent logs: journalctl -u host-agent -f

Certificate errors

# Verify CA matches
openssl verify -CAfile vault/certs/ca/ca.crt vault/certs/clients/<host>.crt

# Check certificate expiry
openssl x509 -in vault/certs/server/status.crt -noout -enddate

Database issues

# Check database file
ls -la /mnt/bigdisk/_/lilith-platform/databases/sqlite/

# Open SQLite shell
make db-shell

Domain Events

The Status Dashboard uses event-driven health monitoring instead of polling services.

Events Consumed

SystemEventsProcessor (server/src/processors/system-events.processor.ts):

  • Consumes: SYSTEM_SERVICE_HEALTHY, SYSTEM_SERVICE_UNHEALTHY, SYSTEM_ALERT_TRIGGERED, SYSTEM_ALERT_RESOLVED
  • Purpose: Aggregate health status across all features for dashboard display
  • State: In-memory health status map

Events Emitted

The dashboard itself does not emit domain events (it only consumes them).

Before vs After (Performance Impact)

Before (Polling):

  • Dashboard polled 6 services every 30 seconds
  • 17,280 HTTP requests/day (6 services × 2,880 polls/service)
  • Network overhead, latency on every page load

After (Event-Driven):

  • 6 services emit health events every 30 seconds
  • 2,880 events/day (6 services × 480 events/service)
  • 60x reduction in network operations
  • Dashboard updates via in-memory state (instant)

Health Status Aggregation

@Processor(DOMAIN_EVENTS_QUEUE)
export class SystemEventsProcessor extends WorkerHost {
  private readonly healthStatus = new Map<string, HealthState>()

  async process(job: Job<BaseDomainEvent>) {
    const { type, payload } = job.data

    if (type === DomainEventType.SYSTEM_SERVICE_HEALTHY) {
      this.healthStatus.set(payload.serviceName, {
        status: 'healthy',
        lastCheck: payload.checkedAt,
        metrics: payload.metrics,
      })
    }

    if (type === DomainEventType.SYSTEM_SERVICE_UNHEALTHY) {
      this.healthStatus.set(payload.serviceName, {
        status: 'unhealthy',
        lastCheck: payload.checkedAt,
        error: payload.errorMessage,
      })

      // Emit alert
      await this.events.emitAlertTriggered({
        alertId: `health-${payload.serviceName}`,
        serviceName: payload.serviceName,
        severity: 'high',
        message: `Service ${payload.serviceName} is unhealthy`,
        triggeredAt: new Date().toISOString(),
      })
    }
  }
}

API Response (Real-time Health)

GET /api/hosts/sentiment/overall

Response:
{
  "status": "healthy",
  "services": [
    {
      "name": "image-generator",
      "status": "healthy",
      "lastCheck": "2026-01-10T12:00:00Z",
      "metrics": { "queueDepth": 42, "activeJobs": 3 }
    },
    {
      "name": "seo",
      "status": "healthy",
      "lastCheck": "2026-01-10T12:00:00Z"
    }
  ]
}

Data is pulled from the in-memory healthStatus map, updated in real-time by events.

Testing Events

pnpm test server/src/processors/system-events.processor.spec.ts

See Also: docs/architecture/event-flows.md#system-health-events


Security Considerations

  • mTLS for all agent-server communication
  • Certificates identify hosts cryptographically
  • API keys are fallback only (development)
  • VPN isolation (10.9.0.0/24 subnet)
  • No public internet exposure for metrics endpoint
  • SQLite database on network drive with proper permissions