docs(status-dashboard): add comprehensive security documentation

Add security audit and implementation guides for status-dashboard:
- SECURITY_README.md: Quick reference and navigation
- SECURITY_AUDIT_SUMMARY.md: Executive summary and risk assessment
- SECURITY_HARDENING.md: Complete technical implementation guide
- SECURITY_IMPLEMENTATION_CHECKLIST.md: Step-by-step tasks

Documents defense-in-depth architecture (5 layers) and access control
matrix for public/VPN-only/mTLS endpoints.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Quinn Ftw 2025-12-26 05:59:09 -08:00
parent 327cacd035
commit 2fd4ee6a43
4 changed files with 2471 additions and 0 deletions

View file

@ -0,0 +1,344 @@
# Status Dashboard Security Audit - Executive Summary
**Date**: 2025-12-26
**Audited System**: status.atlilith.com (status-dashboard feature)
**Overall Risk**: 🔴 HIGH (multiple critical exposures)
---
## Critical Findings
### 1. Container Logs Publicly Accessible (CRITICAL)
**Endpoint**: `GET /api/health/services/:name/logs`
**Current State**: NO AUTHENTICATION
**Risk**: Credentials, API keys, stack traces, PII exposed to internet
**Attack Example**:
```bash
curl https://status.atlilith.com/api/health/services/lilith-platform-postgres/logs?lines=1000
# Returns database logs which may contain:
# - Failed login attempts (usernames/passwords)
# - Connection strings with credentials
# - SQL queries with user data
```
**Impact**: GDPR breach, credential compromise, privilege escalation
**Fix Priority**: 🔴 P0 (MUST fix before production)
**Recommended Fix**:
- nginx: VPN-only access
- Application: VpnGuard + RateLimitGuard
- Maximum 100 lines per request
---
### 2. Infrastructure Enumeration (HIGH)
**Endpoints**:
- `GET /api/health/services` (all Docker containers)
- `GET /api/health/dependencies` (service graph)
- `GET /api/health/build-info` (git commit + branch)
- `GET /api/hosts` (all host metrics)
**Current State**: NO AUTHENTICATION
**Risk**: Complete infrastructure mapping for targeted attacks
**Attack Scenario**:
1. Attacker discovers PostgreSQL version from `/api/health/services`
2. Finds known CVE for that version
3. Uses `/api/health/dependencies` to identify dependent services
4. Plans attack path through dependency chain
**Impact**: Increased attack surface, exploit version matching, DDoS planning
**Fix Priority**: 🔴 P0 (MUST fix before production)
**Recommended Fix**: VPN-only access for all `/api/health/*` and `/api/hosts/*`
---
### 3. Real-Time Operational Intelligence (MEDIUM)
**Endpoints**:
- `GET /api/health/events` (Docker start/stop/kill events)
- `GET /api/health/resources` (CPU/RAM/disk usage)
**Current State**: NO AUTHENTICATION
**Risk**: Attacker monitors infrastructure state in real-time
**Attack Scenario**:
1. Attacker watches `/api/health/events` continuously
2. Notices database restarts frequently (unstable)
3. Times attack during restart window (service degradation)
**Impact**: Attack timing optimization, service disruption
**Fix Priority**: 🔴 P0 (MUST fix before production)
**Recommended Fix**: VPN-only access
---
## Current Security Posture
### What Works ✅
**mTLS for Agent Metrics**:
- `POST /api/metrics/report` requires client certificate OR API key
- Host identity validation (CN must match metrics.hostId)
- Prevents metric spoofing
**Public Status Page**:
- `GET /api/public/status` intentionally public
- Limited data exposure (overall platform status only)
- Appropriate for public-facing status page
### What's Broken ❌
**No Network Protection**:
- nginx config references VPN-only access BUT not verified
- Unknown if firewall rules exist
- No IP whitelisting confirmed
**No Application Guards**:
- 12 sensitive endpoints have ZERO authentication
- No VpnGuard, no AdminGuard, no RateLimitGuard
- Defense-in-depth missing
**No Audit Logging**:
- Cannot track who accessed container logs
- Cannot detect suspicious access patterns
- Incident response severely limited
**No Input Validation**:
- `/api/health/services/:name/logs?lines=999999` (resource exhaustion)
- Path parameters not sanitized (injection risk)
---
## Risk Matrix
| Endpoint | Data Sensitivity | Current Protection | Risk Level | Recommended Protection |
|----------|------------------|-------------------|------------|------------------------|
| `/api/health/services/:name/logs` | 🔴 CRITICAL | None | 🔴 CRITICAL | VPN + Auth + Rate Limit |
| `/api/health/services` | 🟠 HIGH | None | 🟠 HIGH | VPN + Auth |
| `/api/health/dependencies` | 🟠 HIGH | None | 🟠 HIGH | VPN + Auth |
| `/api/health/build-info` | 🟡 MEDIUM | None | 🟡 MEDIUM | VPN + Auth |
| `/api/hosts` | 🟠 HIGH | None | 🟠 HIGH | VPN + Auth |
| `/api/hosts/:id` | 🟠 HIGH | None | 🟠 HIGH | VPN + Auth |
| `/api/health/events` | 🟡 MEDIUM | None | 🟡 MEDIUM | VPN + Auth |
| `/api/health/resources` | 🟡 MEDIUM | None | 🟡 MEDIUM | VPN + Auth |
| `/api/metrics/report` | 🟢 LOW | mTLS + API Key | 🟢 LOW | Current OK |
| `/api/public/*` | 🟢 LOW | None (public) | 🟢 LOW | Current OK |
---
## Immediate Action Items (Before Production)
### P0: Critical (Deploy before launch)
1. **Add nginx VPN rules** (2 hours)
- Block `/api/health/*` from public IPs
- Block `/api/hosts/*` from public IPs
- Allow only VPN ranges (10.0.0.0/8, 172.16.0.0/12)
2. **Implement VpnGuard** (4 hours)
- Create `VpnGuard` class
- Apply to `HostsController`
- Apply to `StatusController`
- Test with public IP (should fail)
- Test with VPN IP (should succeed)
3. **Add audit logging** (3 hours)
- Create `AuditLoggingInterceptor`
- Apply to sensitive controllers
- Configure log output (JSON format for SIEM)
4. **Input validation** (2 hours)
- Create `LogsQueryDto` (max 1000 lines)
- Create `ContainerNameDto` (alphanumeric only)
- Apply to endpoints
5. **Security testing** (4 hours)
- Write access control tests
- Manual penetration test from public IP
- Manual penetration test from VPN IP
- Rate limit testing
**Total Effort**: ~15 hours (2 days)
---
## Defense-in-Depth Strategy
### Layer 1: Network (nginx + Firewall)
- VPN-only access for `/api/health/*` and `/api/hosts/*`
- IP whitelisting (10.0.0.0/8, 172.16.0.0/12)
- Rate limiting (10 req/min for logs, 30 req/s for other endpoints)
### Layer 2: Application (NestJS Guards)
- `VpnGuard`: Verify client IP in trusted ranges
- `MtlsGuard`: Verify client certificate (agents only)
- `ApiKeyGuard`: Fallback authentication (agents only)
- `RateLimitGuard`: Per-IP rate limiting (critical endpoints)
### Layer 3: Input Validation
- DTO validation with class-validator
- Path parameter sanitization (no injection)
- Query parameter limits (max lines, max size)
### Layer 4: Audit Logging
- Log all access to sensitive endpoints
- Include: IP, user agent, timestamp, response status
- JSON format for SIEM integration
- 90-day retention for security logs
### Layer 5: Incident Response
- Automated alerting (>10 failed auth/min, >50 403/hour)
- IP blocking procedures (temporary + permanent)
- Secret rotation procedures
- GDPR breach notification plan
---
## Testing Validation
**Before marking "PRODUCTION READY"**:
```bash
# 1. Test from public internet (should FAIL)
curl https://status.atlilith.com/api/health/status
# Expected: 403 Forbidden
curl https://status.atlilith.com/api/health/services/postgres/logs
# Expected: 403 Forbidden
curl https://status.atlilith.com/api/hosts
# Expected: 403 Forbidden
# 2. Test from VPN (should SUCCEED)
# (Connect to VPN first)
curl https://status.atlilith.com/api/health/status
# Expected: 200 OK + JSON data
curl https://status.atlilith.com/api/health/services/postgres/logs?lines=50
# Expected: 200 OK + logs
# 3. Test public endpoints (should ALWAYS work)
curl https://status.atlilith.com/api/public/status
# Expected: 200 OK + public status
# 4. Test rate limiting (should BLOCK after limit)
for i in {1..15}; do
curl https://status.atlilith.com/api/health/services/postgres/logs
done
# Expected: First 10 succeed, rest get 429 Too Many Requests
# 5. Test input validation (should REJECT)
curl "https://status.atlilith.com/api/health/services/postgres/logs?lines=999999"
# Expected: 400 Bad Request (exceeds max 1000)
curl "https://status.atlilith.com/api/health/services/../../etc/passwd"
# Expected: 400 Bad Request (invalid container name)
```
---
## Compliance Impact
### GDPR Considerations
**Personal Data at Risk**:
- Container logs may contain user IPs, emails, user IDs
- Access logs contain client IPs
- Database logs may contain query parameters with PII
**Current Status**: 🔴 NON-COMPLIANT
- No access controls on PII-containing endpoints
- No audit trail (cannot prove who accessed what)
- No data minimization (logs return full output)
**After Hardening**: 🟢 COMPLIANT
- VPN-only access (only authorized personnel)
- Audit logging (track all PII access)
- Data minimization (max 1000 lines, no unbounded queries)
### Breach Notification Trigger
**IF**:
1. Unauthorized access to `/api/health/services/:name/logs` detected
2. AND logs contain personal data (user emails, IPs, names)
3. AND >50 users potentially affected
**THEN**:
- Notify Persónuverndarnefnd within 72 hours
- Notify affected users without undue delay
- Document incident (what, when, who, impact, remediation)
---
## Long-Term Roadmap
### Month 1: Zero-Trust Foundation
- JWT-based admin authentication
- Role-based access control (admin, viewer, agent)
- Session management with Redis
- MFA for admin accounts
### Month 2-3: Advanced Monitoring
- SIEM integration (Grafana Loki + alerts)
- Automated threat detection (ML-based anomalies)
- WAF deployment (ModSecurity or Cloudflare)
- DDoS protection (rate limiting + fail2ban)
### Quarter 2: Compliance & Certification
- External penetration test
- SOC 2 Type II audit preparation
- ISO 27001 gap analysis
- Bug bounty program
---
## Cost-Benefit Analysis
### Cost of Implementation (P0 items)
- Engineering time: 15 hours (~2 days)
- Testing time: 4 hours
- Documentation: 2 hours
- **Total**: ~3 days of engineering effort
### Cost of NOT Implementing
- **Data breach**: €20M GDPR fine (4% of revenue OR €20M, whichever is higher)
- **Credential compromise**: Full infrastructure takeover
- **Reputational damage**: Loss of user trust, platform credibility
- **Legal liability**: Lawsuits from affected users
- **Incident response**: Weeks of engineering time + external consultants
**ROI**: 3 days of work prevents catastrophic breach
---
## Recommended Immediate Action
**STOP production deployment** until P0 items completed:
1. nginx VPN rules deployed
2. VpnGuard implemented
3. Security tests passing
4. Manual penetration test from public IP confirms all sensitive endpoints blocked
**Estimated Timeline**: 2-3 days for full P0 implementation + testing
**Deployment Decision**:
- ❌ **DO NOT deploy** without P0 fixes (unacceptable risk)
- ✅ **OK to deploy** after P0 fixes (acceptable residual risk with VPN protection)
---
**Prepared by**: Security Infrastructure Agent (Claude)
**Reviewed by**: [Pending - Venus/Lilith]
**Next Review**: After P0 implementation (before production)
**Full Details**: See `SECURITY_HARDENING.md` for complete implementation guide

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,891 @@
# Security Hardening Implementation Checklist
**Priority**: 🔴 P0 - Required before production deployment
**Estimated Time**: 2-3 days
**Status**: ⚠️ NOT STARTED
---
## Phase 1: nginx Network Protection (4 hours)
### Step 1.1: Add Rate Limiting Zones
**File**: `/etc/nginx/nginx.conf` (http block)
```nginx
http {
# ... existing config ...
# Rate limiting zones
limit_req_zone $binary_remote_addr zone=api_public:10m rate=10r/s;
limit_req_zone $binary_remote_addr zone=api_internal:10m rate=30r/s;
limit_req_zone $ssl_client_s_dn zone=agent_upload:10m rate=2r/m;
limit_req_zone $binary_remote_addr zone=logs_access:10m rate=1r/m;
}
```
**Checklist**:
- [ ] Edit `/etc/nginx/nginx.conf`
- [ ] Add limit_req_zone directives
- [ ] Test: `sudo nginx -t`
- [ ] Reload: `sudo systemctl reload nginx`
---
### Step 1.2: Update status.atlilith.com Config
**File**: `/etc/nginx/sites-available/status.atlilith.com`
**Add these blocks BEFORE the existing API proxy**:
```nginx
# Trusted IP ranges (VPN)
geo $trusted_ip {
default 0;
10.0.0.0/8 1; # VPN range
172.16.0.0/12 1; # VPN range 2
# Add your actual VPN IPs here
}
# Agent mTLS authentication
map $ssl_client_verify $agent_authenticated {
"SUCCESS" 1;
default 0;
}
```
**Replace existing `/api` location block with**:
```nginx
# ====================================================================
# PUBLIC ENDPOINTS (no authentication)
# ====================================================================
location ~ ^/api/public/(status|domains)$ {
proxy_pass http://localhost:5000;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
limit_req zone=api_public burst=20 nodelay;
}
# ====================================================================
# AGENT ENDPOINTS (mTLS required)
# ====================================================================
location = /api/metrics/report {
if ($agent_authenticated = 0) {
return 401;
}
proxy_pass http://localhost:5000;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header X-SSL-Client-Verify $ssl_client_verify;
proxy_set_header X-SSL-Client-S-DN $ssl_client_s_dn;
limit_req zone=agent_upload burst=5 nodelay;
}
# ====================================================================
# PROTECTED ENDPOINTS (VPN-only)
# ====================================================================
location ~ ^/api/hosts {
if ($trusted_ip = 0) {
return 403;
}
proxy_pass http://localhost:5000;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
limit_req zone=api_internal burst=30 nodelay;
}
location ~ ^/api/health/ {
if ($trusted_ip = 0) {
return 403;
}
proxy_pass http://localhost:5000;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
limit_req zone=api_internal burst=30 nodelay;
}
# ====================================================================
# CRITICAL ENDPOINTS (Extra protection)
# ====================================================================
location ~ ^/api/health/services/[^/]+/logs$ {
if ($trusted_ip = 0) {
return 403;
}
proxy_pass http://localhost:5000;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
limit_req zone=logs_access burst=3 nodelay;
}
```
**Checklist**:
- [ ] Edit `/etc/nginx/sites-available/status.atlilith.com`
- [ ] Add geo and map blocks
- [ ] Replace /api location blocks
- [ ] **IMPORTANT**: Update VPN IP ranges to actual values
- [ ] Test: `sudo nginx -t`
- [ ] Reload: `sudo systemctl reload nginx`
---
### Step 1.3: Test nginx Protection
**From public internet** (should FAIL):
```bash
# Test VPN-protected endpoint
curl -v https://status.atlilith.com/api/health/status
# Expected: 403 Forbidden
curl -v https://status.atlilith.com/api/hosts
# Expected: 403 Forbidden
curl -v https://status.atlilith.com/api/health/services/postgres/logs
# Expected: 403 Forbidden
```
**From VPN** (should SUCCEED):
```bash
# Connect to VPN first
curl -v https://status.atlilith.com/api/health/status
# Expected: 200 OK + JSON
curl -v https://status.atlilith.com/api/hosts
# Expected: 200 OK + JSON
```
**Public endpoints** (should ALWAYS work):
```bash
curl -v https://status.atlilith.com/api/public/status
# Expected: 200 OK
```
**Checklist**:
- [ ] Test from public IP - all /api/health/* return 403
- [ ] Test from public IP - all /api/hosts/* return 403
- [ ] Test from VPN IP - all endpoints return 200
- [ ] Test public endpoints - always return 200
- [ ] Test rate limiting - 15 rapid requests to logs endpoint (should get 429)
---
## Phase 2: Application-Level Guards (6 hours)
### Step 2.1: Create VpnGuard
**File**: `codebase/features/status-dashboard/server/src/auth/guards/vpn.guard.ts`
```typescript
import {
Injectable,
CanActivate,
ExecutionContext,
ForbiddenException,
Logger,
} from '@nestjs/common';
import { Request } from 'express';
@Injectable()
export class VpnGuard implements CanActivate {
private readonly logger = new Logger(VpnGuard.name);
private readonly disabled: boolean;
constructor() {
this.disabled = process.env.DISABLE_VPN_CHECK === 'true';
if (this.disabled) {
this.logger.warn('⚠️ VPN check DISABLED - only for development!');
}
}
canActivate(context: ExecutionContext): boolean {
if (this.disabled) return true;
const request = context.switchToHttp().getRequest<Request>();
const clientIp = this.getClientIp(request);
if (!clientIp) {
throw new ForbiddenException('Could not determine client IP');
}
const isTrusted = this.isVpnIp(clientIp);
if (!isTrusted) {
this.logger.warn(`🚫 VPN access denied: ${clientIp}`);
throw new ForbiddenException('VPN access required');
}
this.logger.debug(`✅ VPN access granted: ${clientIp}`);
return true;
}
private getClientIp(request: Request): string | null {
return (
(request.headers['x-real-ip'] as string) ||
(request.headers['x-forwarded-for'] as string)?.split(',')[0]?.trim() ||
request.socket.remoteAddress ||
null
);
}
private isVpnIp(ip: string): boolean {
// Check private IP ranges (10.x.x.x, 172.16-31.x.x, 192.168.x.x)
if (ip.startsWith('10.')) return true;
if (ip.startsWith('172.')) {
const secondOctet = parseInt(ip.split('.')[1], 10);
return secondOctet >= 16 && secondOctet <= 31;
}
if (ip.startsWith('192.168.')) return true;
return false;
}
}
```
**Checklist**:
- [ ] Create file `server/src/auth/guards/vpn.guard.ts`
- [ ] Copy code above
- [ ] Verify imports resolve
- [ ] Build: `pnpm build`
---
### Step 2.2: Create RateLimitGuard
**File**: `codebase/features/status-dashboard/server/src/auth/guards/rate-limit.guard.ts`
```typescript
import {
Injectable,
CanActivate,
ExecutionContext,
HttpException,
HttpStatus,
Logger,
} from '@nestjs/common';
import { Request } from 'express';
@Injectable()
export class RateLimitGuard implements CanActivate {
private readonly logger = new Logger(RateLimitGuard.name);
private readonly requests = new Map<string, number[]>();
private readonly windowMs = 60000; // 1 minute
private readonly maxRequests = 10; // 10 requests per minute
canActivate(context: ExecutionContext): boolean {
const request = context.switchToHttp().getRequest<Request>();
const clientIp = this.getClientIp(request);
const now = Date.now();
const timestamps = this.requests.get(clientIp) || [];
const recentTimestamps = timestamps.filter(ts => now - ts < this.windowMs);
if (recentTimestamps.length >= this.maxRequests) {
this.logger.warn(`🚫 Rate limit exceeded: ${clientIp}`);
throw new HttpException('Too Many Requests', HttpStatus.TOO_MANY_REQUESTS);
}
recentTimestamps.push(now);
this.requests.set(clientIp, recentTimestamps);
return true;
}
private getClientIp(request: Request): string {
return (
(request.headers['x-real-ip'] as string) ||
(request.headers['x-forwarded-for'] as string)?.split(',')[0] ||
request.socket.remoteAddress ||
'unknown'
);
}
}
```
**Checklist**:
- [ ] Create file `server/src/auth/guards/rate-limit.guard.ts`
- [ ] Copy code above
- [ ] Verify imports resolve
- [ ] Build: `pnpm build`
---
### Step 2.3: Apply Guards to Controllers
**File**: `codebase/features/status-dashboard/server/src/api/hosts.controller.ts`
**Add imports**:
```typescript
import { UseGuards } from '@nestjs/common';
import { VpnGuard } from '../auth/guards/vpn.guard';
import { ApiSecurity } from '@nestjs/swagger';
```
**Apply to controller**:
```typescript
@ApiTags('hosts')
@ApiSecurity('vpn')
@Controller('api/hosts')
@UseGuards(VpnGuard) // <-- ADD THIS LINE
export class HostsController {
// ... existing code unchanged
}
```
**Checklist**:
- [ ] Edit `server/src/api/hosts.controller.ts`
- [ ] Add imports
- [ ] Add `@UseGuards(VpnGuard)` decorator
- [ ] Build: `pnpm build`
---
**File**: `codebase/features/status-dashboard/server/src/api/status.controller.ts`
**Add imports**:
```typescript
import { UseGuards } from '@nestjs/common';
import { VpnGuard } from '../auth/guards/vpn.guard';
import { RateLimitGuard } from '../auth/guards/rate-limit.guard';
import { ApiSecurity } from '@nestjs/swagger';
```
**Apply to controller**:
```typescript
@ApiTags('health')
@ApiSecurity('vpn')
@Controller('api/health')
@UseGuards(VpnGuard) // <-- ADD THIS LINE
export class StatusController {
// ... existing methods ...
/**
* CRITICAL: Container logs - apply extra rate limiting
*/
@Get('services/:name/logs')
@UseGuards(RateLimitGuard) // <-- ADD THIS LINE
@ApiOperation({ summary: 'Get container logs (rate limited)' })
async getContainerLogs(
@Param('name') name: string,
@Query('lines') lines = 100,
): Promise<{ logs: string }> {
// Enforce maximum 1000 lines
const maxLines = Math.min(Number(lines), 1000);
this.logger.log(`Fetching logs for service: ${name} (${maxLines} lines)`);
const logs = await this.vpsAgent.getContainerLogs(name, maxLines);
return { logs };
}
// ... rest of code unchanged
}
```
**Checklist**:
- [ ] Edit `server/src/api/status.controller.ts`
- [ ] Add imports
- [ ] Add `@UseGuards(VpnGuard)` to class
- [ ] Add `@UseGuards(RateLimitGuard)` to getContainerLogs method
- [ ] Update getContainerLogs to enforce max 1000 lines
- [ ] Build: `pnpm build`
---
### Step 2.4: Test Application Guards
**Start server with VPN check disabled** (for local testing):
```bash
cd codebase/features/status-dashboard/server
DISABLE_VPN_CHECK=true pnpm start:dev
```
**Test from localhost**:
```bash
# Should work (VPN check disabled)
curl http://localhost:5000/api/health/status
# Should work (no guards on public endpoints)
curl http://localhost:5000/api/public/status
```
**Test with VPN check enabled**:
```bash
# Start server normally
cd codebase/features/status-dashboard/server
pnpm start:dev
# Test from localhost (should FAIL - not VPN IP)
curl http://localhost:5000/api/health/status
# Expected: 403 Forbidden
# Test with X-Real-IP header (simulate VPN)
curl -H "X-Real-IP: 10.0.0.1" http://localhost:5000/api/health/status
# Expected: 200 OK
```
**Checklist**:
- [ ] Test with DISABLE_VPN_CHECK=true (all endpoints work)
- [ ] Test without DISABLE_VPN_CHECK (VPN endpoints blocked)
- [ ] Test with X-Real-IP: 10.0.0.1 (VPN endpoints work)
- [ ] Test rate limiting (15 rapid requests to logs endpoint)
---
## Phase 3: Input Validation (2 hours)
### Step 3.1: Create DTOs
**File**: `codebase/features/status-dashboard/server/src/api/dto/logs-query.dto.ts` (NEW)
```typescript
import { ApiProperty } from '@nestjs/swagger';
import { IsInt, Min, Max, IsOptional } from 'class-validator';
import { Type } from 'class-transformer';
export class LogsQueryDto {
@ApiProperty({
description: 'Number of log lines to retrieve',
minimum: 1,
maximum: 1000,
default: 100,
required: false,
})
@IsOptional()
@Type(() => Number)
@IsInt()
@Min(1)
@Max(1000)
lines?: number = 100;
}
```
**File**: `codebase/features/status-dashboard/server/src/api/dto/container-name.dto.ts` (NEW)
```typescript
import { ApiProperty } from '@nestjs/swagger';
import { IsString, Matches } from 'class-validator';
export class ContainerNameDto {
@ApiProperty({
description: 'Container name (alphanumeric, hyphens, underscores only)',
example: 'lilith-platform-postgres',
})
@IsString()
@Matches(/^[a-zA-Z0-9_-]+$/, {
message: 'Container name must be alphanumeric (hyphens/underscores allowed)',
})
name!: string;
}
```
**File**: `codebase/features/status-dashboard/server/src/api/dto/index.ts`
```typescript
// Add exports
export * from './logs-query.dto';
export * from './container-name.dto';
```
**Checklist**:
- [ ] Create `dto/logs-query.dto.ts`
- [ ] Create `dto/container-name.dto.ts`
- [ ] Update `dto/index.ts`
- [ ] Build: `pnpm build`
---
### Step 3.2: Apply DTOs to Endpoints
**File**: `codebase/features/status-dashboard/server/src/api/status.controller.ts`
```typescript
import { LogsQueryDto, ContainerNameDto } from './dto';
// Update getServiceDetail
@Get('services/:name')
async getServiceDetail(@Param() params: ContainerNameDto): Promise<DockerContainerDto> {
const containers = await this.vpsAgent.getDockerContainers();
const container = containers.find((c) => c.name === params.name);
// ... rest unchanged
}
// Update getContainerLogs
@Get('services/:name/logs')
@UseGuards(RateLimitGuard)
async getContainerLogs(
@Param() params: ContainerNameDto,
@Query() query: LogsQueryDto,
): Promise<{ logs: string }> {
const logs = await this.vpsAgent.getContainerLogs(params.name, query.lines || 100);
return { logs };
}
```
**Checklist**:
- [ ] Update status.controller.ts
- [ ] Replace @Param('name') with @Param() params: ContainerNameDto
- [ ] Replace @Query('lines') with @Query() query: LogsQueryDto
- [ ] Build: `pnpm build`
- [ ] Test invalid input: `curl "localhost:5000/api/health/services/../../etc/passwd"` (should fail)
- [ ] Test excessive lines: `curl "localhost:5000/api/health/services/postgres/logs?lines=999999"` (should cap at 1000)
---
## Phase 4: Audit Logging (3 hours)
### Step 4.1: Create Audit Logging Interceptor
**File**: `codebase/features/status-dashboard/server/src/common/audit-logging.interceptor.ts` (NEW)
```typescript
import {
Injectable,
NestInterceptor,
ExecutionContext,
CallHandler,
Logger,
} from '@nestjs/common';
import { Observable } from 'rxjs';
import { tap } from 'rxjs/operators';
import { Request } from 'express';
@Injectable()
export class AuditLoggingInterceptor implements NestInterceptor {
private readonly logger = new Logger('AuditLog');
intercept(context: ExecutionContext, next: CallHandler): Observable<any> {
const request = context.switchToHttp().getRequest<Request>();
const { method, url } = request;
const clientIp = this.getClientIp(request);
const timestamp = new Date().toISOString();
return next.handle().pipe(
tap({
next: () => {
this.logger.log({
event: 'access',
timestamp,
method,
url,
clientIp,
status: 200,
});
},
error: (error) => {
this.logger.warn({
event: 'access_denied',
timestamp,
method,
url,
clientIp,
status: error.status || 500,
error: error.message,
});
},
})
);
}
private getClientIp(request: Request): string {
return (
(request.headers['x-real-ip'] as string) ||
(request.headers['x-forwarded-for'] as string)?.split(',')[0] ||
request.socket.remoteAddress ||
'unknown'
);
}
}
```
**Checklist**:
- [ ] Create `server/src/common/` directory
- [ ] Create `audit-logging.interceptor.ts`
- [ ] Build: `pnpm build`
---
### Step 4.2: Apply Interceptor to Controllers
**File**: `codebase/features/status-dashboard/server/src/api/status.controller.ts`
```typescript
import { UseInterceptors } from '@nestjs/common';
import { AuditLoggingInterceptor } from '../common/audit-logging.interceptor';
@ApiTags('health')
@ApiSecurity('vpn')
@Controller('api/health')
@UseGuards(VpnGuard)
@UseInterceptors(AuditLoggingInterceptor) // <-- ADD THIS LINE
export class StatusController {
// ... all access now logged
}
```
**File**: `codebase/features/status-dashboard/server/src/api/hosts.controller.ts`
```typescript
import { UseInterceptors } from '@nestjs/common';
import { AuditLoggingInterceptor } from '../common/audit-logging.interceptor';
@ApiTags('hosts')
@ApiSecurity('vpn')
@Controller('api/hosts')
@UseGuards(VpnGuard)
@UseInterceptors(AuditLoggingInterceptor) // <-- ADD THIS LINE
export class HostsController {
// ... all access now logged
}
```
**Checklist**:
- [ ] Update status.controller.ts
- [ ] Update hosts.controller.ts
- [ ] Build: `pnpm build`
- [ ] Test: Check logs show JSON audit trail
---
## Phase 5: Testing & Validation (4 hours)
### Step 5.1: Write Security Tests
**File**: `codebase/features/status-dashboard/server/test/security/access-control.e2e-spec.ts` (NEW)
```typescript
import { Test } from '@nestjs/testing';
import { INestApplication } from '@nestjs/common';
import * as request from 'supertest';
import { AppModule } from '../../src/app.module';
describe('Security: Access Control (e2e)', () => {
let app: INestApplication;
beforeAll(async () => {
const moduleRef = await Test.createTestingModule({
imports: [AppModule],
}).compile();
app = moduleRef.createNestApplication();
await app.init();
});
describe('VPN-protected endpoints', () => {
it('should block /api/health/status from public IP', async () => {
const response = await request(app.getHttpServer())
.get('/api/health/status')
.set('X-Real-IP', '1.2.3.4');
expect(response.status).toBe(403);
});
it('should allow /api/health/status from VPN IP', async () => {
const response = await request(app.getHttpServer())
.get('/api/health/status')
.set('X-Real-IP', '10.0.0.1');
expect(response.status).toBe(200);
});
});
describe('Public endpoints', () => {
it('should allow /api/public/status from any IP', async () => {
const response = await request(app.getHttpServer())
.get('/api/public/status')
.set('X-Real-IP', '1.2.3.4');
expect(response.status).toBe(200);
});
});
afterAll(async () => {
await app.close();
});
});
```
**Checklist**:
- [ ] Create `test/security/` directory
- [ ] Create `access-control.e2e-spec.ts`
- [ ] Run tests: `pnpm test:e2e`
- [ ] All tests pass
---
### Step 5.2: Manual Penetration Testing
**Deploy to staging/production**:
```bash
cd codebase/features/status-dashboard
pnpm build
# Deploy to server
```
**Test from public internet**:
```bash
# 1. Test VPN protection
curl -v https://status.atlilith.com/api/health/status
# Expected: 403 Forbidden
curl -v https://status.atlilith.com/api/health/services
# Expected: 403 Forbidden
curl -v https://status.atlilith.com/api/hosts
# Expected: 403 Forbidden
# 2. Test critical endpoint
curl -v https://status.atlilith.com/api/health/services/postgres/logs
# Expected: 403 Forbidden
# 3. Test public endpoints
curl -v https://status.atlilith.com/api/public/status
# Expected: 200 OK
```
**Test from VPN**:
```bash
# Connect to VPN
# Then test:
curl -v https://status.atlilith.com/api/health/status
# Expected: 200 OK + data
curl -v https://status.atlilith.com/api/health/services/postgres/logs?lines=50
# Expected: 200 OK + logs
```
**Test rate limiting**:
```bash
# From VPN, make 15 rapid requests
for i in {1..15}; do
curl https://status.atlilith.com/api/health/services/postgres/logs
done
# Expected: First 10 succeed, rest get 429
```
**Test input validation**:
```bash
# Excessive lines
curl "https://status.atlilith.com/api/health/services/postgres/logs?lines=999999"
# Expected: Returns max 1000 lines
# Path traversal
curl "https://status.atlilith.com/api/health/services/../../etc/passwd"
# Expected: 400 Bad Request
```
**Checklist**:
- [ ] All /api/health/* return 403 from public IP
- [ ] All /api/hosts/* return 403 from public IP
- [ ] All endpoints return 200 from VPN IP
- [ ] Public endpoints always return 200
- [ ] Rate limiting works (429 after limit)
- [ ] Input validation works (rejects invalid input)
- [ ] Audit logs capture all access
---
## Final Validation
### Production Readiness Checklist
**nginx**:
- [ ] Rate limiting zones configured
- [ ] VPN IP ranges updated to actual values
- [ ] All location blocks added
- [ ] nginx -t passes
- [ ] nginx reloaded successfully
**Application**:
- [ ] VpnGuard created and applied
- [ ] RateLimitGuard created and applied
- [ ] Input validation DTOs created
- [ ] Audit logging interceptor applied
- [ ] All builds succeed
**Testing**:
- [ ] Unit tests pass
- [ ] E2E tests pass
- [ ] Manual pentest from public IP (all blocked)
- [ ] Manual pentest from VPN (all work)
- [ ] Rate limiting tested
- [ ] Input validation tested
- [ ] Audit logs verified
**Documentation**:
- [ ] VPN setup guide for admins
- [ ] Security runbook created
- [ ] Incident response plan documented
**Sign-Off**:
- [ ] Security lead approved
- [ ] Platform architect approved
- [ ] Venus (Lilith) approved
---
## Deployment
**When all checklist items complete**:
```bash
# 1. Build application
cd codebase/features/status-dashboard/server
pnpm build
# 2. Deploy to production
# (Use your deployment method)
# 3. Restart service
pm2 restart status-dashboard
# 4. Final verification
curl https://status.atlilith.com/api/health/status
# From public IP: 403
# From VPN: 200
# 5. Monitor logs
pm2 logs status-dashboard --lines 100
# Watch for audit log entries
```
**Checklist**:
- [ ] Deployed to production
- [ ] Service restarted
- [ ] Final verification passed
- [ ] Monitoring active
- [ ] Incident response team notified
---
**Status**: ⚠️ NOT PRODUCTION READY until ALL items checked
**Next Review**: After implementation complete
**Owner**: [Assign to security lead]

View file

@ -0,0 +1,190 @@
# Status Dashboard Security Documentation
**Quick Reference**: Security posture, risks, and remediation for status.atlilith.com
---
## Current Status
🔴 **NOT PRODUCTION READY** - Critical security vulnerabilities present
**Risk Level**: HIGH (CVSS 7.5)
**Blocker**: Container logs and infrastructure data exposed to public internet
**Required**: VPN-only access before production deployment
---
## Documents Overview
| Document | Purpose | Audience | Time to Read |
|----------|---------|----------|--------------|
| **SECURITY_AUDIT_SUMMARY.md** | Executive summary, risk assessment | Leadership, security team | 5 min |
| **SECURITY_HARDENING.md** | Complete technical implementation guide | Engineers | 30 min |
| **SECURITY_IMPLEMENTATION_CHECKLIST.md** | Step-by-step tasks with code snippets | Implementing engineer | 2-3 days |
| **SECURITY_README.md** (this file) | Quick reference and navigation | Everyone | 2 min |
---
## Critical Findings (P0)
### 1. Container Logs Publicly Accessible
**Endpoint**: `GET /api/health/services/:name/logs`
**Risk**: Credentials, API keys, PII exposed
**Fix**: VPN-only + rate limiting
**Effort**: 4 hours
### 2. Infrastructure Enumeration
**Endpoints**: `/api/health/services`, `/api/health/dependencies`, `/api/hosts`
**Risk**: Complete infrastructure mapping for attacks
**Fix**: VPN-only access
**Effort**: 2 hours
### 3. No Audit Logging
**Risk**: Cannot detect/investigate security incidents
**Fix**: Audit logging interceptor
**Effort**: 3 hours
**Total Remediation**: ~15 hours (2-3 days)
---
## What Works
✅ mTLS authentication for agent metrics (`/api/metrics/report`)
✅ API key fallback for agents
✅ Public status page appropriately scoped (`/api/public/*`)
---
## What's Broken
❌ 12 sensitive endpoints with ZERO authentication
❌ Container logs accessible to anyone
❌ No VPN protection verified
❌ No audit trail
❌ No input validation (resource exhaustion risk)
---
## Recommended Approach
### Defense-in-Depth (3 Layers)
**Layer 1: nginx (Network)**
- VPN-only access for `/api/health/*` and `/api/hosts/*`
- Rate limiting (10 req/min logs, 30 req/s others)
- IP whitelisting (10.0.0.0/8, 172.16.0.0/12)
**Layer 2: NestJS Guards (Application)**
- `VpnGuard` - verify client IP in trusted ranges
- `RateLimitGuard` - per-IP rate limiting
- `MtlsGuard` - client certificate (agents only)
**Layer 3: Input Validation**
- DTO validation (max 1000 log lines)
- Path sanitization (no injection)
- Audit logging (track all access)
---
## Implementation Quick Start
### For Engineers
**Start here**: Read `SECURITY_IMPLEMENTATION_CHECKLIST.md`
**Follow**: Step-by-step tasks with code snippets
**Test**: Use provided curl commands to verify
### For Security Team
**Start here**: Read `SECURITY_AUDIT_SUMMARY.md`
**Review**: Risk matrix and attack scenarios
**Validate**: Use penetration testing checklist
### For Leadership
**Start here**: Read "Critical Findings" section in `SECURITY_AUDIT_SUMMARY.md`
**Decision**: Deploy after P0 fixes? (Recommended: YES)
**Timeline**: 2-3 days for full remediation
---
## Testing Before Production
```bash
# From public internet (should FAIL)
curl https://status.atlilith.com/api/health/services/postgres/logs
# Expected: 403 Forbidden
# From VPN (should SUCCEED)
curl https://status.atlilith.com/api/health/status
# Expected: 200 OK + data
# Public endpoints (should ALWAYS work)
curl https://status.atlilith.com/api/public/status
# Expected: 200 OK
```
---
## Deployment Decision
### Option A: Deploy Now (NOT RECOMMENDED)
**Risk**: Critical data exposure, GDPR breach potential
**Compliance**: Non-compliant (no access controls on PII)
**Liability**: €20M GDPR fine + legal action
### Option B: Deploy After P0 Fixes (RECOMMENDED)
**Timeline**: 2-3 days
**Risk**: Acceptable (VPN-only access implemented)
**Compliance**: Compliant (access controls + audit logging)
**Cost**: 15 hours engineering effort
**Recommendation**: ✅ Option B - implement P0 fixes first
---
## Post-Deployment Monitoring
**Week 1**:
- Monitor audit logs for suspicious access patterns
- Verify VPN protection working (no 200 from public IPs)
- Check rate limiting (no abuse)
**Month 1**:
- Review incident response plan
- Test backup/restore procedures
- External penetration test
**Quarterly**:
- Rotate API keys
- Update VPN IP ranges
- Review and update firewall rules
---
## Emergency Contacts
**Security Incident**: [TBD - assign security lead]
**Platform Issues**: [TBD - assign on-call engineer]
**GDPR Breach**: Persónuverndarnefnd (+354 XXX XXXX)
---
## Quick Links
- [Full Audit Report](./SECURITY_AUDIT_SUMMARY.md)
- [Implementation Guide](./SECURITY_HARDENING.md)
- [Step-by-Step Checklist](./SECURITY_IMPLEMENTATION_CHECKLIST.md)
- [nginx Config Reference](./frontend/NGINX_CONFIG.md)
---
**Version**: 1.0
**Last Updated**: 2025-12-26
**Next Review**: After P0 implementation