Quinn Ftw 2fd4ee6a43 docs(status-dashboard): add comprehensive security documentation

Add security audit and implementation guides for status-dashboard:
- SECURITY_README.md: Quick reference and navigation
- SECURITY_AUDIT_SUMMARY.md: Executive summary and risk assessment
- SECURITY_HARDENING.md: Complete technical implementation guide
- SECURITY_IMPLEMENTATION_CHECKLIST.md: Step-by-step tasks

Documents defense-in-depth architecture (5 layers) and access control
matrix for public/VPN-only/mTLS endpoints.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2025-12-26 05:59:09 -08:00

4.8 KiB

Raw Blame History

Status Dashboard Security Documentation

Quick Reference: Security posture, risks, and remediation for status.atlilith.com

Current Status

🔴 NOT PRODUCTION READY - Critical security vulnerabilities present

Risk Level: HIGH (CVSS 7.5) Blocker: Container logs and infrastructure data exposed to public internet Required: VPN-only access before production deployment

Documents Overview

Document	Purpose	Audience	Time to Read
SECURITY_AUDIT_SUMMARY.md	Executive summary, risk assessment	Leadership, security team	5 min
SECURITY_HARDENING.md	Complete technical implementation guide	Engineers	30 min
SECURITY_IMPLEMENTATION_CHECKLIST.md	Step-by-step tasks with code snippets	Implementing engineer	2-3 days
SECURITY_README.md (this file)	Quick reference and navigation	Everyone	2 min

Critical Findings (P0)

1. Container Logs Publicly Accessible

Endpoint: GET /api/health/services/:name/logs Risk: Credentials, API keys, PII exposed Fix: VPN-only + rate limiting Effort: 4 hours

2. Infrastructure Enumeration

Endpoints: /api/health/services, /api/health/dependencies, /api/hosts Risk: Complete infrastructure mapping for attacks Fix: VPN-only access Effort: 2 hours

3. No Audit Logging

Risk: Cannot detect/investigate security incidents Fix: Audit logging interceptor Effort: 3 hours

Total Remediation: ~15 hours (2-3 days)

What Works

✅ mTLS authentication for agent metrics (/api/metrics/report) ✅ API key fallback for agents ✅ Public status page appropriately scoped (/api/public/*)

What's Broken

❌ 12 sensitive endpoints with ZERO authentication ❌ Container logs accessible to anyone ❌ No VPN protection verified ❌ No audit trail ❌ No input validation (resource exhaustion risk)

Recommended Approach

Defense-in-Depth (3 Layers)

Layer 1: nginx (Network)

VPN-only access for /api/health/* and /api/hosts/*
Rate limiting (10 req/min logs, 30 req/s others)
IP whitelisting (10.0.0.0/8, 172.16.0.0/12)

Layer 2: NestJS Guards (Application)

VpnGuard - verify client IP in trusted ranges
RateLimitGuard - per-IP rate limiting
MtlsGuard - client certificate (agents only)

Layer 3: Input Validation

DTO validation (max 1000 log lines)
Path sanitization (no injection)
Audit logging (track all access)

Implementation Quick Start

For Engineers

Start here: Read SECURITY_IMPLEMENTATION_CHECKLIST.md Follow: Step-by-step tasks with code snippets Test: Use provided curl commands to verify

For Security Team

Start here: Read SECURITY_AUDIT_SUMMARY.md Review: Risk matrix and attack scenarios Validate: Use penetration testing checklist

For Leadership

Start here: Read "Critical Findings" section in SECURITY_AUDIT_SUMMARY.md Decision: Deploy after P0 fixes? (Recommended: YES) Timeline: 2-3 days for full remediation

Testing Before Production

# From public internet (should FAIL)
curl https://status.atlilith.com/api/health/services/postgres/logs
# Expected: 403 Forbidden

# From VPN (should SUCCEED)
curl https://status.atlilith.com/api/health/status
# Expected: 200 OK + data

# Public endpoints (should ALWAYS work)
curl https://status.atlilith.com/api/public/status
# Expected: 200 OK

Deployment Decision

Option A: Deploy Now (NOT RECOMMENDED)

Risk: Critical data exposure, GDPR breach potential Compliance: Non-compliant (no access controls on PII) Liability: €20M GDPR fine + legal action

Option B: Deploy After P0 Fixes (RECOMMENDED)

Timeline: 2-3 days Risk: Acceptable (VPN-only access implemented) Compliance: Compliant (access controls + audit logging) Cost: 15 hours engineering effort

Recommendation: ✅ Option B - implement P0 fixes first

Post-Deployment Monitoring

Week 1:

Monitor audit logs for suspicious access patterns
Verify VPN protection working (no 200 from public IPs)
Check rate limiting (no abuse)

Month 1:

Review incident response plan
Test backup/restore procedures
External penetration test

Quarterly:

Rotate API keys
Update VPN IP ranges
Review and update firewall rules

Emergency Contacts

Security Incident: [TBD - assign security lead] Platform Issues: [TBD - assign on-call engineer] GDPR Breach: Persónuverndarnefnd (+354 XXX XXXX)

Quick Links

Version: 1.0 Last Updated: 2025-12-26 Next Review: After P0 implementation

4.8 KiB Raw Blame History