16 KiB
16 KiB
NVIDIA OC Architecture
Overview
@infrastructure/nvidia-oc is a full-stack GPU overclocking solution with three primary interfaces:
- Python Core Library - NVML-based GPU control primitives
- CLI Tool - Terminal interface for local operations
- Web Dashboard - Network-accessible control panel with real-time telemetry
Technology Stack
Backend (Python 3.11+)
- NVML (nvidia-ml-py) - Direct GPU hardware control
- FastAPI - Async REST + WebSocket API
- Pydantic - Data validation and settings management
- Click - CLI framework
- Rich - Terminal UI formatting
- PyYAML - Configuration file loading
Frontend (React 19 + TypeScript)
- React - UI framework
- Vite - Development server and build tool
- @ui/ packages* - Component library (workspace dependencies)
- styled-components - CSS-in-JS styling
- WebSocket API - Real-time telemetry streaming
Architecture Layers
┌─────────────────────────────────────────────────────────────────┐
│ User Interfaces │
├──────────────────────┬────────────────────────┬─────────────────┤
│ CLI (nvidia-oc) │ Web UI (React) │ REST API │
│ - status │ - Dashboard │ - /api/gpus │
│ - set-clock │ - Live charts │ - /api/profile│
│ - set-fan │ - Profile manager │ - /ws/telemetry│
│ - profile │ │ │
└──────────────────────┴────────────────────────┴─────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Application Layer │
├──────────────────────┬────────────────────────┬─────────────────┤
│ CLI Commands │ FastAPI Routes │ WebSocket │
│ (Click framework) │ (REST endpoints) │ (Telemetry) │
└──────────────────────┴────────────────────────┴─────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Core Library │
├──────────────────────┬────────────────────────┬─────────────────┤
│ GPUManager │ ClockController │ FanController │
│ - Enumeration │ - Get/Set offsets │ - Manual ctrl │
│ - Device handles │ - Reset to defaults │ - Fan curves │
└──────────────────────┴────────────────────────┴─────────────────┤
│ TelemetryCollector │ ProfileManager │ Validation │
│ - Metrics polling │ - YAML profiles │ - Safety checks│
│ - Async streaming │ - Apply/Save │ - Thresholds │
└──────────────────────┴────────────────────────┴─────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Hardware Abstraction │
├─────────────────────────────────────────────────────────────────┤
│ NVML (nvidia-ml-py) - NVIDIA Management Library │
│ - nvmlDeviceGetHandleByIndex() │
│ - nvmlDeviceGetClockInfo() │
│ - nvmlDeviceSetFanSpeed_v2() │
│ - nvmlDeviceGetTemperature() │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ GPU Hardware │
├─────────────────────────────────────────────────────────────────┤
│ NVIDIA GPU (Ampere, Ada Lovelace, Blackwell architectures) │
│ - Core clock domain │
│ - Memory clock domain │
│ - Fan controller │
│ - Power management │
│ - Thermal sensors │
└─────────────────────────────────────────────────────────────────┘
Core Modules
core/gpu.py - Device Management
Responsibilities:
- Enumerate NVIDIA GPUs via NVML
- Maintain device handles
- Provide device metadata (name, UUID, index)
Key Classes:
GPUDevice: Dataclass representing a GPUGPUManager: Singleton for NVML initialization and device enumeration
core/clock.py - Overclocking
Responsibilities:
- Read current clock speeds (core, memory, shader)
- Apply clock offsets (requires Coolbits)
- Reset to default clocks
Key Classes:
ClockInfo: Dataclass for clock speedsClockController: Methods for clock manipulation
Safety:
- Validates offsets are within safe ranges (±200MHz core, ±1000MHz memory)
- Checks Coolbits enabled before attempting writes
core/fan.py - Fan Control
Responsibilities:
- Read current fan speed percentage
- Set manual fan speed
- Apply temperature-based fan curves
- Re-enable automatic fan control
Key Classes:
FanCurve: Type alias for List[Tuple[temp_C, fan_percent]]FanController: Methods for fan manipulation
Fan Curve Algorithm:
def apply_curve(temp: int, curve: FanCurve) -> int:
"""Linear interpolation between curve points."""
for i, (temp_threshold, fan_speed) in enumerate(curve):
if temp < temp_threshold:
if i == 0:
return fan_speed
prev_temp, prev_speed = curve[i - 1]
# Linear interpolate
ratio = (temp - prev_temp) / (temp_threshold - prev_temp)
return int(prev_speed + ratio * (fan_speed - prev_speed))
return curve[-1][1] # Max speed if temp exceeds all thresholds
core/telemetry.py - Metrics Collection
Responsibilities:
- Poll GPU metrics (temp, fan, power, clocks, utilization, memory)
- Stream metrics asynchronously via async generator
- Batch metrics for all GPUs
Key Classes:
GPUMetrics: Dataclass for all metricsTelemetryCollector: Methods for collection and streaming
Streaming Pattern:
async def stream(device: GPUDevice, interval: float):
"""Yield metrics every `interval` seconds."""
while True:
yield collect(device)
await asyncio.sleep(interval)
core/profile.py - Profile Management
Responsibilities:
- Load profiles from YAML files
- Validate profiles with Pydantic schemas
- Apply profiles to GPUs
- Save current settings as profiles
Key Classes:
ProfileConfig: Pydantic model for profile validationProfileManager: Methods for profile I/O and application
Profile Format (YAML):
name: "Balanced"
core_offset: 100 # MHz
memory_offset: 500 # MHz
power_limit: 100 # percent
fan_curve:
- [60, 50] # [temp_C, fan_percent]
- [70, 70]
- [75, 85]
- [80, 100]
API Layer
REST Endpoints
| Method | Endpoint | Description |
|---|---|---|
| GET | /api/gpus |
List all GPUs with metadata |
| GET | /api/gpus/{id}/status |
Get real-time metrics for GPU |
| POST | /api/gpus/{id}/clock |
Set clock offsets |
| POST | /api/gpus/{id}/fan |
Set fan speed or curve |
| GET | /api/profiles |
List available profiles |
| POST | /api/profiles/{name}/apply |
Apply profile to all GPUs |
WebSocket Telemetry
Endpoint: WS /ws/telemetry
Format:
{
"timestamp": 1234567890.123,
"gpus": [
{
"id": 0,
"name": "RTX 3090",
"temp": 75,
"fan": 48,
"power": 367.99,
"core_clock": 1815,
"memory_clock": 9501,
"utilization": 95,
"memory_used": 6144,
"memory_total": 24576
}
]
}
Update Frequency: 1Hz (configurable)
CLI Commands
Status Command
nvidia-oc status [--watch]
Displays formatted table of GPU metrics using Rich library. With --watch, updates every 1 second.
Clock Command
nvidia-oc set-clock --gpu <id> --core <offset> --memory <offset>
nvidia-oc set-clock --gpu <id> --reset
Fan Command
nvidia-oc set-fan --gpu <id> --speed <percent>
nvidia-oc set-fan --gpu <id> --auto
Profile Command
nvidia-oc profile list
nvidia-oc profile apply <name>
nvidia-oc profile save <name>
Frontend Architecture
Component Hierarchy
App.tsx
├── ThemeProvider (cyberpunk adapter)
│ └── ToastProvider
│ ├── Navigation
│ └── Container
│ ├── Grid (GPU Cards)
│ │ ├── GPUCard (GPU 0)
│ │ └── GPUCard (GPU 1)
│ ├── Card (Controls)
│ │ ├── ClockControl
│ │ └── FanControl
│ ├── Grid (Charts)
│ │ ├── Card (Temperature)
│ │ │ └── TelemetryChart
│ │ └── Card (Power)
│ │ └── TelemetryChart
│ └── ProfileManager
│ └── DataTable
Custom Hooks
useWebSocket(url)
- Establishes WebSocket connection
- Manages connection state (connecting, connected, disconnected, error)
- Parses incoming telemetry messages
- Returns:
{ metrics, connectionState, error }
useGPUData()
- Fetches GPU list from REST API
- Manages GPU state
- Provides mutation functions (updateClock, updateFan)
- Returns:
{ gpus, loading, error, updateClock, updateFan }
useProfiles()
- Fetches profile list
- Provides apply/save/delete functions
- Returns:
{ profiles, loading, applyProfile, saveProfile, deleteProfile }
Data Flow
Overclocking Flow
User Input (Web UI slider or CLI)
↓
Validation (client-side or CLI)
↓
API Request (POST /api/gpus/{id}/clock)
↓
Pydantic Validation (API models)
↓
ClockController.set_clock_offset()
↓
NVML nvmlDeviceSetClockOffset()
↓
GPU Hardware
↓
Success Response (API)
↓
Toast Notification (Web UI) or Console Output (CLI)
Telemetry Streaming Flow
TelemetryCollector.stream_all(interval=1.0)
↓
Async Generator (yields metrics every 1s)
↓
FastAPI WebSocket Handler
↓
JSON Serialization
↓
WebSocket Send
↓
Frontend useWebSocket Hook
↓
React State Update
↓
Component Re-render (charts, cards, status)
Safety Mechanisms
Clock Validation
def validate_clock_offset(offset: int, domain: str) -> None:
if domain == "core" and abs(offset) > 200:
raise ValueError("Core offset must be within ±200MHz")
if domain == "memory" and abs(offset) > 1000:
raise ValueError("Memory offset must be within ±1000MHz")
Temperature Protection
async def monitor_temperature():
"""Background task to prevent overheating."""
while True:
for gpu in gpus:
temp = get_temperature(gpu)
if temp >= 85:
# Emergency: set fan to 100%
set_fan_speed(gpu, 100)
logger.warning(f"GPU {gpu.index} hit 85°C, fan to 100%")
if temp >= 90:
# Critical: reset clocks
reset_clocks(gpu)
logger.error(f"GPU {gpu.index} hit 90°C, clocks reset")
await asyncio.sleep(5)
Coolbits Check
On startup and before any OC operation, verify Coolbits is enabled:
def check_coolbits() -> bool:
"""Check if Coolbits is enabled in Xorg config."""
try:
with open("/etc/X11/xorg.conf", "r") as f:
content = f.read()
return "Coolbits" in content and ("4" in content or "28" in content)
except FileNotFoundError:
return False
Deployment
Development
# Terminal 1: Backend
uvicorn nvidia_oc.api.main:app --reload --host 0.0.0.0
# Terminal 2: Frontend
cd frontend && pnpm dev
Access: http://localhost:5173 (Vite dev server proxies API to port 8000)
Production (Systemd)
[Unit]
Description=NVIDIA Overclocking Control Panel
After=network.target nvidia-persistenced.service
[Service]
Type=simple
User=root
ExecStart=/usr/local/bin/uvicorn nvidia_oc.api.main:app --host 0.0.0.0 --port 8000
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
Frontend build artifacts served by FastAPI at / route.
Testing Strategy
Unit Tests (pytest)
- Mock NVML calls - Test without GPU hardware
- Test validation logic - Clock/fan range checks
- Test profile loading - YAML parsing and Pydantic validation
Integration Tests
- API endpoint tests - Use FastAPI TestClient
- WebSocket tests - Verify telemetry streaming
- Multi-GPU tests - Mock multiple devices
Stress Tests
- 24-hour burn-in - Continuous operation with load
- Stability validation - Monitor for crashes/throttling
- Profile switching - Rapid profile changes
Performance Considerations
- NVML polling - Limit to 1Hz to avoid overhead
- WebSocket batching - Send all GPU metrics in single message
- React memoization - Prevent unnecessary re-renders of charts
- Async operations - Non-blocking clock/fan adjustments
Security Considerations
- Root privileges - Required for NVML writes (document clearly)
- Network access - Bind to 0.0.0.0 for LAN access (optional localhost-only mode)
- Input validation - Pydantic models prevent injection attacks
- No authentication - v0.1.0 assumes trusted LAN (add auth in v1.0.0)
Future Enhancements
-
v0.2.0:
- Voltage control (requires advanced Coolbits)
- Power limit curves (dynamic based on load)
- Historical metrics database (SQLite or InfluxDB)
-
v0.3.0:
- Multi-node support (cluster management)
- Authentication (JWT + HTTPS)
- Profile scheduler (auto-switch based on time/load)
-
v1.0.0:
- Stable API contract
- Packaging for major distros (Fedora, Ubuntu, Arch)
- Mobile-responsive UI