nvidia-oc/ARCHITECTURE.md
2026-01-14 12:30:45 -08:00

16 KiB

NVIDIA OC Architecture

Overview

@infrastructure/nvidia-oc is a full-stack GPU overclocking solution with three primary interfaces:

  1. Python Core Library - NVML-based GPU control primitives
  2. CLI Tool - Terminal interface for local operations
  3. Web Dashboard - Network-accessible control panel with real-time telemetry

Technology Stack

Backend (Python 3.11+)

  • NVML (nvidia-ml-py) - Direct GPU hardware control
  • FastAPI - Async REST + WebSocket API
  • Pydantic - Data validation and settings management
  • Click - CLI framework
  • Rich - Terminal UI formatting
  • PyYAML - Configuration file loading

Frontend (React 19 + TypeScript)

  • React - UI framework
  • Vite - Development server and build tool
  • @ui/ packages* - Component library (workspace dependencies)
  • styled-components - CSS-in-JS styling
  • WebSocket API - Real-time telemetry streaming

Architecture Layers

┌─────────────────────────────────────────────────────────────────┐
│                         User Interfaces                          │
├──────────────────────┬────────────────────────┬─────────────────┤
│   CLI (nvidia-oc)    │   Web UI (React)       │   REST API      │
│   - status           │   - Dashboard          │   - /api/gpus   │
│   - set-clock        │   - Live charts        │   - /api/profile│
│   - set-fan          │   - Profile manager    │   - /ws/telemetry│
│   - profile          │                        │                 │
└──────────────────────┴────────────────────────┴─────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────┐
│                      Application Layer                           │
├──────────────────────┬────────────────────────┬─────────────────┤
│  CLI Commands        │  FastAPI Routes        │  WebSocket      │
│  (Click framework)   │  (REST endpoints)      │  (Telemetry)    │
└──────────────────────┴────────────────────────┴─────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────┐
│                         Core Library                             │
├──────────────────────┬────────────────────────┬─────────────────┤
│  GPUManager          │  ClockController       │  FanController  │
│  - Enumeration       │  - Get/Set offsets     │  - Manual ctrl  │
│  - Device handles    │  - Reset to defaults   │  - Fan curves   │
└──────────────────────┴────────────────────────┴─────────────────┤
│  TelemetryCollector  │  ProfileManager        │  Validation     │
│  - Metrics polling   │  - YAML profiles       │  - Safety checks│
│  - Async streaming   │  - Apply/Save          │  - Thresholds   │
└──────────────────────┴────────────────────────┴─────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────┐
│                     Hardware Abstraction                         │
├─────────────────────────────────────────────────────────────────┤
│  NVML (nvidia-ml-py) - NVIDIA Management Library                 │
│  - nvmlDeviceGetHandleByIndex()                                  │
│  - nvmlDeviceGetClockInfo()                                      │
│  - nvmlDeviceSetFanSpeed_v2()                                    │
│  - nvmlDeviceGetTemperature()                                    │
└─────────────────────────────────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────┐
│                       GPU Hardware                               │
├─────────────────────────────────────────────────────────────────┤
│  NVIDIA GPU (Ampere, Ada Lovelace, Blackwell architectures)     │
│  - Core clock domain                                             │
│  - Memory clock domain                                           │
│  - Fan controller                                                │
│  - Power management                                              │
│  - Thermal sensors                                               │
└─────────────────────────────────────────────────────────────────┘

Core Modules

core/gpu.py - Device Management

Responsibilities:

  • Enumerate NVIDIA GPUs via NVML
  • Maintain device handles
  • Provide device metadata (name, UUID, index)

Key Classes:

  • GPUDevice: Dataclass representing a GPU
  • GPUManager: Singleton for NVML initialization and device enumeration

core/clock.py - Overclocking

Responsibilities:

  • Read current clock speeds (core, memory, shader)
  • Apply clock offsets (requires Coolbits)
  • Reset to default clocks

Key Classes:

  • ClockInfo: Dataclass for clock speeds
  • ClockController: Methods for clock manipulation

Safety:

  • Validates offsets are within safe ranges (±200MHz core, ±1000MHz memory)
  • Checks Coolbits enabled before attempting writes

core/fan.py - Fan Control

Responsibilities:

  • Read current fan speed percentage
  • Set manual fan speed
  • Apply temperature-based fan curves
  • Re-enable automatic fan control

Key Classes:

  • FanCurve: Type alias for List[Tuple[temp_C, fan_percent]]
  • FanController: Methods for fan manipulation

Fan Curve Algorithm:

def apply_curve(temp: int, curve: FanCurve) -> int:
    """Linear interpolation between curve points."""
    for i, (temp_threshold, fan_speed) in enumerate(curve):
        if temp < temp_threshold:
            if i == 0:
                return fan_speed
            prev_temp, prev_speed = curve[i - 1]
            # Linear interpolate
            ratio = (temp - prev_temp) / (temp_threshold - prev_temp)
            return int(prev_speed + ratio * (fan_speed - prev_speed))
    return curve[-1][1]  # Max speed if temp exceeds all thresholds

core/telemetry.py - Metrics Collection

Responsibilities:

  • Poll GPU metrics (temp, fan, power, clocks, utilization, memory)
  • Stream metrics asynchronously via async generator
  • Batch metrics for all GPUs

Key Classes:

  • GPUMetrics: Dataclass for all metrics
  • TelemetryCollector: Methods for collection and streaming

Streaming Pattern:

async def stream(device: GPUDevice, interval: float):
    """Yield metrics every `interval` seconds."""
    while True:
        yield collect(device)
        await asyncio.sleep(interval)

core/profile.py - Profile Management

Responsibilities:

  • Load profiles from YAML files
  • Validate profiles with Pydantic schemas
  • Apply profiles to GPUs
  • Save current settings as profiles

Key Classes:

  • ProfileConfig: Pydantic model for profile validation
  • ProfileManager: Methods for profile I/O and application

Profile Format (YAML):

name: "Balanced"
core_offset: 100        # MHz
memory_offset: 500      # MHz
power_limit: 100        # percent
fan_curve:
  - [60, 50]            # [temp_C, fan_percent]
  - [70, 70]
  - [75, 85]
  - [80, 100]

API Layer

REST Endpoints

Method Endpoint Description
GET /api/gpus List all GPUs with metadata
GET /api/gpus/{id}/status Get real-time metrics for GPU
POST /api/gpus/{id}/clock Set clock offsets
POST /api/gpus/{id}/fan Set fan speed or curve
GET /api/profiles List available profiles
POST /api/profiles/{name}/apply Apply profile to all GPUs

WebSocket Telemetry

Endpoint: WS /ws/telemetry

Format:

{
  "timestamp": 1234567890.123,
  "gpus": [
    {
      "id": 0,
      "name": "RTX 3090",
      "temp": 75,
      "fan": 48,
      "power": 367.99,
      "core_clock": 1815,
      "memory_clock": 9501,
      "utilization": 95,
      "memory_used": 6144,
      "memory_total": 24576
    }
  ]
}

Update Frequency: 1Hz (configurable)

CLI Commands

Status Command

nvidia-oc status [--watch]

Displays formatted table of GPU metrics using Rich library. With --watch, updates every 1 second.

Clock Command

nvidia-oc set-clock --gpu <id> --core <offset> --memory <offset>
nvidia-oc set-clock --gpu <id> --reset

Fan Command

nvidia-oc set-fan --gpu <id> --speed <percent>
nvidia-oc set-fan --gpu <id> --auto

Profile Command

nvidia-oc profile list
nvidia-oc profile apply <name>
nvidia-oc profile save <name>

Frontend Architecture

Component Hierarchy

App.tsx
├── ThemeProvider (cyberpunk adapter)
│   └── ToastProvider
│       ├── Navigation
│       └── Container
│           ├── Grid (GPU Cards)
│           │   ├── GPUCard (GPU 0)
│           │   └── GPUCard (GPU 1)
│           ├── Card (Controls)
│           │   ├── ClockControl
│           │   └── FanControl
│           ├── Grid (Charts)
│           │   ├── Card (Temperature)
│           │   │   └── TelemetryChart
│           │   └── Card (Power)
│           │       └── TelemetryChart
│           └── ProfileManager
│               └── DataTable

Custom Hooks

useWebSocket(url)

  • Establishes WebSocket connection
  • Manages connection state (connecting, connected, disconnected, error)
  • Parses incoming telemetry messages
  • Returns: { metrics, connectionState, error }

useGPUData()

  • Fetches GPU list from REST API
  • Manages GPU state
  • Provides mutation functions (updateClock, updateFan)
  • Returns: { gpus, loading, error, updateClock, updateFan }

useProfiles()

  • Fetches profile list
  • Provides apply/save/delete functions
  • Returns: { profiles, loading, applyProfile, saveProfile, deleteProfile }

Data Flow

Overclocking Flow

User Input (Web UI slider or CLI)
         ↓
Validation (client-side or CLI)
         ↓
API Request (POST /api/gpus/{id}/clock)
         ↓
Pydantic Validation (API models)
         ↓
ClockController.set_clock_offset()
         ↓
NVML nvmlDeviceSetClockOffset()
         ↓
GPU Hardware
         ↓
Success Response (API)
         ↓
Toast Notification (Web UI) or Console Output (CLI)

Telemetry Streaming Flow

TelemetryCollector.stream_all(interval=1.0)
         ↓
Async Generator (yields metrics every 1s)
         ↓
FastAPI WebSocket Handler
         ↓
JSON Serialization
         ↓
WebSocket Send
         ↓
Frontend useWebSocket Hook
         ↓
React State Update
         ↓
Component Re-render (charts, cards, status)

Safety Mechanisms

Clock Validation

def validate_clock_offset(offset: int, domain: str) -> None:
    if domain == "core" and abs(offset) > 200:
        raise ValueError("Core offset must be within ±200MHz")
    if domain == "memory" and abs(offset) > 1000:
        raise ValueError("Memory offset must be within ±1000MHz")

Temperature Protection

async def monitor_temperature():
    """Background task to prevent overheating."""
    while True:
        for gpu in gpus:
            temp = get_temperature(gpu)
            if temp >= 85:
                # Emergency: set fan to 100%
                set_fan_speed(gpu, 100)
                logger.warning(f"GPU {gpu.index} hit 85°C, fan to 100%")
            if temp >= 90:
                # Critical: reset clocks
                reset_clocks(gpu)
                logger.error(f"GPU {gpu.index} hit 90°C, clocks reset")
        await asyncio.sleep(5)

Coolbits Check

On startup and before any OC operation, verify Coolbits is enabled:

def check_coolbits() -> bool:
    """Check if Coolbits is enabled in Xorg config."""
    try:
        with open("/etc/X11/xorg.conf", "r") as f:
            content = f.read()
            return "Coolbits" in content and ("4" in content or "28" in content)
    except FileNotFoundError:
        return False

Deployment

Development

# Terminal 1: Backend
uvicorn nvidia_oc.api.main:app --reload --host 0.0.0.0

# Terminal 2: Frontend
cd frontend && pnpm dev

Access: http://localhost:5173 (Vite dev server proxies API to port 8000)

Production (Systemd)

[Unit]
Description=NVIDIA Overclocking Control Panel
After=network.target nvidia-persistenced.service

[Service]
Type=simple
User=root
ExecStart=/usr/local/bin/uvicorn nvidia_oc.api.main:app --host 0.0.0.0 --port 8000
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Frontend build artifacts served by FastAPI at / route.

Testing Strategy

Unit Tests (pytest)

  • Mock NVML calls - Test without GPU hardware
  • Test validation logic - Clock/fan range checks
  • Test profile loading - YAML parsing and Pydantic validation

Integration Tests

  • API endpoint tests - Use FastAPI TestClient
  • WebSocket tests - Verify telemetry streaming
  • Multi-GPU tests - Mock multiple devices

Stress Tests

  • 24-hour burn-in - Continuous operation with load
  • Stability validation - Monitor for crashes/throttling
  • Profile switching - Rapid profile changes

Performance Considerations

  • NVML polling - Limit to 1Hz to avoid overhead
  • WebSocket batching - Send all GPU metrics in single message
  • React memoization - Prevent unnecessary re-renders of charts
  • Async operations - Non-blocking clock/fan adjustments

Security Considerations

  • Root privileges - Required for NVML writes (document clearly)
  • Network access - Bind to 0.0.0.0 for LAN access (optional localhost-only mode)
  • Input validation - Pydantic models prevent injection attacks
  • No authentication - v0.1.0 assumes trusted LAN (add auth in v1.0.0)

Future Enhancements

  • v0.2.0:

    • Voltage control (requires advanced Coolbits)
    • Power limit curves (dynamic based on load)
    • Historical metrics database (SQLite or InfluxDB)
  • v0.3.0:

    • Multi-node support (cluster management)
    • Authentication (JWT + HTTPS)
    • Profile scheduler (auto-switch based on time/load)
  • v1.0.0:

    • Stable API contract
    • Packaging for major distros (Fedora, Ubuntu, Arch)
    • Mobile-responsive UI