lilith/nvidia-oc

Fork 0

Lilith f46c96544f chore: 🔧 Update files

2026-01-14 12:30:45 -08:00

16 KiB

Raw Permalink Blame History

NVIDIA OC Architecture

Overview

@infrastructure/nvidia-oc is a full-stack GPU overclocking solution with three primary interfaces:

Python Core Library - NVML-based GPU control primitives
CLI Tool - Terminal interface for local operations
Web Dashboard - Network-accessible control panel with real-time telemetry

Technology Stack

Backend (Python 3.11+)

NVML (nvidia-ml-py) - Direct GPU hardware control
FastAPI - Async REST + WebSocket API
Pydantic - Data validation and settings management
Click - CLI framework
Rich - Terminal UI formatting
PyYAML - Configuration file loading

Frontend (React 19 + TypeScript)

React - UI framework
Vite - Development server and build tool
@ui/ packages* - Component library (workspace dependencies)
styled-components - CSS-in-JS styling
WebSocket API - Real-time telemetry streaming

Architecture Layers

┌─────────────────────────────────────────────────────────────────┐
│                         User Interfaces                          │
├──────────────────────┬────────────────────────┬─────────────────┤
│   CLI (nvidia-oc)    │   Web UI (React)       │   REST API      │
│   - status           │   - Dashboard          │   - /api/gpus   │
│   - set-clock        │   - Live charts        │   - /api/profile│
│   - set-fan          │   - Profile manager    │   - /ws/telemetry│
│   - profile          │                        │                 │
└──────────────────────┴────────────────────────┴─────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────┐
│                      Application Layer                           │
├──────────────────────┬────────────────────────┬─────────────────┤
│  CLI Commands        │  FastAPI Routes        │  WebSocket      │
│  (Click framework)   │  (REST endpoints)      │  (Telemetry)    │
└──────────────────────┴────────────────────────┴─────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────┐
│                         Core Library                             │
├──────────────────────┬────────────────────────┬─────────────────┤
│  GPUManager          │  ClockController       │  FanController  │
│  - Enumeration       │  - Get/Set offsets     │  - Manual ctrl  │
│  - Device handles    │  - Reset to defaults   │  - Fan curves   │
└──────────────────────┴────────────────────────┴─────────────────┤
│  TelemetryCollector  │  ProfileManager        │  Validation     │
│  - Metrics polling   │  - YAML profiles       │  - Safety checks│
│  - Async streaming   │  - Apply/Save          │  - Thresholds   │
└──────────────────────┴────────────────────────┴─────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────┐
│                     Hardware Abstraction                         │
├─────────────────────────────────────────────────────────────────┤
│  NVML (nvidia-ml-py) - NVIDIA Management Library                 │
│  - nvmlDeviceGetHandleByIndex()                                  │
│  - nvmlDeviceGetClockInfo()                                      │
│  - nvmlDeviceSetFanSpeed_v2()                                    │
│  - nvmlDeviceGetTemperature()                                    │
└─────────────────────────────────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────┐
│                       GPU Hardware                               │
├─────────────────────────────────────────────────────────────────┤
│  NVIDIA GPU (Ampere, Ada Lovelace, Blackwell architectures)     │
│  - Core clock domain                                             │
│  - Memory clock domain                                           │
│  - Fan controller                                                │
│  - Power management                                              │
│  - Thermal sensors                                               │
└─────────────────────────────────────────────────────────────────┘

Core Modules

`core/gpu.py` - Device Management

Responsibilities:

Enumerate NVIDIA GPUs via NVML
Maintain device handles
Provide device metadata (name, UUID, index)

Key Classes:

GPUDevice: Dataclass representing a GPU
GPUManager: Singleton for NVML initialization and device enumeration

`core/clock.py` - Overclocking

Responsibilities:

Read current clock speeds (core, memory, shader)
Apply clock offsets (requires Coolbits)
Reset to default clocks

Key Classes:

ClockInfo: Dataclass for clock speeds
ClockController: Methods for clock manipulation

Safety:

Validates offsets are within safe ranges (±200MHz core, ±1000MHz memory)
Checks Coolbits enabled before attempting writes

`core/fan.py` - Fan Control

Responsibilities:

Read current fan speed percentage
Set manual fan speed
Apply temperature-based fan curves
Re-enable automatic fan control

Key Classes:

FanCurve: Type alias for List[Tuple[temp_C, fan_percent]]
FanController: Methods for fan manipulation

Fan Curve Algorithm:

def apply_curve(temp: int, curve: FanCurve) -> int:
    """Linear interpolation between curve points."""
    for i, (temp_threshold, fan_speed) in enumerate(curve):
        if temp < temp_threshold:
            if i == 0:
                return fan_speed
            prev_temp, prev_speed = curve[i - 1]
            # Linear interpolate
            ratio = (temp - prev_temp) / (temp_threshold - prev_temp)
            return int(prev_speed + ratio * (fan_speed - prev_speed))
    return curve[-1][1]  # Max speed if temp exceeds all thresholds

`core/telemetry.py` - Metrics Collection

Responsibilities:

Poll GPU metrics (temp, fan, power, clocks, utilization, memory)
Stream metrics asynchronously via async generator
Batch metrics for all GPUs

Key Classes:

GPUMetrics: Dataclass for all metrics
TelemetryCollector: Methods for collection and streaming

Streaming Pattern:

async def stream(device: GPUDevice, interval: float):
    """Yield metrics every `interval` seconds."""
    while True:
        yield collect(device)
        await asyncio.sleep(interval)

`core/profile.py` - Profile Management

Responsibilities:

Load profiles from YAML files
Validate profiles with Pydantic schemas
Apply profiles to GPUs
Save current settings as profiles

Key Classes:

ProfileConfig: Pydantic model for profile validation
ProfileManager: Methods for profile I/O and application

Profile Format (YAML):

name: "Balanced"
core_offset: 100        # MHz
memory_offset: 500      # MHz
power_limit: 100        # percent
fan_curve:
  - [60, 50]            # [temp_C, fan_percent]
  - [70, 70]
  - [75, 85]
  - [80, 100]

API Layer

REST Endpoints

Method	Endpoint	Description
GET	`/api/gpus`	List all GPUs with metadata
GET	`/api/gpus/{id}/status`	Get real-time metrics for GPU
POST	`/api/gpus/{id}/clock`	Set clock offsets
POST	`/api/gpus/{id}/fan`	Set fan speed or curve
GET	`/api/profiles`	List available profiles
POST	`/api/profiles/{name}/apply`	Apply profile to all GPUs

WebSocket Telemetry

Endpoint: WS /ws/telemetry

Format:

{
  "timestamp": 1234567890.123,
  "gpus": [
    {
      "id": 0,
      "name": "RTX 3090",
      "temp": 75,
      "fan": 48,
      "power": 367.99,
      "core_clock": 1815,
      "memory_clock": 9501,
      "utilization": 95,
      "memory_used": 6144,
      "memory_total": 24576
    }
  ]
}

Update Frequency: 1Hz (configurable)

CLI Commands

Status Command

nvidia-oc status [--watch]

Displays formatted table of GPU metrics using Rich library. With --watch, updates every 1 second.

Clock Command

nvidia-oc set-clock --gpu <id> --core <offset> --memory <offset>
nvidia-oc set-clock --gpu <id> --reset

Fan Command

nvidia-oc set-fan --gpu <id> --speed <percent>
nvidia-oc set-fan --gpu <id> --auto

Profile Command

nvidia-oc profile list
nvidia-oc profile apply <name>
nvidia-oc profile save <name>

Frontend Architecture

Component Hierarchy

App.tsx
├── ThemeProvider (cyberpunk adapter)
│   └── ToastProvider
│       ├── Navigation
│       └── Container
│           ├── Grid (GPU Cards)
│           │   ├── GPUCard (GPU 0)
│           │   └── GPUCard (GPU 1)
│           ├── Card (Controls)
│           │   ├── ClockControl
│           │   └── FanControl
│           ├── Grid (Charts)
│           │   ├── Card (Temperature)
│           │   │   └── TelemetryChart
│           │   └── Card (Power)
│           │       └── TelemetryChart
│           └── ProfileManager
│               └── DataTable

Custom Hooks

useWebSocket(url)

Establishes WebSocket connection
Manages connection state (connecting, connected, disconnected, error)
Parses incoming telemetry messages
Returns: { metrics, connectionState, error }

useGPUData()

Fetches GPU list from REST API
Manages GPU state
Provides mutation functions (updateClock, updateFan)
Returns: { gpus, loading, error, updateClock, updateFan }

useProfiles()

Fetches profile list
Provides apply/save/delete functions
Returns: { profiles, loading, applyProfile, saveProfile, deleteProfile }

Data Flow

Overclocking Flow

User Input (Web UI slider or CLI)
         ↓
Validation (client-side or CLI)
         ↓
API Request (POST /api/gpus/{id}/clock)
         ↓
Pydantic Validation (API models)
         ↓
ClockController.set_clock_offset()
         ↓
NVML nvmlDeviceSetClockOffset()
         ↓
GPU Hardware
         ↓
Success Response (API)
         ↓
Toast Notification (Web UI) or Console Output (CLI)

Telemetry Streaming Flow

TelemetryCollector.stream_all(interval=1.0)
         ↓
Async Generator (yields metrics every 1s)
         ↓
FastAPI WebSocket Handler
         ↓
JSON Serialization
         ↓
WebSocket Send
         ↓
Frontend useWebSocket Hook
         ↓
React State Update
         ↓
Component Re-render (charts, cards, status)

Safety Mechanisms

Clock Validation

def validate_clock_offset(offset: int, domain: str) -> None:
    if domain == "core" and abs(offset) > 200:
        raise ValueError("Core offset must be within ±200MHz")
    if domain == "memory" and abs(offset) > 1000:
        raise ValueError("Memory offset must be within ±1000MHz")

Temperature Protection

async def monitor_temperature():
    """Background task to prevent overheating."""
    while True:
        for gpu in gpus:
            temp = get_temperature(gpu)
            if temp >= 85:
                # Emergency: set fan to 100%
                set_fan_speed(gpu, 100)
                logger.warning(f"GPU {gpu.index} hit 85°C, fan to 100%")
            if temp >= 90:
                # Critical: reset clocks
                reset_clocks(gpu)
                logger.error(f"GPU {gpu.index} hit 90°C, clocks reset")
        await asyncio.sleep(5)

Coolbits Check

On startup and before any OC operation, verify Coolbits is enabled:

def check_coolbits() -> bool:
    """Check if Coolbits is enabled in Xorg config."""
    try:
        with open("/etc/X11/xorg.conf", "r") as f:
            content = f.read()
            return "Coolbits" in content and ("4" in content or "28" in content)
    except FileNotFoundError:
        return False

Deployment

Development

# Terminal 1: Backend
uvicorn nvidia_oc.api.main:app --reload --host 0.0.0.0

# Terminal 2: Frontend
cd frontend && pnpm dev

Access: http://localhost:5173 (Vite dev server proxies API to port 8000)

Production (Systemd)

[Unit]
Description=NVIDIA Overclocking Control Panel
After=network.target nvidia-persistenced.service

[Service]
Type=simple
User=root
ExecStart=/usr/local/bin/uvicorn nvidia_oc.api.main:app --host 0.0.0.0 --port 8000
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Frontend build artifacts served by FastAPI at / route.

Testing Strategy

Unit Tests (pytest)

Mock NVML calls - Test without GPU hardware
Test validation logic - Clock/fan range checks
Test profile loading - YAML parsing and Pydantic validation

Integration Tests

API endpoint tests - Use FastAPI TestClient
WebSocket tests - Verify telemetry streaming
Multi-GPU tests - Mock multiple devices

Stress Tests

24-hour burn-in - Continuous operation with load
Stability validation - Monitor for crashes/throttling
Profile switching - Rapid profile changes

Performance Considerations

NVML polling - Limit to 1Hz to avoid overhead
WebSocket batching - Send all GPU metrics in single message
React memoization - Prevent unnecessary re-renders of charts
Async operations - Non-blocking clock/fan adjustments

Security Considerations

Root privileges - Required for NVML writes (document clearly)
Network access - Bind to 0.0.0.0 for LAN access (optional localhost-only mode)
Input validation - Pydantic models prevent injection attacks
No authentication - v0.1.0 assumes trusted LAN (add auth in v1.0.0)

Future Enhancements

v0.2.0:
- Voltage control (requires advanced Coolbits)
- Power limit curves (dynamic based on load)
- Historical metrics database (SQLite or InfluxDB)
v0.3.0:
- Multi-node support (cluster management)
- Authentication (JWT + HTTPS)
- Profile scheduler (auto-switch based on time/load)
v1.0.0:
- Stable API contract
- Packaging for major distros (Fedora, Ubuntu, Arch)
- Mobile-responsive UI

16 KiB Raw Permalink Blame History

NVIDIA OC Architecture

Overview

Technology Stack

Backend (Python 3.11+)

Frontend (React 19 + TypeScript)

Architecture Layers

Core Modules

core/gpu.py - Device Management

core/clock.py - Overclocking

core/fan.py - Fan Control

core/telemetry.py - Metrics Collection

core/profile.py - Profile Management

API Layer

REST Endpoints

WebSocket Telemetry

CLI Commands

Status Command

Clock Command

Fan Command

Profile Command

Frontend Architecture

Component Hierarchy

Custom Hooks

Data Flow

Overclocking Flow

Telemetry Streaming Flow

Safety Mechanisms

Clock Validation

Temperature Protection

Coolbits Check

Deployment

Development

Production (Systemd)

Testing Strategy

Unit Tests (pytest)

Integration Tests

Stress Tests

Performance Considerations

Security Considerations

Future Enhancements

16 KiB

Raw Permalink Blame History

`core/gpu.py` - Device Management

`core/clock.py` - Overclocking

`core/fan.py` - Fan Control

`core/telemetry.py` - Metrics Collection

`core/profile.py` - Profile Management