Contents

Running a Local LLM on AMD Radeon 780M — gfx1103, ROCm, and the GPU That Wasn't Supposed to Work

Table of Contents

  • The Machine
  • The Problem: gfx1103 Doesn’t Exist
  • GTT Memory — 24 GB for Free
  • The ROCm Stack
  • Getting GPU Inference Working
  • Optimizing: The Hidden GPU Clock Problem
  • Benchmarks — Every Configuration Tested
  • The Surprising Finding: CPU Beats GPU on Generation
  • Monitoring with Collectd and Grafana
  • What the Dashboard Actually Shows
  • Lessons Learned

I wanted a local AI box. Not a cloud API with latency and per-token billing. Not a GPU workstation that sounds like a jet engine. A quiet mini-PC that runs a capable model at home, on my desk, forever, for free.

After some research I picked up a Lenovo ThinkCentre M75Q Gen5 — 8-core AMD Ryzen 7 PRO 8700GE APU, 32 GB DDR5-5600, Radeon 780M iGPU. Around 400€. Fits in a hand. Silent.

What followed was three days of ROCm archaeology, kernel parameter tuning, and benchmarking everything I could think of. This is the full account.


The Machine

The M75Q Gen5 is an APU (Accelerated Processing Unit) — CPU and GPU on the same die, sharing system RAM. No discrete GPU. No separate VRAM. Just 32 GB DDR5-5600 doing triple duty as CPU memory, GPU memory, and swap space for inference workloads.

Hardware
├── CPU: AMD Ryzen 7 PRO 8700GE — 8 cores / 16 threads, RDNA3
├── iGPU: AMD Radeon 780M — 12 Compute Units, gfx1103
├── RAM: 32 GB DDR5-5600 dual-channel (~89.6 GB/s theoretical bandwidth)
├── Storage: 512 GB NVMe → LVM 466 GB at /
└── OS: Ubuntu 26.04 LTS, kernel 7.0.0-22

The GPU has no dedicated VRAM. Instead, it uses two memory pools:

  • UMA VRAM: a slice of RAM carved out by the BIOS (typically 512 MB – 2 GB)
  • GTT (Graphics Translation Table): the GPU’s window into the rest of system RAM, managed by the amdgpu kernel driver

For LLM inference, GTT is what matters. With the right kernel parameter, you can expose 24 GB of system RAM to the GPU. That’s enough to load Gemma 4 12B Q4_0 (7.6 GB) with room to spare.


The Problem: gfx1103 Doesn’t Exist

The Radeon 780M uses GPU architecture version gfx1103. In AMD’s naming scheme, that’s 11.0.3 — RDNA3, Hawk Point generation.

When you install Ollama and try to run a model, this happens:

WARN dropping ROCm device — no rocblas support for gfx target
device=ROCm0 gfx_target=gfx1103
supported="map[gfx1030:true gfx1100:true gfx1101:true gfx1102:true
           gfx1150:true gfx1151:true gfx1200:true gfx1201:true
           gfx908:true gfx90a:true gfx942:true gfx950:true]"
hint="set HSA_OVERRIDE_GFX_VERSION to map to a supported target"

gfx1103 is not in Ollama’s supported list. The GPU is rejected. Inference falls back to CPU.

The hint tells you exactly what to do. HSA_OVERRIDE_GFX_VERSION is an AMD ROCm environment variable that tells the hardware identification layer to present the GPU as a different version. Set it to 11.0.2 and the GPU announces itself as gfx1102 — which IS in the supported list.

export HSA_OVERRIDE_GFX_VERSION=11.0.2

That’s it. One environment variable. GPU inference works.

The reason 11.0.2 (not 11.0.3) is the mapping: Ollama’s bundled ROCm 7.2 includes highly optimized Tensile kernel libraries for gfx1102 (RDNA3 chips like the RX 6650 XT, a common gaming card with large community usage). Those kernels run on gfx1103 hardware because the architectures are close relatives — same RDNA3 generation, similar CU layout.

flowchart TD A[Ollama starts] --> B{GPU detection} B -->|gfx1103 found| C[Check supported list] C -->|gfx1103 NOT in list| D[Drop GPU → CPU fallback] C -->|HSA_OVERRIDE=11.0.2| E[GPU presents as gfx1102] E --> F[gfx1102 IS in list ✓] F --> G[Load ROCm 7.2 Tensile kernels for gfx1102] G --> H[GPU inference active] D --> I[CPU-only inference ~5 tok/s] H --> I2[GPU inference ~4.5 tok/s gen / 54 tok/s prefill]

GTT Memory — 24 GB for Free

Before ROCm can use GPU memory, the driver needs to know how much GTT to expose. By default, amdgpu limits GTT to a fraction of system RAM. For a 32 GB machine that might be 4–8 GB — not enough for a 7.6 GB model.

The fix is a kernel boot parameter:

# /etc/default/grub
GRUB_CMDLINE_LINUX="amdgpu.gttsize=24576"
# 24576 MB = 24 GB GTT pool

After sudo update-grub && sudo reboot, verify:

cat /sys/class/drm/card0/device/mem_info_gtt_total
# 25769803776 = 24 GB ✓

cat /sys/class/drm/card0/device/mem_info_gtt_used
# 7845441536 during inference = ~7.8 GB used for model

The model loads entirely into GTT. The GPU can access it directly via DMA without copying.

graph LR subgraph DDR5["DDR5-5600 32 GB (89.6 GB/s)"] UMA["UMA VRAM\n~512 MB\n(BIOS carved)"] GTT["GTT Pool\n24 GB\n(amdgpu.gttsize=24576)"] SYSRAM["System RAM\n~7.5 GB\nOS + processes"] end subgraph GPU["Radeon 780M — 12 CU gfx1103"] ROCm["ROCm runtime"] end MODEL["/Storage/models/\ngemma4-qat:12b\n7.6 GB GGUF"] -->|loaded into| GTT ROCm <-->|DMA access| GTT ROCm <-->|direct| UMA

The ROCm Stack

Here’s what’s actually running on the machine after setup:

graph TB CLIENT["Client HTTP / OpenAI-compat\n:11434"] OLLAMA["Ollama 0.30.6 official\n/usr/local/bin/ollama"] ROCM["ROCm 7.2 bundled\n/usr/local/lib/ollama/rocm_v7_2/"] ROCBLAS["librocblas.so.5.2\n1792 Tensile kernels\ngfx1102 optimized"] HIP["libggml-hip.so\nGGML HIP backend"] GPU["AMD Radeon 780M\ngfx1103 → gfx1102 via HSA_OVERRIDE"] GTT["GTT Pool 24 GB\nModel: gemma4-qat:12b 7.6 GB"] COLLECTD["collectd exec\ngpu_metrics.sh\ncollect_ollama_metrics.sh"] DEVA["deva 192.168.1.28\nInfluxDB → Grafana"] CLIENT --> OLLAMA OLLAMA --> ROCM ROCM --> ROCBLAS ROCM --> HIP HIP --> GPU GPU <--> GTT COLLECTD -->|UDP :25826| DEVA OLLAMA -.->|metrics via journald| COLLECTD GPU -.->|sysfs hwmon| COLLECTD

Key environment variables in /etc/systemd/system/ollama.service:

[Service]
User=lgirardi
Environment=HSA_OVERRIDE_GFX_VERSION=11.0.2
Environment=ROC_ENABLE_PRE_VEGA=1
Environment=AMDGPU_TARGETS=gfx1103
Environment=OLLAMA_MODELS=/Storage/models
Environment=OLLAMA_IGPU_ENABLE=1
Environment=OLLAMA_FLASH_ATTENTION=1
Environment=OLLAMA_NUM_PARALLEL=1
Environment=LD_LIBRARY_PATH=/usr/local/lib/ollama
ExecStartPre=+/bin/sh -c 'echo high > /sys/class/drm/card0/device/power_dpm_force_performance_level'
ExecStart=/usr/local/bin/ollama serve
ExecStopPost=+/bin/sh -c 'echo auto > /sys/class/drm/card0/device/power_dpm_force_performance_level'

Two things worth calling out:

OLLAMA_IGPU_ENABLE=1: Ollama drops integrated GPUs by default. This tells it to keep them.

ExecStartPre: Forces the GPU to maximum clock speed (2700 MHz) at service start. This turns out to be critical — see the next section.


Optimizing: The Hidden GPU Clock Problem

After getting GPU inference working, I benchmarked it and got 4.4 tok/s generation speed. Then I ran it on CPU and got 5.3 tok/s. The CPU was faster.

Something was wrong.

I checked the GPU clock:

cat /sys/class/drm/card0/device/pp_dpm_sclk
# 0: 800Mhz *      ← running at 800 MHz!
# 1: 1100Mhz
# 2: 2700Mhz

The GPU was running at 800 MHz — the lowest power state. The amdgpu driver defaults to auto power management, and with an iGPU doing inference (which doesn’t look like a gaming workload to the driver), it chose to downclock.

Force it to maximum:

echo high | sudo tee /sys/class/drm/card0/device/power_dpm_force_performance_level

After the change:

cat /sys/class/drm/card0/device/pp_dpm_sclk
# 0: 2700Mhz
# 1: 1100Mhz
# 2: 2700Mhz *     ← now at 2700 MHz

This is why the ExecStartPre line in the service unit matters. Without it, every service restart returns the GPU to 800 MHz.


Benchmarks — Every Configuration Tested

I tested every meaningful combination I could think of:

  • Ollama + ROCm (the main path, gfx1102 mapping)
  • llama.cpp + Vulkan (Mesa RADV, native gfx1103 support)
  • llama.cpp + HIP native gfx1103 (compiled from source with GPU_TARGETS=gfx1103)
  • llama.cpp + CPU only (8 threads, 16 threads)
  • llama.cpp split (24 GPU layers + 24 CPU layers)

Model: Gemma 4 12B QAT Q4_0 (google/gemma-4-12B-it-qat-q4_0-gguf), 7.6 GB. Prompt: “write numbers 1 to 30”, 80 tokens generated. All GPU tests at 2700 MHz.

SetupPrefill (tok/s)Generation (tok/s)
Ollama + ROCm 7.2 gfx1102 + FA544.5
Vulkan RADV PHOENIX (Mesa 26.0.3)394.7
llama.cpp HIP native gfx1103 (ROCm 7.1)224.5
llama.cpp split 24GPU + 24CPU18.54.7
llama.cpp CPU-only 8 threads145.3
llama.cpp CPU-only 16 threads165.1
Ollama + ROCm gfx1102 (800 MHz, no fix)374.5
xychart-beta title "Prefill Speed (tok/s) — higher is better" x-axis ["Ollama ROCm\n2700MHz+FA", "Vulkan RADV", "llama.cpp\nHIP native", "Split 24/24", "CPU 8t", "CPU 16t", "Ollama ROCm\n800MHz"] y-axis "tok/s" 0 --> 60 bar [54, 39, 22, 18.5, 14, 16, 37]
xychart-beta title "Generation Speed (tok/s) — higher is better" x-axis ["Ollama ROCm\n2700MHz+FA", "Vulkan RADV", "llama.cpp\nHIP native", "Split 24/24", "CPU 8t", "CPU 16t", "Ollama ROCm\n800MHz"] y-axis "tok/s" 0 --> 6 bar [4.5, 4.7, 4.5, 4.7, 5.3, 5.1, 4.5]

The Surprising Finding: CPU Beats GPU on Generation

Look at the generation column. CPU (5.3 tok/s) beats every GPU configuration (4.5–4.7 tok/s).

This is not intuitive. GPUs are supposed to be faster. For LLM inference on a discrete GPU with fast GDDR6 or HBM, they are. But on an APU with unified memory, the math is different.

Here’s why:

graph LR subgraph CPU["CPU Path"] CPU_CORE["Ryzen 7 PRO 8700GE\n8 cores / AVX-512 BF16"] -->|direct DDR5 access\nno overhead| DDR5_CPU["DDR5 ~60-70 GB/s\neffective bandwidth"] end subgraph GPU["GPU Path (GTT)"] GPU_CU["Radeon 780M\n12 CUs"] -->|UMA / GTT DMA\n+ ROCm dispatch overhead| DDR5_GPU["DDR5 ~29 GB/s\neffective bandwidth\n(32% of theoretical)"] end

Token generation is memory bandwidth bound, not compute bound. Each forward pass reads the full 6.5 GB of model weights from RAM. The CPU does this with direct DDR5 access and AVX-512 BF16 SIMD instructions. The iGPU has to go through the GTT DMA path, which adds latency and reduces effective bandwidth.

The result: CPU achieves ~5.3 tok/s, GPU achieves ~4.5 tok/s. The GPU’s overhead eats its raw bandwidth advantage.

Prefill is the opposite. Processing an input prompt (the “prefill” phase) is compute-bound — you’re doing batch matrix multiply across all input tokens simultaneously. Here the GPU’s 12 CUs at 2700 MHz dominate: 54 tok/s vs 14 tok/s on CPU. The clock speed matters too: the same GPU at 800 MHz only manages 37 tok/s.

quadrantChart title "Prefill vs Generation — Where Each Backend Wins" x-axis "Low Generation Speed" --> "High Generation Speed" y-axis "Low Prefill Speed" --> "High Prefill Speed" quadrant-1 "Best for RAG / long prompts" quadrant-2 "Best overall" quadrant-3 "Avoid" quadrant-4 "Best for chat" Ollama ROCm 2700MHz: [0.43, 0.97] Vulkan RADV: [0.49, 0.67] llama.cpp native: [0.43, 0.36] CPU 8 threads: [0.96, 0.18] Split 24-24: [0.49, 0.30]

Practical strategy:

  • Chat (short prompts, long responses): CPU-only llama.cpp — smoother output at 5.3 tok/s
  • RAG / document analysis (long prompts): Ollama GPU — 54 tok/s prefill makes a real difference
  • Current setup: Ollama GPU (best prefill, slightly slower generation but acceptable)

Why llama.cpp Native gfx1103 Lost

I compiled llama.cpp from source with native gfx1103 HIP support:

cmake -B build -DGGML_HIP=ON -DGPU_TARGETS=gfx1103 -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc) --target llama-server

This produces a 73 MB libggml-hip.so with GGML HIP kernels compiled natively for gfx1103. No HSA_OVERRIDE needed — the HIP runtime accepts the GPU directly.

Yet it was slower than Ollama’s gfx1102 mapping (22 vs 54 tok/s prefill).

The reason: ROCm version. Ubuntu 26.04 ships ROCm 7.1 in apt. Ollama bundles ROCm 7.2 in its own library directory (/usr/local/lib/ollama/rocm_v7_2/). The 7.2 gfx1102 Tensile kernels are more aggressively optimized than the 7.1 gfx1103 native kernels. A newer version of a close-relative kernel beats an older version of a native one.

The system ROCm 7.1 does include native gfx1103 Tensile kernels:

ls /usr/lib/x86_64-linux-gnu/rocblas/5.1.0/library/ | grep gfx1103
# Kernels.so-000-gfx1103.hsaco  ← exists!

But they’re slower. For now, the mapping approach wins.


Why Vulkan Lost

Mesa RADV supports gfx1103 natively. It’s a well-maintained Vulkan driver for AMD hardware, and after installing mesa-vulkan-drivers, Ollama automatically detected it:

library=Vulkan compute=0.0 name=Vulkan0
description="AMD Radeon 780M Graphics (RADV PHOENIX)"
type=iGPU total="24.5 GiB"

Note that Vulkan sees 24.5 GiB (essentially the full GTT), while ROCm sees only 15 GiB. More memory visibility didn’t translate to better performance: 39 tok/s prefill vs 54 tok/s for ROCm.

GGML’s Vulkan shader kernels are less optimized for batch GEMM operations than the Tensile ROCm path. Vulkan is a viable fallback if ROCm refuses to work, but it’s not the fastest path on this hardware.


Monitoring with Collectd and Grafana

The machine ships metrics to my central monitoring stack via collectd UDP to deva (192.168.1.28:25826 → InfluxDB → Grafana).

sequenceDiagram participant GPU as GPU sysfs/hwmon participant Script as gpu_metrics.sh participant Ollama as Ollama API/journald participant OScript as collect_ollama_metrics.sh participant Collectd as collectd exec plugin participant InfluxDB as InfluxDB (deva) participant Grafana as Grafana loop Every 60 seconds Script->>GPU: read mem_info_gtt_used/total Script->>GPU: read hwmon temp1_input (amdgpu hwmon) Script->>Collectd: PUTVAL gpu/bytes-gtt_used Script->>Collectd: PUTVAL gpu/bytes-vram_used Script->>Collectd: PUTVAL gpu/temperature-gpu OScript->>Ollama: GET /api/ps (model loaded?) OScript->>Ollama: journalctl parse (request count, duration) OScript->>Collectd: PUTVAL ollama/gauge-models_loaded OScript->>Collectd: PUTVAL ollama/counter-requests Collectd->>InfluxDB: UDP :25826 Grafana->>InfluxDB: query end

One gotcha: the GPU temperature sensor. The machine has multiple hwmon entries — NVMe thermal sensor on hwmon0, CPU on hwmon1, and amdgpu on hwmon3. A naive hwmon*/temp1_input | head -1 grabs the NVMe (always ~45°C), not the GPU. The correct approach is to look up the hwmon entry by name:

AMDGPU_HWMON=$(grep -rl "^amdgpu$" /sys/class/hwmon/hwmon*/name | head -1 | xargs dirname)
GPU_TEMP_RAW=$(cat "$AMDGPU_HWMON/temp1_input")
GPU_TEMP_C=$(awk "BEGIN {printf \"%.1f\", $GPU_TEMP_RAW/1000}")

Another gotcha: bc (the calculator) is not installed on Ubuntu 26.04. Use awk for arithmetic instead.

The Ollama metrics situation is more awkward. Ollama 0.30.6 returns 404 for GET /metrics regardless of the OLLAMA_METRICS=true environment variable — the endpoint isn’t implemented in this version. The workaround is to parse the GIN access logs from journald:

LINES=$(journalctl -u ollama.service --since "65 seconds ago" --no-pager -o cat | \
  grep -E '\[GIN\].*\| 200 \|.*POST.*(api/generate|api/chat|v1/completions)')
REQS=$(echo "$LINES" | grep -c "\[GIN\]")

This gives request counts and durations. Token counts (prompt tokens, completion tokens) aren’t available without the metrics endpoint.


What the Dashboard Actually Shows

The Grafana dashboard has four sections:

System Overview (stat panels, always populated):

  • Load Average, Memory %, CPU %, GPU GTT Used, GPU Temperature, Disk /

Ollama LLM (stat panels):

  • Model Loaded (0 or 1, from /api/ps)
  • Requests/s (from journald)
  • Avg Request Duration (from journald)
  • Request Duration sum

Time Series (graphs):

  • Request Rate over time
  • System Load (1m/5m/15m)
  • GPU Memory — GTT Used, VRAM Used, VRAM Total
  • GPU Temperature
  • Memory (used/cached/buffered)
  • Network (enp2s0f1 RX/TX)

Token Metrics:

  • Cumulative Request Count
  • CPU Detail (user/system/wait %)

The GPU memory panel shows the inference pattern clearly: a spike from ~14 MB (idle) to ~7.8 GB (model loaded) when a request comes in, then back down after the 5-minute keep-alive expires.


Lessons Learned

The GPU clock issue is the biggest surprise. APU iGPUs don’t behave like discrete GPUs. The power management driver doesn’t associate “ROCm inference” with “needs maximum clock”. Without explicitly forcing power_dpm_force_performance_level=high, you’re running at a third of peak frequency and wondering why performance is bad.

CPU beats GPU for generation on APUs. If you’re running a single-user chat assistant where generation fluency matters, CPU inference with llama.cpp may actually give a better experience. The unified memory architecture removes the GPU’s bandwidth advantage. This flips the usual “always use GPU” assumption.

Ollama’s bundled ROCm beats system ROCm. Don’t try to build your own ROCm stack to get “native” support — you’ll end up with an older version of the libraries that performs worse. The official Ollama installer bundles ROCm 7.2 with well-optimized kernels. Use it.

HSA_OVERRIDE_GFX_VERSION works, but understand what it does. You’re telling the hardware to lie about what it is. The gfx1102 Tensile kernels run on gfx1103 hardware because the architectures are close enough. This isn’t a hack — AMD maintains near-binary compatibility across RDNA3 variants by design. But it means you’re not running kernel code compiled for your exact hardware. When AMD eventually ships gfx1103 support in a future ROCm release, removing the override should give better performance.

For APU inference, the model size ceiling is real but generous. 24 GB GTT gives you headroom for anything up to ~13B parameters at Q4 quantization — comfortably covering the best open-source models in the “fits in a living room” category. Gemma 4 12B QAT Q4_0 is an excellent choice: quantization-aware training means near-BF16 quality at Q4_0 file size.


The machine runs 24/7. Response times are acceptable. It doesn’t phone home. The electricity bill is a rounding error.

For anyone who wants a local AI assistant without a datacenter, an APU mini-PC is a surprisingly capable platform — once you understand why the GPU clock is sitting at 800 MHz and fix it.