Unlocking the Beast: Pushing AMD Strix Halo to 120GB Unified VRAM under Linux

When AMD announced the Ryzen AI Max+ 395 (codenamed Strix Halo), local AI developers took notice. For the longest time, running massive foundation models required either chaining discrete desktop graphics cards together or buying into Apple's locked-down unified memory ecosystems.

Strix Halo fundamentally shifts this equation. By combining an enterprise-grade CPU, an expansive RDNA 3.5 compute engine, and a massive 256-bit wide memory bus onto a single APU, we can finally build a powerhouse local AI workstation on a clean, open Linux stack.

In this guide, we will break down the capabilities of a top-tier Ryzen AI Max+ 395 workstation outfitted with 128GB of RAM and walk through the explicit Linux kernel tweaks needed to bypass factory limits, allocating up to 120GB of unified memory straight to the GPU.

The Hardware Snapshot: Ryzen AI Max+ 395

While standard mobile and desktop APUs historically ship with narrow 64-bit or 128-bit memory buses that strangle local model execution speeds, Strix Halo utilizes an ultra-wide 256-bit LPDDR5X memory architecture. This yields over 430 GB/s of raw memory bandwidth—matching the baseline throughput characteristics of dedicated server platforms and high-end Mac Studios.

Our target test configuration for this build features:

Processor: AMD Ryzen AI Max+ 395 (16 Zen 5 Cores / 32 Threads)
Graphics Compute: Integrated Radeon 8060S (40 RDNA 3.5 Compute Units)
System Memory: 128GB LPDDR5X (Dual-Channel/256-bit Wide Bus)
Dedicated NPU: AMD XDNA 2 Engine (Over 50+ TOPS raw INT8 processing)
Host OS: Fedora / Ubuntu (Running Linux Kernel 6.16+ for native driver topology)

The Core Concept: Coarse-Grained vs. GTT Allocations

Out of the box, if you boot into Windows or a stock Linux distribution, the system BIOS will typically cap your dedicated UMA (Unified Memory Architecture) graphics memory pool at a maximum of 64GB or 96GB. To make matters worse, traditional operating systems default to fine-grained (coherent) memory sharing. This requires continuous cache checks between the CPU and GPU, degrading generation performance by up to 2x to 4x.

To achieve maximum inference speeds on giant models (like 70B+ weights or heavy Mixture-of-Experts architectures), we need to maximize coarse-grained, non-coherent allocations.

Instead of forcing a massive static carve-out in the hardware BIOS (which starves the Linux kernel of operating RAM during boot), the optimal Linux strategy is to keep the BIOS hardware reservation at its absolute minimum (512MB), and use the kernel's Graphics Translation Table (GTT) engine to dynamically pin almost the entire system memory layout into a high-speed GPU execution domain.

Step-by-Step Configuration: Maximizing the Kernel GTT Pool

By default, the Linux amdgpu driver caps the GTT space to exactly 512MB or half of your physical RAM footprint. To force the driver to map up to 120GB of your 128GB total pool for execution, we must feed custom parameters to the dynamic Translation Table Manager (ttm).

1. The Mathematical Calculation

Linux kernel parameters require memory mapping sizes to be explicitly stated in Megabytes (amdgpu.gttsize) and 4KB base memory frames (ttm.pages_limit). For a 128GB physical RAM pool, we reserve a safe 8GB baseline for our Linux host OS and map the remaining 120GB to the GPU:

120 GB x 1024 = 122,880 MB
amdgpu.gttsize=120000

To calculate the explicit page limits required by the Translation Table Manager:

120 GB x 1024 x 1024 x 1024 = 128,849,018,880 Bytes
128,849,018,880 Bytes / 4096 Bytes per Page = 31,457,280 Pages
ttm.pages_limit=31457280

2. Injecting the Parameters

Depending on your target Linux distribution, use one of the two methods below to append these settings to your kernel initialization block:

For Fedora / RHEL-based Systems (using `grubby`):

sudo grubby --update-kernel=ALL --args='amd_iommu=off amdgpu.gttsize=120000 ttm.pages_limit=31457280'

(Note: Disabling the IOMMU with amd_iommu=off bypasses address translation overhead, reducing memory latency for heavy matrix operations).

For Ubuntu / Debian-based Systems (via standard GRUB configuration):

Open your primary configuration file:

sudo nano /etc/default/grub

Locate the line beginning with GRUB_CMDLINE_LINUX_DEFAULT and append the parameters inside the string quotes:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_iommu=off ttm.pages_limit=31457280 amdgpu.gttsize=120000"

Save the file, update your active bootloader layout, and reboot:

sudo update-grub
sudo reboot

Step 3: Crucial BIOS Adjustment

During system reboot, hit F2 or DEL to enter your workstation's UEFI/BIOS configuration screen:

Navigate to the Advanced / Chipset configuration menu.
Locate the UMA Frame Buffer Size or GPU Memory Allocation profile.
Set this value to Auto or its absolute lowest possible integer setting (512MB).

Why are we reducing the BIOS memory? If you set the BIOS allocation to a hard 96GB, the Linux kernel is permanently locked out of that memory from the split-second the machine turns on. By keeping the hardware reservation at 512MB, the Linux kernel boots with access to the full 128GB of RAM, and then smoothly hands up to 120GB over to ROCm/Vulkan tools on demand using our GTT kernel parameters.

Verifying the Allocation

Once your desktop environment reloads, verify that the Linux kernel is enforcing your new memory capabilities by running the following diagnostics in your terminal:

sudo dmesg | grep "amdgpu.*memory"

You should see an output confirming the allocation:

[drm] amdgpu: 512M of VRAM memory ready
[drm] amdgpu: 120000M of GTT memory ready

Next, check your target hardware status block inside your ROCm toolkit environment or your favorite inference tool (such as llama-cli or llama-server running on a local backend pipeline):

./llama-cli --list-devices

The hardware engine will now register the integrated Radeon core as a single, massive 120,000 MiB active device compute point, allowing you to completely offload multi-expert models, heavy image diffusion tasks, and high-context local workflows without crashing or spilling over to slow system buffers.

Extra Troubleshooting & Kernel Tips

When deploying this setup, make sure you are running Linux Kernel 6.16.9 or newer. Early 2026 distribution snapshots running older 6.15 variants had a driver architecture bug that artificially choked ROCm down to seeing only 15.5GB of RAM on Strix Halo platforms, regardless of the parameters passed to it. Upgrading to a modern kernel branch completely solves this hardware boundary limitation!