Why Oculink GPU Fails in Ollama: Dual Setup & Scheduling

GPUs & Graphics Cards

.key-facts-table{border-collapse:collapse;width:100%;margin:1.2em 0;font-size:.95em;background:#fff;border:1px solid #e5e7eb;border-radius:6px;overflow:hidden} .key-facts-table th,.key-facts-table td{padding:.6em .85em;border-bottom:1px solid #eef0f3;vertical-align:top;text-align:left} .key-facts-table tr:last-child th,.key-facts-table tr:last-child td{border-bottom:none} .key-facts-table tr.kf-section th{background:#eef2ff;font-weight:600;border-bottom:2px solid #c7d2fe;color:#1e1b4b;padding:.75em .85em;font-size:1em} .key-facts-table tr:not(.kf-section) th{width:32%;font-weight:500;color:#555;background:#fafbfc;font-size:.93em} .key-facts-table tr:not(.kf-section) td{color:#222} .key-facts-table code{background:#f1f5f9;padding:.05em .35em;border-radius:3px;font-size:.92em;font-family:Consolas,monospace;color:#0f172a}

Oculink dual-GPU configuration refers to a setup where an external PCIe 4.0 x4 interface is used to add a second GPU.

NVIDIA drivers correctly recognize both GPUs, yet Ollama utilizes only the first one. This is a frequent occurrence in Oculink expansion environments. Even with our site’s test environment (RTX 5080 16GB + RTX 5060 Ti 16GB via Oculink DEG1 / i7-14700F / RAM 96GB), we reproduced the issue where ollama ps output fails to show the second GPU, leaving inference running on a single unit without layer distribution. The same behavior was observed with the latest combination of Ollama 0.22.1, NVIDIA driver 596.21, and Windows 10.

Key Takeaways

  • Ollama may fail to automatically recognize Oculink expansion GPUs; explicit specification via CUDA_VISIBLE_DEVICES is required.
  • Dual-GPU setups provide a VRAM extension effect equivalent to 16GB+16GB=32GB for 70B-class models, but single GPU inference remains faster for 14B-class models. See details on VRAM capacity and specific model performance comparisons like the 14B class.
  • Beyond Oculink PCIe 4.0 x4 bandwidth, power capacity and environment variable configuration are the practical bottlenecks.

The “DUAL_GPU_NOT_CONFIGURED” Phenomenon: RTX 5060 Ti Invisible to Ollama

Immediately after setting up the test environment, nvidia-smi recognized both GPUs, yet Ollama inference jobs utilized only the first one. Checking VRAM usage revealed that only the RTX 5080 was filled, while the RTX 5060 Ti remained idle at 0%. Layer distribution did not occur; when models could not fit into VRAM, they were offloaded to system RAM.

Test Environment and Reproduction Conditions

In our site’s test environment, the RTX 5080 connects via PCIe 5.0 x16 on the mainboard, while the RTX 5060 Ti is connected externally via Oculink DEG1 using PCIe 4.0 x4. The OS was Windows 10, running Ollama version 0.22.1 with NVIDIA driver 596.21. Even when varying model sizes within this same configuration, we could not induce behavior that utilized the second GPU under default settings.

Known Issues Revealed by Official Documentation and GitHub Issues

The Ollama official GPU documentation outlines procedures to explicitly recognize multiple GPUs using CUDA_VISIBLE_DEVICES and UUID specifications. However, in GitHub issue #13163, a known problem is reported where Blackwell-generation (Compute Capability 12.0) GPUs are not automatically recognized. We can conclude that this phenomenon is specific neither to Oculink nor solely the new generation; rather, it arises from the combination of new-gen GPUs and Ollama’s scheduler specifications. Compute Capabilities for each generation can be referenced in the official NVIDIA CUDA GPUs list.

Workaround: Verifying Three Environment Variables Step-by-Step

The workaround involves three steps. While specifying only CUDA_VISIBLE_DEVICES may suffice in minimal configurations, if Ollama’s scheduler does not select layer distribution, adding OLLAMA_SCHED_SPREAD becomes necessary.

CUDA_VISIBLE_DEVICES and UUID Specification (Official Recommendation)

The most reliable procedure is to obtain the UUIDs of both GPUs via nvidia-smi and launch Ollama in the format CUDA_VISIBLE_DEVICES=GPU-uuid1,GPU-uuid2. While device numbers (0, 1) can work, specifying by UUID offers greater stability as numbering may swap depending on PCIe connection states. To obtain a UUID, use nvidia-smi -L and copy the string starting with GPU- from the displayed list.

The Role of num_gpu_split / OLLAMA_SCHED_SPREAD

num_gpu_split is a parameter specifying how many segments to divide layers into during model loading. OLLAMA_SCHED_SPREAD=1 is an environment variable instructing the Ollama scheduler to “prioritize distribution across multiple GPUs.” The former controls behavior at the model level, while the latter operates at the job level. The syntax for num_gpu_split is defined in the PARAMETER section of the Ollama Modelfile reference; it can be written directly into a Modelfile and rebuilt, or specified via options in API requests.

Operability of Configuration Combinations

Configuration Second GPU Recognized? Layer Distribution Occurs?
Default No No
CUDA_VISIBLE_DEVICES only Yes No
+ OLLAMA_SCHED_SPREAD=1 Yes Established
+ num_gpu_split specified Yes Ratios arbitrarily controllable

Only after proceeding to the third step (enabling OLLAMA_SCHED_SPREAD) did layers begin placing onto the RTX 5060 Ti. This aligns with the recommended flow in official documentation.

After modifying CUDA_VISIBLE_DEVICES, you must restart the Ollama daemon. It is premature to conclude “the second GPU is invisible” without reflecting settings. On Windows, a service restart is required; on Linux, use systemctl restart ollama.

Persisting Environment Variables and Scope of Application

CUDA_VISIBLE_DEVICES and OLLAMA_SCHED_SPREAD are evaluated at process startup. Therefore, if running Ollama as a service, updating user environment variables alone will not take effect. The official Ollama FAQ provides procedures for setting system-level environment variables via setx followed by a service restart on Windows, and adding Environment= to the unit file via systemctl edit ollama.service on Linux.

The workflow of stopping an Ollama instance registered as a Windows service → editing environment variables → restarting can be completed in three steps from PowerShell: net stop ollama; setx OLLAMA_SCHED_SPREAD 1 /M; net start ollama. The /M flag instructs writing to the system-wide scope; without it, changes apply only to user scope and vanish upon logoff, making them inaccessible within service environments.

If building a custom model embedding num_gpu_split via Modelfile, you must re-tag using ollama create and specify that tag during inference. For one-time application from the command line, passing { “num_gpu_split”: [16, 16] } in API request options yields the same effect.

Pros and Cons of Dual-GPU Setup: Results Reverse Based on Model Size

The real-world verification revealed a structure where “VRAM expansion is beneficial, but performance gains are conditional.” This conclusion aligns with Ollama’s Multi-GPU design philosophy: “VRAM expansion first, performance second.”

Multi-GPU inference is fundamentally a VRAM extension feature, not a mechanism for doubling throughput. In layer-splitting inference, PCIe transfer overhead between devices occurs; in single-stream generation, this can negate or even outweigh the benefits gained from parallel computation. — Summary of Ollama GPU documentation intent.

70B Class: Only Operable via VRAM Expansion

Models with 70 billion parameters require approximately 40GB even in GGUF Q4 quantization. A single RTX 5080 (16GB VRAM) cannot accommodate them, so adding a second GPU via Oculink to expand total capacity from 16GB+16GB=32GB equivalent is required for operation. While tokens/sec may drop compared to single-GPU performance (if it were possible), the meaningful metric here becomes “operable vs. not operable.” Specific target models include Llama 3.1 70B, Qwen 2.5 72B, and Mixtral 8x22B; VRAM requirements for each are listed on their respective pages in the Ollama official model library.

14B Class: Single GPU is Faster

Conversely, 14B-class models (approx. 9–10GB in Q4) fit comfortably on a single RTX 5080. Dual-GPU configuration here adds PCIe x4 layer communication overhead, resulting in lower tokens/sec compared to single GPU operation. This is the zone where intuition that “using two GPUs = faster” reverses.

Newer models like those at the 128B class (e.g., Mistral Medium 3.5) are emerging, and use cases requiring VRAM equivalent of 32GB are certainly increasing. The demand to run 70B–128B-class models “at home” is one factor driving dual-GPU adoption. However, the trade-off remains: for tasks satisfiable by 14B-class models, a single GPU configuration remains more efficient.

VRAM Requirements and Recommended GPU Configurations by Model Size

Necessary VRAM is determined by LLM parameter counts and quantization levels. Understanding the threshold where dual-GPU setups become meaningful allows for quicker hardware investment decisions.

Model Scale Q4 Quantized VRAM Q8 Quantized VRAM Recommended GPU Configuration
7B Approx. 4–5GB Approx. 7–8GB Single GPU, 8GB or more
13–14B Approx. 8–10GB Approx. 14–16GB Single GPU, 16GB or more
32–34B Approx. 19–22GB Approx. 35–38GB Single 24GB OR Dual 16+16
70–72B Approx. 40–44GB Approx. 70–75GB Dual 24+24 or 16+16
120B+ Approx. 70–80GB Approx. 130GB+ Multi-GPU Required

The table values are approximate weights for Q4/Q8 quantized GGUF files. During actual loading, additional VRAM is required for KV cache depending on the specified num_ctx (context length); a design margin of 2–4GB beyond the table values is realistic. Extending context lengths to 32k or higher consumes several GBs in VRAM just from KV caches, meaning some cases will exceed the minimum lines listed in the recommendations.

Oculink bandwidth is PCIe 4.0 x4 = approx. 64Gbps bidirectional. At first glance, this seems like an eighth of PCIe 5.0 x16 and thus a bottleneck; however, when layers are actually distributed across GPUs, no behavior indicating insufficient bandwidth was observed.

Power Supply and Environment Variables Matter More Than Bandwidth

In multi-GPU inference communication patterns, only activation tensors between layers need to be passed. The transfer volume per token generation is in the megabyte range, which PCIe 4.0 x4 can easily handle. Power supply handling varies significantly by configuration. Oculink docks (e.g., MINISFORUM DEG1) require a separate independent ATX power supply, completely isolated from the main PSU. In this test environment, the main 850W PSU handles RTX 5080 (TDP 360W), i7-14700F, and peripherals; meanwhile, a separate 750W PSU powers only the RTX 5060 Ti (TDP 180W) on the Oculink side. Even during simultaneous full-load operation, each PSU independently bears its load, so no single unit carries a combined total of 540W. Conversely, in an integrated configuration installing both GPUs into two PCIe slots (without using Oculink), the combined TDP concentrates on one PSU; thus, while 850W is tight, a 1000W-class supply is preferable.

Estimated Power Capacity by Configuration

Configuration Total TDP Recommended PSU Notes
Single RTX 5080 + i7-14700F Approx. 540W 850W or more 1000W acceptable for headroom
Internal Dual PCIe (RTX 5080 + 5060 Ti) Approx. 720W 1000W or more Transient power handling recommended; 1200W preferred
Oculink Configuration (via DEG1) Main: 540W / Dock: 180W Main: 850W + Dock: 750W Two independent systems, no concentrated load

The official specifications on the NVIDIA GeForce RTX 50 Series page list TDPs of 360W for RTX 5080 and 180W for RTX 5060 Ti. Transient power spikes can reach 1.5–2 times the rated TDP, so PSUs should be selected with headroom beyond their ratings. A benefit of the dual Oculink configuration is risk isolation: if a power issue occurs on one side, it does not affect the main environment.

Software Configuration Often Becomes the Bottleneck in Multi-GPU Environments

The same structural issues occur with other frameworks. For example, vLLM has reported cases where TTFT and token generation speeds plummet when processing long-context inputs exceeding 64k tokens on multi-GPU setups; this is not resolved unless AITER Unified Attention’s backend is explicitly enabled via environment variables. Ollama’s OLLAMA_SCHED_SPREAD faces a structurally identical problem: the current reality is that software configuration, rather than hardware limitations, often constitutes the bottleneck.

Troubleshooting: Verification Order When Second GPU Is Not Recognized

If environment variables are set but the second GPU remains invisible or layer distribution fails to establish, isolating causes in the following order is efficient. Quickly determining whether the issue lies with Ollama settings or hardware recognition avoids unnecessary tweaking of environment variables.

Step 1: Verify GPU Recognition via nvidia-smi

First, check if both GPUs appear in a list with UUIDs using nvidia-smi -L. If only one appears, the cause lies with drivers, physical connections, or power supply to the Oculink dock—not an Ollama issue. In cases involving Oculink docks, forgetting to turn on the unit’s main power is surprisingly common. PCIe link status can be confirmed via nvidia-smi --query-gpu=pcie.link.gen.current,pcie.link.width.current --format=csv; if Gen4 x4 displays for an Oculink connection, the link is healthy.

Step 2: Obtain ollama ps and OLLAMA_DEBUG Logs

During model loading, run ollama ps in a separate terminal to check if GPU assignments appear under the PROCESSOR column. If only “100% GPU” or “partial offload” appears without multiple GPU names listed, distribution is not occurring. Restart Ollama with OLLAMA_DEBUG=1 set; startup scheduler logs will then sequentially output which GPUs were detected, available VRAM per unit, and how Compute Capabilities were evaluated. Debug procedures are documented in the log acquisition section of the Ollama FAQ.

Step 3: Verify Compute Capability Compatibility

The Blackwell generation (RTX 50 series) has a Compute Capability of 12.0, Ada (RTX 40) is 8.9, and Ampere (RTX 30) is 8.6. In mixed-generation setups, Ollama aligns to the feature set of the lower generation; consequently, tensor core optimizations specific to newer generations may not be fully utilized on those devices. After confirming generation differences, this serves as a basis for deciding whether to separate one unit onto another machine. Simultaneously verify compatibility between CUDA runtime and driver versions; outdated drivers can prevent Blackwell units from being recognized at all.

Step 4: Check VRAM Free Space and Interference by Other Processes

If other processes (browser GPU acceleration, resident Stable Diffusion instances, background game processes) are holding VRAM on the second GPU, Ollama may abandon layer placement. Verify VRAM usage and process lists via nvidia-smi, terminate unnecessary processes, then attempt reloading again. On Windows especially, DWM (Desktop Window Manager) consumes several hundred MBs of VRAM; for dual configurations, a stable operation involves disconnecting the second GPU from displays to use it purely as a compute unit.

Summary

Ollama’s Oculink dual-GPU configuration operates on the premise that “it does not work automatically.” The mechanism requires explicitly specifying devices via CUDA_VISIBLE_DEVICES, forcing distribution with OLLAMA_SCHED_SPREAD, and adjusting ratios via num_gpu_split as needed. While 70B-class models benefit from VRAM expansion, a trade-off exists where single-GPU performance is faster for 14B-class models; practical usage involves determining suitability based on one’s specific workload.

Power supply requirements vary by configuration. When using Oculink (e.g., MINISFORUM DEG1), the dock requires an independent ATX power supply (750W in this test) separate from the main PSU (850W here); thus, two systems operate independently: one handling RTX 5080 and CPU load. Since both PSUs are isolated, combined TDP does not concentrate on a single unit. Conversely, an integrated configuration installing into two PCIe slots requires one PSU to handle a total of approx. 540W; here, 850W is the minimum line while 1000W provides headroom. Real-world verification concludes that Oculink bandwidth rarely becomes a bottleneck compared to insufficient software settings.

References

Hardware
Test Environment GPU RTX 5080 (VRAM 16GB) + RTX 5060 Ti (VRAM 16GB / via Oculink DEG1)
CPU / RAM Intel Core i7-14700F / 96GB
Recommended PSU 1000W or more (Minimum 850W)
Software
Ollama 0.22.1
NVIDIA Driver 596.21
Mandatory Environment Variables CUDA_VISIBLE_DEVICES, OLLAMA_SCHED_SPREAD
Measurement Conditions
Date Measured 2026-05-03

This site participates in the Amazon Associates Program. As an Amazon Associate, we earn from qualifying purchases.

Measurement Time: 2026-05-02 / Measurements on this page reflect conditions as of that date. Evaluations may change due to product updates or third-party benchmark publications. Re-evaluation is recommended for content older than 30 days.

Copied title and URL