.key-facts-table{border-collapse:collapse;width:100%;margin:1.2em 0;font-size:.95em;background:#fff;border:1px solid #e5e7eb;border-radius:6px;overflow:hidden} .key-facts-table th,.key-facts-table td{padding:.6em .85em;border-bottom:1px solid #eef0f3;vertical-align:top;text-align:left} .key-facts-table tr:last-child th,.key-facts-table tr:last-child td{border-bottom:none} .key-facts-table tr.kf-section th{background:#eef2ff;font-weight:600;border-bottom:2px solid #c7d2fe;color:#1e1b4b;padding:.75em .85em;font-size:1em} .key-facts-table tr:not(.kf-section) th{width:32%;font-weight:500;color:#555;background:#fafbfc;font-size:.93em} .key-facts-table tr:not(.kf-section) td{color:#222} .key-facts-table code{background:#f1f5f9;padding:.05em .35em;border-radius:3px;font-size:.92em;font-family:Consolas,monospace;color:#0f172a}
Oculink dual-GPU configuration refers to a setup where an external PCIe 4.0 x4 interface is used to add a second GPU.
NVIDIA drivers correctly recognize both GPUs, yet Ollama utilizes only the first one. This is a frequent occurrence in Oculink expansion environments. Even with our site’s test environment (RTX 5080 16GB + RTX 5060 Ti 16GB via Oculink DEG1 / i7-14700F / RAM 96GB), we reproduced the issue where ollama ps output fails to show the second GPU, leaving inference running on a single unit without layer distribution. The same behavior was observed with the latest combination of Ollama 0.22.1, NVIDIA driver 596.21, and Windows 10.
- Ollama may fail to automatically recognize Oculink expansion GPUs; explicit specification via CUDA_VISIBLE_DEVICES is required.
- Dual-GPU setups provide a VRAM extension effect equivalent to 16GB+16GB=32GB for 70B-class models, but single GPU inference remains faster for 14B-class models. See details on VRAM capacity and specific model performance comparisons like the 14B class.
- Beyond Oculink PCIe 4.0 x4 bandwidth, power capacity and environment variable configuration are the practical bottlenecks.
- The “DUAL_GPU_NOT_CONFIGURED” Phenomenon: RTX 5060 Ti Invisible to Ollama
- Workaround: Verifying Three Environment Variables Step-by-Step
- Pros and Cons of Dual-GPU Setup: Results Reverse Based on Model Size
- VRAM Requirements and Recommended GPU Configurations by Model Size
- Is Oculink PCIe 4.0 x4 a Bottleneck? The Real One Revealed by Benchmarks
- Troubleshooting: Verification Order When Second GPU Is Not Recognized
- Summary
- References
The “DUAL_GPU_NOT_CONFIGURED” Phenomenon: RTX 5060 Ti Invisible to Ollama
Immediately after setting up the test environment, nvidia-smi recognized both GPUs, yet Ollama inference jobs utilized only the first one. Checking VRAM usage revealed that only the RTX 5080 was filled, while the RTX 5060 Ti remained idle at 0%. Layer distribution did not occur; when models could not fit into VRAM, they were offloaded to system RAM.
Test Environment and Reproduction Conditions
In our site’s test environment, the RTX 5080 connects via PCIe 5.0 x16 on the mainboard, while the RTX 5060 Ti is connected externally via Oculink DEG1 using PCIe 4.0 x4. The OS was Windows 10, running Ollama version 0.22.1 with NVIDIA driver 596.21. Even when varying model sizes within this same configuration, we could not induce behavior that utilized the second GPU under default settings.
Known Issues Revealed by Official Documentation and GitHub Issues
The Ollama official GPU documentation outlines procedures to explicitly recognize multiple GPUs using CUDA_VISIBLE_DEVICES and UUID specifications. However, in GitHub issue #13163, a known problem is reported where Blackwell-generation (Compute Capability 12.0) GPUs are not automatically recognized. We can conclude that this phenomenon is specific neither to Oculink nor solely the new generation; rather, it arises from the combination of new-gen GPUs and Ollama’s scheduler specifications. Compute Capabilities for each generation can be referenced in the official NVIDIA CUDA GPUs list.
Workaround: Verifying Three Environment Variables Step-by-Step
The workaround involves three steps. While specifying only CUDA_VISIBLE_DEVICES may suffice in minimal configurations, if Ollama’s scheduler does not select layer distribution, adding OLLAMA_SCHED_SPREAD becomes necessary.
CUDA_VISIBLE_DEVICES and UUID Specification (Official Recommendation)
The most reliable procedure is to obtain the UUIDs of both GPUs via nvidia-smi and launch Ollama in the format CUDA_VISIBLE_DEVICES=GPU-uuid1,GPU-uuid2. While device numbers (0, 1) can work, specifying by UUID offers greater stability as numbering may swap depending on PCIe connection states. To obtain a UUID, use nvidia-smi -L and copy the string starting with GPU- from the displayed list.
The Role of num_gpu_split / OLLAMA_SCHED_SPREAD
num_gpu_split is a parameter specifying how many segments to divide layers into during model loading. OLLAMA_SCHED_SPREAD=1 is an environment variable instructing the Ollama scheduler to “prioritize distribution across multiple GPUs.” The former controls behavior at the model level, while the latter operates at the job level. The syntax for num_gpu_split is defined in the PARAMETER section of the Ollama Modelfile reference; it can be written directly into a Modelfile and rebuilt, or specified via options in API requests.
Operability of Configuration Combinations
| Configuration | Second GPU Recognized? | Layer Distribution Occurs? |
|---|---|---|
| Default | No | No |
| CUDA_VISIBLE_DEVICES only | Yes | No |
| + OLLAMA_SCHED_SPREAD=1 | Yes | Established |
| + num_gpu_split specified | Yes | Ratios arbitrarily controllable |
Only after proceeding to the third step (enabling OLLAMA_SCHED_SPREAD) did layers begin placing onto the RTX 5060 Ti. This aligns with the recommended flow in official documentation.
Persisting Environment Variables and Scope of Application
CUDA_VISIBLE_DEVICES and OLLAMA_SCHED_SPREAD are evaluated at process startup. Therefore, if running Ollama as a service, updating user environment variables alone will not take effect. The official Ollama FAQ provides procedures for setting system-level environment variables via setx followed by a service restart on Windows, and adding Environment= to the unit file via systemctl edit ollama.service on Linux.
The workflow of stopping an Ollama instance registered as a Windows service → editing environment variables → restarting can be completed in three steps from PowerShell: net stop ollama; setx OLLAMA_SCHED_SPREAD 1 /M; net start ollama. The /M flag instructs writing to the system-wide scope; without it, changes apply only to user scope and vanish upon logoff, making them inaccessible within service environments.
If building a custom model embedding num_gpu_split via Modelfile, you must re-tag using ollama create and specify that tag during inference. For one-time application from the command line, passing { “num_gpu_split”: [16, 16] } in API request options yields the same effect.
Pros and Cons of Dual-GPU Setup: Results Reverse Based on Model Size
The real-world verification revealed a structure where “VRAM expansion is beneficial, but performance gains are conditional.” This conclusion aligns with Ollama’s Multi-GPU design philosophy: “VRAM expansion first, performance second.”
Multi-GPU inference is fundamentally a VRAM extension feature, not a mechanism for doubling throughput. In layer-splitting inference, PCIe transfer overhead between devices occurs; in single-stream generation, this can negate or even outweigh the benefits gained from parallel computation. — Summary of Ollama GPU documentation intent.
70B Class: Only Operable via VRAM Expansion
Models with 70 billion parameters require approximately 40GB even in GGUF Q4 quantization. A single RTX 5080 (16GB VRAM) cannot accommodate them, so adding a second GPU via Oculink to expand total capacity from 16GB+16GB=32GB equivalent is required for operation. While tokens/sec may drop compared to single-GPU performance (if it were possible), the meaningful metric here becomes “operable vs. not operable.” Specific target models include Llama 3.1 70B, Qwen 2.5 72B, and Mixtral 8x22B; VRAM requirements for each are listed on their respective pages in the Ollama official model library.
14B Class: Single GPU is Faster
Conversely, 14B-class models (approx. 9–10GB in Q4) fit comfortably on a single RTX 5080. Dual-GPU configuration here adds PCIe x4 layer communication overhead, resulting in lower tokens/sec compared to single GPU operation. This is the zone where intuition that “using two GPUs = faster” reverses.
Newer models like those at the 128B class (e.g., Mistral Medium 3.5) are emerging, and use cases requiring VRAM equivalent of 32GB are certainly increasing. The demand to run 70B–128B-class models “at home” is one factor driving dual-GPU adoption. However, the trade-off remains: for tasks satisfiable by 14B-class models, a single GPU configuration remains more efficient.
VRAM Requirements and Recommended GPU Configurations by Model Size
Necessary VRAM is determined by LLM parameter counts and quantization levels. Understanding the threshold where dual-GPU setups become meaningful allows for quicker hardware investment decisions.
| Model Scale | Q4 Quantized VRAM | Q8 Quantized VRAM | Recommended GPU Configuration |
|---|---|---|---|
| 7B | Approx. 4–5GB | Approx. 7–8GB | Single GPU, 8GB or more |
| 13–14B | Approx. 8–10GB | Approx. 14–16GB | Single GPU, 16GB or more |
| 32–34B | Approx. 19–22GB | Approx. 35–38GB | Single 24GB OR Dual 16+16 |
| 70–72B | Approx. 40–44GB | Approx. 70–75GB | Dual 24+24 or 16+16 |
| 120B+ | Approx. 70–80GB | Approx. 130GB+ | Multi-GPU Required |
The table values are approximate weights for Q4/Q8 quantized GGUF files. During actual loading, additional VRAM is required for KV cache depending on the specified num_ctx (context length); a design margin of 2–4GB beyond the table values is realistic. Extending context lengths to 32k or higher consumes several GBs in VRAM just from KV caches, meaning some cases will exceed the minimum lines listed in the recommendations.
Is Oculink PCIe 4.0 x4 a Bottleneck? The Real One Revealed by Benchmarks
Oculink bandwidth is PCIe 4.0 x4 = approx. 64Gbps bidirectional. At first glance, this seems like an eighth of PCIe 5.0 x16 and thus a bottleneck; however, when layers are actually distributed across GPUs, no behavior indicating insufficient bandwidth was observed.
Power Supply and Environment Variables Matter More Than Bandwidth
In multi-GPU inference communication patterns, only activation tensors between layers need to be passed. The transfer volume per token generation is in the megabyte range, which PCIe 4.0 x4 can easily handle. Power supply handling varies significantly by configuration. Oculink docks (e.g., MINISFORUM DEG1) require a separate independent ATX power supply, completely isolated from the main PSU. In this test environment, the main 850W PSU handles RTX 5080 (TDP 360W), i7-14700F, and peripherals; meanwhile, a separate 750W PSU powers only the RTX 5060 Ti (TDP 180W) on the Oculink side. Even during simultaneous full-load operation, each PSU independently bears its load, so no single unit carries a combined total of 540W. Conversely, in an integrated configuration installing both GPUs into two PCIe slots (without using Oculink), the combined TDP concentrates on one PSU; thus, while 850W is tight, a 1000W-class supply is preferable.
Estimated Power Capacity by Configuration
| Configuration | Total TDP | Recommended PSU | Notes |
|---|---|---|---|
| Single RTX 5080 + i7-14700F | Approx. 540W | 850W or more | 1000W acceptable for headroom |
| Internal Dual PCIe (RTX 5080 + 5060 Ti) | Approx. 720W | 1000W or more | Transient power handling recommended; 1200W preferred |
| Oculink Configuration (via DEG1) | Main: 540W / Dock: 180W | Main: 850W + Dock: 750W | Two independent systems, no concentrated load |
The official specifications on the NVIDIA GeForce RTX 50 Series page list TDPs of 360W for RTX 5080 and 180W for RTX 5060 Ti. Transient power spikes can reach 1.5–2 times the rated TDP, so PSUs should be selected with headroom beyond their ratings. A benefit of the dual Oculink configuration is risk isolation: if a power issue occurs on one side, it does not affect the main environment.
Software Configuration Often Becomes the Bottleneck in Multi-GPU Environments
The same structural issues occur with other frameworks. For example, vLLM has reported cases where TTFT and token generation speeds plummet when processing long-context inputs exceeding 64k tokens on multi-GPU setups; this is not resolved unless AITER Unified Attention’s backend is explicitly enabled via environment variables. Ollama’s OLLAMA_SCHED_SPREAD faces a structurally identical problem: the current reality is that software configuration, rather than hardware limitations, often constitutes the bottleneck.
Troubleshooting: Verification Order When Second GPU Is Not Recognized
If environment variables are set but the second GPU remains invisible or layer distribution fails to establish, isolating causes in the following order is efficient. Quickly determining whether the issue lies with Ollama settings or hardware recognition avoids unnecessary tweaking of environment variables.
Step 1: Verify GPU Recognition via nvidia-smi
First, check if both GPUs appear in a list with UUIDs using nvidia-smi -L. If only one appears, the cause lies with drivers, physical connections, or power supply to the Oculink dock—not an Ollama issue. In cases involving Oculink docks, forgetting to turn on the unit’s main power is surprisingly common. PCIe link status can be confirmed via nvidia-smi --query-gpu=pcie.link.gen.current,pcie.link.width.current --format=csv; if Gen4 x4 displays for an Oculink connection, the link is healthy.
Step 2: Obtain ollama ps and OLLAMA_DEBUG Logs
During model loading, run ollama ps in a separate terminal to check if GPU assignments appear under the PROCESSOR column. If only “100% GPU” or “partial offload” appears without multiple GPU names listed, distribution is not occurring. Restart Ollama with OLLAMA_DEBUG=1 set; startup scheduler logs will then sequentially output which GPUs were detected, available VRAM per unit, and how Compute Capabilities were evaluated. Debug procedures are documented in the log acquisition section of the Ollama FAQ.
Step 3: Verify Compute Capability Compatibility
The Blackwell generation (RTX 50 series) has a Compute Capability of 12.0, Ada (RTX 40) is 8.9, and Ampere (RTX 30) is 8.6. In mixed-generation setups, Ollama aligns to the feature set of the lower generation; consequently, tensor core optimizations specific to newer generations may not be fully utilized on those devices. After confirming generation differences, this serves as a basis for deciding whether to separate one unit onto another machine. Simultaneously verify compatibility between CUDA runtime and driver versions; outdated drivers can prevent Blackwell units from being recognized at all.
Step 4: Check VRAM Free Space and Interference by Other Processes
If other processes (browser GPU acceleration, resident Stable Diffusion instances, background game processes) are holding VRAM on the second GPU, Ollama may abandon layer placement. Verify VRAM usage and process lists via nvidia-smi, terminate unnecessary processes, then attempt reloading again. On Windows especially, DWM (Desktop Window Manager) consumes several hundred MBs of VRAM; for dual configurations, a stable operation involves disconnecting the second GPU from displays to use it purely as a compute unit.
Summary
Ollama’s Oculink dual-GPU configuration operates on the premise that “it does not work automatically.” The mechanism requires explicitly specifying devices via CUDA_VISIBLE_DEVICES, forcing distribution with OLLAMA_SCHED_SPREAD, and adjusting ratios via num_gpu_split as needed. While 70B-class models benefit from VRAM expansion, a trade-off exists where single-GPU performance is faster for 14B-class models; practical usage involves determining suitability based on one’s specific workload.
Power supply requirements vary by configuration. When using Oculink (e.g., MINISFORUM DEG1), the dock requires an independent ATX power supply (750W in this test) separate from the main PSU (850W here); thus, two systems operate independently: one handling RTX 5080 and CPU load. Since both PSUs are isolated, combined TDP does not concentrate on a single unit. Conversely, an integrated configuration installing into two PCIe slots requires one PSU to handle a total of approx. 540W; here, 850W is the minimum line while 1000W provides headroom. Real-world verification concludes that Oculink bandwidth rarely becomes a bottleneck compared to insufficient software settings.
References
- Ollama Official GPU Documentation (Multi-GPU & CUDA_VISIBLE_DEVICES Settings)
- Ollama FAQ (Environment Variable Persistence & Debug Log Acquisition)
- Ollama Modelfile Reference (PARAMETER for num_gpu_split, etc.)
- Ollama Official Model Library (VRAM Requirements per Model)
- NVIDIA CUDA GPUs List (Compute Capabilities)
- Official NVIDIA GeForce RTX 50 Series Specifications
| Hardware | |
|---|---|
| Test Environment GPU | RTX 5080 (VRAM 16GB) + RTX 5060 Ti (VRAM 16GB / via Oculink DEG1) |
| CPU / RAM | Intel Core i7-14700F / 96GB |
| Recommended PSU | 1000W or more (Minimum 850W) |
| Software | |
| Ollama | 0.22.1 |
| NVIDIA Driver | 596.21 |
| Mandatory Environment Variables | CUDA_VISIBLE_DEVICES, OLLAMA_SCHED_SPREAD |
| Measurement Conditions | |
| Date Measured | 2026-05-03 |
This site participates in the Amazon Associates Program. As an Amazon Associate, we earn from qualifying purchases.

