ComfyUI Multi-GPU Guide: RTX 5080 + 4070 Super Real-World Tests

ComfyUI

I want to run image generation on the RTX 5080 while simultaneously sending another generation task to a second GPU. If you use ComfyUI daily, you will likely encounter this scenario. In my setup, I have an RTX 5080 (16GB) installed in my main PC and added an RTX 4070 Super (12GB) via Oculink connection through the MINISFORUM DEG1 eGPU dock, operating a dual-GPU configuration with two graphics cards. The difference between what is possible and impossible when running ComfyUI on dual GPUs is extremely significant. Based on actual measurements in this article, I will organize which use cases are practical and which are not.

Key Points of This Article

  • Dual-GPU operation for ComfyUI via port separation (parallel processing) is the most practical approach. Launch two separate instances on different ports to run them independently.
  • VRAM sharing between GPUs is impossible in ComfyUI. 16GB + 12GB does not equal a combined 28GB pool.
  • In some LLM tools like Ollama, pseudo-integration of VRAM via pipeline parallelism is possible. The decision depends on the specific use case.
The information in this article reflects the status as of 2026-04-10
Content may be outdated.

Parallel Processing via Port Separation ― The Most Practical Dual-GPU Operation

If you are using ComfyUI with dual GPUs, the first approach to consider is port separation. It is simple: launch one instance of ComfyUI for each GPU and assign a separate port to each. For example, assign port 8188 to the main GPU and port 8189 to the sub-GPU. By opening two tabs in your browser, they will operate as independent generation environments.

Key Configuration Points at Startup

For the Main GPU (RTX 5080), the key startup options are roughly: --port 8188 --cuda-device 1 --normalvram --reserve-vram 1.5 --bf16-unet --bf16-text-enc. For the Sub-GPU (RTX 4070 Super), in addition to setting the environment variable set CUDA_VISIBLE_DEVICES=0, specify: --port 8189 --cuda-device 0 --bf16-unet --bf16-text-enc --normalvram.

Item Main GPU (RTX 5080) Sub-GPU (RTX 4070 Super)
Environment Variables None required set CUDA_VISIBLE_DEVICES=0
Port --port 8188 --port 8189
--cuda-device 1 0
VRAM Mode --normalvram --normalvram
Reserved VRAM --reserve-vram 1.5 --reserve-vram 1.0
Precision Options --bf16-unet --bf16-text-enc --bf16-unet --bf16-text-enc

Detailed specifications for startup options can be verified in the argument definitions of main.py from the ComfyUI Official Repository. The flags --bf16-unet and --bf16-text-enc set weights to BF16 precision while leveraging Tensor Cores, ensuring stable operation on architectures from Ampere onwards (RTX 30/40/50 series). The scope of BF16 support for Tensor Cores is documented in the NVIDIA Mixed Precision Training Documentation.

Be careful with the combination of numbers used for CUDA_VISIBLE_DEVICES and --cuda-device. If you restrict visible GPUs via CUDA_VISIBLE_DEVICES, the number specified by --cuda-device refers to the index within those “visible” GPUs. If errors occur due to a mismatch in numbering, check this correspondence first.

Selecting Which GPU to Use ― Correspondence Between CUDA_VISIBLE_DEVICES and --cuda-device

The CUDA_VISIBLE_DEVICES environment variable is a mechanism that filters which GPUs are “visible” from the CUDA runtime. When specifying set CUDA_VISIBLE_DEVICES=0, only the first physical GPU is visible; thus, ComfyUI’s --cuda-device 0 refers to this “visible GPU index.” Confusion often arises because physical device numbers and logical device numbers become separated.

To control the enumeration order of GPUs, use the CUDA_DEVICE_ORDER environment variable. By default, they are enumerated in FASTEST_FIRST (performance order), so with a dual-GPU configuration, the order might inadvertently swap. If you want to fix it by PCI bus order, set CUDA_DEVICE_ORDER=PCI_BUS_ID. The specification basis is stated clearly in the NVIDIA CUDA C++ Best Practices Guide.

The biggest advantage of this method is that generation processing on the sub-GPU will not interrupt work being done on the main GPU. Even with GPUs having different speeds, there are no issues. While memory usage doubles for two instances, in an environment with 32GB or more of RAM, this should not be a practical obstacle.

Benchmark Data: Speed During Parallel Execution

The results below show the outcome when generation tasks were simultaneously submitted to both GPUs on our test bench (i7-14700F / 96GB RAM).

GPU Connection Method Sampling Speed Steps
RTX 5080 (16GB) PCIe x16 3.83 s/it 70 steps
RTX 4070 Super (12GB) Oculink (PCIe x4) 5.71 s/it 75 steps

A key point to note is that there were no signs of interference between the two speeds during parallel execution. Although bandwidth differs between PCIe x16 and Oculink (PCIe x4), most image generation processing completes within the GPU’s VRAM, so contention for bus bandwidth rarely occurs. The RTX 4070 Super via Oculink achieved speed comparable to standalone operation.

CMD Speed Comparison During ComfyUI Parallel Execution (Top: RTX 5080 at 3.83s/it, Bottom: RTX 4070 Super at 5.71s/it)
CMD output during parallel execution. Top is RTX 5080 (3.83s/it), bottom is RTX 4070 Super (5.71s/it). No bandwidth interference.

Simultaneous Processing via Multi-GPU Nodes ― An Advanced Option

ComfyUI has several custom nodes available that allow utilizing multiple GPUs within a single workflow. However, they cannot currently be described as “universal speed-up tools.”

Overview of Major Custom Nodes

Node Name Mechanism Speed Improvement VRAM Efficiency Support for GPUs with Speed Differences
ComfyUI-MultiGPU Distributes UNet/CLIP/VAE to separate GPUs (sequential execution) × ○ Saves VRAM △ Minimal impact from speed differences
ComfyUI-Distributed Parallel execution of different seeds ○ Throughput improvement × Full copy to each GPU × Slowest GPU becomes bottleneck
ComfyUI-ParallelAnything Splits batches and distributes across GPUs ○ Reduces total batch time × Full copy to each GPU × Assumes equal speeds

ComfyUI-MultiGPU is a node that assigns different components of models (such as UNet, CLIP, and VAE) to separate GPUs. Although the processing itself executes sequentially, it eliminates the need to load everything onto a single GPU, resulting in VRAM savings. It supports FLUX and LTX Video. For implementation details, refer to the ComfyUI-MultiGPU Official Repository.

ComfyUI-Distributed takes a different approach: it executes the same workflow in parallel across multiple GPUs. Since generation can happen simultaneously with different seeds, throughput improves. While speed per card remains unchanged, the number of images output per unit time increases.

ComfyUI-ParallelAnything is a node that splits batches and distributes them to each GPU. However, it requires a full copy of the model on every GPU, doubling VRAM consumption. It is designed assuming two GPUs with identical speeds.

Currently, there is no method to split a single denoising step across two GPUs. In other words, you cannot generate one image twice as fast using two GPUs. If aiming for speed-up with dual GPUs, the focus must be on improving throughput (generating multiple images simultaneously).

This also explains why I adopt the port separation method. There is approximately a 1.5x difference in processing speed between my RTX 5080 and RTX 4070 Super; using simultaneous processing nodes would make the slower card the bottleneck. For GPUs with significant speed differences, running them independently is more rational.

You Cannot Share VRAM ― The Biggest Misconception About Dual-GPU

“Can’t I combine the 16GB of RTX 5080 and the 12GB of RTX 4070 Super to use a total of 28GB?” This is likely the most common misconception when considering dual-GPU configurations. In conclusion, combining VRAM for shared usage is impossible in ComfyUI. NVIDIA GPU VRAM is physically independent on each card; consumer-grade cards do not have mechanisms to integrate memory space across multiple GPUs.

NVLink / SLI / NVSwitch Overview

NVLink is a high-bandwidth interconnect technology for communication between GPUs, with the RTX 3090 being the last generation of consumer support. For the RTX 40 series and later, references to NVLink have been removed from the official page (NVIDIA GeForce RTX 40 Series Official Page), confirming its discontinuation. While data center cards like H100 / A100 allow integrated memory addressing via NVSwitch—enabling them to treat multiple GPUs as a single virtual memory space—this feature is not available for consumer products.

NVLink and NVSwitch are interconnect technologies optimized for data center GPUs, enabling high-bandwidth, low-latency memory access across multiple GPUs.

There was a time when high-bandwidth GPU interconnects like NVLink were available for consumers. The RTX 3090 was the last generation to support it; subsequent RTX 40 series models discontinued this feature. Currently, only data center GPUs such as A100 and H100 can use integrated memory addressing. You may also see discussions about “VRAM sharing via SLI,” but SLI is a frame-splitting technology for gaming and has nothing to do with VRAM sharing. Furthermore, its bandwidth (2 GB/s or equivalent 16 Gbps) makes it impractical for AI computation.

I actually attempted to specify the RTX 4070 Super as an offload destination for the RTX 5080’s VRAM, but due to ComfyUI’s architecture, this was impossible. If you want to run large models, your only option is to prepare a single GPU with sufficient VRAM capacity.

For large models like SDXL or Flux where VRAM is insufficient, dual-GPU will not solve the problem. If the goal is increasing VRAM capacity, it is better to choose a single GPU with larger memory rather than investing in two GPUs.

Exception: Ollama Can “Pseudo-Share” VRAM

I stated that combining VRAM is impossible for ComfyUI, but the situation is entirely different for LLM inference tools like Ollama. Ollama can distribute transformer layers across multiple GPUs using pipeline parallelism (a method distinct from tensor parallelism). Data flows through layers sequentially; as a result, it becomes possible to load models equivalent to 16GB + 12GB ≈ 28GB.

The advantage of pipeline parallelism is that inter-GPU communication can be minimized. Unlike tensor parallelism (where multiple GPUs share the same layer), only intermediate activations between layers are transferred to the next GPU, resulting in low dependency on PCIe bandwidth. The basis for this being practical even with low-bandwidth connections like Oculink (PCIe x4) is confirmed in the Multi-GPU operation specifications of the Ollama Official GPU Documentation.

Benchmark Results During Dual-GPU Operation

The measurement results from our test environment are summarized below.

Model RTX 5080 Standalone Dual GPU (5080+4070S) Difference
Qwen3 8B 99.15 tok/s 130.68 tok/s +32%
Gemma3 12B 85.90 tok/s 84.47 tok/s Nearly identical
Qwen3 32B 5.69 tok/s (CPU overflow) 11.24 tok/s +97%

The Qwen3 8B model fits within a single GPU’s memory, but with dual GPUs, distributing layers expands the effective memory bandwidth, resulting in a 32% speed increase. On the other hand, Gemma3 12B shows almost no change; effectiveness varies depending on model size and distribution ratios between GPUs.

The most dramatic difference appeared with Qwen3 32B. When running solely on an RTX 5080 (16GB VRAM), the model exceeds capacity, forcing offloading to CPU memory and dropping speed to just 5.69 tok/s. With dual GPUs, however, it distributes across both cards’ VRAMs, eliminating the need for offloading and achieving nearly double speed at 11.24 tok/s. Models that overflow into CPU due to insufficient VRAM are precisely those that benefit most from a dual-GPU setup.

However, there is a caveat: data transfer can become a bottleneck between GPUs with different speeds. Especially in Oculink (PCIe x4) connections, latency increases during the prefill process at prompt input. While token generation (eval) speed remains good, wait times for the first response may be longer.

Cases Suitable and Unsuitable for Dual-GPU

Based on the verification results so far, I will organize the suitability of dual-GPU configurations.

Suitable Use Cases

Running separate tasks simultaneously via port separation. This is ideal for operations like generating images on a sub-GPU while working on the main GPU. Since processes do not interfere, speed differences are irrelevant.

Running large models for LLM inference with tools like Ollama. The ability to run models that don’t fit in a single GPU without CPU offloading is a significant advantage. As shown by the Qwen3 32B example, the speed difference compared to when CPU offloading occurs is striking.

Doubling throughput with two GPUs of equal speed. If using identical models on both cards, parallel batch processing via ComfyUI-Distributed or ParallelAnything functions effectively.

Unsuitable Use Cases

Aiming to accelerate a single image generation process. There is currently no technology available that splits a single denoising step across GPUs.

Wanting to combine VRAM capacity for large models. ComfyUI does not support sharing VRAM between GPUs. If you lack VRAM for SDXL or Flux, the correct solution is to upgrade to a single GPU with larger capacity rather than buying two cards.

Aiming for simultaneous processing with GPUs that have significant speed differences. In simultaneous processing using multi-GPU nodes, the slower GPU becomes the bottleneck. For combinations like RTX 5080 + RTX 4070 Super, independent operation via port separation yields higher overall efficiency.

Considerations for Power / PSU Configuration

A frequently overlooked aspect of dual-GPU operation is power supply configuration. The TGP (Total Graphics Power) for the RTX 5080 is 360W, and for the RTX 4070 Super it is 220W. Since my CPU i7-14700F also consumes up to 219W at maximum turbo power, there are moments when nearly 800W of total system power is drawn instantaneously.

Component Rated Power Source
RTX 5080 360W (TGP) NVIDIA Official Specs
RTX 4070 Super 220W (TGP) NVIDIA Official Specs
i7-14700F 219W (Max Turbo Power) Intel Official Specs

In my environment, I have completely separated the power supply for the sub-GPU connected via Oculink (independent PSU). Since the eGPU dock MINISFORUM DEG1 requires a dedicated ATX power source, the load on the main PSU and the Oculink-side PSU is physically isolated; fluctuations in one do not propagate to the other. While a dual-PSU configuration increases wiring complexity, it allows absorbing temperature rises during long generation tasks without affecting the other side, providing thermal headroom. If building with a single PSU, aim for an 80PLUS Platinum unit rated at least 1000W combined for both GPUs and CPU.

As a specific example of power supply usage, I use the Thermaltake TOUGHPOWER GT 750W for the DEG1. For sub-GPUs in the RTX 4070 Super class, 600–750W provides ample headroom; there is no need to overspecify capacity.

Note that you can also consider this without assuming “DEG1 = small-scale power supply for a sub-GPU.” If modifying an existing PC’s configuration is difficult due to budget, another option is installing high-end GPUs like the RTX 5090 on the DEG1 side. This allows adding a high TGP GPU via Oculink as a separate system without touching the main PSU or case capacity issues. You can start the GPU via DEG1 only during heavy generation tasks and run daily browsing/light workloads using just the main GPU, avoiding constant idle consumption of high-end GPUs (approx. 30W for RTX 5090). Compared to keeping a PCIe x16 direct connection always powered on, this approach is easier to manage in terms of both total electricity costs and heat generation.

Troubleshooting FAQ

Q1: “No CUDA device available” error when starting ComfyUI

The specification for CUDA_VISIBLE_DEVICES likely does not match the number specified by --cuda-device. If you set CUDA_VISIBLE_DEVICES=0 but specify --cuda-device 1, it will fail because only one GPU is visible (index 0), yet index 1 is requested. Align both to index 0, or remove CUDA_VISIBLE_DEVICES entirely and specify the physical number.

Q2: Generation speed on one GPU is extremely slow compared to the other

This is often caused by insufficient PCIe lane bandwidth. Oculink via M.2 usually connects as PCIe x4; consequently, model load times are longer compared to GPUs connected via PCIe x16. While s/it during generation (VRAM internal processing) shows little difference, the experience differs significantly when switching models for the first time. The impact of PCIe bandwidth and host-device transfers is also discussed in NVIDIA’s official CUDA Developer Documentation.

Q3: Browser tab for sub-GPU does not respond

This could be due to a missing specification for --port 8189 or the Windows Firewall blocking it. Check LISTEN status using netstat -ano | findstr 8189; if nothing appears, recheck startup options. If you see “To see the GUI go to: http://127.0.0.1:8189” in the sub-GPU’s command prompt, it is listening correctly.

Q4: How do I force Ollama to use only one specific GPU?

To pin Ollama to a single card, use the CUDA_VISIBLE_DEVICES environment variable to limit which GPUs Ollama can see—just as in the ComfyUI section above (e.g., set CUDA_VISIBLE_DEVICES=0 uses only GPU 0). Set it before starting the Ollama server, or add it to override.conf if running via systemd. Note that num_gpu (OLLAMA_NUM_GPU) controls how many layers are offloaded to the GPU—not which GPU is used—so treat it as separate from GPU pinning. You can verify placement with ollama ps. Details are in the Ollama FAQ. On Windows, using set in your current shell rather than setx tends to apply changes immediately.

Q5: Main GPU suddenly crashes during parallel execution while sub-GPU continues working?

This may indicate insufficient capacity with a single PSU configuration. Measure instantaneous power usage using tools like GPU-Z or HWiNFO64 and verify that it stays below 80% of the PSU’s rated limit for your total load (GPU + CPU). Operating consistently above 80% shortens lifespan; if you suspect insufficiency, consider adding a second PSU or upgrading to a higher-capacity model. Separating systems with dual PSUs is also a rational choice.

Q6: If buying new in 2026, should I choose the RTX 5060 Ti 16GB or RTX 4070 Super as my sub-GPU?

If purchasing new, the RTX 5060 Ti 16GB is the practical solution. It offers advantages in both VRAM capacity (16GB vs 12GB) and power consumption (180W vs 220W), with a similar price point. The 4070 Super represents an older generation; stock availability may be thin, making it only worth considering if hunting for used units. However, the 4070 Super’s memory bandwidth (672GB/s) is wider than that of the 5060 Ti (448GB/s), so there are scenarios where the 4070 Super performs better in bandwidth-dependent workloads (= large model inference). While it depends on usage, under the premise “building a new setup,” I recommend the RTX 5060 Ti 16GB.

Q7: Is there a perceptible difference between Oculink x4 and PCIe x16 during generation?

In image generation (SDXL / Flux), there is virtually no perceptible difference. This is because the primary load during generation comes from internal VRAM ↔ GPU core computations, not bandwidth between GPU and CPU. However, for LLMs (during large model loading) or heavy ControlNet tasks where a large preprocessor must be passed from the CPU side, Oculink x4 (= equivalent to PCIe 4.0 x4) may result in increased wait times. While there can be second-level differences in total workflow time including load and preprocessing steps, it has almost no impact on generation time per step.

Summary

The most practical way to operate ComfyUI with dual GPUs is “gaining two independent generation environments via port separation.” Features like combining VRAM or splitting denoising steps remain unrealized for now. For GPUs with significant speed differences, it is more rational to offload separate tasks rather than forcing simultaneous processing.

On the other hand, in LLM inference tools like Ollama, pipeline parallelism allows pseudo-integration of VRAM. The value of dual-GPU setups varies greatly depending on use cases; therefore, you should clarify “what exactly do I want to achieve” before planning your configuration. If your only goal is loading a large model onto one GPU, purchasing a single high-capacity VRAM card is preferable over investing in a two-card setup.

The eGPU dock used in this setup is the MINISFORUM DEG1, which connects the RTX 4070 Super to the main PC over Oculink.

Finding GPUs and Power Supplies Discussed Here on Amazon

Product Description
RTX 5080 NVIDIA GeForce RTX 5080 16GB GDDR7 (Main GPU in this article)
RTX 4070 Super NVIDIA GeForce RTX 4070 Super 12GB GDDR6X (Sub-GPU in actual measurements)
RTX 5060 Ti 16GB NVIDIA GeForce RTX 5060 Ti 16GB GDDR7 (Recommended sub-GPU for new builds in 2026)
1000W Power Supply ATX PSU with 80PLUS Gold certification (Suitable for dual-GPU configurations)

These are Amazon Japan listings; current pricing is shown on each linked page.

References

This site participates in the Amazon Associates Program. As an Amazon Associate, this site earns from qualifying purchases.

This article was written by the AI Hardware Zukan Editorial Team based on information available at the time of publication. Evaluations may change due to product updates or fluctuations in third-party benchmarks, prices, and supported runtimes. We recommend re-evaluating content after a certain period has passed.

Copied title and URL