RTX 5080 MoE Models: Power Drops to One-Quarter in Real-World Tests

Local AI

Mixture of Experts (MoE) models are LLMs that compute only active parameters.

In continuous measurements of multiple LLMs on an RTX 5080, dense models consistently drew between 200–300W. In contrast, the MoE-configured gemma4:26b and qwen3.5:35b-a3b were the only ones to drop power consumption as low as 47–73W. GPU memory usage sat at around 14.8GB. Yet we observed behavior contrary to intuition: GPU temperatures also dropped to just 42–45°C.

Key Takeaways

  • Dense models stabilize in the 200–300W range, while MoE drops to 47–73W (1/4 to 1/6 of dense power)
  • Despite VRAM usage reaching 14.8GB—comparable to dense models—power consumption remains low
  • For cost-sensitive batch operations, MoE is advantageous; for prioritizing conversational speed, dense models are superior

Dense Models at 280W vs. MoE Models at 50–70W — Power Difference Observed on RTX 5080

We ran multiple models under identical conditions: Ollama 0.20.7, NVIDIA driver 595.97, and Windows 11 (25H2, build 26200) on the same RTX 5080. The results clearly split into two clusters: dense models converged to 200–300W, while only the two MoE-configured models dropped to the 50–70W range.

This gap cannot be explained by model size or quantization differences alone. Even among models filling roughly 14GB of VRAM, dense variants consumed nearly 300W, whereas MoE versions drew only around 50W.

RTX 5080 Measured: Power Draw of Dense vs. MoE Models (W)Dense modelsMoE models0100200300Wphi4:14b301Wqwen3:14b287Wmistral:7b279Wdeepseek-r1:8b268Wllama3.1:8b267Wgemma3:12b261Wqwen3.5:9b248Wphi4-mini:3.8b242Wllama3.2:3b239Wgemma3:4b217Wgemma4:26b73Wqwen3.5:35b-a3b47WMeasured 2026-04-19 / Ollama 0.20.7 / RTX 5080, i7-14700F, 96GB / steady-state median
Dense models cluster between 217 and 301W, while the MoE models draw 73W (gemma4:26b) and 47W (qwen3.5:35b-a3b) — roughly a quarter of the dense range. Measurement conditions are listed in the table below (as of Ollama 0.20.7).

Test Environment and Measurement Conditions

The test environment for this site is as follows:

GPU NVIDIA GeForce RTX 5080 (VRAM 15.9GB)
CPU Intel Core i7-14700F
RAM 96GB
NVIDIA Driver 595.97
Ollama Version 0.20.7
OS Windows 11 (25H2, build 26200)
Date Measured April 19, 2026

The RTX 5080 specifications include VRAM of 16GB GDDR7, a TBP (Total Board Power) of 360W, 10752 CUDA cores, and support for fifth-generation Tensor Cores. NVIDIA GeForce RTX 5080 Official Specification Page

Power Sampling Procedure via nvidia-smi

Power consumption was measured by sampling every second using the command nvidia-smi --query-gpu=power.draw,utilization.gpu,temperature.gpu,memory.used --format=csv -l 1 for a duration of 60 seconds. The median value during steady-state inference was adopted as the final reading. Tokens/sec were calculated from Ollama API responses using eval_count / eval_duration. VRAM usage reflects stable values after model loading completed, while TTFT (Time to First Token) was aggregated based on prompt_eval_duration and the timestamp of the first token arrival. This measurement procedure follows guidelines from the NVIDIA System Management Interface (nvidia-smi) Official Documentation.

Each model underwent three warm-up runs with short prompts beforehand to stabilize KV cache and shader compilation before actual measurements began. Tokens/sec, VRAM usage, GPU temperature, and power consumption all reflect steady-state inference observations.

Power Consumption Profile of Dense Models (Stable in 200–300W Range)

We begin with data from the dense models. Despite parameter counts varying significantly between 3B and 14B, power consumption remained surprisingly narrow.

Model Tokens/sec VRAM Usage GPU Temperature Power Consumption
phi4-mini:3.8b 237.8 4.7GB 56.0°C 242W
llama3.2:3b 278.7 4.0GB 57.0°C 239W
gemma3:4b 189.1 4.7GB 59.0°C 217W
mistral:7b 155.2 6.2GB 62.0°C 279W
llama3.1:8b 140.5 6.6GB 60.0°C 267W
deepseek-r1:8b 132.7 6.8GB 60.0°C 268W
qwen3.5:9b 107.8 8.7GB 59.0°C 248W
gemma3:12b 85.9 9.8GB 58.0°C 261W
phi4:14b 82.6 11.0GB 62.0°C 301W
qwen3:14b 79.1 10.7GB 62.0°C 287W

Relationship Between Model Size and Power Consumption

The 3B class consumed around 239W, while the 14B class reached 301W. Even with parameters increasing roughly fivefold, power consumption increased by only about 20–25%. The GPU operates at approximately 80% of its TDP (360W) continuously, indicating that compute units are heavily engaged in matrix operations.

Smaller models achieve higher tokens/sec: llama3.2:3b reached 278.7 tok/s and phi4-mini:3.8b recorded a similarly high speed of 237.8 tok/s. As inference becomes faster, the volume of computation per unit time increases accordingly, resulting in proportionally higher power draw—a natural behavior.

Thermal Behavior from GPU Temperature Perspective

Temperatures remained stable within a narrow band of 56–62°C. With fans operating effectively inside the case, heat generation remains consistent even during prolonged inference. Considering the RTX 5080’s TBP of 360W, this class demonstrates ample thermal headroom.

Unusual Power Drop Observed with MoE Models

This is the core topic. The two models—gemma4:26b and qwen3.5:35b-a3b—exhibited distinctly different behavior on the same RTX 5080.

Model Tokens/sec VRAM Usage GPU Temperature Power Consumption
gemma4:26b (MoE, A4B) 36.6 14.8GB 45.0°C 73W
qwen3.5:35b-a3b (MoE, A3B) 18.4 14.8GB 42.0°C 47W
Despite GPU memory usage of around 14.8GB, power consumption is only one-fourth that of the dense model phi4:14b (11.0GB·301W). The assumption that “high GPU memory usage equals high power draw” does not hold here.

Behavior of gemma4:26b (A4B)

The gemma4:26b model has a total parameter count of 26B but activates only 3.8B parameters per token (denoted as A4B). In our test environment (RTX 5080 / i7-14700F / 96GB RAM), it achieved 36.6 tok/s at 73W. Compared to the dense model phi4:14b (82.6 tok/s·301W), its speed is less than half, yet power consumption drops to one-fourth.

The A4B architecture features a total of 26B parameters with only about 3.8B actively computed per token. External reports have recorded speeds like 131 tok/s on vLLM and 181 tok/s on Ollama, but in our Windows 11 + Ollama 0.20.7 environment, behavior leaning toward power limitation was dominant.

Behavior of qwen3.5:35b-a3b

The qwen3.5:35b-a3b model adopts an even more extreme configuration. With a total of 35B parameters, it activates only about 3B (A3B), making its active parameter count smaller than gemma4’s and reducing power consumption to just 47W.

While speed drops significantly to around one-fourth or one-fifth compared to dense models at 18.4 tok/s, it is noteworthy that a model with a total of 35B parameters can operate within an uncommonly low-power range typically seen in laptop GPUs.

Origins of MoE Architecture and Implementation Differences Across Models

The modern implementation of Mixture of Experts (MoE) traces back to Google Brain’s Switch Transformer. It proposed a mechanism that drastically reduces inference costs by limiting the number of experts accessed per token, even for models with trillion-parameter scales. Fedus, Zoph, Shazeer “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity” (arXiv:2101.03961)

Subsequently, Mistral AI introduced a practical open-weight MoE model with Mixtral 8x7B, publishing a design featuring two active experts per token for an effective activation of 12.9B parameters out of 46.7B total. Jiang et al. “Mixtral of Experts” (arXiv:2401.04088) The qwen3.5:35b-a3b model tested here is an MoE variant within Alibaba Cloud’s Qwen3 series, characterized by 3B active parameters (denoted as A3B). Qwen Team “Qwen3 Technical Blog”

“Sparse MoE models achieve equivalent quality to dense models using only a fraction of the inference FLOPs because they activate only some experts per token.” — Summary of the central claim from the Switch Transformer paper (Fedus et al. 2022)

Model Total Parameters Active Parameters Architecture Family
gemma4:26b 26B 3.8B (A4B) Gemma-family MoE
qwen3.5:35b-a3b 35B 3.0B (A3B) Qwen3-family MoE
Mixtral 8x7B (Reference Value) 46.7B 12.9B Mistral-family MoE
phi4:14b (Dense Model Reference) 14B 14B Dense Transformer

While Mixtral 8x7B activates a relatively high number of parameters at 12.9B, both gemma4:26b and qwen3.5:35b-a3b achieve even finer-grained activation, dropping to the 3–4B range. The smaller the active parameter count, the more pronounced becomes the reduction in Streaming Multiprocessor (SM) utilization discussed later.

Why Only MoE Models Show Power Reduction — Asymmetric Utilization of Active Parameters and Memory Bandwidth

The following section includes analysis based on estimation. We believe the observed combination—high GPU memory usage, low power, and lower speed—can be partly explained by structural characteristics of MoE models.

Meaning of Active Parameters (A3B/A4B)

MoE architectures contain multiple expert subnetworks internally, selecting only a subset for computation per input token. The notation “A4B” in gemma4:26b refers to approximately 4B active parameters (actually 3.8B), while “A3B” in qwen3.5:35b-a3b indicates about 3B active parameters.

In our measurement, nvidia-smi reported GPU memory usage near 14.8GB—comparable to dense models. However, this alone does not prove that all model weights were resident in VRAM: the q4_K_M build of qwen3.5:35b-a3b is roughly 24GB, larger than the RTX 5080’s 16GB, so part of the model may sit outside VRAM depending on the runtime’s offloading behavior. We therefore treat 14.8GB as observed GPU memory usage, not as proof of full VRAM residency. In any case, only about 3–4B parameters are actually routed through computation per token.

Trade-off Between Memory Bandwidth Bottleneck and SM Idle Time

A plausible interpretation starts from how each model type drives the GPU. During dense model inference on an RTX 5080, its Streaming Multiprocessors (SMs)—the compute units—operate nearly at full capacity performing matrix operations. In contrast, MoE models perform routing decisions per token and execute computations only for selected active parameters. Consequently, SM utilization likely drops, reducing the power drawn from the SMs—the primary consumers of GPU power.

Memory bandwidth may behave differently. Because the loaded weights still have to be fetched to feed the compute units, memory-bandwidth pressure could match or exceed dense-model levels. The picture we are proposing is asymmetric—memory traffic staying busy while the compute units have idle capacity—but this is an interpretation, not a measurement.

A Microsoft Research publication on DeepSpeed-MoE also notes that during MoE inference, memory bandwidth becomes a bottleneck and GPU compute unit utilization drops significantly compared to dense models. Rajbhandari et al. “DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training” (arXiv:2201.05596) The low power we observed is consistent with this kind of asymmetric resource use, though our measurements do not directly confirm it.

This remains a hypothesis: we did not directly profile SM occupancy or memory-bandwidth utilization (for example with Nsight), so it should be read as an explanation consistent with the observed low power, low temperature, and lower throughput—not as a directly measured cause.

Comparison of Power Efficiency per Token (Tokens/W)

Combining tokens/sec with power consumption allows comparison via tokens generated per watt (tokens/W). For scenarios prioritizing electricity costs or battery operation, this metric proves more practical than raw tokens/sec alone.

Model Tokens/sec Power Consumption Tokens/W
llama3.2:3b (Dense) 278.7 239W 1.17
phi4-mini:3.8b (Dense) 237.8 242W 0.98
gemma3:4b (Dense) 189.1 217W 0.87
mistral:7b (Dense) 155.2 279W 0.56
llama3.1:8b (Dense) 140.5 267W 0.53
qwen3.5:9b (Dense) 107.8 248W 0.43
gemma3:12b (Dense) 85.9 261W 0.33
phi4:14b (Dense) 82.6 301W 0.27
gemma4:26b (MoE) 36.6 73W 0.50
qwen3.5:35b-a3b (MoE) 18.4 47W 0.39

The 3B-class dense model llama3.2:3b achieves the highest efficiency at 1.17 tokens/W. Meanwhile, although the 35B-class qwen3.5:35b-a3b reaches only 0.39 tokens/W, it remains approximately 1.4 times more efficient than the 14B dense model phi4:14b (0.27 tokens/W). This implementation behavior shows that a model with parameters 2.5× larger operates using less power.

Tokens/W directly impacts scenarios involving overnight batch processing or always-on applications constrained by power capacity. Conversely, for conversational UIs prioritizing response latency, the choice becomes binary: dense models in the 3–8B range offering higher absolute tokens/sec are superior.

Trade-off Between Power Reduction and Throughput — Operational Decision Criteria

The decision to adopt MoE depends on specific use cases. The question boils down to whether you prioritize electricity costs and thermal headroom or response speed.

Approach for Estimating Electricity Costs

Assuming always-on inference 24 hours a day, a dense model at 280W draws about 200kWh per month, while an MoE setup at around 60W uses about 43kWh—roughly a fifth of the energy, cutting the running cost by about 80%. The absolute figure depends on your local electricity rate; at Japan’s ¥30/kWh (a 2026 reference based on Tokyo Electric Power Company’s tiered residential rate, Juryo Dento B Tier 3) that works out to roughly ¥6,000 versus ¥1,300 per month—a gap of about ¥4,700, or over ¥50,000 a year. The same roughly 5× difference applies at any rate. This is a GPU-board-power-only estimate and excludes the CPU, motherboard, storage, fans, PSU efficiency loss, and idle system power. Tokyo Electric Power Company Energy Partner — Juryo Dento B Standard Residential Rate Plan

Prolonged high-load GPU operation affects power supply unit lifespan, case heat dissipation, and breaker capacity. Before maintaining 280W operations for extended periods on systems with limited power headroom, always verify the PSU’s rated output and its 80PLUS certification grade.

Decision Criteria for Selection

For local assistant applications prioritizing conversational response speed, dense models (such as llama3.1:8b) that are 3–7× faster in tokens/sec hold the advantage. The experience of receiving immediate responses holds value exceeding electricity cost savings.

Conversely, for tasks where delays are acceptable—overnight batch processing, long-document summarization, or bulk draft generation—the slower but power-efficient MoE models prove effective. With GPU temperatures stabilizing between 42–45°C, thermal headroom increases, facilitating concurrent execution with other processes.

Note that encountering behavior in alternative AI software like ComfyUI where “GPU utilization drops mid-process and processing shifts to CPU” is typically a bug-related phenomenon. In contrast, MoE’s power reduction represents expected design behavior—a fundamentally different characteristic.

Version Differences in Ollama Runtime

Ollama improved GPU offloading for MoE models starting with the v0.20 series. Particularly for Qwen3-family MoEs, routing of active parameters varies depending on runtime support levels and can affect speed. Our measurements used Ollama 0.20.7; however, reports indicate that successor versions have further improved throughput for gemma4-series models. It is not uncommon for tokens/sec to fluctuate by 20–40% through runtime updates alone. Ollama Releases (GitHub)

Summary

In our RTX 5080 tests, dense models clustered in the 200–300W range, while only MoE-configured models—gemma4:26b and qwen3.5:35b-a3b—dropped to as low as 47–73W. Despite VRAM usage reaching 14.8GB comparable to dense models, power consumption was reduced to one-fourth. We interpret this as an asymmetric pattern—only the active parameters drive the SMs, while memory traffic may remain a limiting factor—though we did not directly measure it.

In terms of tokens/W efficiency, the 3B dense model achieves a peak of 1.17, while the 35B MoE reaches 0.39—surpassing the 14B dense model’s 0.27. For conversational applications requiring absolute token throughput, dense models remain superior; for extended batch operations prioritizing electricity cost and heat reduction, MoE is preferable. This binary choice forms the core of operational decision-making. Maintaining both options locally to switch between them proves most practical.

This site participates in the Amazon Associates Program. As an Amazon Associate, we earn from qualifying purchases.

This article was written by the editorial team at AI Hardware Zukan based on information available as of its publication date. Evaluations may change due to product updates, third-party benchmarks, pricing fluctuations, or supported runtime variations. We recommend re-evaluating content after a certain period has elapsed.

Reference Materials

Copied title and URL