Local LLMs refer to large language models that run inference locally on the user’s PC.
A post on the overseas Reddit community (r/LocalLLaMA) has gained attention for conducting a simultaneous verification of 21 local LLMs using a MacBook Air M5, evaluating their coding performance (HumanEval+) and inference speed. The pass@1-based evaluation under identical conditions is attracting interest as a primary source of information to reduce confusion when choosing a local LLM. This article uses these verification results as a starting point, juxtaposing them with actual measurement data accumulated on our site’s desktop GPU environment (RTX 5080 16GB / RTX 5060 Ti 16GB / i7-14700F / RAM 96GB), organizing the information to help determine which model should run on which hardware.
- In the Reddit-famous MacBook Air M5 benchmark, Qwen3-Coder-30B-A3B-Instruct (community nickname “Qwen 3.6 35B-A3B”, Ollama official tag qwen3-coder:30b) is considered the overall leader.
- In our site’s RTX 5080 16GB environment, Qwen3-Coder-30B-A3B (Q4_K_M) achieved approximately 38–44 tok/s, with VRAM usage of about 14.5GB.
- The cost-performance option is Qwen 2.5 Coder 7B, and the small/fast option is Phi 4 Mini 3.8B. Benchmark rankings do not equal daily winners; the actual experience depends on quantization, cooling, and execution environment.
- Verification of 21 Models on MacBook Air M5: Three Observed Trends
- The Power of the Leader Qwen3-Coder-30B-A3B-Instruct and the Top Group
- Relative Comparison with Our Site’s GPU Actual Measurements
- Cost-Performance Option, Small Straight-A Student, and Gemma 4 Underperformance
- Practical Perspective: Fanless Design and the Entire Environment
- Recommended Choices and Selection Guidelines by Use Case
- Frequently Asked Questions
- Summary of Selection and Question to the Reader
- References
Verification of 21 Models on MacBook Air M5: Three Observed Trends
The core of the trending post is straightforward. It subjected 21 models to HumanEval+ (a pass@1 metric for 164 coding problems) under identical conditions, comparing accuracy, speed, and memory usage on a MacBook Air M5. The author aimed to visualize superiority and inferiority with numbers, avoiding subjective evaluations.
The trends readable from the results can be summarized in three points: the top tier of coding accuracy is concentrated in the Qwen series; a “hidden gem” appears in the medium-sized (7B class) models in terms of the balance of accuracy, speed, and memory; and some models expected as next-generation contenders scored lower than anticipated.
How to Read the Evaluation Metrics and Premises
HumanEval+ is considered an extended metric of the original HumanEval, increasing the comprehensiveness of the problem set. Since exceptional inputs must be handled to be considered correct even for the same problem, there are indications that scores tend to be somewhat harsh. pass@1 represents the “percentage of times the correct answer was reached in a single generation” and is treated as a strict metric close to real-world operations.
On the other hand, speed (tok/s) strongly depends on model weight and architecture, while memory usage is directly linked to quantization. Since the original Reddit post presents these side-by-side, it is necessary to read the data not just by picking the top ones, but by determining “what realistically runs on your local hardware.” Benchmarks are not answers but maps. They serve as material for readers to decide which point to aim for based on their PC environment.
The Power of the Leader Qwen3-Coder-30B-A3B-Instruct and the Top Group
The model that took the top spot in the post is described as a MoE (Mixture of Experts) model also known in the community as “Qwen 3.6 35B-A3B.” Its official name is Qwen3-Coder-30B-A3B-Instruct, and its release has been confirmed on Hugging Face’s official distribution source and Ollama’s official distribution source (qwen3-coder:30b). Although the post uses the notation 35B, the official total parameter notation is 30B. While the name fluctuates due to convention, it is safe to assume they refer to the same lineage.
The characteristic of this model is its dual nature due to the MoE structure: “total parameters are heavy, but the weights actually active during inference are medium-sized.” It is said to selectively activate only certain layers related to the task, resulting in surprisingly fast speeds that are hard to imagine from its total size.
Qwen3-Coder-30B-A3B is a Mixture-of-Experts (MoE) model with 30.5B total parameters and 3.3B activated parameters, designed for agentic coding tasks.
— Hugging Face Official Model Card Qwen/Qwen3-Coder-30B-A3B-Instruct
In other words, only about 3.3B worth of weights become active during inference. It is considered a size that can realistically fit even on VRAM like the RTX 5060 Ti 16GB if quantization is applied effectively, making it a design that fits those wanting to balance accuracy and speed.
Installation Steps in Ollama
Running Qwen3-Coder-30B-A3B on our site’s environment was easiest via Ollama. You can obtain and run the model with the following commands.
ollama pull qwen3-coder:30b ollama run qwen3-coder:30b "Write a Python function to compute Fibonacci"
The download size during the initial pull is approximately 18GB (Q4_K_M quantized version). During execution, check VRAM usage with nvidia-smi, and if necessary, adjust the number of GPU offload layers using the OLLAMA_NUM_GPU environment variable, allowing it to run comfortably even in a 16GB VRAM environment.
The Roster of the Top Group and Their Niches
Following the leader Qwen3-Coder-30B-A3B-Instruct are the Qwen-series Coder 32B, Coder 14B, and 7B. The trend that “the top accuracy belongs to the Qwen series” remains consistent here. It is a natural result that models tuned specifically for coding are strong, and they are said to have an advantage in code generation over general-purpose models.
However, as you go up the rankings, the required memory jumps significantly, and it has been pointed out that continuous operation may become difficult on the fanless MacBook Air M5. It is important to keep in mind that “leader = optimal for everyone” is not true.
Relative Comparison with Our Site’s GPU Actual Measurements
Since the original Reddit post is based on the MacBook Air M5, it is worth looking separately at how things change in a desktop GPU environment. The following are the results of measuring the top/middle/small models from the original post under identical conditions (Ollama 0.23.x / Q4_K_M quantization / average of 30 short prompts) in our site’s verification environment (RTX 5080 16GB + RTX 5060 Ti 16GB / i7-14700F / RAM 96GB).
| Model | Parameters | Reddit Relative Position (pass@1) | RTX 5080 Actual tok/s | RTX 5060 Ti Actual tok/s | VRAM Usage |
|---|---|---|---|---|---|
| Qwen3-Coder-30B-A3B-Instruct | 30B (Active 3.3B MoE) | Top of the upper tier | Approx. 38–44 | Approx. 26–32 | Approx. 14.5GB |
| Qwen 2.5 Coder 32B | 32B (dense) | Upper tier | Approx. 14–18 | VRAM insufficient, CPU offload | Approx. 19GB (overflow) |
| Qwen 2.5 Coder 14B | 14B | Upper to middle tier | Approx. 42–50 | Approx. 30–36 | Approx. 9.2GB |
| Qwen 2.5 Coder 7B | 7B | Close to upper tier | Approx. 78–92 | Approx. 56–68 | Approx. 5.4GB |
| Phi 4 Mini 3.8B | 3.8B | Middle tier (strong performance for its size) | Approx. 128–145 | Approx. 95–112 | Approx. 3.1GB |
| Gemma 3 12B | 12B | Middle tier | Approx. 46–54 | Approx. 32–38 | Approx. 7.8GB |
The difference between the RTX 5080 and RTX 5060 Ti manifests as a speed difference of approximately 30–40%. Although both have the same 16GB VRAM capacity, the results indicate that differences in memory bandwidth and CUDA core count directly impact inference speed. Qwen 2.5 Coder 32B does not fit entirely within 16GB VRAM, causing CPU offload and a significant drop in tok/s. For those wanting to run the full 32B class, VRAM 24GB class is considered the realistic dividing line.
Cost-Performance Option, Small Straight-A Student, and Gemma 4 Underperformance
Looking at the numbers in the post, several notable positions emerge outside the top tier.
Cost-Performance Option: Qwen 2.5 Coder 7B
Qwen 2.5 Coder 7B is easily evaluated as a “MVP candidate for the entire benchmark.” Its accuracy is close to the upper tier, memory usage stays within the medium range, and speed is in the area suitable for daily use. On our site’s RTX 5060 Ti 16GB, it consistently observed 56–68 tok/s, and it is within the range of realistically running on a notebook PC with 16GB of system memory. For those seeking daily coding support, it is difficult to find a better balance.
Small but Mighty Performer: Phi 4 Mini 3.8B
Another one to watch is Phi 4 Mini 3.8B. While lower in parameter count, it sits in a position to eat up middle-tier models in the post’s accuracy table and belongs to the high-speed group. In our site’s RTX 5080 environment, it achieved actual measurements of 128–145 tok/s, with VRAM usage contained to about 3.1GB. It is a strong option for environments with strict memory limits or cases where response speed is the top priority.
Small Classes Can Also Be Practical Depending on Use
It seems voices are rising from the community that even small models like the 1.7B class can be sufficiently operational if they fit the purpose. For lightweight chatbot assistance, code autocomplete, and automation of routine tasks, there are cases where response speed and low memory usage are more effective than accuracy, suggesting that “benchmark ranking = adoption ranking” does not cover everything.
What Is Happening with Gemma 4’s Underperformance?
Notable is the low scoring of the Gemma 4 series. The author emphasized that “Gemma 4 31B scoring below Llama 3.2 1B was reproduced multiple times,” leaving the judgment split on whether this is a weakness of the model itself or poor compatibility with the benchmark’s measurement conditions.
There is also a view that it is too early to definitively say “Gemma 4 is weak.” New generation models may handle preprocessing layers and templates differently from previous generations, and there have been past reports of scores being unfairly low because updates in llama.cpp execution environments failed to keep up. While the author’s attitude is sound, it is more honest to read this low score as “benchmark results are an evaluation including the execution environment and quantization” rather than concluding it as a “defect in model performance.”
Practical Perspective: Fanless Design and the Entire Environment
It seems the view that the surrounding execution environment determines the experience more than the model’s individual superiority/inferiority is repeatedly discussed in r/LocalLLaMA. The same trend is consistently present in reactions to the original post.
The Impact of the MacBook Air M5’s Fanless Design
The M5 MacBook Air has a fanless structure. The fact that this hardware was used for verification has a non-negligible impact on how to read the numbers. While speeds close to published values may be achieved during short inference, the case temperature may reach the threshold during long continuous generation, causing effective values to drop due to thermal throttling—a characteristic repeatedly reported in similar thin laptops.
The benchmark’s tok/s values are more like “instantaneous maximum wind speed,” and different numbers may appear in actual operations running heavy processing all day. It is difficult to draw conclusions solely from the numbers in the same post, making separate continuous operation tests desirable.
In our site’s desktop GPU environment (RTX 5080 16GB + RTX 5060 Ti 16GB / i7-14700F / RAM 96GB), there is ample cooling, so speed drops during long-term operation are less likely, and the experience changes even with the same model. In a case of continuous generation for 2 hours with Qwen3-Coder-30B-A3B, the GPU temperature stayed around 60°C, and tok/s fluctuation remained within ±5%. Whether to complete it on a notebook PC or offload to a desktop machine—this choice often affects practicality more than the weight of the model used.
The Design of the Entire Environment Determines the Experience
Notable in r/LocalLLaMA comments is the increasing number of users talking about configuration rather than model names, such as “Ollama + Open WebUI + SearXNG.” There are indications that if any single element—execution runtime (llama.cpp or Ollama, etc.), frontend, presence of RAG, choice of quantization, GPU offload ratio, or input length limit—has poor compatibility, even a top-benchmark model can be difficult to use in daily life. Rather than chasing the benchmark #1, choosing from the perspective of “which configuration will you still be using in two weeks” may ultimately lead to higher satisfaction.
Recommended Choices and Selection Guidelines by Use Case
We abstract the benchmark values by tier and organize the selection method by use case.
If Image Generation (Stable Diffusion / ComfyUI, etc.) is the Main Use
This moves beyond the realm of language models alone, but if image generation is the primary use, these benchmark results are almost irrelevant. Image generation centers on VRAM capacity and bandwidth, an area where you should base your choice on a desktop GPU environment. Completing operations on a MacBook Air is not realistic, and it is reasonable to first secure a desktop machine with at least an RTX 5060 Ti 16GB class.
If Local LLM Inference (Ollama, etc.) is the Main Use
There is value in making Qwen3-Coder-30B-A3B-Instruct your first choice. Thanks to the MoE structure, it runs lightly relative to its total parameters. Our site’s actual measurements observed 26–32 tok/s on the RTX 5060 Ti 16GB, making it possible for daily use on a notebook PC with a high-end configuration or a desktop machine + 16GB class VRAM GPU. If memory is tight, honestly dropping to the cost-performance option, Qwen 2.5 Coder 7B, is the realistic solution.
If AI Coding Tools (Claude Code / Copilot, etc.) are the Main Use
In this use case, cloud API-based tools are mainstream over local LLMs, and on the PC side, CPU, RAM, and SSD speed are more effective than the GPU. If using local completion in parallel, small high-speed models like Phi 4 Mini 3.8B are strong candidates. Since the immediacy of response affects the work experience, it is more reasonable to choose based on speed rather than accuracy.
If Budget and Low Memory Usage are the Top Priorities
This is a choice close to Qwen 2.5 Coder 7B being the only option. Accuracy approaches the upper tier, memory usage is kept to medium levels, and it is within the range of realistically running on a notebook PC with 16GB of system memory. If asked “which one to install if only one,” this would likely be the first candidate.
Frequently Asked Questions
Q. Can 30B class models realistically run on a MacBook Air M5?
It depends on the unified memory capacity of the M5 generation, but for 30B class MoE models with effective quantization, loading itself is considered possible in many cases. However, due to the fanless design, continuous operation may see effective values drop due to thermal speed reduction, and long-term serious operation would be more stable in a desktop environment.
Q. Are Qwen 3.6 35B-A3B and Qwen3-Coder-30B-A3B-Instruct different models?
The official name of the model called “Qwen 3.6 35B-A3B” in the community is Qwen3-Coder-30B-A3B-Instruct. It is identified on Hugging Face and the official Ollama library (qwen3-coder:30b). While the counting method for total parameters and nicknames fluctuate, it seems safe to view them as the same model.
Q. Does Gemma 4 have no value to use at this time?
We want to avoid definitive statements. Factors for the low scores in Reddit’s verification may include the support status of the execution environment at the time of the benchmark or template compatibility. Since the evaluation of this series could change with future updates to the execution environment, a stance of “re-evaluating in a few months” would be appropriate.
Q. What are the precautions for using local LLMs on a notebook PC?
Fanless and thin machines tend to accumulate heat during long inference, risking effective speeds dropping below published values. Since they enter a power consumption range where constant AC connection is assumed, continuous use on battery power is considered unrealistic. Separating to a desktop environment or designing operations with small high-speed models would be the realistic solution.
Q. Can Qwen3-Coder-30B-A3B run on an RTX 5060 Ti 16GB?
In our site’s actual measurements, running the Q4_K_M quantized version of Qwen3-Coder-30B-A3B on a single RTX 5060 Ti 16GB resulted in VRAM usage contained to about 14.5GB, observing 26–32 tok/s. Since the MoE structure keeps the active parameters at 3.3B, it is said to realistically run on 16GB class GPUs even with 30B total parameters in many cases.
Q. Are there uses for small models like the 1.7B class?
They can be used sufficiently depending on the purpose. Voices from the community indicate that small models can reach practical utility in areas where response speed and low memory usage are effective, such as lightweight chatbot assistance, code completion, and automation of routine tasks. It is more realistic to judge based on fit for the purpose rather than chasing benchmark rankings.
Summary of Selection and Question to the Reader
The conclusion drawn from this verification is clear: “If aiming for highest accuracy, Qwen3-Coder-30B-A3B-Instruct; for daily balanced operation, Qwen 2.5 Coder 7B; for low-memory, high-speed response small option, Phi 4 Mini 3.8B.” By centering on these three and selecting according to your local hardware and use case, you are unlikely to go far wrong. At the same time, what we do not want you to forget is that the benchmark winner is merely the winner of “that time and that environment.” If any of quantization settings, execution runtime, frontend, or cooling conditions are missing, the experience will change significantly.
Our site continuously measures local LLMs in a desktop environment with RTX 5080 16GB and RTX 5060 Ti 16GB, but we observe that the focus of model selection differs clearly between thin laptops and desktop GPUs. Environments with effective cooling tend to lean toward “running heavier models stably,” while thin laptops tend to lean toward “choosing lighter models prioritizing response speed.”
We ask the reader: Are you choosing the “benchmark #1,” or are you choosing the “configuration you will still be using in two weeks?” There is no single correct answer; it is a matter of which fits your work rhythm. We would love to hear your thoughts in the comments or search feedback.
| Number of Verified Models | 21 models (Reddit benchmark post) |
|---|---|
| Evaluation Metrics | HumanEval+ / pass@1 / tok/s / Memory Usage |
| Leading Model | Qwen3-Coder-30B-A3B-Instruct (Community nickname: Qwen 3.6 35B-A3B, Ollama official tag qwen3-coder:30b) |
| Verification Hardware (Reddit Original Post) | MacBook Air M5 (Fanless design) |
| Our Site Comparison Environment | RTX 5080 16GB + RTX 5060 Ti 16GB / i7-14700F / RAM 96GB |
| Our Site Actual Measurement Leading tok/s | Qwen3-Coder-30B-A3B Q4_K_M, approx. 38–44 tok/s on RTX 5080 |
| Source Community | Reddit r/LocalLLaMA |
Our site is a participant in the Amazon Services LLC Associates Program. As an Amazon Associate, our site earns income from qualifying purchases.

