Qwen3.6-35B-A3B: Local MoE Multimodal LLM Guide

Local LLMs

Qwen3.6-35B-A3B is a MoE-based multimodal LLM with 35 billion total parameters and 3 billion active parameters, distributed by Alibaba on HuggingFace. It fits within 34.87GB in its FP8 quantized version, bringing it within the reach of high-end consumer GPUs. It is also a model of note for its official claim of outperforming Claude Sonnet 4.5 in most vision-language benchmarks, making it a key model for individual users considering local execution.

Key Points of This Article

  • Qwen3.6-35B-A3B is a MoE model with 35 billion total / 3 billion active parameters; the FP8 version is 34.87GB.
  • It claims to surpass Claude Sonnet 4.5 in vision-language benchmarks, but independent verification for text inference and coding is pending.
  • Since it is not supported by Ollama, the practical solution for individuals to try it is LM Studio v0.4.12 or later, or via HuggingFace/vLLM.

Overview: Positioning of Qwen3.6-35B-A3B

Qwen3.6-35B-A3B is the open-source version of the Qwen series developed by Alibaba, featuring a multimodal LLM with an integrated vision encoder. The series includes the cloud-based commercial version “Qwen3.6-Plus” (supporting 1M context), which was released first, with this model following its release on HuggingFace.

There are two key design aspects. First is the compression efficiency of MoE (Mixture of Experts), where only 3 billion parameters are active out of 35 billion total. Second is the integration of vision capabilities, which forms the basis of the official claim that it “surpassed Anthropic Claude Sonnet 4.5 in VQA benchmarks.”

However, this claim is limited to “most vision-language benchmarks,” with no official comparisons provided for text inference, coding, or long-document summarization. This point should be kept in mind as a premise for interpretation.

Detailed Specifications and Storage Requirements

The available weights come in two types: BF16 (full precision) and FP8 quantized. The sizes are as follows.

Precision / Quantization Size / Requirements
BF16 (Full Precision) 71.9GB; cannot reside on a single consumer GPU
FP8 Quantized Version (Qwen3.6-35B-A3B-FP8) 34.87GB; requires CPU offloading even on RTX 5090 (32GB)
Total Parameters 35 billion (MoE)
Active Parameters 3 billion (activated during inference)
Multimodal Support Integrated Vision Encoder

The 1M context is limited to the Plus version; the correct approach is to follow the official repository’s description for the maximum context of the publicly released 35B-A3B. Alibaba creates buzz with the OSS version while guiding customers requiring long-context processing to the Plus version, employing a two-stage structure.

Prerequisites for Local Execution and Runtime

The officially verified runtimes are LM Studio v0.4.12 or later, HuggingFace transformers, and vLLM. Note that Ollama is not supported, and there is no official slug in the Ollama registry. Ollama users can manually register the model by downloading GGUF from HuggingFace and using ollama create, but this requires preparation for trial and error with the Modelfile.

For individual users to try it quickly, the process is as follows: Update LM Studio to v0.4.12 or later, load the FP8 version, and measure VRAM usage and token generation speed on your GPU. When running the FP8 version on an RTX 5090, the actual speed may drop to around 15–30 tokens/second when using CPU offloading (including estimates due to unpublished official benchmarks).

The BF16 version (71.9GB) does not run on a single consumer GPU. Even on an RTX 5090 (32GB), full VRAM residency is impossible, requiring 96GB+ of CPU RAM and partial offloading. If you wish to try it, please choose the FP8 version or lower quantized versions.

There is also the option of using GGUF quantized versions distributed by the community (e.g., Q4_K_M). If compressed to 4-bit, it fits within approximately 17–20GB, potentially reaching practical speeds with a configuration of RTX 4090, 5080, 5060 Ti 16GB, and some CPU offloading.

Performance and Features: Combination of MoE and Vision-Language

The advantage of the MoE design is lower inference cost. Even if both claim to be “35B,” a Dense model runs 35 billion calculations, whereas MoE runs only 3 billion. Depending on the design, inference speed can differ by about 10x. The core of this model is the ability to leverage 35 billion parameters of knowledge capacity at inference speeds equivalent to 8GB-class local LLMs.

Regarding vision-language tasks, Qwen series has a strong track record in VQA, extending this lineage. Following Alibaba’s official claims, it becomes a strong candidate for applications such as RAG with images, structuring after document OCR, and multilingual support.

On the other hand, independent verification is not yet complete. It is reasonable to interpret that its capabilities in text inference and coding are “pending verification” until scores are published on third-party benchmarks like LiveBench, MMLU-Pro, HumanEval, and SWE-bench Verified. MoE models are said to tend to lose performance with quantization (observed in the Mixtral series), so the degree of quality degradation when 4-bit/5-bit GGUF versions appear will be a key observation point.

Comparison with Existing Models: Distance Between Local and Cloud

Since independent verification of Qwen3.6-35B-A3B itself is not yet available, we include the results of our site’s local 8-model actual measurement (RTX 5060 Ti 16GB, agent_bench 11 tasks) as a reference. Please read this as a benchmark for how much local LLMs in the same class differ from Claude Sonnet.

Model Match Rate (agent_bench 11 tasks)
claude-sonnet-4-6 (API) 10/11 tasks matched (91%)
Gemma 4 (8B) (Ollama: gemma4:latest) (Local 8B) 10/11 (91%)
Phi-4 14B (Ollama: phi4:14b) (Local) 10/11 (91%)
DeepSeek R1 8B (Ollama: deepseek-r1:8b) (Local) 10/11 (91%)
Gemma 3 12B (Ollama: gemma3:12b) (Local) 9/11 (82%)
Mistral 7B (Ollama: mistral:7b) (Local) 9/11 (82%)

Local models in the 8B–14B class have reached scores equivalent to the Claude Sonnet 4.6 API on agent_bench. It can be said that the naive intuition that “local is inferior to cloud” is breaking down for specific task sets.

However, the 11 tasks in agent_bench are not comprehensive; they do not include complex long-text inference, multi-step planning, or vision-language tasks. Similar to Alibaba’s claim being limited to “most vision-language benchmarks,” benchmark-dependent numbers should not be generalized without confirming the scope of the target tasks. While Qwen3.6-35B-A3B is likely excellent in vision-language benchmarks, its superiority in text inference and coding requires separate verification.

Use Cases by Category

We organize realistic choices available at this time from the perspective of those considering adoption.

IT Directors / DX Officers

For the short term, “not rushing to act” is the correct answer. This is because independent verification is incomplete, making it impossible to judge the validity of benchmark claims; Ollama is unsupported, raising the hurdle for PoC; and it takes time for the quality of quantized versions to stabilize. In the medium term, it is sufficient to keep it in mind as a candidate for use cases where confidential data cannot leave the premises (e.g., legal document review, RAG handling personnel information, medical record processing).

Local LLM Operators

The top priority is confirming the update to LM Studio v0.4.12 or later. Check if the FP8 version can be loaded, and measure VRAM usage and token generation speed on your GPU. If GGUF quantized versions are released, you may choose to prioritize those.

SaaS Product Developers

If verifying via API, the fastest way is to call Qwen3.6-Plus (1M context version) via Alibaba Cloud. For products concerned with domestic data residency, you must carefully review region selection and data handling agreements. If running the OSS version in-house, economic viability may be established in cases where existing API costs exceed roughly 1,000,000 JPY (about US$6,000–6,500) per month, such as RAG including images or structuring after OCR. For cases under roughly 100,000 JPY (about US$600–650) per month, continuing with the API remains advantageous.

Individual Developers / Prompt Designers

Bookmark the model information on HuggingFace and try it via LM Studio when third-party GGUF quantized versions are released. Rather than forcing BF16 or FP8 to run, waiting for 4-bit to 5-bit quantization is the optimal solution for time efficiency.

Local LLM introductions are covered in a separate article. If you have not yet touched Ollama or LM Studio, we recommend reading that first.

Frequently Asked Questions

Q. What is the difference between Qwen3.6-35B-A3B and Qwen3.6-Plus?

Qwen3.6-Plus is the commercial API version via Alibaba Cloud, supporting 1M context. Qwen3.6-35B-A3B is the open-source version distributed on HuggingFace, a MoE model with 35 billion total and 3 billion active parameters. Some features, such as 1M context, are limited to the Plus version.

Q. Can it run on a personal PC?

The FP8 quantized version (34.87GB) requires CPU offloading even on an RTX 5090 (32GB). The BF16 version (71.9GB) is impossible on a single consumer GPU. The most realistic short route is to wait for the 4-bit GGUF quantized version created by the community and load it with LM Studio v0.4.12 or later.

Q. Can it be used with Ollama?

It is not currently registered in the Ollama registry. While there is a method to manually register GGUF files from HuggingFace using ollama create, it is more reliable to wait for the official slug registration. LM Studio has official support (v0.4.12 or later), so if you are not an Ollama user, this is faster.

Q. Is it really more performant than Claude Sonnet 4.5?

Alibaba’s official claim is that it “surpasses in most vision-language benchmarks,” but comparisons in text inference and coding have not been published. Until independent third-party benchmarks are complete, the appropriate evaluation is “promising for vision-language tasks, general performance requires verification.”

Q. Is commercial use allowed?

Usage conditions follow the official model card on HuggingFace. The Qwen series has historically been provided under conditions equivalent to Apache 2.0 for commercial use, but the official conditions for 3.6-35B-A3B must be confirmed on the model card. If using the Plus version as a commercial API, Alibaba Cloud’s terms of service apply.

Summary

We narrow down the significance of Qwen3.6-35B-A3B to three points.

First, the design of 35-billion-total / 3-billion-active MoE has established a scenario where free OSS stands alongside commercial APIs in the vision-language field. Alibaba’s intent in directly naming Claude Sonnet 4.5 can be read as a challenge to the necessity of payment itself.

Second, the official claim of “winning in vision-language benchmarks” is a conditional claim with limited scope, and independent verification for text inference and coding has not yet been released. Until third-party benchmarks are complete, we want to avoid both excessive expectations and excessive disappointment.

Third, the immediate response differs by role. IT directors should wait for independent verification. Local LLM operators should prepare for LM Studio v0.4.12 or later and wait for GGUF quantized versions. SaaS developers should verify the Plus version via Alibaba Cloud and begin cost calculations. Individual developers should watch the HuggingFace model information. As a turning point where the boundary between closed APIs and OSS is practically dissolving in the vision-language field, this model will be remembered as a key entry.

The information in this article is as of the date of publication. Evaluations may change due to product updates or fluctuations in third-party benchmarks, prices, and supported runtimes. Re-verification is recommended for content after a certain period has passed.

Copied title and URL