Fixing RAM Exhaustion When Running Gemma 4 with llama.cpp

GPU・グラフィックボード

A model was loaded onto a GPU with 32GB of VRAM. Although there was still plenty of VRAM available, the process was forcibly terminated after just a few prompts—the cause was not the GPU, but the exhaustion of system RAM.

A phenomenon where system RAM is abnormally consumed while running Gemma 4 31B using llama.cpp has become a topic of discussion in the overseas Reddit community (r/LocalLLaMA). The reporter stated that despite having a generous environment of 32GB VRAM and 64GB system RAM, llama.cpp was killed by Linux’s OOM Killer. Multiple users have confirmed similar phenomena, and reproduction has been reported even in RTX 5090 environments.

In the world of local LLMs, the primary concern often revolves around “whether VRAM is sufficient,” but this case challenges that common assumption. This article delves into the causes of this phenomenon and organizes the countermeasures available at this time.

Key Points of This Article

  • Running Gemma 4 with llama.cpp in long context causes OOM due to system RAM exhaustion, not VRAM.
  • The main cause is believed to be the accumulation of KV cache checkpoints on the system RAM side.
  • Shortening the context length (-c value) is the most reliable countermeasure. Even with 64GB RAM, maintaining a 100k context is not possible.

The Phenomenon of System RAM Exhaustion Instead of VRAM

First, let’s accurately grasp the reported symptoms.

The Reddit poster’s environment consisted of 32GB VRAM and 64GB (DDR5) system RAM. When loading the Gemma 4 31B Unsloth quantized model (UD_Q5_K_XL) with a context length of 102400 (approx. 100k tokens), the model itself fit within VRAM, and there was ample VRAM available immediately after loading.

The problem surfaced from there. After sending a few prompts, system RAM usage surged rapidly, reaching 63GB out of 64GB, resulting in the Linux OOM Killer terminating the llama.cpp process.

Interestingly, the poster later verified this on another PC equipped with 128GB RAM (DDR4). While it did not crash immediately, reports indicate that RAM usage rose to 80GB after processing just a few prompts of tens of thousands of tokens, and the trend was still increasing.

Another key point to note is that lowering the quantization level did not resolve the issue. Switching to Q4 quantization reduced VRAM usage to approximately 23GB, but the abnormal consumption of system RAM remained unchanged. This fact suggests that the root of the problem lies not in the model’s weight size, but in another memory area.

Before trying countermeasures, please check the following environment information first.

  • OS: Linux (where OOM Killer activates) / Windows (manifests as swap bloat)
  • GPU: Model name and VRAM capacity (checkable via nvidia-smi)
  • System RAM: Installed capacity and current usage (checkable via free -h)
  • llama.cpp: Build date or commit hash (checkable via the –version option)
  • Model Used: Model name, quantization format, and context length settings

Why Is So Much System RAM Consumed?

The author believes that llama.cpp’s memory management structure is behind this counterintuitive phenomenon where system RAM is exhausted despite having ample VRAM.

The Relationship Between KV Cache and Context Length

When running an LLM with llama.cpp, memory is largely divided into two uses. One is for model weights (parameters), which are primarily placed in VRAM. The other is the KV cache, an area for holding past token information during inference.

KV cache consumption increases proportionally with context length. If the context length is set to 100k tokens, a correspondingly massive KV cache area is required.

Comments on Reddit have pointed out the existence of “context checkpoints.” In llama.cpp, checkpoints are created approximately every 8192 tokens. For MoE models, this is reported to be about 533MB per checkpoint, and potentially larger for Dense models. With a 100k context, approximately 12 checkpoints would occur, occupying over 6GB of memory just from this alone. Moreover, it is highly likely that these accumulate on the system RAM side rather than VRAM.

The behavior of KV cache checkpoints accumulating in system RAM is believed to stem from llama.cpp’s design. It is dangerous to judge “there is still plenty of room” based solely on free VRAM capacity.

Potential Differences in Behavior Between Dense and MoE

The Gemma 4 family includes two variants: the Dense 31B (the subject of this article) and the MoE 26B variant (activating 8 out of 128 experts). Memory behavior may differ depending on the architecture. In MoE, different experts (sub-networks) are activated depending on the input, requiring additional buffers for expert switching during inference. On the other hand, Dense models pass all parameters through every token, making memory pressure from KV cache the primary cause.

The Gemma 4 31B used by the Reddit poster is considered a Dense model, so it is highly likely that the accumulation of KV cache and checkpoints, rather than MoE-specific expert buffers, drove up system RAM consumption.

However, there is currently insufficient backing from clear technical documentation for this point, and it remains speculation within the community.

Organizing Reproduction Conditions and Impact Scope

Summarizing the reported conditions in a table reveals the contours of the problem.

Environment VRAM System RAM Quantization Context Length Result
PC1 32GB 64GB DDR5 UD_Q5_K_XL 102400 RAM reached 63GB → OOM Kill
PC1 32GB 64GB DDR5 Q4 102400 VRAM 23GB used, RAM exhaustion not improved
PC2 Multi-GPU 128GB DDR4 UD_Q5_K_XL 102400 RAM reached 80GB, still rising

There are three key points to note.

First, RAM consumption does not improve even when lowering the quantization level. Changing from Q5 to Q4 reduced VRAM usage to about 23GB, but the abnormal consumption of system RAM remained the same. This is evidence that factors independent of model weight size are at play.

Second, the issue is reproducible across multiple machines. The reporter verified this on two machines with different hardware configurations and confirmed the same behavior. It is likely caused by memory management on the llama.cpp side rather than a specific hardware or driver issue.

Third, Gemma 4 is currently not fully stable for general local operation. Around the same time, reports on r/LocalLLAMA indicated that Gemma 4’s tool-calling functionality does not work correctly and that inference exhibits “lazy” behavior (such as skipping necessary external tool calls). While there is no direct causal link to the RAM exhaustion problem, the author views this as circumstantial evidence that the model’s maturity is still developing.

Countermeasures Available at This Time

A fundamental fix requires waiting for the llama.cpp development team, but there are several countermeasures users can try now.

Guidelines for Context Length and RAM Capacity

The most reliable countermeasure is to shorten the context length (-c option). The reporter themselves stated, “Even with ample VRAM, I had to lower the context.”

Please refer to the following figures as a guideline (these are estimates at present and may change with future llama.cpp updates).

System RAM Recommended Context Length (30B Class) Notes
32GB 8192〜16384 Be conservative, considering OS and other process usage
64GB 16384〜32768 Can handle somewhat longer prompts
128GB 32768〜65536 100k is still risky. Monitoring RAM usage is recommended

If you want to fully utilize a 100k context, even 128GB of system RAM might be insufficient. In practice, keeping the context length around 32k to 64k is the safe operational line.

The reporter’s used parameters were -ngl 999 -c 102400 -fa on –cache-type-k q8_0 –cache-type-v q8_0. The most effective change among these is lowering the -c value. Try changing it to -c 32768 or -c 16384 first to see if RAM consumption stabilizes.

Options to Try in llama.cpp Settings

Aside from shortening the context length, there are a few settings worth trying.

  1. Lower the KV cache quantization level: The reporter used –cache-type-k q8_0 –cache-type-v q8_0. Changing this to q4_0 may reduce KV cache memory consumption by about half. However, the impact on output quality varies by model and task.
  2. Adjust batch size: Reducing -b (batch size) or -ub (micro-batch size) may reduce temporary buffers during processing. You can also try lowering the default value to -b 512 or -b 256 to observe the effect.
  3. Expand swap space: This is not a fundamental solution, but on Linux, adding a swap file can delay the activation of OOM Kill. However, inference speed will drop significantly if access to swap occurs, so this should only be considered an emergency measure.
Lowering the KV cache quantization level too much (q4_0 or lower) may degrade the quality of generated text. Carefully check the output content after making changes. Also, note that expanding swap space onto an SSD may affect the SSD’s lifespan due to frequent writes.

Additionally, llama.cpp is an actively developed project, with multiple builds released, including b8838 (confirmed release on 4/18) in April 2026. Builds supporting diverse GPU accelerators such as CUDA 12/13 and ROCm 7.2 are available for macOS, Linux, and Windows, and development is vigorous. Discussions regarding this RAM exhaustion problem have also begun in llama.cpp’s GitHub Discussions, and the author believes there is a good chance it will be improved in future versions.

In our site’s test environment, running gemma4:26b (MoE variant) via Ollama recorded 14.8GB VRAM usage and an inference speed of 36.6 tokens/sec. However, this was measured with Ollama’s default context length, which differs significantly from setting a 100k context in llama.cpp. If the context length is kept short, this reference data suggests that inference of the Gemma 4 MoE variant can still operate at a practically useful speed even in a 16GB VRAM environment. Note that the RAM exhaustion discussed in this article is a phenomenon prominent in long-context operation of the Dense 31B, and the conditions for occurrence differ from the MoE variant 26B.

How Much System RAM Is Needed for a Local LLM Environment?

This case contains an important lesson for considering hardware configurations for local LLMs. Focusing only on “VRAM capacity” can lead to unexpected bottlenecks.

The memory in a local LLM environment needs to be understood as having a three-layer structure.

Layer 1: VRAM (GPU side) — Stores model weights and the main part of the KV cache. The most critical resource directly linked to inference speed.

Layer 2: System RAM (CPU side) — Used for model loading, KV cache checkpoints, and temporary buffers during inference. It also serves as the overflow destination for data spilling over from VRAM.

Layer 3: Storage (Swap) — The final fallback destination when RAM is insufficient. Speed is orders of magnitude slower, and practicality drops significantly when relying on this.

The estimated guidelines for system RAM by model size (assuming long-context operation) are as follows. Please note that these are estimates derived by the author from source information and general memory consumption trends, and may vary depending on model architecture and quantization format.

Model Size Short Context (8k or less) Medium Context (8k〜32k) Long Context (Over 32k)
7B〜8B Class 16GB 32GB 32〜64GB
14B〜26B Class 32GB 64GB 64〜128GB
Over 30B Class 64GB 64〜128GB 128GB or more

When considering the configuration of an AI PC, you should be mindful of investment in system RAM, not just the GPU VRAM budget. Prices for DDR5 memory are trending downward, and a 32GB×2 (64GB) configuration can be obtained for around 20,000 yen as of April 2026. If you plan to operate with long contexts, it is reassuring to install 64GB or more from the start.

NVIDIA is also advancing technology development for VRAM efficiency, and new technologies like RTX Neural Texture Compression are beginning to contribute to reducing VRAM usage. However, it is necessary to distinguish that this has no direct effect on system RAM issues like the one discussed here and is strictly a technology within the context of VRAM optimization.

Summary

The problem of system RAM exhaustion when running Gemma 4 with llama.cpp in long context is believed to be primarily caused by KV cache checkpoints and buffers accumulating on the system RAM side, rather than VRAM.

Organizing the priority of countermeasures, the flow is as follows.

  1. First, shorten the context length (-c) to 32768 or less to see if the symptoms improve.
  2. If not improved, lower the KV cache quantization level to q4_0.
  3. If system RAM is less than 64GB, consider upgrading to 128GB.
  4. Update to the latest build of llama.cpp and wait for fixes in future versions.

In local LLM environment design, attention often focuses solely on “whether VRAM is sufficient,” but system RAM is also a critical resource. Especially when handling models of 30B class or larger with long contexts, 64GB or more of system RAM is becoming a de facto requirement.

Have you ever been mindful of system RAM usage in your environment? The era where RAM becomes a bottleneck next to VRAM may be approaching.

Frequently Asked Questions (FAQ)

Q: Does this problem only occur with Gemma 4?

A: It is prominently reported with Gemma 4, but it is a phenomenon that could occur with other models if combining large context with large parameter models. There are indications that MoE architecture models have particularly large memory consumption for checkpoints, so caution is needed for models with similar architectures.

Q: Does the same problem occur on Windows?

A: Forced termination by an OOM Killer like on Linux does not occur, but RAM exhaustion itself can happen regardless of the OS. On Windows, the swap file (page file) expands automatically, making immediate crashes less likely, but inference speed drops extremely if access to swap increases. It is desirable to monitor memory usage via Task Manager while operating.

Q: Can I avoid this by using Ollama instead of llama.cpp?

A: Since Ollama’s backend uses GGML/llama.cpp technology, the fundamental memory management behavior may be similar. However, Ollama is set with a relatively short default context length, so it is considered unlikely to encounter the same problem unless an extreme long context like 100k is explicitly specified.

This site is a participant in the Amazon Services LLC Associates Program. As an Amazon Associate, this site earns income from qualifying purchases.

This article was written by the AI Hardware Zukan Editorial Department based on information available at the time of writing. Evaluations may change due to product updates or fluctuations in third-party benchmarks, prices, or supported runtimes. Re-verification is recommended for content that has aged.

References

Copied title and URL