Have you ever tried to upscale a video you made in ComfyUI to 4K, only for your PC’s memory usage to suddenly spike near its ceiling and freeze the whole system? Or fed in a long video, waited over 15 minutes, and then had it throw a “not enough memory” error ── wasting all that waiting time? You may have run into something like this.
To state the conclusion first: Upscaling an 8-second, 4K, 60fps video in ComfyUI consumes as much as 67〜77GB of system RAM. If memory is insufficient, it crashes with a “not enough memory” (OOM) error after processing for over 15 minutes. I produce videos at this specification daily for Adobe Stock and have suffered greatly from this memory issue.
The answer I arrived at was the “standalone” method of running the same ESRGAN model directly without going through ComfyUI. Simply stripping away the wrapper changed the numbers as follows:
- RAM Consumption: 67〜77GB → 1.4GB (Approximately 1/48th; runs even on a PC with 16GB)
- Processing Time: 19 minutes → 7 minutes (Measured on RTX 5080, 2.6x faster)
- Stability: Crashes due to insufficient memory → Successfully completes even videos of any length
Since the model used is identical, image quality remains unchanged. This method solves all three issues at once: crashing, slowness, and excessive memory consumption.
In this article, I will break down from first principles why upscaling within ComfyUI consumes so much system RAM, and then explain how to run ESRGAN outside of ComfyUI with actual working code and a full benchmark test using two GPUs. Anyone with one GPU can replicate these results. Note that the basic procedure for easily upscaling via nodes inside ComfyUI is explained in another article titled “Upscaling LTX 1 Videos to 4K,” so if you want to try it via GUI first, please read that one. This article serves as an advanced follow-up.
What exactly is “standalone”?
It refers to a standalone program running only the necessary processing without going through integrated GUI environments like ComfyUI or Stable Diffusion WebUI. In this article, we call the method of calling the ESRGAN model used internally by ComfyUI directly from a Python script without launching ComfyUI “standalone.” It is an image of removing the “heaviness” that comes with GUI convenience and running only necessary processing in a minimal configuration.
What you will learn from this article
- Why upscaling within ComfyUI consumes 67〜77GB of system RAM (from first principles)
- How to run ESRGAN directly without using ComfyUI ── with copy-pasteable code that works immediately
- Benchmark results comparing standalone vs. ComfyUI across two types of GPUs (memory, speed, GPU utilization)
- Determining which method suits your specific use case
- Pitfalls often encountered when dealing with eGPU bandwidth or specifying GPUs (e.g., Oculink)
- Why does upscaling within ComfyUI consume so much memory?
- How I arrived at building a standalone solution
- What is spandrel ── Borrowing the “insides” of ComfyUI directly
- How to build ── Setting up a standalone for 4K video upscaling on one GPU
- Benchmark ── Comparing memory and speed between ComfyUI and Standalone
- Which one should you choose? ── Considering your use case
- Automatically selecting models based on video content
- Pitfalls often encountered
- Application: Mass production in parallel with two GPUs
- Summary
Why does upscaling within ComfyUI consume so much memory?
ComfyUI is designed to cache the results of executed nodes in memory to speed up subsequent executions. Intermediate tensors, decoded frames, and outputs from each node remain in memory throughout processing. While touching still images one by one, this behavior of “remembering everything” contributes to comfort. The problem arises with video.
In a typical video upscaling workflow, all frames are expanded into memory at once before flowing through the nodes. For an 8-second, 60fps video, that is 481 frames. Upscaling this to 4K (3840×2160) results in approximately 35MB per frame output. Holding all 481 frames simultaneously consumes over ten gigabytes just for the data itself. Furthermore, ComfyUI’s upscaling node (ImageUpscaleWithModel) processes tiles to save VRAM but then writes back the final 4K outputs of all frames as one massive tensor into CPU memory at the very end. It is during this aggregation moment that memory requirements peak.
In actual measurements, ComfyUI consumed 67〜77GB of system RAM for an 8-second video upscaled to 4K. In an RTX 5080 environment with 96GB of RAM, it barely completes the task using almost all available memory. While the video itself finishes correctly, the consumption is extraordinary. If you attempt this when there is no spare memory in your system, it crashes at that final aggregation stage like so:
RuntimeError: [enforce fail at alloc_cpu.cpp:121] DefaultCPUAllocator: not enough memory: you tried to allocate 68089282560 bytes.
This is an error meaning “failed to reserve approximately 68GB (68,089,282,560 bytes).” What makes this particularly nasty is that it occurs at the final stage after processing for over 15 minutes. It proceeds normally up until then, so you don’t notice anything wrong, and all those waited 15 minutes are completely wasted. As frame counts increase, required memory skyrockets, making longer videos more prone to this bottleneck.
This is not a defect in ComfyUI; it is the trade-off for its versatility that allows free connection of any nodes via GUI. However, during mass production phases where “fixed processing runs on large batches,” the cost of this versatility becomes burdensome. For use cases like mine where dozens are processed daily, it can be fatal.
How I arrived at building a standalone solution
I initially tried to make things work within ComfyUI as straightforwardly as possible. I lowered tile sizes to save VRAM, split frame counts, and even tested custom nodes for memory management. That was sufficient for the VRAM side. However, there was no way around that single bulk allocation at the end where all frames are aggregated into CPU memory. Tiling is merely a measure to save VRAM (on the GPU side); the design itself of aggregating final outputs into RAM remains unchanged.
This prompted me to change my thinking. The core of upscaling is an ESRGAN neural network model, and ComfyUI is just the “wrapper” that calls it. If so, why not remove the wrapper and run the model directly? Instead of holding all frames in memory, process one by one, write out, then another ── if we stream them like this, memory usage should remain constant. When I actually implemented this, RAM consumption capped at 1.4GB, and it turned out faster than ComfyUI. Below, I will show how to build this step-by-step.
What is spandrel ── Borrowing the “insides” of ComfyUI directly
The key lies in a library called spandrel. It automatically detects and loads super-resolution models like ESRGAN, SwinIR, and HAT to run them. Interestingly, ComfyUI itself uses this spandrel internally for upscaling. This means we can call the core of ComfyUI directly without launching ComfyUI.
The best part is that you can use the .pth files (such as 4x-UltraSharp.pth) located in ComfyUI’s models/upscale_models/ directory exactly as they are. There is no need to repurchase or convert models. The model assets you usually use with ComfyUI work perfectly intact.
How to build ── Setting up a standalone for 4K video upscaling on one GPU
You only need three things:
- ComfyUI’s embedded Python (contains
torchandspandrel. If you use ComfyUI normally, no additional installation is needed) - An upscaling model (
4x-UltraSharp.pth, etc., from themodels/upscale_models/folder) - ffmpeg / ffprobe (used for video frame input/output)
In the code below, the parts marked with highlights are where you need to modify settings according to your environment (folder paths, model paths, etc.). The rest can be copied and pasted as is.
Step 1: Loading the model
Loading a model via spandrel takes just a few lines. Using fp16 (half-precision) reduces VRAM usage and increases speed.
import torch import spandrel # Specify the model with an absolute path (e.g., D:/ComfyUI/ComfyUI/models/upscale_models/4x-UltraSharp.pth) loaded = spandrel.ModelLoader(device="cuda:0").load_from_file("4x-UltraSharp.pth") model = loaded.model.eval().half() # Save memory and speed up with fp16 print(f"scale={loaded.scale}x") # 4 for 4x-UltraSharp
This places the 4x upscaling model onto the GPU. The scale factor (4 for 4x-UltraSharp) is stored in loaded.scale, which can be used in subsequent calculations. You must adjust the “4” in scale=4 or W*4 appearing later to match your model’s scale factor (2 for a 2x model). Passing loaded.scale will align this automatically).
Step 2: Upscaling one frame by splitting it into tiles
If you pass a full 4K-class frame directly to the model, even 16GB of VRAM may not be enough. Therefore, we split each frame into 768px tiles, process them sequentially, and stitch the results together. The trick is to overlap adjacent tiles by 32 pixels (padding) so that tile boundaries are less noticeable.
def upscale_frame(img, model, tile=768, pad=32, scale=4): # img: fp16 tensor [1, 3, H, W] on GPU _, _, H, W = img.shape if H <= tile and W <= tile: # If small enough, process directly with torch.inference_mode(): return model(img) out = torch.zeros((1, 3, H*scale, W*scale), dtype=img.dtype, device=img.device) for y in range(0, H, tile): for x in range(0, W, tile): # Extract tiles with padding ys, xs = max(y-pad, 0), max(x-pad, 0) ye, xe = min(y+tile+pad, H), min(x+tile+pad, W) with torch.inference_mode(): up = model(img[:, :, ys:ye, xs:xe]) # Remove padding and paste to the correct position in output top, left = (y-ys)*scale, (x-xs)*scale dy, dx = y*scale, x*scale h = min(tile, H-y) * scale w = min(tile, W-x) * scale out[:, :, dy:dy+h, dx:dx+w] = up[:, :, top:top+h, left:left+w] return out
Making tiles smaller further reduces VRAM consumption, while larger ones increase speed. For a GPU with 16GB of VRAM, 768px offers the best balance, and actual measurements showed peak VRAM usage capped at 13GB.
Step 3: Processing frames "on the fly" (this is where it differs most)
This part makes a decisive difference from ComfyUI. Instead of holding all frames in memory, we receive them one by one from ffmpeg → upscale them → immediately write them out to ffmpeg again. The processing flow looks like this:
- Producer: ffmpeg decomposes the video into raw RGB frames and outputs them one by one via a pipe.
- GPU: Upscales received frames using tile-based processing.
- Consumer: Another ffmpeg receives upscaled frames and encodes them back into a 4K video.
We run these three in parallel via separate threads, connecting them with small queues (approx. 64 input frames / 16 output frames). Since the queue upper limit is fixed, the memory cap remains unchanged whether the video is 8 seconds or 5 minutes long. Simultaneously, time where the GPU idles waiting for I/O disappears. In code, it looks like this:
import subprocess, threading, queue, numpy as np def upscale_video(src, dst, W, H, fps=60, model=model): out_w, out_h = W*4, H*4 # Producer: Input video → Raw RGB frames dec = subprocess.Popen(["ffmpeg", "-i", src, "-f", "rawvideo", "-pix_fmt", "rgb24", "-v", "error", "pipe:1"], stdout=subprocess.PIPE) # Consumer: Raw RGB frames → 4K h264 enc = subprocess.Popen(["ffmpeg", "-y", "-f", "rawvideo", "-pix_fmt", "rgb24", "-s", f"{out_w}x{out_h}", "-r", str(fps), "-i", "pipe:0", "-c:v", "libx264", "-crf", "8", "-pix_fmt", "yuv420p", dst], stdin=subprocess.PIPE) q = queue.Queue(maxsize=64) # Input buffer (fixed upper limit) frame_bytes = W * H * 3 def reader(): # Producer thread while True: buf = dec.stdout.read(frame_bytes) if len(buf) < frame_bytes: break q.put(np.frombuffer(buf, np.uint8).reshape(H, W, 3).copy()) # .copy() makes it writable (avoids warnings) q.put(None) # End-of-stream marker threading.Thread(target=reader, daemon=True).start() while True: # GPU (Main loop) frame = q.get() if frame is None: break t = torch.from_numpy(frame).to("cuda:0").half().permute(2,0,1)[None] / 255.0 up = upscale_frame(t, model) # Function from Step 2 out = (up[0].permute(1,2,0).clamp(0,1) * 255).byte().cpu().numpy() enc.stdin.write(out.tobytes()) # Send to consumer enc.stdin.close(); enc.wait()
The key is the maxsize in queue.Queue(maxsize=64). This acts as a cap for memory. Even if reading is too fast, the queue fills up at 64 frames and waits there, so memory does not expand beyond that limit. In this video upscaling configuration verified here, while ComfyUI "reads everything then processes," our method "processes while reading and discards" ── this difference cut memory usage to roughly 1/48th. Note that since ComfyUI behavior varies depending on workflows and settings, this is strictly a comparison for the upscaling configuration measured in this article.
Step 4: Batch process videos placed in an input folder to make them all 4K
You can specify files one by one, but since we are here, let's set it up so that placing a video into the "input" folder and running the script will result in 4K versions appearing in the "output" folder. This allows you to throw in as many videos as needed for processing, making mass production much easier. You simply append the following code after the previous steps (Steps 1-3).
import json, glob, os INPUT_DIR = "input" # Path of folder containing input videos (e.g., C:/Users/you/videos/input). Put your videos here OUTPUT_DIR = "output" # Path of output destination folder (e.g., C:/Users/you/videos/output). 4K videos will be created here os.makedirs(OUTPUT_DIR, exist_ok=True) for src in sorted(glob.glob(os.path.join(INPUT_DIR, "*.mp4"))): name = os.path.basename(src) dst = os.path.join(OUTPUT_DIR, name) # Get width and height of input video via ffprobe probe = subprocess.run(["ffprobe", "-v", "error", "-select_streams", "v:0", "-show_entries", "stream=width,height", "-of", "json", src], capture_output=True, text=True).stdout W = json.loads(probe)["streams"][0]["width"] H = json.loads(probe)["streams"][0]["height"] print(f"Processing: {name} ({W}x{H} → {W*4}x{H*4})") upscale_video(src, dst, W, H) print("All finished")
Save Steps 1-4 together as a single file (e.g., upscale.py) and execute it using the Python bundled with ComfyUI. The key is to use ComfyUI's embedded Python rather than your system python, since it contains torch and spandrel.
"C:\ComfyUI\python_embeded\python.exe" upscale.py
All you need to do now is drop the videos you want to upscale into the input folder and run it. 4K versions with the same names will be created one by one in the output folder. If you put in 10 files, all 10 are processed together ── this is automation of mass production where "you just leave them in a folder and they get upscaled automatically." Since we process frames sequentially, memory usage remains constant regardless of how many videos you add.
To adjust image quality or size, modify the ffmpeg encoding side in Step 3. Setting -crf 8 is for high-quality output; if the result after 4x scaling is too large, adding -vf "scale=3840:2160:flags=lanczos" will adjust it to exact 4K dimensions. For submission purposes like Adobe Stock, keep this setting towards higher quality. Since audio is not handled here, if needed, you can synthesize the original video's audio with ffmpeg at the end (see FAQ).
Benchmark ── Comparing memory and speed between ComfyUI and Standalone
We measured processing an identical 8-second video (1152×640 / 60fps / 481 frames) upscaled to 4K using both standalone and ComfyUI, across two GPUs: RTX 5080 (direct PCIe connection) and RTX 5060 Ti (Oculink eGPU). The model used was the same 4x-UltraSharp for all tests.
| Configuration | Method | Processing Time | RAM Peak | GPU Utilization |
|---|---|---|---|---|
| RTX 5080 (Direct) | Standalone | ~7 minutes (443 sec) | 1.4GB | 96% (Full load) |
| RTX 5080 (Direct) | ComfyUI | ~19 minutes (1140 sec) | 76.6GB | 80〜97% (with dips) |
| RTX 5060 Ti (Oculink) | Standalone | ~15 minutes (928 sec) | 1.4GB | Intermittent (bandwidth wait) |
| RTX 5060 Ti (Oculink) | ComfyUI | ~36 minutes (2186 sec) | 67.8GB | Intermittent |
To translate what this table implies for practical use:
A 48〜55x difference in memory usage means a change in the "range of PCs that can run." Standalone caps at just 1.4GB RAM regardless of GPU. In contrast, ComfyUI requires 67〜77GB. This means to safely run an 8-second 4K video on ComfyUI, you practically need a system with ~96GB class RAM. With standalone, the same video can be processed even on a PC with only 16GB of memory. This difference is most effective for those who had given up on upscaling due to insufficient memory.
Faster by more than double was contrary to expectation. On the same GPU, the RTX 5080 took 443 seconds with standalone versus 1140 seconds (2.6x) with ComfyUI; for the 5060 Ti it was 928 seconds vs 2186 seconds (2.4x). I initially thought "ComfyUI sends all frames to GPU at once, so transfer happens only once and should be faster." The actual measurement showed the opposite. Holding all frames in memory creates overhead just from allocating and moving that massive amount of data, making it slower instead. Processing while reading and discarding yields better results for both memory and time.
The GPU doesn't idle. For standalone on direct RTX 5080 connection, GPU utilization averaged 96%, staying almost constantly high. By streaming read → process → write in parallel, the GPU never stops waiting for I/O. Conversely, ComfyUI has distinct stages: "read all → process all → write all," creating dips between stage transitions and memory management (utilization fluctuates between 80〜97%). Keeping the most expensive component, the GPU, from idling pays off significantly during mass production.
eGPU (Oculink) requires attention to bandwidth. For the RTX 5060 Ti, regardless of method, GPU utilization became intermittent (jagged). This is because Oculink's transfer bandwidth (PCIe 4.0 x4, approx. 8GB/s) cannot keep up with round-trips for 4K output (~35MB per frame), causing the GPU to pause waiting for transfers. This does not occur on a direct PCIe 5.0 x16 connection like the RTX 5080 (approx. 63GB/s). When running upscaling via an external GPU, consider this bandwidth penalty (processing time approx. double compared to RTX 5080) in addition to raw compute performance differences.
Which one should you choose? ── Considering your use case
You might think standalone is the only choice after reading this far, but ComfyUI isn't inferior. Honestly, there are clear pros and cons depending on usage scenarios.
Suitable for upscaling within ComfyUI if:
- You want to try easily via GUI first
- You process only a few videos per month
- You don't want to write or touch code
Suitable for standalone if:
- You process videos daily
- You mass-produce videos for platforms like Adobe Stock
- You handle long-form video (wanting to avoid memory usage skyrocketing)
- You have limited RAM (want to run 4K on a PC with 16〜32GB)
If you are just testing a few videos, the convenience of GUI wins. Once entering mass production phases, standalone's low memory consumption, speed, and reliable completion become effective benefits. I myself first solidified procedures via GUI before switching to standalone at the timing for mass production. This order is the most logical.
Automatically selecting models based on video content
ESRGAN-series models each have their strengths: UltraSharp focuses on sharpness and detail, Remacri offers natural organic textures, and RealESRGAN excels at noise removal. Manually choosing which model to apply for every video is tedious and prone to inconsistent judgment.
To address this, we quantify visual characteristics of the video (brightness, edge count, noise, saturation) from a few frames and automatically assign models accordingly. If there's strong noise → RealESRGAN; dark and flat scenes → RealESRGAN; sharp with vivid colors → UltraSharp; smooth footage otherwise → Remacri.
In my case, I always subject videos to multimodal inspection before upscaling via a pipeline like the one described in "Multimodal Inspection for Video Stock Pipeline," where I already see details such as "is this video bright or dark, does it have much noise?" at that stage. Therefore, model selection naturally happens during the inspection phase. Passing the judgment results from inspection directly to model selection allows us to apply suitable models per video without manual intervention. Furthermore, a significant advantage is that "failed works" where image quality doesn't meet standards can be filtered out at this inspection stage. Since upscaling takes over ten minutes per file and is heavy processing, discarding bad videos beforehand reduces the total wasted upscales significantly.
However, this workflow relies on having pre-built mechanisms for inspection and sorting in my environment; it cannot be instantly automated everywhere without setup. While I won't delve into building those specific systems here, knowing that "inspection and upscaling should be designed as a set to reduce waste" is useful knowledge. For more details on inspection itself, see "What is Multimodal Inspection?."
Pitfalls often encountered
"Wanting to run on a second GPU" but it runs on another one instead
In multi-GPU environments, specifying cuda:1 sometimes results in using an unintended GPU. This is because CUDA's default device order (FASTEST_FIRST) does not match the numbering from nvidia-smi (PCI bus order). When installing two GPUs of the same generation, which one gets assigned to cuda:0 may contradict intuition.
set CUDA_DEVICE_ORDER=PCI_BUS_ID rem To use a specific single GPU most reliably set CUDA_VISIBLE_DEVICES=1
To be honest, I forgot this setting once and ended up measuring on the RTX 5080 when intending to test the 5060 Ti, forcing me to redo measurements. Do not assume GPU numbers; strongly recommend checking via nvidia-smi during processing which one is actually active.
VRAM and RAM are different ── lowering tiles doesn't reduce RAM usage
A common pitfall is confusing VRAM (GPU memory) with RAM (system memory). Tiling saves VRAM, but ComfyUI's consumption of 67〜77GB occurs in system RAM. Therefore, even if you lower tile sizes thinking "not enough memory," the expansion on the RAM side won't stop. VRAM is inside the GPU; RAM is on the motherboard ── different locations entirely. Tiling happens on the GPU side, while bulk allocation of outputs happens on the system side. Standalone solves this by eliminating that aggregation in system RAM itself. If you lack VRAM, lower tiles; if you lack RAM, change processing methods ── treat them separately.
Other common bottlenecks
- Model not found → Check path of
.pthfile (ComfyUI'smodels/upscale_models/) - VRAM insufficient → Lower tile size (e.g., 768 → 512)
- ffmpeg not found → Specify full path or add to PATH environment variable
Application: Mass production in parallel with two GPUs
In mass production phases where dozens of videos are processed, distributing one video per GPU and running them in parallel increases throughput. In my environment (RTX 5080 + RTX 5060 Ti), streaming one to each yielded a perceived speedup of 1.5〜2x (though the 5060 Ti side suffers from the aforementioned Oculink bandwidth penalty). This assumes a dual-GPU setup, so please first complete the steps above with a single GPU. Parallel processing and VRAM handling for dual GPUs are examined in detail in "ComfyUI Dual-GPU Operation Guide."
Summary
The reason upscaling within ComfyUI consumes excessive memory for long videos is its design to bulk-allocate the 4K output of all frames at the end (actual measurement: 67〜77GB; crashes with OOM when memory is tight). By running the ESRGAN model, which forms the core of upscaling, directly from outside ComfyUI via spandrel and processing frames one by one in a streaming fashion, this problem disappears fundamentally.
- Memory: Constant regardless of video length (actual measurement: 1.4GB = approx. 1/48〜55th of ComfyUI). Runs even on PCs with 16GB RAM
- Speed: 2.4〜2.6x faster on the same GPU. Keeps GPU at 96% utilization without idling
- Image Quality: Identical to ComfyUI since it uses the same
.pthfile (optimization for automatically selecting models per video is possible but requires a pre-built sorting mechanism via inspection; this article does not cover that depth)
"ComfyUI can do it, but at the cost of massive memory and time. Standalone drastically cuts both." This was the conclusion drawn from actual measurements. If you want to first grasp procedures using ComfyUI nodes, start with "Upscaling LTX 1 Videos to 4K"; try making this article's standalone version as your next step.
