Will this model fit on my GPU?

Pick your graphics card, the model parameter count and a quantization format. The checker estimates total VRAM needed for weights, KV cache and overhead, then tells you if it runs and how much headroom is left.

GPU (VRAM)

Custom VRAM (GB)

Model size — parameters (billions)

Quantization

Context length (tokens): 8192

512128k

KV cache precision

—

0 GB usedcap

Weights

–

KV cache

–

Overhead

–

Total needed

–

Headroom

–

Max context

–

How the fit estimate works

Most "will it run" disappointments come from forgetting that weights are only part of the budget. This checker adds three terms the way an inference runtime actually allocates them.

1. Weight memory. Each parameter occupies a number of bytes set by the quantization format: 2 bytes at FP16, 1 at INT8, and roughly 0.5 at 4-bit. So weights_GB = params_billion × bytes_per_param. A 7B model at Q4_K_M is about 7 × 0.5 = 3.5 GB.

2. KV cache. Every token you keep in context stores a key and a value vector for each transformer layer, and this grows linearly with context length. We approximate it from the parameter count (which scales hidden size × layers) as kv_GB ≈ params_billion × ctx_tokens × kv_bytes × 5.2e-5. At long context this term can dwarf the weights — a 32k window on a 13B model can need 6–10 GB on its own, which is why an 8-bit or 4-bit KV cache option matters.

3. Runtime overhead. CUDA context, activation buffers, the allocator's fragmentation slack and framework reservations cost a roughly fixed amount plus a small fraction of the weights. We use overhead_GB = 0.8 + 0.06 × weights_GB.

The total is the sum of all three. We compare it to your card's VRAM minus a 7% safety reserve (drivers and the desktop compositor never give you the full sticker number). If the total is under that usable budget it fits; if it is within 10% we flag it as tight, since long generations push allocation higher. The "max context" figure solves the same equation backwards: how many tokens of KV cache the remaining VRAM allows after weights and overhead. The math is deliberately conservative — treat a green result as "should load with room for normal generation," not a guarantee for every batch size.

Will this model fit on my GPU?

How the fit estimate works

Related Tools