Estimate GPU Memory Saved by Quantizing Your LLM
Set your model size and pick a quantization level to see VRAM for FP16, INT8, INT4 and GGUF formats side by side — including weights, KV cache and runtime overhead.
How this VRAM calculator works
Large language models store one number per weight. At full precision each weight is 16 bits (2 bytes), so the raw weight memory is simply params × bytes_per_weight. Quantization shrinks bytes_per_weight by packing each value into fewer bits, which is why an INT4 model is roughly a quarter the size of its FP16 original.
The core formula this tool uses for weight memory is:
weight_GB = (params_B × 1e9 × bits_per_weight / 8) / 1024³
GGUF "K-quants" are not clean power-of-two widths — they mix block scales and a few higher-precision tensors, so their effective bits-per-weight land between the labels: Q4_K_M is about 4.5 bpw, Q5_K_M about 5.5, Q6_K about 6.6, and Q2_K about 2.6. This calculator uses those measured effective widths rather than the nominal bit count, which is the detail most simple estimators get wrong and the reason real GGUF files are larger than a naive "params ÷ 2" guess.
Weights are only part of the budget. Inference also needs a KV cache that grows with context length. We approximate it as kv_GB = 2 × layers × hidden × ctx × 2 bytes, deriving layers and hidden width from the parameter count, and the cache is held in FP16 regardless of weight quantization. On top of that we add a fixed ~15% runtime overhead for activations, CUDA context and fragmentation. The "savings" figure compares each quant level's total against the FP16 baseline, so you can see the exact GB and percentage each step buys you. Lower precision trades memory for a small accuracy loss — Q4_K_M is the usual sweet spot, while Q2_K is best reserved for fitting a model that otherwise will not load at all.