Estimate KV Cache VRAM for Long-Context LLM Inference

Transformer inference stores a key and value vector for every token at every layer. This calculator turns your model shape, context length, batch size and quantization into the exact GPU memory the KV cache consumes — the hidden cost that decides whether a long prompt fits on your card.

Context length (tokens)

Batch size (concurrent sequences)

Number of layers

Hidden size (model dimension)

Query heads

Key/Value heads (GQA)

KV cache precision

Your GPU VRAM (GB, optional)

KV per token

—

Total KV cache

—

Per layer

—

% of your VRAM

—

Context	KV cache (this config)

How the KV cache memory formula works

During autoregressive decoding the model never recomputes attention over past tokens — it caches their key (K) and value (V) projections and reuses them. Memory therefore grows linearly with sequence length, unlike the model weights which are fixed. The exact size is:

KV_bytes = 2 × batch × context × layers × kv_heads × head_dim × bytes_per_element

The leading 2 accounts for storing both K and V. The crucial subtlety most calculators miss is head_dim: it is derived from the query side as head_dim = hidden_size / query_heads, but the cache is sized by the number of key/value heads, not query heads. Modern models use Grouped-Query Attention (GQA), where many query heads share one KV head. Llama 3.1 8B has 32 query heads but only 8 KV heads — a 4× cache reduction versus naive Multi-Head Attention. Multi-Query Attention (MQA) pushes this to a single KV head.

This tool computes head_dim from your hidden size and query-head count, then multiplies the GQA-aware KV-head count through the formula. Switching precision rescales bytes_per_element: FP16 uses 2 bytes, FP8/INT8 halves that, and INT4 quantization quarters it — at some accuracy cost on long contexts. The percentage bar compares the result against your stated VRAM after you have already reserved room for weights and activations, so treat anything above ~60% as a squeeze. Because the cache scales with both context and batch, doubling either doubles memory: a 70B model at 128K context can need more VRAM for its cache than for several quantized copies of its weights, which is exactly why long-context serving is memory-bound rather than compute-bound.

Estimate KV Cache VRAM for Long-Context LLM Inference

How the KV cache memory formula works

Related Tools