How Many Concurrent Users Can One GPU Serve?

Estimate the realistic concurrent-session capacity of a single GPU running an LLM, derived from free VRAM after model weights and the KV cache footprint each active context consumes.

e.g. A100 80, H100 80, RTX 4090 24, L40S 48
Use head count of grouped-query KV, not total attention heads
Activations, CUDA graphs, fragmentation buffer
0
Concurrent users / GPU
0
KV cache / user (GB)
0
Model weights (GB)
0
VRAM left for KV (GB)

The formula this tool uses

weights_GB = params_B × (quant_bits / 8)
kv_per_token_bytes = 2 × layers × kv_heads × head_dim × kv_bytes
kv_per_user_GB = kv_per_token_bytes × context_tokens / 1024³
usable_GB = VRAM × (usable_fraction) − weights_GB − overhead_GB
concurrent_users = floor( usable_GB / kv_per_user_GB )

Most "tokens per second" benchmarks ignore the harder ceiling on a production endpoint: memory, not compute. Once the model weights are resident, every additional concurrent session must keep its own key/value cache in VRAM for the full length of its conversation. That KV cache, not the GPU's FLOPS, is what caps how many users a single card can hold at once.

The leading factor of 2 in the per-token formula accounts for storing both the key and the value tensor. We multiply by the number of transformer layers, the number of KV heads (use the grouped-query value, which is why modern models like Llama-3 70B fit far more sessions than their total head count would suggest), and the head dimension. Multiplying by the bytes per element gives the memory a single token occupies; scaling by the context length gives one user's footprint.

Available memory is whatever remains after weights and a reserved overhead slice for activations, CUDA graphs, and allocator fragmentation. We also apply a usable-fraction discount because no inference server safely fills 100% of VRAM. Dividing the remainder by the per-user KV cost and taking the floor yields a defensible concurrency estimate. Quantizing weights to INT4 frees room for the cache, while an 8-bit KV cache roughly doubles the sessions you can host. Treat the result as a steady-state ceiling at full context; with shorter average prompts or paged-attention reuse, real throughput is usually higher.