How Many Concurrent Users Can One GPU Serve?
Estimate the realistic concurrent-session capacity of a single GPU running an LLM, derived from free VRAM after model weights and the KV cache footprint each active context consumes.
The formula this tool uses
kv_per_token_bytes = 2 × layers × kv_heads × head_dim × kv_bytes
kv_per_user_GB = kv_per_token_bytes × context_tokens / 1024³
usable_GB = VRAM × (usable_fraction) − weights_GB − overhead_GB
concurrent_users = floor( usable_GB / kv_per_user_GB )
Most "tokens per second" benchmarks ignore the harder ceiling on a production endpoint: memory, not compute. Once the model weights are resident, every additional concurrent session must keep its own key/value cache in VRAM for the full length of its conversation. That KV cache, not the GPU's FLOPS, is what caps how many users a single card can hold at once.
The leading factor of 2 in the per-token formula accounts for storing both the key and the value tensor. We multiply by the number of transformer layers, the number of KV heads (use the grouped-query value, which is why modern models like Llama-3 70B fit far more sessions than their total head count would suggest), and the head dimension. Multiplying by the bytes per element gives the memory a single token occupies; scaling by the context length gives one user's footprint.
Available memory is whatever remains after weights and a reserved overhead slice for activations, CUDA graphs, and allocator fragmentation. We also apply a usable-fraction discount because no inference server safely fills 100% of VRAM. Dividing the remainder by the per-user KV cost and taking the floor yields a defensible concurrency estimate. Quantizing weights to INT4 frees room for the cache, while an 8-bit KV cache roughly doubles the sessions you can host. Treat the result as a steady-state ceiling at full context; with shorter average prompts or paged-attention reuse, real throughput is usually higher.