Memory-bandwidth-bound estimator for local and server LLM decoding speed. Set your GPU bandwidth, model parameters and quantization to see the theoretical and realistic tokens-per-second ceiling.

How the tokens-per-second estimate works

Autoregressive decoding is memory-bandwidth bound, not compute bound: to generate one token the GPU must stream every model weight (plus the KV cache) from VRAM through the chip exactly once. That single fact lets us predict speed without running the model. The core formula this calculator uses is:

tokens/sec ≈ (bandwidth × efficiency) ÷ bytes_per_token

Bytes-per-token is dominated by the weights: bytes_weights = params × 1e9 × (bits ÷ 8). So an 8B model at 4-bit (Q4) needs ~4 GB read per token, while the same model at FP16 needs ~16 GB — which is exactly why quantization roughly doubles or quadruples your speed. We add a KV-cache term, bytes_kv ≈ context × 2 × layers × hidden × 2 bytes, approximated here from parameter count, because long contexts add real per-token reads that flatten throughput as the conversation grows.

The efficiency slider models Model FLOPs / bandwidth Utilization (MFU). Real kernels never hit 100% of the spec-sheet bandwidth — 60-80% is typical for well-tuned inference (llama.cpp, vLLM, TensorRT-LLM), lower for unoptimized paths. Batching multiple requests reuses the same weight read across several sequences, so aggregate throughput scales nearly linearly with batch size even though each individual request stays at the single-stream rate (until you become compute-bound). That is the information most spec pages hide: a 4090 doing 50 tok/s for one user can serve 8 users at ~45 tok/s each, ~360 tok/s aggregate, because the bandwidth bottleneck is shared. Use these numbers as an upper-bound sanity check before you buy a GPU or pick a deployment tier — measured speed will land within ~10-20% if your serving stack is healthy.

Estimate LLM Inference Throughput in Tokens/Second

How the tokens-per-second estimate works

Related Tools