Question 1

How is VRAM calculated for model weights?

Accepted Answer

Weights VRAM = (parameters in billions) × (bytes per parameter). FP32 = 4B, FP16 = 2B, INT8 = 1B, INT4 = 0.5B.

Question 2

What is KV cache VRAM?

Accepted Answer

KV cache stores key/value tensors during transformer inference. Estimated as 2 × (num_layers) × (hidden_dim) × (batch_size × seq_len) × bytes_per_param. We use a simplified heuristic: 2 × batch × seq_len × (params_B × 0.5) × bytes.

Question 3

Which GPUs are compared?

Accepted Answer

T4 (16GB), L4 (24GB), A10G (24GB), A100-40GB, A100-80GB, H100 (80GB), RTX 3090 (24GB), RTX 4090 (24GB).