How is LLM VRAM calculated?

LLM VRAM = Model Weights + KV Cache + Overhead. Model weights are calculated as (Parameters × Bits per Weight) / 8. KV cache depends on context length, layers, and attention architecture. Overhead includes CUDA context and activation buffers (~0.5GB + 1% of model size).

What's the difference between Standard MHA and GQA attention for VRAM?

Standard MHA (Multi-Head Attention) stores separate K and V tensors for every attention head. GQA (Grouped Query Attention) groups heads to share K/V pairs, reducing KV cache size by 4-8x. For example, Llama 2 70B uses 8 KV heads for 64 query heads, achieving 8x reduction.

Why do MoE models need more VRAM than their active parameters suggest?

MoE (Mixture of Experts) models activate only a fraction of parameters per token (e.g., 37B active in DeepSeek-V3's 671B total), but ALL expert weights must remain loaded in VRAM. Memory usage is determined by total parameters, not active parameters.

What does Tight Fit status mean in the VRAM calculator?

Tight Fit means the model will technically run, but VRAM usage exceeds 85% with less than 30GB free. This leaves minimal headroom for context growth or other processes. Consider reducing context length or using lower quantization for stability.

How is tokens per second (TPS) estimated for LLMs?

TPS is estimated using: (GPU Memory Bandwidth × Efficiency Factor) / (Bytes per Token). LLM inference is memory-bandwidth limited—the GPU must read all model weights for each generated token. Higher bandwidth GPUs and lower quantization result in faster generation speeds.

Quick LLM VRAM Checker - Compare All Configurations

Different architectures handle attention differently, significantly impacting KV cache memory usage.

Quantization	BPW	7B Model	70B Model
FP16	16	~14 GB	~140 GB
Q8_0	8	~7 GB	~70 GB
Q4_K_M	4.65	~4.1 GB	~41 GB
Q2_K	2.63	~2.3 GB	~23 GB

Standard

Standard Transformer / GQA

Llama 2/3, Mistral, Qwen, GPT-2

Most common architecture. Uses KV heads to store attention state. Modern models use GQA (Grouped Query Attention) where multiple query heads share KV pairs. Llama 2 70B uses 8 KV heads for 64 query heads.

KV Cache = 2 × Layers × KV Heads × Head Dim × Context × Bytes

MLA

Multi-head Latent Attention

DeepSeek-V2, DeepSeek-V3

Compresses KV into a low-rank latent space using learned projections. Can achieve 10-20× compression compared to standard attention.

KV Cache = Layers × (KV LoRA Rank + RoPE Dim) × Context × Bytes

MoE

Mixture of Experts

DeepSeek-V3, Mixtral, Qwen2-MoE

Uses sparse activation, where only a subset of "expert" layers process each token. KV cache is same as standard, but all expert weights must remain in VRAM.

Total params determine VRAM, active params determine speed

Hybrid

Mamba + Attention

Jamba, Zamba

Combines Mamba SSM layers (no KV cache needed) with sparse attention layers. Only attention layers contribute to KV cache, dramatically reducing memory for long contexts.

KV Cache = 2 × Attention Layers Only × KV Heads × Head Dim × Context × Bytes

GPU	Bandwidth	Q4 7B (~4GB)	Q4 70B (~40GB)
RTX 4090	1008 GB/s	~170 tok/s	~21 tok/s
RTX 3090	936 GB/s	~160 tok/s	~20 tok/s
RTX 4080	717 GB/s	~125 tok/s	~15 tok/s
RTX 3080	760 GB/s	~130 tok/s	~16 tok/s
RTX 4070 Ti	504 GB/s	~90 tok/s	~10 tok/s

Quick LLM VRAM Checker

Configuration Comparison

How LLM VRAM is Calculated

Model Weights

KV Cache (Context Memory)

System Overhead

Model Architectures Explained

Standard Transformer / GQA

Multi-head Latent Attention

Mixture of Experts

Mamba + Attention

Understanding Speed Estimates

Frequently Asked Questions

Quick LLM VRAM Checker

Configuration Comparison

How LLM VRAM is Calculated

Model Weights

KV Cache (Context Memory)

System Overhead

Model Architectures Explained

Standard Transformer / GQA

Multi-head Latent Attention

Mixture of Experts

Mamba + Attention

Understanding Speed Estimates

Frequently Asked Questions

Related LLM Tools

Can I Run This LLM?

LLM Cost Calculator

Token Counter