Quick LLM VRAM Checker

Compare all quantization and cache format configurations at once. See which combinations fit your GPU and get performance estimates for each.

You can also use our Can I Run This LLM calculator to check VRAM requirements for a specific configuration with visual breakdown.

Results update in real-time
Account for Windows/Linux display driver VRAM usage

Configuration Comparison

Select a model and GPU to see all possible configurations.

How LLM VRAM is Calculated

Understanding the three components that determine GPU memory requirements.

Total VRAM = Model Weights + KV Cache + Overhead

Model Weights

The model's parameters are the primary VRAM consumer. The calculation depends on the number of parameters and their precision:

VRAM (GB) = (Params in B × BPW × Quant Overhead) / 8
Quantization BPW 7B Model 70B Model
FP16 16 ~14 GB ~140 GB
Q8_0 8 ~7 GB ~70 GB
Q4_K_M 4.65 ~4.1 GB ~41 GB
Q2_K 2.63 ~2.3 GB ~23 GB

KV Cache (Context Memory)

During generation, the model caches attention Key and Value tensors for each processed token. This cache grows with context length:

KV Cache = 2 × Layers × KV Heads × Head Dim × Context × Bytes / 1024³
  • FP16 cache: 2 bytes per element
  • Q8 cache: 1 byte (50% savings)
  • Q4 cache: 0.5 bytes (75% savings)

System Overhead

Additional memory required for the inference engine, driver buffers, and temporary computations:

  • CUDA context + activations: ~0.5 GB + (model size × 1%)
  • OS/Display overhead: ~1 GB if GPU is used for monitor

This is why we recommend leaving 1-2 GB of headroom beyond your calculated requirements.

Model Architectures Explained

Different architectures handle attention differently, significantly impacting KV cache memory usage.

Standard

Standard Transformer / GQA

Llama 2/3, Mistral, Qwen, GPT-2

Most common architecture. Uses KV heads to store attention state. Modern models use GQA (Grouped Query Attention) where multiple query heads share KV pairs. Llama 2 70B uses 8 KV heads for 64 query heads.

KV Cache = 2 × Layers × KV Heads × Head Dim × Context × Bytes
MLA

Multi-head Latent Attention

DeepSeek-V2, DeepSeek-V3

Compresses KV into a low-rank latent space using learned projections. Can achieve 10-20× compression compared to standard attention.

KV Cache = Layers × (KV LoRA Rank + RoPE Dim) × Context × Bytes
MoE

Mixture of Experts

DeepSeek-V3, Mixtral, Qwen2-MoE

Uses sparse activation, where only a subset of "expert" layers process each token. KV cache is same as standard, but all expert weights must remain in VRAM.

Total params determine VRAM, active params determine speed
Hybrid

Mamba + Attention

Jamba, Zamba

Combines Mamba SSM layers (no KV cache needed) with sparse attention layers. Only attention layers contribute to KV cache, dramatically reducing memory for long contexts.

KV Cache = 2 × Attention Layers Only × KV Heads × Head Dim × Context × Bytes

Understanding Speed Estimates

Why memory bandwidth, not compute power, determines LLM inference speed.

LLM token generation is memory-bandwidth limited. Each generated token requires reading the entire model's weights from VRAM. The GPU spends more time waiting for data than computing.

Tokens/sec ≈ (GPU Bandwidth × Efficiency) / Bytes per Token
Bytes per Token = (Active Params × BPW) / 8 + KV Cache Read
GPU Bandwidth Q4 7B (~4GB) Q4 70B (~40GB)
RTX 4090 1008 GB/s ~170 tok/s ~21 tok/s
RTX 3090 936 GB/s ~160 tok/s ~20 tok/s
RTX 4080 717 GB/s ~125 tok/s ~15 tok/s
RTX 3080 760 GB/s ~130 tok/s ~16 tok/s
RTX 4070 Ti 504 GB/s ~90 tok/s ~10 tok/s

Actual speeds vary ~15-20% based on batch size, prompt length, and software optimizations. MoE models use active parameters for speed calculations, making them faster than their total size suggests.

Frequently Asked Questions

Why compare all configurations at once?

Different use cases have different requirements. Quick batch comparisons let you see trade-offs between quality (higher quantization) and VRAM savings (lower quantization) at a glance, without testing each combination manually. This is especially useful when you're unsure which quantization level to target.

What's the difference between this and the "Can I Run This LLM" calculator?

This tool shows ALL possible configurations in a sortable table for quick comparison. "Can I Run This LLM" shows VRAM usage for a single configuration you select, with a visual breakdown bar. Use this for quick comparisons across many options, use that when you've narrowed down to a specific setup.

How accurate are these VRAM estimates?

Estimates are within 5-10% of actual usage for most configurations. Real-world usage may vary slightly due to inference engine optimizations (llama.cpp, Ollama, vLLM, etc.), specific model implementations, and system-specific factors like driver versions.

What does "Tight Fit" status mean?

The model will technically fit and run, but VRAM usage exceeds 85% with less than 30GB free. This leaves minimal headroom for context growth, memory fragmentation, or other processes. You may experience out-of-memory errors with longer conversations. Consider reducing context length or using lower quantization.

How is tokens per second (TPS) calculated?

TPS is estimated using: (GPU Memory Bandwidth × Efficiency Factor) / (Bytes per Token). We use an 85% efficiency factor by default, which accounts for memory access patterns and overhead. Actual speeds depend on prompt length, batch size, and software optimizations.

Why do 70B MoE models need so much VRAM?

MoE (Mixture of Experts) models like Mixtral 8x7B or DeepSeek-V3 activate only a fraction of parameters per token for faster inference, but ALL expert weights must be loaded in VRAM. The "total parameters" determine VRAM requirements, while "active parameters" determine generation speed. This is why DeepSeek-V3 (671B total, 37B active) is fast but needs massive VRAM.