Model Weights (VRAM)
The model's parameters must fit in GPU memory. A 7B parameter model at FP16 (16-bit) needs ~14GB. Quantization to Q4 (4-bit) reduces this to ~4GB, making large models accessible on consumer GPUs.
Free GPU VRAM calculator for running Large Language Models locally. Check if your graphics card can handle Llama 3, DeepSeek V3, Mistral, Qwen, and other open-source AI models with your preferred quantization settings.
Want to compare all configurations at once? Use our Quick LLM VRAM Checker to see every quantization and cache format combination in a single table, with detailed explanations of how VRAM calculations work.
The VRAM Calculator is regularly updated. Missing a model or GPU? Request it here.
Select a model, GPU, and quantization to see if it will run on your hardware.
The model's parameters must fit in GPU memory. A 7B parameter model at FP16 (16-bit) needs ~14GB. Quantization to Q4 (4-bit) reduces this to ~4GB, making large models accessible on consumer GPUs.
During inference, the model caches attention keys and values for each token. Longer context windows require more cache memory. A 70B model at 32K context can need 8GB+ just for the cache.
LLM inference is memory-bandwidth limited. The GPU must read all model weights for each token generated. Higher bandwidth GPUs (RTX 4090: 1TB/s) generate tokens faster than lower bandwidth ones (RTX 3080: 760GB/s).
Lower bit formats use less VRAM but may slightly reduce model quality
| Format | Bits/Weight | VRAM Usage | Quality | Best For |
|---|---|---|---|---|
| FP16 | 16 bits | 100% | Excellent | Maximum quality, large VRAM GPUs |
| Q8_0 | 8 bits | ~50% | Excellent | Near-lossless, recommended if VRAM allows |
| Q4_K_M | 4.65 bits | ~30% | Great | Best balance of quality and VRAM savings |
| Q4_K_S | 4.58 bits | ~29% | Great | Slightly smaller, good quality |
| Q3_K_M | 3.91 bits | ~24.4% | Good | Limited VRAM, acceptable quality loss |
| Q2_K | 2.63 bits | ~16.4% | Fair | Extreme VRAM constraints, noticeable quality loss |
VRAM requirements depend on the model size, quantization format, and context length. A 7B parameter model in Q4 quantization needs about 4-5GB VRAM, while a 70B model needs 35-40GB. Use our calculator above to get exact requirements for your specific setup.
Quantization reduces the precision of model weights from 16-bit floats to lower bit formats like 8-bit, 4-bit, or even 2-bit integers. Q4_K_M (4.65 bits per weight) typically offers the best balance of quality and VRAM savings, reducing memory usage by about 70% compared to full precision with minimal quality loss.
Yes! Modern gaming GPUs work great for local LLM inference. RTX 3090/4090 (24GB) can run 70B models quantized. RTX 3080/4080 (10-16GB) handle 13B models well. Even RTX 3060 (12GB) can run 7B models. The key is choosing the right quantization for your VRAM.
LLM inference is memory bandwidth limited during token generation. The GPU must read all model weights from VRAM for each token generated. Higher bandwidth GPUs like RTX 4090 (1TB/s) generate tokens faster. Quantized models also generate faster because there's less data to read per token.
The KV (Key-Value) cache stores attention computations from previous tokens so they don't need to be recalculated. This enables fast generation but requires VRAM proportional to context length. Longer contexts need more cache. You can use quantized KV cache (Q8 or Q4) to reduce this overhead.
llama.cpp is a low-level C++ inference engine optimized for CPUs and GPUs. Ollama provides a user-friendly wrapper with easy model management. vLLM is designed for high-throughput serving with advanced batching. For personal use, Ollama is easiest. For maximum control, use llama.cpp directly.