Why compare all configurations at once?
Different use cases have different requirements. Quick batch comparisons let you see trade-offs
between quality (higher quantization) and VRAM savings (lower quantization) at a glance, without
testing each combination manually. This is especially useful when you're unsure which quantization
level to target.
What's the difference between this and the "Can I Run This LLM" calculator?
This tool shows ALL possible configurations in a sortable table for quick comparison. "Can I Run This
LLM" shows VRAM usage for a single configuration you select, with a visual breakdown bar. Use this
for quick comparisons across many options, use that when you've narrowed down to a specific setup.
How accurate are these VRAM estimates?
Estimates are within 5-10% of actual usage for most configurations. Real-world usage may vary
slightly due to inference engine optimizations (llama.cpp, Ollama, vLLM, etc.), specific model
implementations, and system-specific factors like driver versions.
What does "Tight Fit" status mean?
The model will technically fit and run, but VRAM usage exceeds 85% with less than 30GB free. This
leaves minimal headroom for context growth, memory fragmentation, or other processes. You may
experience out-of-memory errors with longer conversations. Consider reducing context length or using
lower quantization.
How is tokens per second (TPS) calculated?
TPS is estimated using: (GPU Memory Bandwidth × Efficiency Factor) / (Bytes per Token). We use an 85%
efficiency factor by default, which accounts for memory access patterns and overhead. Actual speeds
depend on prompt length, batch size, and software optimizations.
Why do 70B MoE models need so much VRAM?
MoE (Mixture of Experts) models like Mixtral 8x7B or DeepSeek-V3 activate only a fraction of
parameters per token for faster inference, but ALL expert weights must be loaded in VRAM. The "total
parameters" determine VRAM requirements, while "active parameters" determine generation speed. This
is why DeepSeek-V3 (671B total, 37B active) is fast but needs massive VRAM.