VRAM usage for all quantization and cache format combinations. Base overhead: 0.58 GB (CUDA context + activations).
| Quantization | Cache Format | Model Weights | 4K Context |
|---|---|---|---|
| FP16 16.0 bpw | FP32 | 15.96 GB | 16.97 GB (+0.44 KV) |
| FP16 16.0 bpw | FP16 | 15.96 GB | 16.75 GB (+0.22 KV) |
| FP16 16.0 bpw | Q8_0 | 15.96 GB | 16.66 GB (+0.12 KV) |
| FP16 16.0 bpw | FP8 (Exp) | 15.96 GB | 16.65 GB (+0.11 KV) |
| FP16 16.0 bpw | Q4_0 (Exp) | 15.96 GB | 16.6 GB (+0.07 KV) |
| Q8_0 8.0 bpw | FP32 | 7.98 GB | 8.99 GB (+0.44 KV) |
| Q8_0 8.0 bpw | FP16 | 7.98 GB | 8.77 GB (+0.22 KV) |
| Q8_0 8.0 bpw | Q8_0 | 7.98 GB | 8.68 GB (+0.12 KV) |
| Q8_0 8.0 bpw | FP8 (Exp) | 7.98 GB | 8.67 GB (+0.11 KV) |
| Q8_0 8.0 bpw | Q4_0 (Exp) | 7.98 GB | 8.62 GB (+0.07 KV) |
| Q4_K_M 4.65 bpw | FP32 | 4.64 GB | 5.65 GB (+0.44 KV) |
| Q4_K_M 4.65 bpw | FP16 | 4.64 GB | 5.43 GB (+0.22 KV) |
| Q4_K_M 4.65 bpw | Q8_0 | 4.64 GB | 5.33 GB (+0.12 KV) |
| Q4_K_M 4.65 bpw | FP8 (Exp) | 4.64 GB | 5.32 GB (+0.11 KV) |
| Q4_K_M 4.65 bpw | Q4_0 (Exp) | 4.64 GB | 5.28 GB (+0.07 KV) |
| Q4_K_S 4.58 bpw | FP32 | 4.57 GB | 5.58 GB (+0.44 KV) |
| Q4_K_S 4.58 bpw | FP16 | 4.57 GB | 5.36 GB (+0.22 KV) |
| Q4_K_S 4.58 bpw | Q8_0 | 4.57 GB | 5.26 GB (+0.12 KV) |
| Q4_K_S 4.58 bpw | FP8 (Exp) | 4.57 GB | 5.25 GB (+0.11 KV) |
| Q4_K_S 4.58 bpw | Q4_0 (Exp) | 4.57 GB | 5.21 GB (+0.07 KV) |
| Q3_K_M 3.91 bpw | FP32 | 3.9 GB | 4.91 GB (+0.44 KV) |
| Q3_K_M 3.91 bpw | FP16 | 3.9 GB | 4.69 GB (+0.22 KV) |
| Q3_K_M 3.91 bpw | Q8_0 | 3.9 GB | 4.6 GB (+0.12 KV) |
| Q3_K_M 3.91 bpw | FP8 (Exp) | 3.9 GB | 4.59 GB (+0.11 KV) |
| Q3_K_M 3.91 bpw | Q4_0 (Exp) | 3.9 GB | 4.54 GB (+0.07 KV) |
| Q2_K 2.63 bpw | FP32 | 2.62 GB | 3.64 GB (+0.44 KV) |
| Q2_K 2.63 bpw | FP16 | 2.62 GB | 3.42 GB (+0.22 KV) |
| Q2_K 2.63 bpw | Q8_0 | 2.62 GB | 3.32 GB (+0.12 KV) |
| Q2_K 2.63 bpw | FP8 (Exp) | 2.62 GB | 3.31 GB (+0.11 KV) |
| Q2_K 2.63 bpw | Q4_0 (Exp) | 2.62 GB | 3.27 GB (+0.07 KV) |
Total VRAM = Model Weights + KV Cache + 0.58 GB overhead. Actual usage may vary ±5% based on inference engine and optimizations.
Use our calculator to see if this model fits your specific hardware configuration.