VRAM usage for all quantization and cache format combinations. Base overhead: 0.58 GB (CUDA context + activations).
| Quantization | Cache Format | Model Weights | 8K Context | 16K Context | 32K Context | 40K Context |
|---|---|---|---|---|---|---|
| FP16 16.0 bpw | FP32 | 16.8 GB | 19.63 GB (+2.25 KV) | 21.88 GB (+4.5 KV) | 26.38 GB (+9.0 KV) | 28.63 GB (+11.25 KV) |
| FP16 16.0 bpw | FP16 | 16.8 GB | 18.5 GB (+1.12 KV) | 19.63 GB (+2.25 KV) | 21.88 GB (+4.5 KV) | 23.0 GB (+5.62 KV) |
| FP16 16.0 bpw | Q8_0 | 16.8 GB | 18.0 GB (+0.62 KV) | 18.62 GB (+1.24 KV) | 19.86 GB (+2.48 KV) | 20.47 GB (+3.09 KV) |
| FP16 16.0 bpw | FP8 (Exp) | 16.8 GB | 17.94 GB (+0.56 KV) | 18.5 GB (+1.12 KV) | 19.63 GB (+2.25 KV) | 20.19 GB (+2.81 KV) |
| FP16 16.0 bpw | Q4_0 (Exp) | 16.8 GB | 17.72 GB (+0.34 KV) | 18.05 GB (+0.67 KV) | 18.73 GB (+1.35 KV) | 19.07 GB (+1.69 KV) |
| Q8_0 8.0 bpw | FP32 | 8.4 GB | 11.23 GB (+2.25 KV) | 13.48 GB (+4.5 KV) | 17.98 GB (+9.0 KV) | 20.23 GB (+11.25 KV) |
| Q8_0 8.0 bpw | FP16 | 8.4 GB | 10.11 GB (+1.12 KV) | 11.23 GB (+2.25 KV) | 13.48 GB (+4.5 KV) | 14.61 GB (+5.62 KV) |
| Q8_0 8.0 bpw | Q8_0 | 8.4 GB | 9.6 GB (+0.62 KV) | 10.22 GB (+1.24 KV) | 11.46 GB (+2.48 KV) | 12.07 GB (+3.09 KV) |
| Q8_0 8.0 bpw | FP8 (Exp) | 8.4 GB | 9.54 GB (+0.56 KV) | 10.11 GB (+1.12 KV) | 11.23 GB (+2.25 KV) | 11.79 GB (+2.81 KV) |
| Q8_0 8.0 bpw | Q4_0 (Exp) | 8.4 GB | 9.32 GB (+0.34 KV) | 9.66 GB (+0.67 KV) | 10.33 GB (+1.35 KV) | 10.67 GB (+1.69 KV) |
| Q4_K_M 4.65 bpw | FP32 | 4.88 GB | 7.71 GB (+2.25 KV) | 9.96 GB (+4.5 KV) | 14.46 GB (+9.0 KV) | 16.71 GB (+11.25 KV) |
| Q4_K_M 4.65 bpw | FP16 | 4.88 GB | 6.59 GB (+1.12 KV) | 7.71 GB (+2.25 KV) | 9.96 GB (+4.5 KV) | 11.09 GB (+5.62 KV) |
| Q4_K_M 4.65 bpw | Q8_0 | 4.88 GB | 6.08 GB (+0.62 KV) | 6.7 GB (+1.24 KV) | 7.94 GB (+2.48 KV) | 8.56 GB (+3.09 KV) |
| Q4_K_M 4.65 bpw | FP8 (Exp) | 4.88 GB | 6.03 GB (+0.56 KV) | 6.59 GB (+1.12 KV) | 7.71 GB (+2.25 KV) | 8.28 GB (+2.81 KV) |
| Q4_K_M 4.65 bpw | Q4_0 (Exp) | 4.88 GB | 5.8 GB (+0.34 KV) | 6.14 GB (+0.67 KV) | 6.81 GB (+1.35 KV) | 7.15 GB (+1.69 KV) |
| Q4_K_S 4.58 bpw | FP32 | 4.81 GB | 7.64 GB (+2.25 KV) | 9.89 GB (+4.5 KV) | 14.39 GB (+9.0 KV) | 16.64 GB (+11.25 KV) |
| Q4_K_S 4.58 bpw | FP16 | 4.81 GB | 6.51 GB (+1.12 KV) | 7.64 GB (+2.25 KV) | 9.89 GB (+4.5 KV) | 11.01 GB (+5.62 KV) |
| Q4_K_S 4.58 bpw | Q8_0 | 4.81 GB | 6.01 GB (+0.62 KV) | 6.63 GB (+1.24 KV) | 7.86 GB (+2.48 KV) | 8.48 GB (+3.09 KV) |
| Q4_K_S 4.58 bpw | FP8 (Exp) | 4.81 GB | 5.95 GB (+0.56 KV) | 6.51 GB (+1.12 KV) | 7.64 GB (+2.25 KV) | 8.2 GB (+2.81 KV) |
| Q4_K_S 4.58 bpw | Q4_0 (Exp) | 4.81 GB | 5.73 GB (+0.34 KV) | 6.06 GB (+0.67 KV) | 6.74 GB (+1.35 KV) | 7.08 GB (+1.69 KV) |
| Q3_K_M 3.91 bpw | FP32 | 4.11 GB | 6.94 GB (+2.25 KV) | 9.19 GB (+4.5 KV) | 13.69 GB (+9.0 KV) | 15.94 GB (+11.25 KV) |
| Q3_K_M 3.91 bpw | FP16 | 4.11 GB | 5.81 GB (+1.12 KV) | 6.94 GB (+2.25 KV) | 9.19 GB (+4.5 KV) | 10.31 GB (+5.62 KV) |
| Q3_K_M 3.91 bpw | Q8_0 | 4.11 GB | 5.3 GB (+0.62 KV) | 5.92 GB (+1.24 KV) | 7.16 GB (+2.48 KV) | 7.78 GB (+3.09 KV) |
| Q3_K_M 3.91 bpw | FP8 (Exp) | 4.11 GB | 5.25 GB (+0.56 KV) | 5.81 GB (+1.12 KV) | 6.94 GB (+2.25 KV) | 7.5 GB (+2.81 KV) |
| Q3_K_M 3.91 bpw | Q4_0 (Exp) | 4.11 GB | 5.02 GB (+0.34 KV) | 5.36 GB (+0.67 KV) | 6.04 GB (+1.35 KV) | 6.37 GB (+1.69 KV) |
| Q2_K 2.63 bpw | FP32 | 2.76 GB | 5.59 GB (+2.25 KV) | 7.84 GB (+4.5 KV) | 12.34 GB (+9.0 KV) | 14.59 GB (+11.25 KV) |
| Q2_K 2.63 bpw | FP16 | 2.76 GB | 4.47 GB (+1.12 KV) | 5.59 GB (+2.25 KV) | 7.84 GB (+4.5 KV) | 8.97 GB (+5.62 KV) |
| Q2_K 2.63 bpw | Q8_0 | 2.76 GB | 3.96 GB (+0.62 KV) | 4.58 GB (+1.24 KV) | 5.82 GB (+2.48 KV) | 6.44 GB (+3.09 KV) |
| Q2_K 2.63 bpw | FP8 (Exp) | 2.76 GB | 3.9 GB (+0.56 KV) | 4.47 GB (+1.12 KV) | 5.59 GB (+2.25 KV) | 6.15 GB (+2.81 KV) |
| Q2_K 2.63 bpw | Q4_0 (Exp) | 2.76 GB | 3.68 GB (+0.34 KV) | 4.02 GB (+0.67 KV) | 4.69 GB (+1.35 KV) | 5.03 GB (+1.69 KV) |
Total VRAM = Model Weights + KV Cache + 0.58 GB overhead. Actual usage may vary ±5% based on inference engine and optimizations.
Use our calculator to see if this model fits your specific hardware configuration.