VRAM usage for all quantization and cache format combinations. Base overhead: 0.55 GB (CUDA context + activations).
| Quantization | Cache Format | Model Weights | 8K Context | 16K Context | 32K Context |
|---|---|---|---|---|---|
| FP16 16.0 bpw | FP32 | 10.08 GB | 11.19 GB (+0.56 KV) | 11.75 GB (+1.12 KV) | 12.88 GB (+2.25 KV) |
| FP16 16.0 bpw | FP16 | 10.08 GB | 10.91 GB (+0.28 KV) | 11.19 GB (+0.56 KV) | 11.75 GB (+1.12 KV) |
| FP16 16.0 bpw | Q8_0 | 10.08 GB | 10.78 GB (+0.15 KV) | 10.94 GB (+0.31 KV) | 11.25 GB (+0.62 KV) |
| FP16 16.0 bpw | FP8 (Exp) | 10.08 GB | 10.77 GB (+0.14 KV) | 10.91 GB (+0.28 KV) | 11.19 GB (+0.56 KV) |
| FP16 16.0 bpw | Q4_0 (Exp) | 10.08 GB | 10.71 GB (+0.08 KV) | 10.8 GB (+0.17 KV) | 10.97 GB (+0.34 KV) |
| Q8_0 8.0 bpw | FP32 | 5.04 GB | 6.15 GB (+0.56 KV) | 6.71 GB (+1.12 KV) | 7.84 GB (+2.25 KV) |
| Q8_0 8.0 bpw | FP16 | 5.04 GB | 5.87 GB (+0.28 KV) | 6.15 GB (+0.56 KV) | 6.71 GB (+1.12 KV) |
| Q8_0 8.0 bpw | Q8_0 | 5.04 GB | 5.74 GB (+0.15 KV) | 5.9 GB (+0.31 KV) | 6.21 GB (+0.62 KV) |
| Q8_0 8.0 bpw | FP8 (Exp) | 5.04 GB | 5.73 GB (+0.14 KV) | 5.87 GB (+0.28 KV) | 6.15 GB (+0.56 KV) |
| Q8_0 8.0 bpw | Q4_0 (Exp) | 5.04 GB | 5.67 GB (+0.08 KV) | 5.76 GB (+0.17 KV) | 5.93 GB (+0.34 KV) |
| Q4_K_M 4.65 bpw | FP32 | 2.93 GB | 4.04 GB (+0.56 KV) | 4.6 GB (+1.12 KV) | 5.73 GB (+2.25 KV) |
| Q4_K_M 4.65 bpw | FP16 | 2.93 GB | 3.76 GB (+0.28 KV) | 4.04 GB (+0.56 KV) | 4.6 GB (+1.12 KV) |
| Q4_K_M 4.65 bpw | Q8_0 | 2.93 GB | 3.63 GB (+0.15 KV) | 3.79 GB (+0.31 KV) | 4.1 GB (+0.62 KV) |
| Q4_K_M 4.65 bpw | FP8 (Exp) | 2.93 GB | 3.62 GB (+0.14 KV) | 3.76 GB (+0.28 KV) | 4.04 GB (+0.56 KV) |
| Q4_K_M 4.65 bpw | Q4_0 (Exp) | 2.93 GB | 3.56 GB (+0.08 KV) | 3.65 GB (+0.17 KV) | 3.81 GB (+0.34 KV) |
| Q4_K_S 4.58 bpw | FP32 | 2.89 GB | 4.0 GB (+0.56 KV) | 4.56 GB (+1.12 KV) | 5.68 GB (+2.25 KV) |
| Q4_K_S 4.58 bpw | FP16 | 2.89 GB | 3.71 GB (+0.28 KV) | 4.0 GB (+0.56 KV) | 4.56 GB (+1.12 KV) |
| Q4_K_S 4.58 bpw | Q8_0 | 2.89 GB | 3.59 GB (+0.15 KV) | 3.74 GB (+0.31 KV) | 4.05 GB (+0.62 KV) |
| Q4_K_S 4.58 bpw | FP8 (Exp) | 2.89 GB | 3.57 GB (+0.14 KV) | 3.71 GB (+0.28 KV) | 4.0 GB (+0.56 KV) |
| Q4_K_S 4.58 bpw | Q4_0 (Exp) | 2.89 GB | 3.52 GB (+0.08 KV) | 3.6 GB (+0.17 KV) | 3.77 GB (+0.34 KV) |
| Q3_K_M 3.91 bpw | FP32 | 2.46 GB | 3.57 GB (+0.56 KV) | 4.14 GB (+1.12 KV) | 5.26 GB (+2.25 KV) |
| Q3_K_M 3.91 bpw | FP16 | 2.46 GB | 3.29 GB (+0.28 KV) | 3.57 GB (+0.56 KV) | 4.14 GB (+1.12 KV) |
| Q3_K_M 3.91 bpw | Q8_0 | 2.46 GB | 3.17 GB (+0.15 KV) | 3.32 GB (+0.31 KV) | 3.63 GB (+0.62 KV) |
| Q3_K_M 3.91 bpw | FP8 (Exp) | 2.46 GB | 3.15 GB (+0.14 KV) | 3.29 GB (+0.28 KV) | 3.57 GB (+0.56 KV) |
| Q3_K_M 3.91 bpw | Q4_0 (Exp) | 2.46 GB | 3.1 GB (+0.08 KV) | 3.18 GB (+0.17 KV) | 3.35 GB (+0.34 KV) |
| Q2_K 2.63 bpw | FP32 | 1.66 GB | 2.77 GB (+0.56 KV) | 3.33 GB (+1.12 KV) | 4.45 GB (+2.25 KV) |
| Q2_K 2.63 bpw | FP16 | 1.66 GB | 2.49 GB (+0.28 KV) | 2.77 GB (+0.56 KV) | 3.33 GB (+1.12 KV) |
| Q2_K 2.63 bpw | Q8_0 | 1.66 GB | 2.36 GB (+0.15 KV) | 2.51 GB (+0.31 KV) | 2.82 GB (+0.62 KV) |
| Q2_K 2.63 bpw | FP8 (Exp) | 1.66 GB | 2.35 GB (+0.14 KV) | 2.49 GB (+0.28 KV) | 2.77 GB (+0.56 KV) |
| Q2_K 2.63 bpw | Q4_0 (Exp) | 1.66 GB | 2.29 GB (+0.08 KV) | 2.37 GB (+0.17 KV) | 2.54 GB (+0.34 KV) |
Total VRAM = Model Weights + KV Cache + 0.55 GB overhead. Actual usage may vary ±5% based on inference engine and optimizations.
Use our calculator to see if this model fits your specific hardware configuration.