VRAM usage for all quantization and cache format combinations. Base overhead: 0.53 GB (CUDA context + activations).
| Quantization | Cache Format | Model Weights | 8K Context | 16K Context | 32K Context |
|---|---|---|---|---|---|
| FP16 16.0 bpw | FP32 | 6.51 GB | 7.6 GB (+0.56 KV) | 8.17 GB (+1.12 KV) | 9.29 GB (+2.25 KV) |
| FP16 16.0 bpw | FP16 | 6.51 GB | 7.32 GB (+0.28 KV) | 7.6 GB (+0.56 KV) | 8.17 GB (+1.12 KV) |
| FP16 16.0 bpw | Q8_0 | 6.51 GB | 7.2 GB (+0.15 KV) | 7.35 GB (+0.31 KV) | 7.66 GB (+0.62 KV) |
| FP16 16.0 bpw | FP8 (Exp) | 6.51 GB | 7.18 GB (+0.14 KV) | 7.32 GB (+0.28 KV) | 7.6 GB (+0.56 KV) |
| FP16 16.0 bpw | Q4_0 (Exp) | 6.51 GB | 7.13 GB (+0.08 KV) | 7.21 GB (+0.17 KV) | 7.38 GB (+0.34 KV) |
| Q8_0 8.0 bpw | FP32 | 3.26 GB | 4.35 GB (+0.56 KV) | 4.91 GB (+1.12 KV) | 6.04 GB (+2.25 KV) |
| Q8_0 8.0 bpw | FP16 | 3.26 GB | 4.07 GB (+0.28 KV) | 4.35 GB (+0.56 KV) | 4.91 GB (+1.12 KV) |
| Q8_0 8.0 bpw | Q8_0 | 3.26 GB | 3.94 GB (+0.15 KV) | 4.1 GB (+0.31 KV) | 4.4 GB (+0.62 KV) |
| Q8_0 8.0 bpw | FP8 (Exp) | 3.26 GB | 3.93 GB (+0.14 KV) | 4.07 GB (+0.28 KV) | 4.35 GB (+0.56 KV) |
| Q8_0 8.0 bpw | Q4_0 (Exp) | 3.26 GB | 3.87 GB (+0.08 KV) | 3.95 GB (+0.17 KV) | 4.12 GB (+0.34 KV) |
| Q4_K_M 4.65 bpw | FP32 | 1.89 GB | 2.99 GB (+0.56 KV) | 3.55 GB (+1.12 KV) | 4.67 GB (+2.25 KV) |
| Q4_K_M 4.65 bpw | FP16 | 1.89 GB | 2.7 GB (+0.28 KV) | 2.99 GB (+0.56 KV) | 3.55 GB (+1.12 KV) |
| Q4_K_M 4.65 bpw | Q8_0 | 1.89 GB | 2.58 GB (+0.15 KV) | 2.73 GB (+0.31 KV) | 3.04 GB (+0.62 KV) |
| Q4_K_M 4.65 bpw | FP8 (Exp) | 1.89 GB | 2.56 GB (+0.14 KV) | 2.7 GB (+0.28 KV) | 2.99 GB (+0.56 KV) |
| Q4_K_M 4.65 bpw | Q4_0 (Exp) | 1.89 GB | 2.51 GB (+0.08 KV) | 2.59 GB (+0.17 KV) | 2.76 GB (+0.34 KV) |
| Q4_K_S 4.58 bpw | FP32 | 1.86 GB | 2.96 GB (+0.56 KV) | 3.52 GB (+1.12 KV) | 4.64 GB (+2.25 KV) |
| Q4_K_S 4.58 bpw | FP16 | 1.86 GB | 2.68 GB (+0.28 KV) | 2.96 GB (+0.56 KV) | 3.52 GB (+1.12 KV) |
| Q4_K_S 4.58 bpw | Q8_0 | 1.86 GB | 2.55 GB (+0.15 KV) | 2.7 GB (+0.31 KV) | 3.01 GB (+0.62 KV) |
| Q4_K_S 4.58 bpw | FP8 (Exp) | 1.86 GB | 2.54 GB (+0.14 KV) | 2.68 GB (+0.28 KV) | 2.96 GB (+0.56 KV) |
| Q4_K_S 4.58 bpw | Q4_0 (Exp) | 1.86 GB | 2.48 GB (+0.08 KV) | 2.56 GB (+0.17 KV) | 2.73 GB (+0.34 KV) |
| Q3_K_M 3.91 bpw | FP32 | 1.59 GB | 2.68 GB (+0.56 KV) | 3.25 GB (+1.12 KV) | 4.37 GB (+2.25 KV) |
| Q3_K_M 3.91 bpw | FP16 | 1.59 GB | 2.4 GB (+0.28 KV) | 2.68 GB (+0.56 KV) | 3.25 GB (+1.12 KV) |
| Q3_K_M 3.91 bpw | Q8_0 | 1.59 GB | 2.28 GB (+0.15 KV) | 2.43 GB (+0.31 KV) | 2.74 GB (+0.62 KV) |
| Q3_K_M 3.91 bpw | FP8 (Exp) | 1.59 GB | 2.26 GB (+0.14 KV) | 2.4 GB (+0.28 KV) | 2.68 GB (+0.56 KV) |
| Q3_K_M 3.91 bpw | Q4_0 (Exp) | 1.59 GB | 2.21 GB (+0.08 KV) | 2.29 GB (+0.17 KV) | 2.46 GB (+0.34 KV) |
| Q2_K 2.63 bpw | FP32 | 1.07 GB | 2.16 GB (+0.56 KV) | 2.73 GB (+1.12 KV) | 3.85 GB (+2.25 KV) |
| Q2_K 2.63 bpw | FP16 | 1.07 GB | 1.88 GB (+0.28 KV) | 2.16 GB (+0.56 KV) | 2.73 GB (+1.12 KV) |
| Q2_K 2.63 bpw | Q8_0 | 1.07 GB | 1.76 GB (+0.15 KV) | 1.91 GB (+0.31 KV) | 2.22 GB (+0.62 KV) |
| Q2_K 2.63 bpw | FP8 (Exp) | 1.07 GB | 1.74 GB (+0.14 KV) | 1.88 GB (+0.28 KV) | 2.16 GB (+0.56 KV) |
| Q2_K 2.63 bpw | Q4_0 (Exp) | 1.07 GB | 1.69 GB (+0.08 KV) | 1.77 GB (+0.17 KV) | 1.94 GB (+0.34 KV) |
Total VRAM = Model Weights + KV Cache + 0.53 GB overhead. Actual usage may vary ±5% based on inference engine and optimizations.
Use our calculator to see if this model fits your specific hardware configuration.