VRAM usage for all quantization and cache format combinations. Base overhead: 0.51 GB (CUDA context + activations).
| Quantization | Cache Format | Model Weights | 8K Context | 16K Context | 32K Context |
|---|---|---|---|---|---|
| FP16 16.0 bpw | FP32 | 1.26 GB | 3.52 GB (+1.75 KV) | 5.27 GB (+3.5 KV) | 8.77 GB (+7.0 KV) |
| FP16 16.0 bpw | FP16 | 1.26 GB | 2.64 GB (+0.88 KV) | 3.52 GB (+1.75 KV) | 5.27 GB (+3.5 KV) |
| FP16 16.0 bpw | Q8_0 | 1.26 GB | 2.25 GB (+0.48 KV) | 2.73 GB (+0.96 KV) | 3.69 GB (+1.93 KV) |
| FP16 16.0 bpw | FP8 (Exp) | 1.26 GB | 2.2 GB (+0.44 KV) | 2.64 GB (+0.88 KV) | 3.52 GB (+1.75 KV) |
| FP16 16.0 bpw | Q4_0 (Exp) | 1.26 GB | 2.03 GB (+0.26 KV) | 2.29 GB (+0.53 KV) | 2.82 GB (+1.05 KV) |
| Q8_0 8.0 bpw | FP32 | 0.63 GB | 2.89 GB (+1.75 KV) | 4.64 GB (+3.5 KV) | 8.14 GB (+7.0 KV) |
| Q8_0 8.0 bpw | FP16 | 0.63 GB | 2.01 GB (+0.88 KV) | 2.89 GB (+1.75 KV) | 4.64 GB (+3.5 KV) |
| Q8_0 8.0 bpw | Q8_0 | 0.63 GB | 1.62 GB (+0.48 KV) | 2.1 GB (+0.96 KV) | 3.06 GB (+1.93 KV) |
| Q8_0 8.0 bpw | FP8 (Exp) | 0.63 GB | 1.57 GB (+0.44 KV) | 2.01 GB (+0.88 KV) | 2.89 GB (+1.75 KV) |
| Q8_0 8.0 bpw | Q4_0 (Exp) | 0.63 GB | 1.4 GB (+0.26 KV) | 1.66 GB (+0.53 KV) | 2.19 GB (+1.05 KV) |
| Q4_K_M 4.65 bpw | FP32 | 0.37 GB | 2.62 GB (+1.75 KV) | 4.37 GB (+3.5 KV) | 7.87 GB (+7.0 KV) |
| Q4_K_M 4.65 bpw | FP16 | 0.37 GB | 1.75 GB (+0.88 KV) | 2.62 GB (+1.75 KV) | 4.37 GB (+3.5 KV) |
| Q4_K_M 4.65 bpw | Q8_0 | 0.37 GB | 1.35 GB (+0.48 KV) | 1.83 GB (+0.96 KV) | 2.8 GB (+1.93 KV) |
| Q4_K_M 4.65 bpw | FP8 (Exp) | 0.37 GB | 1.31 GB (+0.44 KV) | 1.75 GB (+0.88 KV) | 2.62 GB (+1.75 KV) |
| Q4_K_M 4.65 bpw | Q4_0 (Exp) | 0.37 GB | 1.13 GB (+0.26 KV) | 1.4 GB (+0.53 KV) | 1.92 GB (+1.05 KV) |
| Q4_K_S 4.58 bpw | FP32 | 0.36 GB | 2.62 GB (+1.75 KV) | 4.37 GB (+3.5 KV) | 7.87 GB (+7.0 KV) |
| Q4_K_S 4.58 bpw | FP16 | 0.36 GB | 1.74 GB (+0.88 KV) | 2.62 GB (+1.75 KV) | 4.37 GB (+3.5 KV) |
| Q4_K_S 4.58 bpw | Q8_0 | 0.36 GB | 1.35 GB (+0.48 KV) | 1.83 GB (+0.96 KV) | 2.79 GB (+1.93 KV) |
| Q4_K_S 4.58 bpw | FP8 (Exp) | 0.36 GB | 1.3 GB (+0.44 KV) | 1.74 GB (+0.88 KV) | 2.62 GB (+1.75 KV) |
| Q4_K_S 4.58 bpw | Q4_0 (Exp) | 0.36 GB | 1.13 GB (+0.26 KV) | 1.39 GB (+0.53 KV) | 1.92 GB (+1.05 KV) |
| Q3_K_M 3.91 bpw | FP32 | 0.31 GB | 2.56 GB (+1.75 KV) | 4.31 GB (+3.5 KV) | 7.81 GB (+7.0 KV) |
| Q3_K_M 3.91 bpw | FP16 | 0.31 GB | 1.69 GB (+0.88 KV) | 2.56 GB (+1.75 KV) | 4.31 GB (+3.5 KV) |
| Q3_K_M 3.91 bpw | Q8_0 | 0.31 GB | 1.3 GB (+0.48 KV) | 1.78 GB (+0.96 KV) | 2.74 GB (+1.93 KV) |
| Q3_K_M 3.91 bpw | FP8 (Exp) | 0.31 GB | 1.25 GB (+0.44 KV) | 1.69 GB (+0.88 KV) | 2.56 GB (+1.75 KV) |
| Q3_K_M 3.91 bpw | Q4_0 (Exp) | 0.31 GB | 1.08 GB (+0.26 KV) | 1.34 GB (+0.53 KV) | 1.86 GB (+1.05 KV) |
| Q2_K 2.63 bpw | FP32 | 0.21 GB | 2.46 GB (+1.75 KV) | 4.21 GB (+3.5 KV) | 7.71 GB (+7.0 KV) |
| Q2_K 2.63 bpw | FP16 | 0.21 GB | 1.59 GB (+0.88 KV) | 2.46 GB (+1.75 KV) | 4.21 GB (+3.5 KV) |
| Q2_K 2.63 bpw | Q8_0 | 0.21 GB | 1.19 GB (+0.48 KV) | 1.68 GB (+0.96 KV) | 2.64 GB (+1.93 KV) |
| Q2_K 2.63 bpw | FP8 (Exp) | 0.21 GB | 1.15 GB (+0.44 KV) | 1.59 GB (+0.88 KV) | 2.46 GB (+1.75 KV) |
| Q2_K 2.63 bpw | Q4_0 (Exp) | 0.21 GB | 0.98 GB (+0.26 KV) | 1.24 GB (+0.53 KV) | 1.76 GB (+1.05 KV) |
Total VRAM = Model Weights + KV Cache + 0.51 GB overhead. Actual usage may vary ±5% based on inference engine and optimizations.
Use our calculator to see if this model fits your specific hardware configuration.