VRAM usage for all quantization and cache format combinations. Base overhead: 0.52 GB (CUDA context + activations).
| Quantization | Cache Format | Model Weights | 8K Context | 16K Context | 32K Context |
|---|---|---|---|---|---|
| FP16 16.0 bpw | FP32 | 3.15 GB | 4.1 GB (+0.44 KV) | 4.54 GB (+0.88 KV) | 5.42 GB (+1.75 KV) |
| FP16 16.0 bpw | FP16 | 3.15 GB | 3.88 GB (+0.22 KV) | 4.1 GB (+0.44 KV) | 4.54 GB (+0.88 KV) |
| FP16 16.0 bpw | Q8_0 | 3.15 GB | 3.79 GB (+0.12 KV) | 3.91 GB (+0.24 KV) | 4.15 GB (+0.48 KV) |
| FP16 16.0 bpw | FP8 (Exp) | 3.15 GB | 3.77 GB (+0.11 KV) | 3.88 GB (+0.22 KV) | 4.1 GB (+0.44 KV) |
| FP16 16.0 bpw | Q4_0 (Exp) | 3.15 GB | 3.73 GB (+0.07 KV) | 3.8 GB (+0.13 KV) | 3.93 GB (+0.26 KV) |
| Q8_0 8.0 bpw | FP32 | 1.58 GB | 2.53 GB (+0.44 KV) | 2.97 GB (+0.88 KV) | 3.84 GB (+1.75 KV) |
| Q8_0 8.0 bpw | FP16 | 1.58 GB | 2.31 GB (+0.22 KV) | 2.53 GB (+0.44 KV) | 2.97 GB (+0.88 KV) |
| Q8_0 8.0 bpw | Q8_0 | 1.58 GB | 2.21 GB (+0.12 KV) | 2.33 GB (+0.24 KV) | 2.57 GB (+0.48 KV) |
| Q8_0 8.0 bpw | FP8 (Exp) | 1.58 GB | 2.2 GB (+0.11 KV) | 2.31 GB (+0.22 KV) | 2.53 GB (+0.44 KV) |
| Q8_0 8.0 bpw | Q4_0 (Exp) | 1.58 GB | 2.16 GB (+0.07 KV) | 2.22 GB (+0.13 KV) | 2.35 GB (+0.26 KV) |
| Q4_K_M 4.65 bpw | FP32 | 0.92 GB | 1.87 GB (+0.44 KV) | 2.31 GB (+0.88 KV) | 3.18 GB (+1.75 KV) |
| Q4_K_M 4.65 bpw | FP16 | 0.92 GB | 1.65 GB (+0.22 KV) | 1.87 GB (+0.44 KV) | 2.31 GB (+0.88 KV) |
| Q4_K_M 4.65 bpw | Q8_0 | 0.92 GB | 1.55 GB (+0.12 KV) | 1.67 GB (+0.24 KV) | 1.91 GB (+0.48 KV) |
| Q4_K_M 4.65 bpw | FP8 (Exp) | 0.92 GB | 1.54 GB (+0.11 KV) | 1.65 GB (+0.22 KV) | 1.87 GB (+0.44 KV) |
| Q4_K_M 4.65 bpw | Q4_0 (Exp) | 0.92 GB | 1.5 GB (+0.07 KV) | 1.56 GB (+0.13 KV) | 1.69 GB (+0.26 KV) |
| Q4_K_S 4.58 bpw | FP32 | 0.9 GB | 1.85 GB (+0.44 KV) | 2.29 GB (+0.88 KV) | 3.17 GB (+1.75 KV) |
| Q4_K_S 4.58 bpw | FP16 | 0.9 GB | 1.64 GB (+0.22 KV) | 1.85 GB (+0.44 KV) | 2.29 GB (+0.88 KV) |
| Q4_K_S 4.58 bpw | Q8_0 | 0.9 GB | 1.54 GB (+0.12 KV) | 1.66 GB (+0.24 KV) | 1.9 GB (+0.48 KV) |
| Q4_K_S 4.58 bpw | FP8 (Exp) | 0.9 GB | 1.53 GB (+0.11 KV) | 1.64 GB (+0.22 KV) | 1.85 GB (+0.44 KV) |
| Q4_K_S 4.58 bpw | Q4_0 (Exp) | 0.9 GB | 1.48 GB (+0.07 KV) | 1.55 GB (+0.13 KV) | 1.68 GB (+0.26 KV) |
| Q3_K_M 3.91 bpw | FP32 | 0.77 GB | 1.72 GB (+0.44 KV) | 2.16 GB (+0.88 KV) | 3.03 GB (+1.75 KV) |
| Q3_K_M 3.91 bpw | FP16 | 0.77 GB | 1.5 GB (+0.22 KV) | 1.72 GB (+0.44 KV) | 2.16 GB (+0.88 KV) |
| Q3_K_M 3.91 bpw | Q8_0 | 0.77 GB | 1.41 GB (+0.12 KV) | 1.53 GB (+0.24 KV) | 1.77 GB (+0.48 KV) |
| Q3_K_M 3.91 bpw | FP8 (Exp) | 0.77 GB | 1.39 GB (+0.11 KV) | 1.5 GB (+0.22 KV) | 1.72 GB (+0.44 KV) |
| Q3_K_M 3.91 bpw | Q4_0 (Exp) | 0.77 GB | 1.35 GB (+0.07 KV) | 1.42 GB (+0.13 KV) | 1.55 GB (+0.26 KV) |
| Q2_K 2.63 bpw | FP32 | 0.52 GB | 1.47 GB (+0.44 KV) | 1.91 GB (+0.88 KV) | 2.78 GB (+1.75 KV) |
| Q2_K 2.63 bpw | FP16 | 0.52 GB | 1.25 GB (+0.22 KV) | 1.47 GB (+0.44 KV) | 1.91 GB (+0.88 KV) |
| Q2_K 2.63 bpw | Q8_0 | 0.52 GB | 1.15 GB (+0.12 KV) | 1.27 GB (+0.24 KV) | 1.51 GB (+0.48 KV) |
| Q2_K 2.63 bpw | FP8 (Exp) | 0.52 GB | 1.14 GB (+0.11 KV) | 1.25 GB (+0.22 KV) | 1.47 GB (+0.44 KV) |
| Q2_K 2.63 bpw | Q4_0 (Exp) | 0.52 GB | 1.1 GB (+0.07 KV) | 1.16 GB (+0.13 KV) | 1.3 GB (+0.26 KV) |
Total VRAM = Model Weights + KV Cache + 0.52 GB overhead. Actual usage may vary ±5% based on inference engine and optimizations.
Use our calculator to see if this model fits your specific hardware configuration.