VRAM usage for all quantization and cache format combinations. Base overhead: 0.51 GB (CUDA context + activations).
| Quantization | Cache Format | Model Weights | 8K Context | 16K Context | 32K Context |
|---|---|---|---|---|---|
| FP16 16.0 bpw | FP32 | 1.05 GB | 1.74 GB (+0.19 KV) | 1.93 GB (+0.38 KV) | 2.31 GB (+0.75 KV) |
| FP16 16.0 bpw | FP16 | 1.05 GB | 1.65 GB (+0.09 KV) | 1.74 GB (+0.19 KV) | 1.93 GB (+0.38 KV) |
| FP16 16.0 bpw | Q8_0 | 1.05 GB | 1.61 GB (+0.05 KV) | 1.66 GB (+0.1 KV) | 1.76 GB (+0.21 KV) |
| FP16 16.0 bpw | FP8 (Exp) | 1.05 GB | 1.6 GB (+0.05 KV) | 1.65 GB (+0.09 KV) | 1.74 GB (+0.19 KV) |
| FP16 16.0 bpw | Q4_0 (Exp) | 1.05 GB | 1.58 GB (+0.03 KV) | 1.61 GB (+0.06 KV) | 1.67 GB (+0.11 KV) |
| Q8_0 8.0 bpw | FP32 | 0.53 GB | 1.22 GB (+0.19 KV) | 1.41 GB (+0.38 KV) | 1.78 GB (+0.75 KV) |
| Q8_0 8.0 bpw | FP16 | 0.53 GB | 1.12 GB (+0.09 KV) | 1.22 GB (+0.19 KV) | 1.41 GB (+0.38 KV) |
| Q8_0 8.0 bpw | Q8_0 | 0.53 GB | 1.08 GB (+0.05 KV) | 1.13 GB (+0.1 KV) | 1.24 GB (+0.21 KV) |
| Q8_0 8.0 bpw | FP8 (Exp) | 0.53 GB | 1.08 GB (+0.05 KV) | 1.12 GB (+0.09 KV) | 1.22 GB (+0.19 KV) |
| Q8_0 8.0 bpw | Q4_0 (Exp) | 0.53 GB | 1.06 GB (+0.03 KV) | 1.09 GB (+0.06 KV) | 1.14 GB (+0.11 KV) |
| Q4_K_M 4.65 bpw | FP32 | 0.31 GB | 1.0 GB (+0.19 KV) | 1.19 GB (+0.38 KV) | 1.56 GB (+0.75 KV) |
| Q4_K_M 4.65 bpw | FP16 | 0.31 GB | 0.9 GB (+0.09 KV) | 1.0 GB (+0.19 KV) | 1.19 GB (+0.38 KV) |
| Q4_K_M 4.65 bpw | Q8_0 | 0.31 GB | 0.86 GB (+0.05 KV) | 0.91 GB (+0.1 KV) | 1.02 GB (+0.21 KV) |
| Q4_K_M 4.65 bpw | FP8 (Exp) | 0.31 GB | 0.86 GB (+0.05 KV) | 0.9 GB (+0.09 KV) | 1.0 GB (+0.19 KV) |
| Q4_K_M 4.65 bpw | Q4_0 (Exp) | 0.31 GB | 0.84 GB (+0.03 KV) | 0.87 GB (+0.06 KV) | 0.92 GB (+0.11 KV) |
| Q4_K_S 4.58 bpw | FP32 | 0.3 GB | 0.99 GB (+0.19 KV) | 1.18 GB (+0.38 KV) | 1.56 GB (+0.75 KV) |
| Q4_K_S 4.58 bpw | FP16 | 0.3 GB | 0.9 GB (+0.09 KV) | 0.99 GB (+0.19 KV) | 1.18 GB (+0.38 KV) |
| Q4_K_S 4.58 bpw | Q8_0 | 0.3 GB | 0.86 GB (+0.05 KV) | 0.91 GB (+0.1 KV) | 1.01 GB (+0.21 KV) |
| Q4_K_S 4.58 bpw | FP8 (Exp) | 0.3 GB | 0.85 GB (+0.05 KV) | 0.9 GB (+0.09 KV) | 0.99 GB (+0.19 KV) |
| Q4_K_S 4.58 bpw | Q4_0 (Exp) | 0.3 GB | 0.83 GB (+0.03 KV) | 0.86 GB (+0.06 KV) | 0.92 GB (+0.11 KV) |
| Q3_K_M 3.91 bpw | FP32 | 0.26 GB | 0.95 GB (+0.19 KV) | 1.14 GB (+0.38 KV) | 1.51 GB (+0.75 KV) |
| Q3_K_M 3.91 bpw | FP16 | 0.26 GB | 0.86 GB (+0.09 KV) | 0.95 GB (+0.19 KV) | 1.14 GB (+0.38 KV) |
| Q3_K_M 3.91 bpw | Q8_0 | 0.26 GB | 0.81 GB (+0.05 KV) | 0.86 GB (+0.1 KV) | 0.97 GB (+0.21 KV) |
| Q3_K_M 3.91 bpw | FP8 (Exp) | 0.26 GB | 0.81 GB (+0.05 KV) | 0.86 GB (+0.09 KV) | 0.95 GB (+0.19 KV) |
| Q3_K_M 3.91 bpw | Q4_0 (Exp) | 0.26 GB | 0.79 GB (+0.03 KV) | 0.82 GB (+0.06 KV) | 0.87 GB (+0.11 KV) |
| Q2_K 2.63 bpw | FP32 | 0.17 GB | 0.87 GB (+0.19 KV) | 1.05 GB (+0.38 KV) | 1.43 GB (+0.75 KV) |
| Q2_K 2.63 bpw | FP16 | 0.17 GB | 0.77 GB (+0.09 KV) | 0.87 GB (+0.19 KV) | 1.05 GB (+0.38 KV) |
| Q2_K 2.63 bpw | Q8_0 | 0.17 GB | 0.73 GB (+0.05 KV) | 0.78 GB (+0.1 KV) | 0.88 GB (+0.21 KV) |
| Q2_K 2.63 bpw | FP8 (Exp) | 0.17 GB | 0.72 GB (+0.05 KV) | 0.77 GB (+0.09 KV) | 0.87 GB (+0.19 KV) |
| Q2_K 2.63 bpw | Q4_0 (Exp) | 0.17 GB | 0.71 GB (+0.03 KV) | 0.73 GB (+0.06 KV) | 0.79 GB (+0.11 KV) |
Total VRAM = Model Weights + KV Cache + 0.51 GB overhead. Actual usage may vary ±5% based on inference engine and optimizations.
Use our calculator to see if this model fits your specific hardware configuration.