VRAM usage for all quantization and cache format combinations. Base overhead: 0.52 GB (CUDA context + activations).
| Quantization | Cache Format | Model Weights | 4K Context |
|---|---|---|---|
| FP16 16.0 bpw | FP32 | 3.15 GB | 3.88 GB (+0.22 KV) |
| FP16 16.0 bpw | FP16 | 3.15 GB | 3.77 GB (+0.11 KV) |
| FP16 16.0 bpw | Q8_0 | 3.15 GB | 3.73 GB (+0.06 KV) |
| FP16 16.0 bpw | FP8 (Exp) | 3.15 GB | 3.72 GB (+0.05 KV) |
| FP16 16.0 bpw | Q4_0 (Exp) | 3.15 GB | 3.7 GB (+0.03 KV) |
| Q8_0 8.0 bpw | FP32 | 1.58 GB | 2.31 GB (+0.22 KV) |
| Q8_0 8.0 bpw | FP16 | 1.58 GB | 2.2 GB (+0.11 KV) |
| Q8_0 8.0 bpw | Q8_0 | 1.58 GB | 2.15 GB (+0.06 KV) |
| Q8_0 8.0 bpw | FP8 (Exp) | 1.58 GB | 2.14 GB (+0.05 KV) |
| Q8_0 8.0 bpw | Q4_0 (Exp) | 1.58 GB | 2.12 GB (+0.03 KV) |
| Q4_K_M 4.65 bpw | FP32 | 0.92 GB | 1.65 GB (+0.22 KV) |
| Q4_K_M 4.65 bpw | FP16 | 0.92 GB | 1.54 GB (+0.11 KV) |
| Q4_K_M 4.65 bpw | Q8_0 | 0.92 GB | 1.49 GB (+0.06 KV) |
| Q4_K_M 4.65 bpw | FP8 (Exp) | 0.92 GB | 1.49 GB (+0.05 KV) |
| Q4_K_M 4.65 bpw | Q4_0 (Exp) | 0.92 GB | 1.46 GB (+0.03 KV) |
| Q4_K_S 4.58 bpw | FP32 | 0.9 GB | 1.64 GB (+0.22 KV) |
| Q4_K_S 4.58 bpw | FP16 | 0.9 GB | 1.53 GB (+0.11 KV) |
| Q4_K_S 4.58 bpw | Q8_0 | 0.9 GB | 1.48 GB (+0.06 KV) |
| Q4_K_S 4.58 bpw | FP8 (Exp) | 0.9 GB | 1.47 GB (+0.05 KV) |
| Q4_K_S 4.58 bpw | Q4_0 (Exp) | 0.9 GB | 1.45 GB (+0.03 KV) |
| Q3_K_M 3.91 bpw | FP32 | 0.77 GB | 1.5 GB (+0.22 KV) |
| Q3_K_M 3.91 bpw | FP16 | 0.77 GB | 1.39 GB (+0.11 KV) |
| Q3_K_M 3.91 bpw | Q8_0 | 0.77 GB | 1.34 GB (+0.06 KV) |
| Q3_K_M 3.91 bpw | FP8 (Exp) | 0.77 GB | 1.34 GB (+0.05 KV) |
| Q3_K_M 3.91 bpw | Q4_0 (Exp) | 0.77 GB | 1.32 GB (+0.03 KV) |
| Q2_K 2.63 bpw | FP32 | 0.52 GB | 1.25 GB (+0.22 KV) |
| Q2_K 2.63 bpw | FP16 | 0.52 GB | 1.14 GB (+0.11 KV) |
| Q2_K 2.63 bpw | Q8_0 | 0.52 GB | 1.09 GB (+0.06 KV) |
| Q2_K 2.63 bpw | FP8 (Exp) | 0.52 GB | 1.09 GB (+0.05 KV) |
| Q2_K 2.63 bpw | Q4_0 (Exp) | 0.52 GB | 1.07 GB (+0.03 KV) |
Total VRAM = Model Weights + KV Cache + 0.52 GB overhead. Actual usage may vary ±5% based on inference engine and optimizations.
Use our calculator to see if this model fits your specific hardware configuration.