Quantization by definition lower memory requirements - instead of using f16 for weights, you are using q8, q6, q4, or q2 which means the weights are smaller by 2x, ~2.7x, 4x or 8x respectively.
That doesn’t necessarily translate to the full memory reduction because of interim compute tensors and KV cache, but those can also be quantized.
That doesn’t necessarily translate to the full memory reduction because of interim compute tensors and KV cache, but those can also be quantized.