Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Quantization by definition lower memory requirements - instead of using f16 for weights, you are using q8, q6, q4, or q2 which means the weights are smaller by 2x, ~2.7x, 4x or 8x respectively.

That doesn’t necessarily translate to the full memory reduction because of interim compute tensors and KV cache, but those can also be quantized.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: