- https://arxiv.org/pdf/2502.02631
- https://pytorch.org/blog/paretoq-scaling-laws-in-extremely-low-bit-llm-quantization/
Conclusions
-
QAT finetuning consistently surpasses PTQ and QAT from scratch. Optimal performance is nearly achieved by dedicating the majority of the training budget to full precision (FP) training and approximately 10% to QAT.
-
Quantization grids and ranges are pivotal in the sub-4-bit regime, with a sharp learning behavior transition between 1-bit/1.58-bit/2-bit and 3-bit/4-bit
-
Learnable range settings out-perform statistics-based methods
- While prior work favored learnable policies for activations but used statistics-based quantization for weights, with appropriate gradient scaling, learnable scales yield stable, superior performance for weights.