ParetoQ - Scaling Laws for Low-bit LLM Quantization

Conclusions

QAT finetuning consistently surpasses PTQ and QAT from scratch. Optimal performance is nearly achieved by dedicating the majority of the training budget to full precision (FP) training and approximately 10% to QAT.
Quantization grids and ranges are pivotal in the sub-4-bit regime, with a sharp learning behavior transition between 1-bit/1.58-bit/2-bit and 3-bit/4-bit
Learnable range settings out-perform statistics-based methods
- While prior work favored learnable policies for activations but used statistics-based quantization for weights, with appropriate gradient scaling, learnable scales yield stable, superior performance for weights.