Conclusions

  • QAT finetuning consistently surpasses PTQ and QAT from scratch. Optimal performance is nearly achieved by dedicating the majority of the training budget to full precision (FP) training and approximately 10% to QAT.

  • Quantization grids and ranges are pivotal in the sub-4-bit regime, with a sharp learning behavior transition between 1-bit/1.58-bit/2-bit and 3-bit/4-bit

  • Learnable range settings out-perform statistics-based methods

    • While prior work favored learnable policies for activations but used statistics-based quantization for weights, with appropriate gradient scaling, learnable scales yield stable, superior performance for weights.