-
non-linear quantization
-
Most existing
llama.cpp
quantization types use a linear mapping between quants and de-quantized weights (i.e.,x = a * q
orx = a * q + b
-
In the case of iquants, it hardcodes a different mapping into the space with a look-up table
- Example for a 4bit quant
static const int8_t kvalues_iq4nl[16] = {-127, -104, -83, -65, -49, -35, -22, -10, 1, 13, 25, 38, 53, 69, 89, 113};
- Quantization is done this way
float al = id*xb[j]; // scale to int8
// binary search to find the index of the closest value in kvalues_iq4nl to al
int l = best_index_int8(16, kvalues_iq4nl, al); // find closest index to defined mapping
Lb[j] = l; // the quantized rep is the index within the length 16 array
float q = values[l]; // the actual value is in int8
- Dequantization is just a table-lookup
qs= quantized values
dl = scaling factor
y[j+ 0] = dl * kvalues_iq4nl[qs[j] & 0xf];
y[j+16] = dl * kvalues_iq4nl[qs[j] >> 4];