Neural Network Quantization

Neural network quantization is useful exclusively at the production (inference) stage. Moreover, its primary goal is to adapt a model for inference on CPU devices. An ambitious objective, isn’t it?

Let’s examine in detail how quantization works and what performance improvements it brings, using an LLM as an example.

Recently, I trained a transformer-based model for text paraphrasing. It was trained for 2 hours on one of the most powerful GPUs — the Nvidia A100. The model has around 800 million parameters (weights) and occupies approximately 320 MB. This size is clearly problematic for inference.

If we print the model parameters, we can see layer names and their data types. Below is part of the transformer encoder block: embedding layers and Multihead Attention layers.

encoder.embedding.weight torch.float32  
encoder.encoder_layers.0.mha.wq.weight torch.float32  
encoder.encoder_layers.0.mha.wq.weight torch.float32  
encoder.encoder_layers.0.mha.wq.bias torch.float32  
encoder.encoder_layers.0.mha.wk.weight torch.float32  
encoder.encoder_layers.0.mha.wk.bias torch.float32  
encoder.encoder_layers.0.mha.wv.weight torch.float32

Why is the model so large? The vast majority of models are trained using the float32 data type — a 32-bit floating-point format. It is standard for training because it can represent very small values accurately. For example, float16 is less flexible in numerical representation, which can be problematic during training.

Consider a simple example. Suppose we compute a prediction error and want to calculate gradients to update the weights. If one gradient equals 0.000007 (which is entirely possible), float32 will represent it precisely, and gradients will update correctly. In float16, however, this value may underflow to zero, meaning the gradient would not update.

However, while numerical precision is critical during training, inference has different requirements. During inference, weights are not updated, and small numerical inaccuracies do not affect a training loop. Therefore, we can convert model weights to int8. The int8 format is an integer representation that occupies four times less memory than float32. This is the core idea behind quantization.

As a result, the model becomes at least four times smaller while keeping the same number of parameters. In my case, the transformer model trained in float32 weighed 800 MB, and after quantization, it was reduced to 133 MB.

Key Characteristics of Quantization

First and most important: quantization is performed at the layer level, not on the entire model uniformly. Typically, some layers are quantized to int8, while others remain in float32.

Second: there are two main types of quantization — post-training (dynamic) and static.

Dynamic quantization corresponds to my example. I trained the model in float32 and then converted certain layers to int8 before inference to reduce size and enable CPU deployment.

Static quantization requires preparation before training. The model is fine-tuned so that during training it already operates with quantized values alongside float32. In other words, the model “learns to live with” quantization.

Not all layers can be quantized dynamically due to their structure.

For example:

Convolutional layers cannot be dynamically quantized.
Linear layers are well-suited for dynamic quantization.

A linear layer is simply a matrix multiplication operation performed in one step. A convolutional layer, by contrast, operates via a sliding window over images or spectrograms, often with strides and dilation. Quantizing such operations dynamically carries a high risk of information degradation, which is why it is restricted.

Supported Layers

Dynamic quantization:

Linear

LSTM
Embedding (with specific nuances)- LSTM
Embedding (with specific nuances)

Static quantization:

Linear

LSTM
Embedding
LSTM
Embedding
Conv1d/2d/3d
MultiheadAttention

The choice of quantization method depends on the architecture. In my case, I used a transformer where most layers are linear. Even if MultiheadAttention cannot be fully quantized, quantizing the linear components significantly reduces the model size.

After quantizing linear and embedding layers, the model size became 133 MB and was able to paraphrase a 120-word text in 5 seconds on CPU.