This article was translated from Chinese by Gemini 2.5 Pro.
To be honest, many people may not know that ICECHUI offers a free model API, supporting up to 6 million tokens per day. Of course, developers need to apply for it. Currently, various Qwen models are available, and I’ve found them to be quite good.
But this isn’t an advertisement. It’s a prelude to the topic of large language model lightweighting: What is quantization?
To Put It Simply#
Quantization is a technique aimed at reducing the numerical precision of model parameters, thereby decreasing the number of bits required to store each parameter.
In a typical AI model training process, the model’s parameters—namely weights and biases—are usually stored in 32-bit floating-point format (FP32). This format offers extremely high precision, capable of representing a very wide and fine range of numerical values, such as 7.892345678.
However, quantization can approximate such a high-precision value with a low-precision 8-bit integer (INT8), like the integer 8. The essence of this process is to trade an acceptable loss in precision for a significant boost in model efficiency.
To understand this process more vividly, imagine simplifying a comprehensive academic tome into a children’s book or a summary.
The academic tome contains a wealth of advanced vocabulary and complex sentence structures, making it information-dense but also heavy and difficult to read quickly.
In contrast, the children’s book or summary conveys the core ideas using simpler language and fewer pages.
Although some subtle details and nuances might be lost in the simplification, the core content is preserved and becomes easier to store, disseminate, and understand.
AI model quantization is just like this simplification process. It converts the complex numerical language inside the model (high-precision floating-point numbers) into a more concise and efficient language (low-precision integers), making the model lighter and faster, albeit at the cost of a slight sacrifice in expressive precision.
Why Quantize?#
The reason is quite straightforward. The most direct and significant benefit of quantization is the drastic reduction in the model’s storage size. A model’s size is primarily determined by the number of its parameters and the storage precision of each parameter.
When we convert parameters from the standard 32-bit floating-point (FP32, occupying 4 bytes) to 8-bit integers (INT8, occupying 1 byte), the storage requirement for each parameter drops by 75%. For large models with billions or even tens of billions of parameters, this reduction is undeniable.
Besides reducing size, quantization can also significantly speed up the model’s inference, which is the process where the model uses its learned knowledge to make predictions or generate content.
This speed improvement comes from two main sources:
- Integer arithmetic is inherently faster than floating-point arithmetic. On GPUs or hardware specifically designed for AI (like Google’s TPUs or NVIDIA’s Tensor Cores), the circuitry for performing integer math operations is simpler, has lower latency, and higher throughput. Converting the vast number of multiply-accumulate operations in a model from floating-point to integer directly leverages this hardware advantage, thus reducing computation time.
- It alleviates memory bandwidth pressure. Memory bandwidth, the rate of data transfer between the processor and memory, is often a major bottleneck in AI computation. Since a quantized model is smaller, the amount of data that needs to be loaded from memory to the processor’s cache for computation is reduced. Less data transfer means shorter waiting times, allowing the processor to be more efficiently engaged in actual calculations.
Furthermore, computational efficiency is directly related to energy consumption. Because integer operations are simpler than floating-point operations, they consume less energy. Therefore, quantized models are more energy-efficient during runtime.
Combining these three points, traditional AI applications mostly follow a cloud AI model, where terminal devices (like smartphones) send data to cloud servers for processing and receive the results back.
This model relies on a stable and fast network connection and carries the risk of data privacy leaks. Edge AI, on the other hand, aims to empower terminal devices with direct computational capabilities, allowing AI to run locally.
The Art of Mathematics#
To achieve the conversion from high-precision floating-point numbers to low-precision integers, we need a clear and reliable mathematical framework.
This process can be seen as an art of translation, with the core challenge being how to accurately map a continuous interval with infinite possible values (e.g., all floating-point numbers from -15.0 to +15.0) to a discrete set of integers with a finite number of values (e.g., the 256 values representable by INT8, from -128 to 127).
At the heart of this art is a linear mapping method known as the Affine Quantization Scheme.
A Little Math, Easy Peasy!#
The basic idea of affine quantization is very simple. It establishes a correspondence between floating-point numbers and integers through a linear equation. This relationship can be summarized by the following formula:
Here, is the original floating-point value, and is its corresponding integer value. The scale and zero_point in the formula are two key quantization parameters that together define the specific rules of this translation.
This formula is essentially a simple linear transformation, similar to converting temperature from Celsius to Fahrenheit, accomplished through scaling and shifting.
Conversely, when we want to convert a floating-point number into an integer , we just need to rearrange the above formula to get the core calculation step of quantization:
Here, represents the scale (scaling factor), and represents the zero_point. This formula tells us that quantizing a floating-point number involves three steps. First, divide by the scaling factor . Then, add the zero-point . Finally, round the result to the nearest integer. This integer is the quantized representation of the original floating-point number .
Values that fall outside the preset floating-point range are clipped to the nearest integer value representable by that range, ensuring all inputs are mapped into the valid integer space.
Scale Factor? Zero-Point?#
The scale is a positive floating-point number that defines the granularity of the quantization. It can be thought of as the numerical range in the floating-point world that a single unit step in the integer world represents.
For example, if the scale is 0.1, the difference between integer values 5 and 6 represents a floating-point change of 0.1.
A smaller scale value means higher precision, as each integer corresponds to a narrower range of floating-point numbers. Conversely, a larger scale value means coarser granularity and greater precision loss.
The zero_point is an integer whose purpose is to ensure that the special value of 0.0 in the floating-point world can be exactly represented by a value in the integer world.
This is crucial because 0 is a very common and meaningful value in neural networks. For instance, in the ReLU activation function, all negative inputs are set to 0. In convolution operations, 0 is also frequently used for padding.
If floating-point 0.0 cannot be accurately represented after quantization, it will introduce a systematic bias that continuously affects the model’s computational results. The zero_point is the integer value that corresponds exactly to 0.0.
So?#
Depending on the distribution characteristics of the floating-point values, we can adopt two different mapping strategies to determine the scale and zero_point: asymmetric quantization and symmetric quantization.
- Asymmetric Quantization is the most general affine quantization scheme. It is suitable for any range of floating-point numbers, for example, a tensor after being processed by a ReLU activation function, whose numerical range might be . In this case, the floating-point 0 is not at the center of the range. Asymmetric quantization uses a non-zero to shift the entire integer mapping interval to perfectly align with this asymmetric floating-point range.
- Symmetric Quantization is a simplified special case, suitable for tensors whose numerical range is roughly centered around 0, such as weights in the range of . In symmetric quantization, we enforce that must be 0. The advantage of this is that the quantization formula simplifies to , which can save an addition operation during computation, potentially leading to a slight speed increase. However, if the actual data distribution is not strictly symmetric, forcing the use of symmetric quantization may sacrifice some representational accuracy.
In practice, the art of mathematics lies not just in applying these formulas, but more in determining the optimal quantization range for each layer’s weights and activations, and from there, calculating the most appropriate scale and zero_point. This process of determining parameters is called calibration.
For model weights, their numerical range is known and static at the time of quantization. But for activations, their values change dynamically with each different input data. How to choose a fixed range to accommodate all possible inputs becomes a central challenge in quantization.
The calibration process usually involves feeding a small, representative set of data samples into the model to observe and statistically analyze the typical distribution range of the activations.
Choosing a range that is too narrow will cause many out-of-range outliers to be clipped, resulting in large clipping errors. On the other hand, choosing a range that is too wide will make the representation of most common values coarser, leading to larger rounding errors.
Finding the optimal balance in this trade-off is the core problem that different quantization methodologies (such as static quantization, dynamic quantization, and quantization-aware training) aim to solve.
PTQ: Post-Training Quantization#
As the name suggests, post-training quantization is an operation performed on a model that has already been fully trained using high precision (like FP32).
It’s more like a model conversion or post-processing step rather than part of the training process. The workflow typically involves taking a pre-trained model and applying a quantization algorithm to convert its parameters to a low-precision format.
The biggest advantage of PTQ is its convenience and efficiency. Since it doesn’t require re-training the model, the entire process is very fast, usually taking only a few minutes to a few hours.
Furthermore, it requires much less data; sometimes, no data is needed at all. This makes PTQ an extremely attractive option, especially when you have an off-the-shelf pre-trained model but lack the original training dataset or sufficient computational resources.
PTQ can be further divided into several different techniques, with the two most common being:
- Static Quantization: This method quantizes both the weights and the activations of the model. Quantizing weights is relatively straightforward as they are fixed. However, for activations, which change with the input, static quantization requires an additional calibration step. In this step, developers feed a small set of representative sample data into the model and record the dynamic range of activations for each layer. Based on these statistics, the algorithm calculates a fixed, optimal set of quantization parameters for each layer’s activations. During actual inference, the model uses these pre-calculated parameters to quantize the activations. Since all parameters are determined before inference, static quantization has no extra computational overhead at runtime, resulting in the fastest inference speed.
- Dynamic Quantization: Unlike static quantization, dynamic quantization typically only pre-quantizes the model’s weights, while the activations are quantized on-the-fly during the inference process. Specifically, each time new input data arrives, the system calculates the current range (maximum and minimum values) of the activations in real-time and dynamically generates quantization parameters based on this. The advantage of this method is that the quantization parameters are perfectly adapted to each specific input, often resulting in higher accuracy than static quantization. The trade-off, however, is the added overhead of calculating these parameters during each inference, which can lead to slower overall inference speed compared to static quantization.
QAT: Quantization-Aware Training#
Quantization-Aware Training takes a completely different approach from PTQ. Instead of introducing quantization after training is complete, it integrates the quantization process directly into the model’s training or fine-tuning phase.
Its core objective is to make the model “aware” during training that it will be quantized in the future, and to proactively learn to adapt to and compensate for the precision loss introduced by quantization.
QAT is implemented through a clever mechanism, often called pseudo-quantization or simulated quantization. In each iteration of training, the specific process is as follows:
- Forward Pass: When computing the network’s output, the model simulates the quantization operation. It first pseudo-quantizes the full-precision weights and activations (i.e., calculates their corresponding integer values according to the quantization rule, and then converts them back to floating-point numbers using the dequantization formula; these floating-point numbers now carry the quantization error). It then uses these values with simulated errors to complete the network’s calculations.
- Backward Pass: When calculating gradients and updating weights, the model ignores the quantization operation from the forward pass. Gradients are calculated based on the original full-precision weights, and the updates are also applied to the full-precision weights. This design (often implemented using a technique called the Straight-Through Estimator) ensures the stability of the training process and the effective propagation of gradients.
In this way, the model continuously experiences the effects of quantization throughout the training process and adjusts its weights to find a solution that remains optimal even after quantization.
This is like a marksman who considers the effect of wind speed during training and proactively adjusts their aim, rather than experiencing the wind for the first time on competition day.
Therefore, the greatest advantage of QAT is its ability to preserve model accuracy to the maximum extent.
Since the model has learned how to coexist with quantization error, its final quantized version can almost always achieve higher accuracy than PTQ, sometimes even approaching the level of the original FP32 model. Thus, when an application has extremely stringent accuracy requirements, QAT is the preferred solution.
Of course, this high accuracy comes at a cost. QAT requires a complete training pipeline, which means it needs access to a large training dataset and a significant amount of computational resources for re-training or fine-tuning the model.
The Preview is Over, Here’s a Foreshadowing#
While PTQ and QAT form the core framework of quantization technology, the field is constantly evolving, with many specialized tools and more advanced techniques emerging for specific scenarios and needs.
GPTQ#
GPTQ, which stands for Generalized Post-Training Quantization, is an advanced post-training quantization method designed specifically for models based on the GPT architecture.
Its standout feature is its ability to quantize models to very low bit-widths, such as INT4 or even INT3, while maintaining extremely low performance loss. GPTQ achieves a better precision-compression ratio than traditional PTQ methods through a layer-by-layer quantization approach combined with precise analysis of quantization error.
In practical applications, GPTQ-quantized models often exhibit extremely high inference speeds on GPUs, making it one of the top choices for GPU users running large models.
GGML#
GGML is a tensor library designed for machine learning, which includes a set of custom binary formats for storing quantized models. Unlike GPTQ, which is primarily GPU-oriented, models in the GGML format show excellent performance when running on CPUs.
This allows users without high-end graphics cards to experience large language models on their own laptops or desktops. GGML supports multiple quantization levels (e.g., the commonly seen q4_0, q5_1, etc., in filenames), providing users with flexible options to trade off between model size and performance.
Extreme Quantization#
- Binarization is one of the most extreme forms of quantization. In Binarized Neural Networks (BNNs), the model’s weights are restricted to only two possible values, typically or a scaled .
- Ternarization slightly relaxes this constraint, allowing weights to take on three values, usually or .
Conclusion#
Quantization, at its core, is an optimization technique designed to make massive and complex AI models smaller, faster, and more energy-efficient.
This enables large models like LLMs to move out of resource-abundant cloud data centers and be deployed on the various resource-constrained devices we use in our daily lives.
Whether it’s drastically reducing model size to fit into a smartphone, accelerating inference to support real-time decisions in autonomous driving, or lowering power consumption to extend the battery life of wearable devices, quantization plays an indispensable role.
It has become one of the core engineering pillars driving the popularization, practical application, and democratization of AI technology.
Footnotes#
Hugging Face. (n.d.). Quantization. In Optimum documentation. Hugging Face. Retrieved August 20, 2025, from https://huggingface.co/docs/optimum/concept_guides/quantization ↗ Li, J. (2024, May 27). Mechanistic Interpretability of Binary and Ternary Transformers (arXiv:2405.17703) [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2405.17703 ↗