In the previous post on KV caching, we saw how caching key and value vectors eliminates redundant computation during autoregressive generation. The speedup is enormous — instead of . But we also met the memory wall: the cache itself grows linearly with sequence length, and at long contexts it dominates GPU memory.
For Llama-3.1-8B at 32K tokens, the KV cache alone consumes ~4.3 GB in FP16. At 128K tokens, it's over 17 GB — more than the model weights. Serve 8 users in parallel and you need 140 GB just for caches.
The brute-force solution is quantization — storing each value in fewer bits. Standard 4-bit quantization cuts cache memory by 4×. But existing methods like KIVI and KVQuant require per-block calibration constants (scales, zero-points) that eat into the compression gains, and they introduce systematic bias into attention scores.
In March 2026, a team at Google Research published TurboQuant (ICLR 2026), an algorithm that compresses KV caches to as few as 3 bits per value with:
- No training or calibration required — completely data-oblivious
- Zero memory overhead — no scales or zero-points stored
- Provably near-optimal distortion — within 2.7× of the information-theoretic limit
- Unbiased inner products — attention scores have zero systematic error
This post builds TurboQuant from scratch. We'll implement every piece in Python, understand why it works mathematically, and see it compress real KV cache vectors.
The Core Idea
TurboQuant solves a precise mathematical problem: given a high-dimensional vector and a bit budget of bits per coordinate, find a quantizer that minimizes distortion while keeping inner products unbiased.
The algorithm has two stages:
Stage 1 — PolarQuant ( bits): Randomly rotate the vector, then apply an optimal scalar quantizer to each coordinate independently. This handles most of the compression.
Stage 2 — QJL Error Correction (1 bit): Take the residual error from Stage 1, project it through a random matrix, and store only the sign bits. This single bit per coordinate eliminates the bias that Stage 1 introduces.
Together, the two stages use bits per coordinate and produce unbiased inner product estimates with near-optimal mean squared error. Let's build each piece.
Stage 1: Random Rotation
Why rotate?
Real KV cache vectors are messy. Some coordinates carry huge values (outliers), others are near zero. The distribution varies wildly across layers, heads, and positions. A one-size-fits-all quantizer seems hopeless — you'd need to measure each vector's statistics before you could quantize it, and storing those statistics eats into your compression gains.
Here's the core problem. Imagine a 128-dimensional key vector where most of the "meaning" is concentrated in just 3 coordinates: . If you give every coordinate the same number of quantization bits, you're wasting 125 coordinates worth of bits on near-zero values while the 3 important coordinates don't get enough resolution. It's like giving equal shelf space to every product in a store — most of it goes to items nobody buys.
The solution is beautifully simple: spread the information evenly before quantizing. If you multiply by a random orthogonal matrix , the information that was concentrated in 3 coordinates gets distributed across all 128. Now every coordinate carries roughly the same amount of signal, and uniform bit allocation becomes optimal.
There's a remarkable fact from high-dimensional geometry that makes this work: after rotation, every coordinate of the result follows nearly the same distribution — approximately for large . It doesn't matter what the original vector looked like. Sparse, dense, outlier-heavy — the rotation makes them all look the same.
This means a single, pre-computed quantizer works for every input, regardless of its original distribution. No calibration, no per-tensor statistics, no overhead.
Try it yourself — the left panel shows a sparse vector where 3 coordinates carry all the information. The right panel shows the same vector after rotation. Notice how the quantization error (red) drops dramatically when the information is spread evenly:
And here's the same effect on a larger scale. The left histogram shows the spiky, uneven coordinate distribution before rotation. The right shows the smooth bell curve after. Crank the dimension slider up and watch the bell curve tighten:
The math
For a unit vector uniformly distributed on the sphere , each coordinate follows a Beta distribution:
As grows, this converges to . For KV cache vectors with (a typical head dimension), the approximation is already excellent.
The random rotation is generated once at initialization via QR decomposition of a Gaussian random matrix. Since is orthogonal, it preserves norms and inner products: . The rotation itself adds no distortion — it just reshapes the distribution into something we can quantize optimally.
Implementation
Stage 1: Lloyd-Max Quantization
The rotation gives us coordinates with a known distribution — a bell curve. Now we need the best possible way to map those continuous values to discrete levels. This is the Lloyd-Max quantizer — the optimal scalar quantizer for a given probability distribution.
The intuition
Consider 2-bit quantization: we get 4 discrete levels to represent the entire bell curve. Where should we place them?
The naive approach is uniform quantization: space the levels evenly across the range. But think about where the values actually live. The bell curve peaks at zero — the vast majority of values cluster near the center. Very few values sit in the tails. Uniform quantization gives equal resolution to the crowded center and the empty tails. It's like putting the same number of bus stops in downtown Manhattan and rural Wyoming.
Lloyd-Max does the obvious-in-hindsight thing: put more levels where more values live. It packs centroids densely near zero (where the bell curve peaks) and spaces them out in the tails (where values are rare). Each centroid is placed at the average of the values it represents, minimizing the squared distance to its assigned points. This is k-means in 1D, and it converges to the provably optimal quantizer for any given distribution.
Since we know the exact distribution (the Beta distribution from the rotation step), we can pre-compute these optimal centroids once and reuse them forever.
Move the bit-width slider and watch the centroids rearrange. At 1 bit (2 levels), you get a crude positive/negative split. At 4 bits (16 levels), the centroids pack tightly around zero where most values live, giving fine resolution where it matters most.
How it works
The Lloyd-Max algorithm alternates two steps until convergence:
-
Update boundaries: Place each decision boundary at the midpoint between adjacent centroids:
-
Update centroids: Move each centroid to the conditional mean of the distribution within its region:
This is essentially k-means in 1D, optimized for minimum MSE. The resulting quantizer is provably optimal for the given distribution and bit-width.
Implementation
Distortion guarantees
TurboQuant proves that this approach achieves MSE distortion:
The information-theoretic lower bound (Shannon source coding) is . So TurboQuant is within a factor of of the theoretical optimum at every bit-width. No quantizer can beat the Shannon bound — and TurboQuant nearly matches it.
At 4 bits, the MSE is just 0.009 — meaning the average squared error per coordinate is less than 1%. At 3 bits, it's 0.03. This is why TurboQuant can compress to 3 bits without visible quality degradation.
Stage 2: QJL Error Correction
Stage 1 achieves excellent MSE, but it has a subtle problem: the quantized inner products are biased. When we compute where is the Stage 1 reconstruction, we get:
Why bias matters more than error
You might think: the MSE is tiny, who cares about bias? The issue is what happens after the inner product. Attention scores pass through softmax, which computes . Softmax doesn't just amplify errors — it amplifies them exponentially.
An analogy: imagine you're weighing packages on a slightly miscalibrated scale that always reads 0.1 kg too high. For a single measurement, 0.1 kg is nothing. But if you're computing compound interest on those measurements over 1,000 packages, the systematic error in one direction compounds into a large drift. A scale with random noise of 0.2 kg but zero bias would actually be better for the total, because random errors cancel out over many measurements.
That's exactly what happens in attention. A biased quantizer makes the model systematically over-attend to certain positions and under-attend to others. Over a 32K-token context with thousands of key vectors, this systematic drift accumulates and degrades generation quality. An unbiased quantizer with slightly more variance is safer — the random errors average out across tokens.
The visualization below shows this in action. The biased quantizer shifts attention weights away from the correct positions, and the problem gets worse as context length grows. The unbiased quantizer stays close to the true distribution:
The Johnson-Lindenstrauss trick
The Quantized Johnson-Lindenstrauss (QJL) transform fixes the bias with a single additional bit per coordinate. The idea:
- Compute the Stage 1 residual: — this is the error that Stage 1 made
- Project through a random Gaussian matrix : — this scrambles the error into a random basis
- Store only the sign bits: — throw away the magnitudes, keep only "positive or negative"
- Also store (the residual norm — a single scalar, telling us how big the total error was)
The intuition: the sign of a random projection tells you which direction the error points in. It's like asking 128 random people "was the error more to the left or the right?" — each answer is crude (just +1 or -1), but averaging 128 independent noisy estimates gives a surprisingly accurate correction. The random projection ensures these estimates are independent and unbiased.
The dequantized residual correction is:
The magic: this estimator is unbiased. , which exactly cancels the bias from Stage 1. The combined estimator gives:
Zero bias. The cost? Just 1 bit per coordinate for the sign, plus one scalar for the residual norm.
Implementation
A practical caveat
The paper's QJL stage provides mathematically unbiased inner products — a clean theoretical result. However, community implementations have found that for KV cache compression specifically, the variance introduced by QJL can be amplified by softmax, sometimes degrading generation quality. In practice, many implementations dedicate all bits to the MSE stage and skip QJL, achieving better empirical results at the cost of slight bias.
The lesson: unbiased doesn't always mean lower error. Softmax's exponential nonlinearity means low-variance biased estimators can outperform high-variance unbiased ones. The right choice depends on your context length and quality requirements.
Putting It All Together
Here's the complete TurboQuant pipeline, end to end:
Testing it
At 4 bits per coordinate, we lose less than 0.1% of the information. The cosine similarity of 0.999 means the reconstructed vector is nearly identical to the original.
Using TurboQuant in Practice
With HuggingFace Transformers
The simplest way to use TurboQuant today is through the community turboquant package:
Three lines to compress your model's KV cache:
Practical tips
Bit allocation matters. Keys need more precision than values. Keys determine where the model attends (via ), while values are averaged together (weighted by attention scores). A common sweet spot is 4-bit keys + 2-bit values, giving ~5× compression with working generation.
Protect the residual window. Keep the most recent 128-256 tokens in full FP16 precision and only compress older tokens. The model relies most heavily on recent context, so preserving it at full precision is cheap insurance.
Layer-adaptive precision. The first and last transformer layers are disproportionately important. Allocating an extra bit to these layers (e.g., 5-bit for first/last, 3-bit for the rest) improves output quality with minimal memory increase.
Benchmark Results
TurboQuant was evaluated on Llama-3.1-8B-Instruct and Mistral-7B-Instruct across several long-context benchmarks:
| Method | Bits | LongBench | Needle-in-Haystack | Overhead |
|---|---|---|---|---|
| Full FP16 | 16 | 50.06 | 0.997 | — |
| KIVI | 5 | 50.16 | 0.981 | Per-block scales |
| PolarQuant | 3.9 | 49.78 | 0.995 | None |
| TurboQuant | 3.5 | 50.06 | 0.997 | None |
| TurboQuant | 2.5 | 49.44 | 0.993 | None |
At 3.5 bits, TurboQuant matches full FP16 performance on LongBench while using 4.5× less memory. Even at the extreme 2.5-bit setting, degradation is minimal.
On H100 GPUs, 4-bit TurboQuant achieves up to 8× speedup on attention logit computation compared to unquantized 32-bit keys. The compressed cache is not just smaller — it's faster to compute with, because less data needs to move from HBM to compute cores.
Quantization speed
Unlike Product Quantization or RabitQ, which require expensive offline indexing, TurboQuant quantizes vectors in microseconds:
| Method | Indexing Time (d=1536) |
|---|---|
| TurboQuant | 0.002 seconds |
| Product Quantization | 494 seconds |
| RabitQ | 3,957 seconds |
This is what "online" and "data-oblivious" buys you. No preprocessing, no calibration dataset, no waiting. Each vector is quantized independently using the pre-computed rotation matrix and codebook.
The Paper Landscape
TurboQuant builds on and connects to several lines of research:
-
QJL (Zandieh, Daliri, Han, 2024) — Introduced the 1-bit Quantized Johnson-Lindenstrauss transform for KV cache compression with zero overhead. The theoretical foundation for Stage 2.
-
PolarQuant (Han, Kacham, Karbasi, Mirrokni, Zandieh, AISTATS 2026) — Uses polar coordinate transformation for quantization. TurboQuant's random rotation approach is simpler and achieves better theoretical bounds.
-
KIVI (Liu et al., 2024) — Per-channel quantization with asymmetric key/value precision. Strong practical results but requires storing per-block statistics.
-
KVQuant (Hooper et al., 2024) — Sensitivity-aware KV cache quantization. Good empirical performance but requires calibration data.
TurboQuant's key contribution is unifying the theoretical and practical: it proves optimality bounds while being simpler to implement than methods that require training or calibration.
When NOT to Use TurboQuant
TurboQuant isn't a silver bullet. Here's when other approaches are better:
Short contexts (< 1K tokens). The KV cache is tiny and not the bottleneck. The overhead of the rotation matrix multiply and centroid lookup costs more than it saves. At short contexts, the model weights dominate memory, not the cache.
When you have calibration data. Methods like GPTQ and AWQ exploit the specific weight distribution of your model to achieve lower distortion at the same bit-width. TurboQuant's advantage is being data-oblivious — it works without calibration. If you can afford a calibration pass, calibration-aware methods will beat it for weight quantization.
When FP8 is good enough. FP8 KV cache (8-bit) is simpler, has native hardware support on H100/GB200, and adds zero compute overhead. If your context length fits in memory with FP8, the added complexity of TurboQuant isn't worth it. TurboQuant shines when you need to go below 8 bits — the 3-4 bit regime where FP8 can't go.
Training weights and gradients. Model weights must be stored at full precision during training — any approximation compounds over millions of gradient steps. TurboQuant is designed for ephemeral data (KV cache entries, embeddings for retrieval) that is written once and read a few times, not for parameters that are updated continuously.
The rotation matrix is . For d_head=128, the rotation matrix is 128×128 = 16K floats (64KB). Trivial. But for high-dimensional embeddings (d=4096), it's 64M floats (256MB). At those dimensions, consider the Walsh-Hadamard transform instead — it achieves a similar decorrelation effect in time and space.
What We Built
We started with a problem: KV caches consuming tens of gigabytes of GPU memory. We built a solution in three pieces:
- Random rotation — a linear algebra trick that transforms arbitrary distributions into a predictable bell curve, enabling a universal quantizer
- Lloyd-Max quantization — the information-theoretically optimal scalar quantizer, pre-computed once and reused forever
- QJL error correction — a 1-bit sign-based estimator that eliminates bias in inner products
Together, these compress 32-bit floating-point values to 3-4 bits with provably near-optimal distortion and zero systematic error. No training. No calibration. No overhead.
The result: models that once needed 80GB GPUs for long-context inference can now run on consumer hardware. That's the kind of efficiency gain that changes who gets to use these models — and that matters.
