SpectralQuant Implementation Analysis
Companion to SpectralQuant KV Cache. See also: Inference Stack Synthesis.
Update (April 2026): The repo at
https://github.com/Dynamis-Labs/spectralquantis now public (MIT license, 106 stars). The actual code structure is documented below. The original reconstruction has been replaced with verified details from the repo.
1. Paper recap (Implementation-Relevant points)
For the full technical analysis of SpectralQuant’s spectral rotation, selective error correction, and non-uniform bit allocation, see SpectralQuant KV Cache. This document focuses on the implementation: code structure, reproduction, and production integration path.
2. Actual repository structure
spectralquant/
src/spectralquant/
calibration.py # Eigenspectral calibration with PCA and d_eff calculation
spectral_rotation.py # Rotation-based compression
nonuniform_quantization.py # Lloyd-Max quantization with codebooks
selective_qjl.py # Selective error correction (signal dims only)
engine.py # Main compression engine
metrics.py # Evaluation metrics (cosine sim, perplexity)
experiments/ # 21 scripts covering:
run_memory_efficiency.py
# Cross-architecture testing (Qwen, Llama, Mistral, Gemma)
# Statistical significance (10-seed confidence intervals)
# Latency benchmarks, perplexity, NIAH, LongBench
results/ # 44 JSON files across 15 subdirectories
paper_output/ # LaTeX source + compiled PDF + figures
eval/
cosine_sim.py # Cosine similarity measurement
perplexity.py # Perplexity evaluation on WikiText-2 / C4
scripts/
calibrate.py # CLI entry point for calibration
evaluate.py # CLI entry point for benchmarks
configs/
qwen2.5_14b.yaml # Per-model calibration configs
llama3_8b.yaml
3. Implementation details (Reconstructed)
3.1 calibration data collection
Expected implementation:
# scripts/calibrate.py (reconstructed)
def collect_key_activations(model, dataloader, num_samples=128):
"""Run forward passes, collect K projections per layer per head."""
key_accum = {} # {(layer, head): list of key tensors}
for batch_idx, batch in enumerate(dataloader):
if batch_idx >= num_samples:
break
with torch.no_grad():
# Hook into attention layers to capture K after projection
outputs = model(batch, output_attentions=False)
return key_accumKey parameters:
- Dataset: Likely WikiText-2 or C4 validation split (128-512 sequences)
- Sequence length: 2048 tokens (matching model context)
- Batch size: 1 (calibration is not throughput-sensitive)
- Sampling: First N sequences, no shuffling (deterministic)
Collection mechanism: Register forward hooks on each attention layer’s key projection
(k_proj). Capture the output tensor after the linear projection but before any RoPE or
Reshaping. This gives raw key vectors in [batch, seq_len, num_heads, head_dim] format.
3.2 eigendecomposition
def compute_spectral_basis(key_activations, per_head=True):
"""
Compute eigenbasis for key covariance.
Args:
key_activations: [N, head_dim] concatenated key vectors
per_head: If True, separate basis per attention head
Returns:
eigenvectors: [head_dim, head_dim] orthogonal rotation matrix
eigenvalues: [head_dim] variance along each component
"""
# Center the data
mean = key_activations.mean(dim=0)
centered = key_activations - mean
# Covariance matrix: [head_dim, head_dim]
# For head_dim=128, this is 128x128 -- trivially small
cov = (centered.T @ centered) / (centered.shape[0] - 1)
# Eigendecomposition (symmetric positive semi-definite)
eigenvalues, eigenvectors = torch.linalg.eigh(cov)
# Sort descending by eigenvalue
idx = eigenvalues.argsort(descending=True)
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
return eigenvectors, eigenvaluesCritical design decisions:
-
Per-head vs per-layer: Should be per-head because different heads attend to different subspaces. The covariance structure varies significantly across heads (some are “positional,” some are “semantic”).
-
torch.linalg.eigh vs numpy: Use
torch.linalg.eigh(symmetric eigendecomposition) because the covariance matrix is guaranteed symmetric PSD. For head_dim=128, the 128x128 eigendecomposition takes <1ms — the bottleneck is collecting activations. -
Storage: The eigenvector matrix is
[num_layers, num_heads, head_dim, head_dim]. For Qwen 2.5-14B: ~32 MB in FP16. One-time cost, stored alongside the model. -
RoPE interaction: Keys have RoPE applied. Calibration must eigendecompose post-RoPE keys because that’s the representation stored in the KV cache. However, RoPE makes the covariance position-dependent. The paper likely averages across positions.
3.3 spectral rotation at inference
class SpectralKVCache:
def __init__(self, eigenvectors, eigenvalues, d_eff, bit_config):
self.R = eigenvectors # rotation matrices
self.lambdas = eigenvalues # for bit allocation
self.d_eff = d_eff # signal/noise boundary
self.bit_config = bit_config # per-dim bit widths
def compress_keys(self, keys, layer_idx, head_idx):
"""
keys: [batch, seq_len, head_dim]
Returns: compressed representation
"""
R = self.R[layer_idx, head_idx] # [head_dim, head_dim]
# Rotate into eigenbasis
rotated = keys @ R # [batch, seq_len, head_dim]
# Split signal and noise
d = self.d_eff[layer_idx, head_idx]
signal = rotated[..., :d] # high-variance dims
noise = rotated[..., d:] # low-variance dims
# Quantize signal with more bits (e.g., 4-8 bit)
signal_q = self.quantize(signal, self.bit_config[..., :d])
# Quantize noise aggressively or skip QJL
noise_q = self.low_bit_quantize(noise, bits=2)
return (signal_q, noise_q, d)The rotation replaces random projection: In QJL, you multiply keys by a random Gaussian matrix. SpectralQuant replaces this with the eigenvector matrix, which concentrates variance into the first few dimensions.
3.4 QJL selective application
d_eff cutoff determination:
def compute_d_eff(eigenvalues, threshold=0.99):
"""
Find effective dimensionality via participation ratio
or cumulative variance threshold.
"""
# Participation ratio method (as described in paper)
total = eigenvalues.sum()
d_eff_pr = (total ** 2) / (eigenvalues ** 2).sum()
# Or: cumulative variance threshold
cumvar = eigenvalues.cumsum(dim=0) / total
d_eff_cv = (cumvar >= threshold).nonzero()[0][0].item() + 1
return int(d_eff_pr) # paper uses participation ratioThe paper reports d_eff ≈ 4 via participation ratio for keys. This is much lower than a 99% variance threshold would give. The participation ratio weights by squared eigenvalues, making it sensitive to the spectral gap — which is what makes it the right metric here.
Key insight: SpectralQuant’s “selective” QJL means NO QJL on noise dimensions, not reduced QJL. The paper shows removing QJL entirely from noise dims improves quality by +3.0pp cosine similarity, because QJL correction on near-zero signals injects noise.
3.5 Non-Uniform bit allocation (Water-Filling)
def water_filling_allocation(eigenvalues, total_bits_per_dim=4.0):
"""
Optimal bit allocation: R_i = max(0, 0.5 * log2(sigma_i^2 / theta))
where theta is the water level chosen so sum(R_i) = R_total.
"""
d = len(eigenvalues)
total_budget = total_bits_per_dim * d
# Binary search for the water level theta
lo, hi = 1e-10, eigenvalues.max().item()
for _ in range(100): # bisection
theta = (lo + hi) / 2
bits = 0.5 * torch.log2(eigenvalues / theta).clamp(min=0)
total_used = bits.sum().item()
if total_used > total_budget:
lo = theta
else:
hi = theta
# Snap to implementable bit widths: {0, 1, 2, 3, 4, 6, 8}
bits = snap_to_codebook(bits)
return bitsBit width mapping: Signal dimensions get 4-8 bits, noise dimensions get 0-2 bits. The codebook is likely a simple uniform quantizer with learned scale/zero-point per channel.
3.6 memory layout
Per token, per layer, per head:
+------------------+-------------------+------------------+
| Signal dims (d_eff) | Noise dims (remaining) | Metadata |
| mixed-precision | low-bit quantized | scales |
+------------------+-------------------+------------------+
Memory savings (for head_dim=128, d_eff=4):
- Uncompressed: 128 dims * 16 bits = 256 bytes
- Signal: 4 dims * 8 bits = 4 bytes + 2 bytes scales
- Noise: 124 dims * 2 bits = 31 bytes + 4 bytes scales
- Total: ~41 bytes ⇒ ~6.2x compression (close to reported 5.95x with metadata overhead)
4. Code quality assessment
research code — Well-Structured
The repo is research code but well-organized (106 stars, MIT license):
- 21 experiment scripts with cross-architecture testing (Qwen, Llama, Mistral, Gemma)
- Statistical significance via 10-seed confidence intervals (seeds: 42, 123, 7, 2024, 31415, 99, 1337, 8675309, 271828, 314159)
- 44 JSON result files across 15 subdirectories — raw data is reproducible
- Lloyd-Max quantization with codebooks (not simple uniform quantization as initially assumed)
- Pure PyTorch, no custom CUDA kernels
integration gap to production (vLLM / SGLang / FlashInfer)
Estimated weeks to months of engineering work (speculative — depends on vLLM architecture changes needed):
- PagedAttention compatibility: vLLM’s paged blocks need mixed-precision packing within pages
- CUDA kernel for fused rotate-quantize: Must avoid materializing the full rotated tensor
- Attention kernel modification: FlashAttention expects uniform precision KV cache — mixed-precision requires a custom kernel
- Calibration pipeline: One-time step on model load (~15 seconds per the paper)
FlashInfer integration path: FlashInfer already supports INT4 KV cache. Extending to mixed-precision per-dimension would require modifying append_paged_kv_cache and batch_decode kernels.
5. Reproduction guide
hardware requirements
- Calibration: 1x GPU with enough VRAM for the model + ~10GB activation cache
- Qwen 2.5-14B: ~28GB model + ~10GB activations = 38GB (1x A100-80GB)
- Calibration time: ~15 seconds (per paper claim)
- Evaluation: Same as calibration
verified commands (from repo README)
git clone https://github.com/dynamis-labs/spectralquant.git
pip install -e ".[dev]"
# Quick test
PYTHONPATH=src python experiments/run_memory_efficiency.py --quick
# Full evaluation
PYTHONPATH=src python experiments/run_memory_efficiency.py
# Expected: cos_sim = 0.9485, ratio = 5.95xRequirements: Python >= 3.10, PyTorch >= 2.2.0, CUDA GPU
verified headline numbers (Qwen 2.5-14B)
| Method | Compression | Cosine Sim | Latency (512 tokens) |
|---|---|---|---|
| FP16 baseline | 1.0x | 1.000 | — |
| TurboQuant | 5.02x | 0.9226 | 0.566 ms/step |
| SpectralQuant | 5.95x | 0.9485 | 0.257 ms/step |
The 2.2x latency improvement (0.257 vs 0.566 ms/step at 512 tokens) is the end-to-end number. The 4.5x attention-specific speedup cited in the paper refers to the attention computation alone (skipping QJL on 124 dims).
what could go wrong
- RoPE handling: Wrong treatment of RoPE-transformed keys during calibration invalidates the eigenbasis
- Calibration data mismatch: Different split/dataset shifts results by 0.1-0.3 perplexity
- Head dimension: Must operate on KV heads (not Q heads for GQA models)
- GQA: LLaMA-3 uses 8 KV heads vs 32 Q heads — calibration must target KV heads
6. Comparison with related work
| Feature | QJL/TurboQuant | KIVI | GEAR | SpectralQuant |
|---|---|---|---|---|
| Rotation | Random Gaussian | None | None | Learned eigenvectors |
| Bit allocation | Uniform | Uniform + outlier | Uniform + residual | Water-filling |
| Calibration | None (data-oblivious) | None | None | 15 seconds (data-aware) |
| Error correction | QJL on all dims | None | Low-rank residual | QJL on signal only |
| Compression | ~5x | ~8x | ~6x | ~6x |
| Quality at 6x | Good | Degraded | Good | Best (by cosine sim) |
SpectralQuant’s tradeoff: requires offline calibration (a new step in the deployment pipeline) but delivers the best quality at a given compression ratio.
7. Answered questions (from repo)
- GQA models: Yes — experiments include Llama, Mistral, Gemma (all GQA)
- Triton kernel: No — pure PyTorch implementation
- Value vectors: d_eff ≈ 40-55 for values (vs 4 for keys) — confirmed in
calibration.py - Calibration determinism: Default seed 42, with 10-seed robustness testing
- Latency overhead: 0.257 ms/step at 512 tokens (2.2x faster than TurboQuant)
- vLLM/SGLang plugin: No — standalone research code
- Memory layout: Handled in
engine.pywithnonuniform_quantization.pyfor codebook packing
interesting reads
- SpectralQuant repository (GitHub, MIT license) — the source code analyzed in this article
- TurboQuant (Google Research) — the QJL baseline implementation
- KIVI — per-channel KV quantization, earlier approach
- vLLM / SGLang — the serving frameworks where SpectralQuant would integrate
Updated April 2026 with verified details from the SpectralQuant repository (MIT license, 106 stars).