SpectralQuant Implementation Analysis

Companion to SpectralQuant KV Cache. See also: Inference Stack Synthesis.

Update (April 2026): The repo at https://github.com/Dynamis-Labs/spectralquant is now public (MIT license, 106 stars). The actual code structure is documented below. The original reconstruction has been replaced with verified details from the repo.

1. Paper recap (Implementation-Relevant points)

For the full technical analysis of SpectralQuant’s spectral rotation, selective error correction, and non-uniform bit allocation, see SpectralQuant KV Cache. This document focuses on the implementation: code structure, reproduction, and production integration path.

2. Actual repository structure

spectralquant/
  src/spectralquant/
    calibration.py          # Eigenspectral calibration with PCA and d_eff calculation
    spectral_rotation.py    # Rotation-based compression
    nonuniform_quantization.py  # Lloyd-Max quantization with codebooks
    selective_qjl.py        # Selective error correction (signal dims only)
    engine.py               # Main compression engine
    metrics.py              # Evaluation metrics (cosine sim, perplexity)
  experiments/              # 21 scripts covering:
    run_memory_efficiency.py
    # Cross-architecture testing (Qwen, Llama, Mistral, Gemma)
    # Statistical significance (10-seed confidence intervals)
    # Latency benchmarks, perplexity, NIAH, LongBench
  results/                  # 44 JSON files across 15 subdirectories
  paper_output/             # LaTeX source + compiled PDF + figures
  eval/
    cosine_sim.py       # Cosine similarity measurement
    perplexity.py       # Perplexity evaluation on WikiText-2 / C4
  scripts/
    calibrate.py        # CLI entry point for calibration
    evaluate.py         # CLI entry point for benchmarks
  configs/
    qwen2.5_14b.yaml    # Per-model calibration configs
    llama3_8b.yaml

3. Implementation details (Reconstructed)

3.1 calibration data collection

Expected implementation:

# scripts/calibrate.py (reconstructed)
def collect_key_activations(model, dataloader, num_samples=128):
    """Run forward passes, collect K projections per layer per head."""
    key_accum = {}  # {(layer, head): list of key tensors}
 
    for batch_idx, batch in enumerate(dataloader):
        if batch_idx >= num_samples:
            break
        with torch.no_grad():
            # Hook into attention layers to capture K after projection
            outputs = model(batch, output_attentions=False)
 
    return key_accum

Key parameters:

Dataset: Likely WikiText-2 or C4 validation split (128-512 sequences)
Sequence length: 2048 tokens (matching model context)
Batch size: 1 (calibration is not throughput-sensitive)
Sampling: First N sequences, no shuffling (deterministic)

Collection mechanism: Register forward hooks on each attention layer’s key projection (k_proj). Capture the output tensor after the linear projection but before any RoPE or Reshaping. This gives raw key vectors in [batch, seq_len, num_heads, head_dim] format.

3.2 eigendecomposition

def compute_spectral_basis(key_activations, per_head=True):
    """
    Compute eigenbasis for key covariance.
 
    Args:
        key_activations: [N, head_dim] concatenated key vectors
        per_head: If True, separate basis per attention head
 
    Returns:
        eigenvectors: [head_dim, head_dim] orthogonal rotation matrix
        eigenvalues: [head_dim] variance along each component
    """
    # Center the data
    mean = key_activations.mean(dim=0)
    centered = key_activations - mean
 
    # Covariance matrix: [head_dim, head_dim]
    # For head_dim=128, this is 128x128 -- trivially small
    cov = (centered.T @ centered) / (centered.shape[0] - 1)
 
    # Eigendecomposition (symmetric positive semi-definite)
    eigenvalues, eigenvectors = torch.linalg.eigh(cov)
 
    # Sort descending by eigenvalue
    idx = eigenvalues.argsort(descending=True)
    eigenvalues = eigenvalues[idx]
    eigenvectors = eigenvectors[:, idx]
 
    return eigenvectors, eigenvalues

Critical design decisions:

Per-head vs per-layer: Should be per-head because different heads attend to different subspaces. The covariance structure varies significantly across heads (some are “positional,” some are “semantic”).
torch.linalg.eigh vs numpy: Use torch.linalg.eigh (symmetric eigendecomposition) because the covariance matrix is guaranteed symmetric PSD. For head_dim=128, the 128x128 eigendecomposition takes <1ms — the bottleneck is collecting activations.
Storage: The eigenvector matrix is [num_layers, num_heads, head_dim, head_dim]. For Qwen 2.5-14B: ~32 MB in FP16. One-time cost, stored alongside the model.
RoPE interaction: Keys have RoPE applied. Calibration must eigendecompose post-RoPE keys because that’s the representation stored in the KV cache. However, RoPE makes the covariance position-dependent. The paper likely averages across positions.

3.3 spectral rotation at inference

class SpectralKVCache:
    def __init__(self, eigenvectors, eigenvalues, d_eff, bit_config):
        self.R = eigenvectors           # rotation matrices
        self.lambdas = eigenvalues      # for bit allocation
        self.d_eff = d_eff              # signal/noise boundary
        self.bit_config = bit_config    # per-dim bit widths
 
    def compress_keys(self, keys, layer_idx, head_idx):
        """
        keys: [batch, seq_len, head_dim]
        Returns: compressed representation
        """
        R = self.R[layer_idx, head_idx]  # [head_dim, head_dim]
 
        # Rotate into eigenbasis
        rotated = keys @ R  # [batch, seq_len, head_dim]
 
        # Split signal and noise
        d = self.d_eff[layer_idx, head_idx]
        signal = rotated[..., :d]     # high-variance dims
        noise = rotated[..., d:]      # low-variance dims
 
        # Quantize signal with more bits (e.g., 4-8 bit)
        signal_q = self.quantize(signal, self.bit_config[..., :d])
 
        # Quantize noise aggressively or skip QJL
        noise_q = self.low_bit_quantize(noise, bits=2)
 
        return (signal_q, noise_q, d)

The rotation replaces random projection: In QJL, you multiply keys by a random Gaussian matrix. SpectralQuant replaces this with the eigenvector matrix, which concentrates variance into the first few dimensions.

3.4 QJL selective application

d_eff cutoff determination:

def compute_d_eff(eigenvalues, threshold=0.99):
    """
    Find effective dimensionality via participation ratio
    or cumulative variance threshold.
    """
    # Participation ratio method (as described in paper)
    total = eigenvalues.sum()
    d_eff_pr = (total ** 2) / (eigenvalues ** 2).sum()
 
    # Or: cumulative variance threshold
    cumvar = eigenvalues.cumsum(dim=0) / total
    d_eff_cv = (cumvar >= threshold).nonzero()[0][0].item() + 1
 
    return int(d_eff_pr)  # paper uses participation ratio

The paper reports d_eff ≈ 4 via participation ratio for keys. This is much lower than a 99% variance threshold would give. The participation ratio weights by squared eigenvalues, making it sensitive to the spectral gap — which is what makes it the right metric here.

Key insight: SpectralQuant’s “selective” QJL means NO QJL on noise dimensions, not reduced QJL. The paper shows removing QJL entirely from noise dims improves quality by +3.0pp cosine similarity, because QJL correction on near-zero signals injects noise.

3.5 Non-Uniform bit allocation (Water-Filling)

def water_filling_allocation(eigenvalues, total_bits_per_dim=4.0):
    """
    Optimal bit allocation: R_i = max(0, 0.5 * log2(sigma_i^2 / theta))
    where theta is the water level chosen so sum(R_i) = R_total.
    """
    d = len(eigenvalues)
    total_budget = total_bits_per_dim * d
 
    # Binary search for the water level theta
    lo, hi = 1e-10, eigenvalues.max().item()
 
    for _ in range(100):  # bisection
        theta = (lo + hi) / 2
        bits = 0.5 * torch.log2(eigenvalues / theta).clamp(min=0)
        total_used = bits.sum().item()
 
        if total_used > total_budget:
            lo = theta
        else:
            hi = theta
 
    # Snap to implementable bit widths: {0, 1, 2, 3, 4, 6, 8}
    bits = snap_to_codebook(bits)
 
    return bits

Bit width mapping: Signal dimensions get 4-8 bits, noise dimensions get 0-2 bits. The codebook is likely a simple uniform quantizer with learned scale/zero-point per channel.

3.6 memory layout

Per token, per layer, per head:
+------------------+-------------------+------------------+
| Signal dims (d_eff) | Noise dims (remaining) | Metadata   |
| mixed-precision     | low-bit quantized      | scales     |
+------------------+-------------------+------------------+

Memory savings (for head_dim=128, d_eff=4):

Uncompressed: 128 dims * 16 bits = 256 bytes
Signal: 4 dims * 8 bits = 4 bytes + 2 bytes scales
Noise: 124 dims * 2 bits = 31 bytes + 4 bytes scales
Total: ~41 bytes ⇒ ~6.2x compression (close to reported 5.95x with metadata overhead)

4. Code quality assessment

research code — Well-Structured

The repo is research code but well-organized (106 stars, MIT license):

21 experiment scripts with cross-architecture testing (Qwen, Llama, Mistral, Gemma)
Statistical significance via 10-seed confidence intervals (seeds: 42, 123, 7, 2024, 31415, 99, 1337, 8675309, 271828, 314159)
44 JSON result files across 15 subdirectories — raw data is reproducible
Lloyd-Max quantization with codebooks (not simple uniform quantization as initially assumed)
Pure PyTorch, no custom CUDA kernels

integration gap to production (vLLM / SGLang / FlashInfer)

Estimated weeks to months of engineering work (speculative — depends on vLLM architecture changes needed):

PagedAttention compatibility: vLLM’s paged blocks need mixed-precision packing within pages
CUDA kernel for fused rotate-quantize: Must avoid materializing the full rotated tensor
Attention kernel modification: FlashAttention expects uniform precision KV cache — mixed-precision requires a custom kernel
Calibration pipeline: One-time step on model load (~15 seconds per the paper)

FlashInfer integration path: FlashInfer already supports INT4 KV cache. Extending to mixed-precision per-dimension would require modifying append_paged_kv_cache and batch_decode kernels.

5. Reproduction guide

hardware requirements

Calibration: 1x GPU with enough VRAM for the model + ~10GB activation cache
- Qwen 2.5-14B: ~28GB model + ~10GB activations = 38GB (1x A100-80GB)
- Calibration time: ~15 seconds (per paper claim)
Evaluation: Same as calibration

verified commands (from repo README)

git clone https://github.com/dynamis-labs/spectralquant.git
pip install -e ".[dev]"
 
# Quick test
PYTHONPATH=src python experiments/run_memory_efficiency.py --quick
 
# Full evaluation
PYTHONPATH=src python experiments/run_memory_efficiency.py
# Expected: cos_sim = 0.9485, ratio = 5.95x

Requirements: Python >= 3.10, PyTorch >= 2.2.0, CUDA GPU

verified headline numbers (Qwen 2.5-14B)

Method	Compression	Cosine Sim	Latency (512 tokens)
FP16 baseline	1.0x	1.000	—
TurboQuant	5.02x	0.9226	0.566 ms/step
SpectralQuant	5.95x	0.9485	0.257 ms/step

The 2.2x latency improvement (0.257 vs 0.566 ms/step at 512 tokens) is the end-to-end number. The 4.5x attention-specific speedup cited in the paper refers to the attention computation alone (skipping QJL on 124 dims).

what could go wrong

RoPE handling: Wrong treatment of RoPE-transformed keys during calibration invalidates the eigenbasis
Calibration data mismatch: Different split/dataset shifts results by 0.1-0.3 perplexity
Head dimension: Must operate on KV heads (not Q heads for GQA models)
GQA: LLaMA-3 uses 8 KV heads vs 32 Q heads — calibration must target KV heads

6. Comparison with related work

Feature	QJL/TurboQuant	KIVI	GEAR	SpectralQuant
Rotation	Random Gaussian	None	None	Learned eigenvectors
Bit allocation	Uniform	Uniform + outlier	Uniform + residual	Water-filling
Calibration	None (data-oblivious)	None	None	15 seconds (data-aware)
Error correction	QJL on all dims	None	Low-rank residual	QJL on signal only
Compression	~5x	~8x	~6x	~6x
Quality at 6x	Good	Degraded	Good	Best (by cosine sim)

SpectralQuant’s tradeoff: requires offline calibration (a new step in the deployment pipeline) but delivers the best quality at a given compression ratio.

7. Answered questions (from repo)

GQA models: Yes — experiments include Llama, Mistral, Gemma (all GQA)
Triton kernel: No — pure PyTorch implementation
Value vectors: d_eff ≈ 40-55 for values (vs 4 for keys) — confirmed in calibration.py
Calibration determinism: Default seed 42, with 10-seed robustness testing
Latency overhead: 0.257 ms/step at 512 tokens (2.2x faster than TurboQuant)
vLLM/SGLang plugin: No — standalone research code
Memory layout: Handled in engine.py with nonuniform_quantization.py for codebook packing

interesting reads

SpectralQuant repository (GitHub, MIT license) — the source code analyzed in this article
TurboQuant (Google Research) — the QJL baseline implementation
KIVI — per-channel KV quantization, earlier approach
vLLM / SGLang — the serving frameworks where SpectralQuant would integrate

Updated April 2026 with verified details from the SpectralQuant repository (MIT license, 106 stars).

Alan's PKB

Explorer

SpectralQuant Implementation

SpectralQuant Implementation Analysis

interesting reads

Graph View

Table of Contents

Backlinks