SpectralQuant Implementation Analysis

Companion to SpectralQuant KV Cache. See also: Inference Stack Synthesis.

Update (April 2026): The repo at https://github.com/Dynamis-Labs/spectralquant is now public (MIT license, 106 stars). The actual code structure is documented below. The original reconstruction has been replaced with verified details from the repo.

1. Paper recap (Implementation-Relevant points)

For the full technical analysis of SpectralQuant’s spectral rotation, selective error correction, and non-uniform bit allocation, see SpectralQuant KV Cache. This document focuses on the implementation: code structure, reproduction, and production integration path.

2. Actual repository structure

spectralquant/
  src/spectralquant/
    calibration.py          # Eigenspectral calibration with PCA and d_eff calculation
    spectral_rotation.py    # Rotation-based compression
    nonuniform_quantization.py  # Lloyd-Max quantization with codebooks
    selective_qjl.py        # Selective error correction (signal dims only)
    engine.py               # Main compression engine
    metrics.py              # Evaluation metrics (cosine sim, perplexity)
  experiments/              # 21 scripts covering:
    run_memory_efficiency.py
    # Cross-architecture testing (Qwen, Llama, Mistral, Gemma)
    # Statistical significance (10-seed confidence intervals)
    # Latency benchmarks, perplexity, NIAH, LongBench
  results/                  # 44 JSON files across 15 subdirectories
  paper_output/             # LaTeX source + compiled PDF + figures
  eval/
    cosine_sim.py       # Cosine similarity measurement
    perplexity.py       # Perplexity evaluation on WikiText-2 / C4
  scripts/
    calibrate.py        # CLI entry point for calibration
    evaluate.py         # CLI entry point for benchmarks
  configs/
    qwen2.5_14b.yaml    # Per-model calibration configs
    llama3_8b.yaml

3. Implementation details (Reconstructed)

3.1 calibration data collection

Expected implementation:

# scripts/calibrate.py (reconstructed)
def collect_key_activations(model, dataloader, num_samples=128):
    """Run forward passes, collect K projections per layer per head."""
    key_accum = {}  # {(layer, head): list of key tensors}
 
    for batch_idx, batch in enumerate(dataloader):
        if batch_idx >= num_samples:
            break
        with torch.no_grad():
            # Hook into attention layers to capture K after projection
            outputs = model(batch, output_attentions=False)
 
    return key_accum

Key parameters:

  • Dataset: Likely WikiText-2 or C4 validation split (128-512 sequences)
  • Sequence length: 2048 tokens (matching model context)
  • Batch size: 1 (calibration is not throughput-sensitive)
  • Sampling: First N sequences, no shuffling (deterministic)

Collection mechanism: Register forward hooks on each attention layer’s key projection (k_proj). Capture the output tensor after the linear projection but before any RoPE or Reshaping. This gives raw key vectors in [batch, seq_len, num_heads, head_dim] format.

3.2 eigendecomposition

def compute_spectral_basis(key_activations, per_head=True):
    """
    Compute eigenbasis for key covariance.
 
    Args:
        key_activations: [N, head_dim] concatenated key vectors
        per_head: If True, separate basis per attention head
 
    Returns:
        eigenvectors: [head_dim, head_dim] orthogonal rotation matrix
        eigenvalues: [head_dim] variance along each component
    """
    # Center the data
    mean = key_activations.mean(dim=0)
    centered = key_activations - mean
 
    # Covariance matrix: [head_dim, head_dim]
    # For head_dim=128, this is 128x128 -- trivially small
    cov = (centered.T @ centered) / (centered.shape[0] - 1)
 
    # Eigendecomposition (symmetric positive semi-definite)
    eigenvalues, eigenvectors = torch.linalg.eigh(cov)
 
    # Sort descending by eigenvalue
    idx = eigenvalues.argsort(descending=True)
    eigenvalues = eigenvalues[idx]
    eigenvectors = eigenvectors[:, idx]
 
    return eigenvectors, eigenvalues

Critical design decisions:

  1. Per-head vs per-layer: Should be per-head because different heads attend to different subspaces. The covariance structure varies significantly across heads (some are “positional,” some are “semantic”).

  2. torch.linalg.eigh vs numpy: Use torch.linalg.eigh (symmetric eigendecomposition) because the covariance matrix is guaranteed symmetric PSD. For head_dim=128, the 128x128 eigendecomposition takes <1ms — the bottleneck is collecting activations.

  3. Storage: The eigenvector matrix is [num_layers, num_heads, head_dim, head_dim]. For Qwen 2.5-14B: ~32 MB in FP16. One-time cost, stored alongside the model.

  4. RoPE interaction: Keys have RoPE applied. Calibration must eigendecompose post-RoPE keys because that’s the representation stored in the KV cache. However, RoPE makes the covariance position-dependent. The paper likely averages across positions.

3.3 spectral rotation at inference

class SpectralKVCache:
    def __init__(self, eigenvectors, eigenvalues, d_eff, bit_config):
        self.R = eigenvectors           # rotation matrices
        self.lambdas = eigenvalues      # for bit allocation
        self.d_eff = d_eff              # signal/noise boundary
        self.bit_config = bit_config    # per-dim bit widths
 
    def compress_keys(self, keys, layer_idx, head_idx):
        """
        keys: [batch, seq_len, head_dim]
        Returns: compressed representation
        """
        R = self.R[layer_idx, head_idx]  # [head_dim, head_dim]
 
        # Rotate into eigenbasis
        rotated = keys @ R  # [batch, seq_len, head_dim]
 
        # Split signal and noise
        d = self.d_eff[layer_idx, head_idx]
        signal = rotated[..., :d]     # high-variance dims
        noise = rotated[..., d:]      # low-variance dims
 
        # Quantize signal with more bits (e.g., 4-8 bit)
        signal_q = self.quantize(signal, self.bit_config[..., :d])
 
        # Quantize noise aggressively or skip QJL
        noise_q = self.low_bit_quantize(noise, bits=2)
 
        return (signal_q, noise_q, d)

The rotation replaces random projection: In QJL, you multiply keys by a random Gaussian matrix. SpectralQuant replaces this with the eigenvector matrix, which concentrates variance into the first few dimensions.

3.4 QJL selective application

d_eff cutoff determination:

def compute_d_eff(eigenvalues, threshold=0.99):
    """
    Find effective dimensionality via participation ratio
    or cumulative variance threshold.
    """
    # Participation ratio method (as described in paper)
    total = eigenvalues.sum()
    d_eff_pr = (total ** 2) / (eigenvalues ** 2).sum()
 
    # Or: cumulative variance threshold
    cumvar = eigenvalues.cumsum(dim=0) / total
    d_eff_cv = (cumvar >= threshold).nonzero()[0][0].item() + 1
 
    return int(d_eff_pr)  # paper uses participation ratio

The paper reports d_eff ≈ 4 via participation ratio for keys. This is much lower than a 99% variance threshold would give. The participation ratio weights by squared eigenvalues, making it sensitive to the spectral gap — which is what makes it the right metric here.

Key insight: SpectralQuant’s “selective” QJL means NO QJL on noise dimensions, not reduced QJL. The paper shows removing QJL entirely from noise dims improves quality by +3.0pp cosine similarity, because QJL correction on near-zero signals injects noise.

3.5 Non-Uniform bit allocation (Water-Filling)

def water_filling_allocation(eigenvalues, total_bits_per_dim=4.0):
    """
    Optimal bit allocation: R_i = max(0, 0.5 * log2(sigma_i^2 / theta))
    where theta is the water level chosen so sum(R_i) = R_total.
    """
    d = len(eigenvalues)
    total_budget = total_bits_per_dim * d
 
    # Binary search for the water level theta
    lo, hi = 1e-10, eigenvalues.max().item()
 
    for _ in range(100):  # bisection
        theta = (lo + hi) / 2
        bits = 0.5 * torch.log2(eigenvalues / theta).clamp(min=0)
        total_used = bits.sum().item()
 
        if total_used > total_budget:
            lo = theta
        else:
            hi = theta
 
    # Snap to implementable bit widths: {0, 1, 2, 3, 4, 6, 8}
    bits = snap_to_codebook(bits)
 
    return bits

Bit width mapping: Signal dimensions get 4-8 bits, noise dimensions get 0-2 bits. The codebook is likely a simple uniform quantizer with learned scale/zero-point per channel.

3.6 memory layout

Per token, per layer, per head:
+------------------+-------------------+------------------+
| Signal dims (d_eff) | Noise dims (remaining) | Metadata   |
| mixed-precision     | low-bit quantized      | scales     |
+------------------+-------------------+------------------+

Memory savings (for head_dim=128, d_eff=4):

  • Uncompressed: 128 dims * 16 bits = 256 bytes
  • Signal: 4 dims * 8 bits = 4 bytes + 2 bytes scales
  • Noise: 124 dims * 2 bits = 31 bytes + 4 bytes scales
  • Total: ~41 bytes ~6.2x compression (close to reported 5.95x with metadata overhead)

4. Code quality assessment

research code — Well-Structured

The repo is research code but well-organized (106 stars, MIT license):

  • 21 experiment scripts with cross-architecture testing (Qwen, Llama, Mistral, Gemma)
  • Statistical significance via 10-seed confidence intervals (seeds: 42, 123, 7, 2024, 31415, 99, 1337, 8675309, 271828, 314159)
  • 44 JSON result files across 15 subdirectories — raw data is reproducible
  • Lloyd-Max quantization with codebooks (not simple uniform quantization as initially assumed)
  • Pure PyTorch, no custom CUDA kernels

integration gap to production (vLLM / SGLang / FlashInfer)

Estimated weeks to months of engineering work (speculative — depends on vLLM architecture changes needed):

  1. PagedAttention compatibility: vLLM’s paged blocks need mixed-precision packing within pages
  2. CUDA kernel for fused rotate-quantize: Must avoid materializing the full rotated tensor
  3. Attention kernel modification: FlashAttention expects uniform precision KV cache — mixed-precision requires a custom kernel
  4. Calibration pipeline: One-time step on model load (~15 seconds per the paper)

FlashInfer integration path: FlashInfer already supports INT4 KV cache. Extending to mixed-precision per-dimension would require modifying append_paged_kv_cache and batch_decode kernels.

5. Reproduction guide

hardware requirements

  • Calibration: 1x GPU with enough VRAM for the model + ~10GB activation cache
    • Qwen 2.5-14B: ~28GB model + ~10GB activations = 38GB (1x A100-80GB)
    • Calibration time: ~15 seconds (per paper claim)
  • Evaluation: Same as calibration

verified commands (from repo README)

git clone https://github.com/dynamis-labs/spectralquant.git
pip install -e ".[dev]"
 
# Quick test
PYTHONPATH=src python experiments/run_memory_efficiency.py --quick
 
# Full evaluation
PYTHONPATH=src python experiments/run_memory_efficiency.py
# Expected: cos_sim = 0.9485, ratio = 5.95x

Requirements: Python >= 3.10, PyTorch >= 2.2.0, CUDA GPU

verified headline numbers (Qwen 2.5-14B)

MethodCompressionCosine SimLatency (512 tokens)
FP16 baseline1.0x1.000
TurboQuant5.02x0.92260.566 ms/step
SpectralQuant5.95x0.94850.257 ms/step

The 2.2x latency improvement (0.257 vs 0.566 ms/step at 512 tokens) is the end-to-end number. The 4.5x attention-specific speedup cited in the paper refers to the attention computation alone (skipping QJL on 124 dims).

what could go wrong

  1. RoPE handling: Wrong treatment of RoPE-transformed keys during calibration invalidates the eigenbasis
  2. Calibration data mismatch: Different split/dataset shifts results by 0.1-0.3 perplexity
  3. Head dimension: Must operate on KV heads (not Q heads for GQA models)
  4. GQA: LLaMA-3 uses 8 KV heads vs 32 Q heads — calibration must target KV heads

6. Comparison with related work

FeatureQJL/TurboQuantKIVIGEARSpectralQuant
RotationRandom GaussianNoneNoneLearned eigenvectors
Bit allocationUniformUniform + outlierUniform + residualWater-filling
CalibrationNone (data-oblivious)NoneNone15 seconds (data-aware)
Error correctionQJL on all dimsNoneLow-rank residualQJL on signal only
Compression~5x~8x~6x~6x
Quality at 6xGoodDegradedGoodBest (by cosine sim)

SpectralQuant’s tradeoff: requires offline calibration (a new step in the deployment pipeline) but delivers the best quality at a given compression ratio.

7. Answered questions (from repo)

  1. GQA models: Yes — experiments include Llama, Mistral, Gemma (all GQA)
  2. Triton kernel: No — pure PyTorch implementation
  3. Value vectors: d_eff ≈ 40-55 for values (vs 4 for keys) — confirmed in calibration.py
  4. Calibration determinism: Default seed 42, with 10-seed robustness testing
  5. Latency overhead: 0.257 ms/step at 512 tokens (2.2x faster than TurboQuant)
  6. vLLM/SGLang plugin: No — standalone research code
  7. Memory layout: Handled in engine.py with nonuniform_quantization.py for codebook packing

interesting reads

  • SpectralQuant repository (GitHub, MIT license) — the source code analyzed in this article
  • TurboQuant (Google Research) — the QJL baseline implementation
  • KIVI — per-channel KV quantization, earlier approach
  • vLLM / SGLang — the serving frameworks where SpectralQuant would integrate

Updated April 2026 with verified details from the SpectralQuant repository (MIT license, 106 stars).