Alan's PKB

Tag: inference

8 items with this tag.

  • May 03, 2026

    SpecDecode-1: A Speculative Decoding ASIC

    • hardware
    • inference
    • speculative-decoding
    • chip-design
    • ASIC
    • KV-cache
    • tree-attention
  • May 01, 2026

    The Case for an Agentic Inference Chip

    • hardware
    • inference
    • agentic-ai
    • chip-design
    • memory-architecture
    • speculative-decoding
    • ARIA
    • KV-cache
  • Apr 13, 2026

    Interactive Visualizations

    • tools
    • interactive
    • GPU
    • B200
    • memory
    • KV
    • roofline
    • inference
    • systolic
  • Apr 12, 2026

    Breaking Down Blackwell

    • nvidia
    • blackwell
    • b200
    • systolic-arrays
    • inference
    • gpu-architecture
    • peak-flops
    • the
    • where
    • 5th
    • power
    • stress-testing
    • decode
    • NVL72
    • what
    • project
    • interesting
  • Apr 11, 2026

    InferBench

    • inference
    • benchmarking
    • asic
    • architecture
    • research
    • why
    • workload
    • LLM
    • diffusion
    • MoE
    • vision
    • systolic
    • SIMT
    • In-Memory
    • dataflow
    • reconfigurable
    • worked
    • NVIDIA
    • groq
    • google
    • comparison
    • key
    • proposed
    • useful
    • cost
    • energy
    • flexibility
    • validation
    • calibration
    • publication
  • Apr 11, 2026

    Inference Optimization Stack

    • inference
    • optimization
    • quantization
    • cuda
    • blackwell
    • moe
    • kv-cache
    • synthesis
    • research
    • 1
    • 2
    • 3
    • TrtLLMGen
    • thunder
    • 4
    • 5
    • 6
    • 7
    • baseline
    • with
    • 8
    • 9
    • 10
  • Apr 11, 2026

    SpectralQuant KV Cache

    • kv-cache
    • quantization
    • inference
    • attention
    • compression
    • spectral-methods
    • transformer-internals
    • research
    • executive
    • 1
    • 2
    • participation
    • the
    • what
    • 3
    • 4
    • task
    • statistical
    • distribution
    • 5
    • why
    • connection
    • 6
    • KIVI
    • loki
    • KV-CoRE
    • RoPE
    • random
    • Rate-Distortion
    • 7
    • memory
    • throughput
    • calibration
    • compatibility
    • 8
    • tested
    • layer
    • Training-Time
    • dynamic
    • interaction
    • theoretical
    • 9
  • Apr 11, 2026

    TrtLLMGen MoE Kernels

    • nvidia
    • tensorrt-llm
    • flashinfer
    • moe
    • cuda
    • blackwell
    • sm100
    • inference
    • open-source
    • mlperf
    • research
    • 1
    • the
    • where
    • 2
    • why
    • 3
    • 4
    • what
    • NVIDIA
    • 5
    • MLPerf
    • InferenceX
    • 6
    • 7
    • Short-Term
    • Medium-Term
    • 8
    • 9

© 2026

  • GitHub
  • RSS