Knowing what a GPU can do in theory is very different from making it do that thing in practice. CUDA programming is where the abstraction meets the silicon — you’re managing warps, shared memory banks, register pressure, and occupancy, all while trying to keep thousands of cores fed with data. The gap between a naive kernel and an expert one is routinely 10-50x on the same hardware. That’s not an optimization opportunity; that’s the difference between a viable product and a science project.
At the infrastructure level, the challenges shift from single-chip efficiency to orchestration at scale. How do you serve models across hundreds of GPUs with NVLink and InfiniBand without the interconnect becoming the bottleneck? how does Apple run private cloud compute for on-device intelligence — routing requests between Apple Silicon inference nodes while maintaining the privacy guarantees that make the whole thing possible? these are systems problems in the classic sense: scheduling, networking, fault tolerance, and the unglamorous plumbing that determines whether your expensive hardware actually delivers tokens to users.
Edge compute is the other end of the spectrum. Robotics accelerators and on-device inference chips operate under hard power and latency constraints that datacenter hardware never sees. When your power budget is 5 watts instead of 1000, the design tradeoffs are fundamentally different — and the systems work to make those chips useful is where a lot of the real innovation happens.
- Thunder Kittens CUDA — register-tiled kernel design on H100
- Apple Intelligence Infrastructure — PCC, on-device vs cloud, Apple Silicon for inference
- Robotics Edge Accelerators — edge compute for robotics