HBM Is Not All You Need: Efficient Disaggregated LLM Serving across Memory-heterogeneous Accelerators

Zhixiang Wei, Yun Wang, James Yen, Mingyuan Xia 2026-06-30

The problem is that LLM inference's prefill phase is compute-bound, leaving HBM bandwidth idle, while decode is memory-bound, making costly HBM-based GPUs inefficient for both phases. The method, HMA-Serve, disaggregates serving across memory-heterogeneous accelerators (MemHA) by pairing GDDR-based accelerators for prefill with HBM-based GPUs for decode, using phase-wise quantization, a compute-transfer pipeline, and deferred dequantization. Experimental evidence across four Qwen3 models and three production traces shows HMA-Serve achieves up to 3.2× higher goodput than state-of-the-art memory-homogeneous methods and 4.8× higher goodput-per-dollar with no measurable loss on generation-quality benchmarks. This matters because it enables cost-effective, cross-vendor LLM serving by efficiently utilizing heterogeneous memory technologies, breaking single-vendor assumptions about KV format and software stack.

PDF