GPUSparse: GPU-Accelerated Learned Sparse Retrieval with Parallel Inverted Indices
GPUSparse addresses the CPU bottleneck in learned sparse retrieval by introducing a GPU-accelerated inverted index with parallel scoring. The system uses block-aligned posting lists, batched scatter-add algorithms, and fused Triton kernels to process hundreds of queries simultaneously. On MS MARCO passage ranking, GPUSparse matches exact CPU scoring (MRR@10=0.383) while achieving a 235x speedup over Pyserini and 787 QPS throughput, unlike Seismic which sacrifices 25% recall for speed. This matters because it enables real-time, exact sparse retrieval at scale, revealing a fundamental work-efficiency versus bandwidth-efficiency tradeoff for GPU-based search systems.