Daily | Yixun Hong

Filtered by: GPU × Language × Clear all

EGG: An Expert-Guided Agent Framework for Kernel Generation

Yaochen Han, Ke Fan, Hongxu Jiang, Wanqi Xu 2026-06-28

Language × Model LLM GPU × Hardware Tensor Agent

EGG addresses the problem of automating high-performance GPU kernel generation for LLMs, which currently requires manual expert tuning. The method decomposes kernel generation into two hierarchical stages—algorithmic structure design and hardware-specific tuning—guided by expert optimization principles and a stage-aware multi-agent collaboration mechanism. Experimental results on KernelBench and real-world workloads demonstrate a 2.13x average speedup over PyTorch, outperforming existing agent-based and RL-based approaches. This matters because it significantly reduces the reliance on manual optimization, enabling scalable and efficient kernel generation to combat the growing computational costs of LLMs.

PDF

RolloutPipe: Overlapping Pipelined Rollout and Training in Disaggregated On-Policy LLM Reinforcement Learning

Rongjian Chen, Jianmin Hu, Kejiang Ye, Minxian Xu 2026-06-28

Language × Model LLM Training GPU ×

Problem: Existing synchronous on-policy GRPO RLVR systems leave trainer GPUs idle during rollout, while asynchronous systems train on stale data. Method: RolloutPipe introduces complete-group pipelining (CGP) and frontier-group dispatch (FGD) to overlap rollout and training in disaggregated architectures while maintaining on-policy correctness. Finding: Evaluated on Qwen3-1.7B across four benchmarks and twelve rollout settings, RolloutPipe reduces rollout-to-train-end time by 30.7%-42.3% and lowers trainer waiting ratio by 37%-76% versus Slime. Why it matters: This enables efficient, on-policy LLM reinforcement learning post-training without idle GPU resources or stale data, critical for scaling reasoning tasks.

PDF