RolloutPipe: Overlapping Pipelined Rollout and Training in Disaggregated On-Policy LLM Reinforcement Learning
Problem: Existing synchronous on-policy GRPO RLVR systems leave trainer GPUs idle during rollout, while asynchronous systems train on stale data. Method: RolloutPipe introduces complete-group pipelining (CGP) and frontier-group dispatch (FGD) to overlap rollout and training in disaggregated architectures while maintaining on-policy correctness. Finding: Evaluated on Qwen3-1.7B across four benchmarks and twelve rollout settings, RolloutPipe reduces rollout-to-train-end time by 30.7%-42.3% and lowers trainer waiting ratio by 37%-76% versus Slime. Why it matters: This enables efficient, on-policy LLM reinforcement learning post-training without idle GPU resources or stale data, critical for scaling reasoning tasks.