Filtered by: VLLM × Clear all

ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature Scaling

Sihwa Lee, Janghwan Lee, Donghoon Yoo, Jae Gon Kim 2026-06-14

Problem: Large reasoning models (LRMs) incur high inference costs due to long reasoning traces, and directly applying NVFP4 low-precision quantization degrades reasoning accuracy while existing kernels fail to deliver latency benefits in small-batch autoregressive decoding. Method: ReSET proposes a step-aware temperature scaling method that estimates step-level uncertainty online using both token-level and step-level entropy signals, and introduces a CUDA-core small-M NVFP4 kernel for latency-critical decoding. Finding: ReSET improves NVFP4 reasoning accuracy by up to ~2 points over the NVFP4 baseline, and the custom kernel achieves up to 2.5× kernel-level speedup over NVFP4 vLLM and approximately 2× end-to-end decoding speedup over BF16. Why it matters: This work enables accurate and efficient low-precision inference for latency-critical LRM deployments, reducing computational and memory costs without sacrificing reasoning quality.

PDF

GF-DiT: Scheduling Parallelism for Diffusion Transformer Serving

Xinwei Qiang, Yifan Hu, Shixuan Sun, Jing Yang 2026-06-14

The problem is that existing Diffusion Transformer (DiT) serving systems use static parallelism for each request, which is inefficient due to heterogeneity across requests, execution stages, and system conditions. GF-DiT introduces a policy-programmable runtime that dynamically adapts parallelism via an asynchronous execution abstraction and group-free collectives for low-overhead online GPU reallocation. Experimental evaluation in vLLM-Omni shows GF-DiT improves throughput by up to 6.01×, reduces mean latency by up to 95%, and lowers SLO violation rates by up to 90% compared to fixed-pipeline execution. This matters because it enables efficient, elastic DiT serving that treats GPU parallelism as a schedulable resource, significantly improving performance and service quality for image and video generation workloads.

PDF