Publications

(2026). AdaServe: SLO-Customized LLM Serving with Fine-Grained Speculative Decoding. EuroSys 2026.

PDF Cite

(2026). FlexLLM: Token-Level Co-Serving of LLM Inference and Finetuning with SLO Guarantees. NSDI 2026.

PDF Cite Code Project

(2025). SuffixDecoding: A Model-Free Approach to Speeding Up Large Language Model Inference. NeurIPS 2025 (Spotlight 🏆).

PDF Cite Code Project

(2025). SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning. NeurIPS 2025.

PDF Cite

(2025). OWL: Overcoming Window Length-Dependence in Speculative Decoding for Long-Context Inputs. ArXiv 2025.

PDF Cite

(2024). Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models. ACL 2024 Oral (Outstanding paper award 🏆).

PDF Cite

(2024). SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification. ASPLOS 2024 (Cited 350+ times 🏆).

PDF Cite Code

(2024). Optimal Kernel Orchestration for Tensor Programs with Korch. ASPLOS 2024.

PDF Code

(2023). Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems. ACM Comput. Surv..

PDF Cite

(2023). Direct Telemetry Access. SIGCOMM 2023.

PDF Cite Code

(2021). Zero-CPU Collection with Direct Telemetry Access. HotNets 2021.

PDF Cite