AI

Recurrent Drafter for Fast Speculative Decoding in Large Language Models

1 Mins read

We present Recurrent Drafter (ReDrafter), an advanced speculative decoding approach that achieves state-of-the-art speedup for large language models (LLMs) inference. The performance gains are driven by three key aspects: (1) leveraging a recurrent neural network (RNN) as the draft model conditioning on LLM’s hidden states, (2) applying a dynamic tree attention algorithm over beam search results to eliminate duplicated prefixes in candidate sequences, and (3) training through knowledge distillation from the LLM. ReDrafter accelerates Vicuna inference in MT-Bench by up to 3.5x with a PyTorch implementation on Nvidia H100 GPUs. To demonstrate its practicality in production environments, we integrate ReDrafter into TensorRT-LLM, reaching up to 2.5x speedup on H100 GPUs. We also validated its effectiveness for on-device applications by implementing the approach in MLX and benchmarking performance on Metal GPUs in Apple Silicon chips, achieving up to 2.3x speedup.


Source link

Related posts
AI

BONE: A Unifying Machine Learning Framework for Methods that Perform Bayesian Online Learning in Non-Stationary Environments

3 Mins read
In this paper, researchers from Queen Mary University of London, UK, University of Oxford, UK, Memorial University of Newfoundland, Canada, and Google…
AI

13 Most Powerful Supercomputers in the World

9 Mins read
Supercomputers are the pinnacle of computational technology, which is made to tackle complex problems. These devices manage enormous databases, facilitating advances in…
AI

Unveiling Interpretable Features in Protein Language Models through Sparse Autoencoders

3 Mins read
Protein language models (PLMs) have significantly advanced protein structure and function prediction by leveraging the vast diversity of naturally evolved protein sequences….

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *