AI

Google DeepMind Introduces Tandem Transformers for Inference Efficient Large Language Models LLMs

2 Mins read

Very large language models (LLMs) continue to face major computational cost barriers, which prevents their broad deployment, even with inference optimization approaches that have advanced significantly. Sequentially producing tokens throughout the autoregressive generation process is a major cause of the high inference latency. Because ML accelerators (GPUs/TPUs) are designed for matrix-matrix multiplications and not the matrix-vector operations common in LLMs, this limitation prevents them from being fully utilized. As a result, autoregressive answer creation is far less efficient than prompt processing, which involves handling all tokens concurrently. 

However, the relative importance of the ability to comprehend the query or prefill (natural language understanding, or NLU) and the ability to produce an answer (natural language generation, or NLG) remains unclear. Modern LLM designs that rely solely on decoders bind these two activities together.

A new study by Google Research and DeepMind takes an efficiency-oriented look at this basic question. Their study presents Tandem Transformers, a new design that gives NLU (prefill processing) a far larger share of the model’s resources than NLG (response generation) does.  

The researchers implement a projection layer to bring the perhaps higher-dimensional representation space into alignment. Experiments with Tandem (PaLM2-Bison, PaLM2-Gecko) show that the capacity required for NLU vs NLG parts of LLMs can be separated, resulting in a more efficient design without a noticeable decrease in accuracy (where PaLM2-Gecko < PaLM2-Otter < PaLM2-Bison, according to model size). To maintain high accuracy, Tandem’s primary model refreshes all prefill representations, in contrast to an encoder-decoder architecture that would process query/prefix through an encoder and then generate the entire response through a decoder. 

They recommend Tandem + SPEED for applications that want output indistinguishable from the main model. The speculative decoding (SPEED) framework uses the Tandem small model to create draft tokens. Then, the large model verifies them. Improving draft quality while decreasing verification overhead relative to traditional SPEED is greatly aided by Tandem’s small model’s capacity to respond to the representations of large models.

Since Tandem is an independent model, it can produce respectable results without inherently requiring verification by a huge model. Tandem + SPEED can also leverage ML representations while autoregressively generating tokens, giving the drafter a far better compromise between token quality and model latency. Studies have demonstrated that logit distillation is useful for improving SPEED draft model training. This method works well with distillation and is complementary to it. Empirical Results for Tandem + SPEED. Lastly, they evaluate TPUv5e’s latency extensively for both the stand-alone and SPEED Tandem versions (PaLM2- Bison, PaLM2-Gecko), where PaLM2- Bison is the main large model and PaLM2- Gecko is the secondary small model. The researchers find that Tandem + SPEED with distillation can outperform the baseline PaLM2-Bison model by a factor of at least 2.19 on various datasets while maintaining the same output quality. As a bonus, their model is 1.11 to 1.17 times faster than the usual SPEED with the small model as the secondary model. Using an adaptive block length in SPEED, Tandem’s latency can be further reduced on various datasets by 1.04× to 1.09×.


Check out the PaperAll credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 38k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel

You may also like our FREE AI Courses….


Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone’s life easy.




Source link

Related posts
AI

PRISE: A Unique Machine Learning Method for Learning Multitask Temporal Action Abstractions Using Natural Language Processing (NLP)

2 Mins read
In the domain of sequential decision-making, especially in robotics, agents often deal with continuous action spaces and high-dimensional observations. These difficulties result…
AI

FLUTE: A CUDA Kernel Designed for Fused Quantized Matrix Multiplications to Accelerate LLM Inference

3 Mins read
Large Language Models (LLMs) face deployment challenges due to latency issues caused by memory bandwidth constraints. Researchers use weight-only quantization to address…
AI

Self-Route: A Simple Yet Effective AI Method that Routes Queries to RAG or Long Context LC based on Model Self-Reflection

3 Mins read
Large Language Models (LLMs) have revolutionized the field of natural language processing, allowing machines to understand and generate human language. These models,…

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *