AI

KV Prediction for Improved Time to First Token

1 Mins read

Inference with transformer-based language models begins with a prompt processing step. In this step, the model generates the first output token and stores the KV cache needed for future generation steps. This prompt processing step can be computationally expensive, taking 10s of seconds or more for billion-parameter models on edge devices when prompt lengths or batch sizes rise. This degrades user experience by introducing significant latency into the model’s outputs. To reduce the time spent producing the first output (known as the “time to first token”, or TTFT) of a pretrained model, we introduce a novel method called KV Prediction. In our method, a small auxiliary model is used to process the prompt and produce an approximation of the KV cache used by a base model. This approximated KV cache is then used with the base model for autoregressive generation without the need to query the auxiliary model again. We demonstrate that our method produces a pareto-optimal efficiency-accuracy trade-off when compared to baselines. On TriviaQA, we demonstrate relative accuracy improvements in the range of 15%−50% across a range of TTFT FLOPs budgets. We also demonstrate accuracy improvements of up to 30% on HumanEval python code completion at fixed TTFT FLOPs budgets. Additionally, we benchmark models on an Apple M2 Pro CPU and demonstrate that our improvement in FLOPs translates to a TTFT speedup on hardware. We release our code here.


Source link

Related posts
AI

7 Best Practices, Use Cases & Benefits in 2025

5 Mins read
We are using 7 leading survey tools and have seen how AI facilitates steps like: Question creation with prompts and automated data…
AI

Meta AI Releases 'NATURAL REASONING': A Multi-Domain Dataset with 2.8 Million Questions To Enhance LLMs’ Reasoning Capabilities

3 Mins read
Large language models (LLMs) have shown remarkable advancements in reasoning capabilities in solving complex tasks. While models like OpenAI’s o1 and DeepSeek’s…
AI

Google DeepMind Research Releases SigLIP2: A Family of New Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

4 Mins read
Modern vision-language models have transformed how we process visual data, yet they often fall short when it comes to fine-grained localization and…

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *