AI

Stanford University Researchers Introduce FlashFFTConv: A New Artificial Intelligence System for Optimizing FFT Convolutions for Long Sequences

4 Mins read

Reasoning efficiently across extended sequences is a major difficulty in machine learning. Recently, convolutions have emerged as a critical primitive for sequence modeling, supporting state-of-the-art performance in language modeling, time-series analysis, computer vision, DNA modeling, and more. Despite these impressive quality findings and additional advantages, such as improved stability and better scalability as the sequence length increases, convolutional sequence models are still significantly slower than Transformers. 

One main cause is unreliable hardware support. Convolutions for sequence modeling frequently employ filters as lengthy as the input sequence, in contrast to the short filters used in classical convolutions for visual applications. The Fast Fourier Transform (FFT) convolution algorithm calculates the convolution between an input u and convolution kernel k by mapping the input and output frequencies. 

Despite being asymptotically efficient, the FFT convolution algorithm has a low wall-clock time on contemporary accelerators. However, technological progress in systems has allowed Transformers to reach the limits of current accelerators, with an end-to-end FLOP usage of over 72% when using FlashAttention-v2. 

To offer longer-context capabilities, a new research from Stanford University investigates how to optimize the FFT convolution method on contemporary accelerators. The researchers believe that, as advances in systems like FlashAttention led to better models and new attention algorithms, optimizing the FFT convolution will lead to new and better algorithms, boosting the quality of convolutional sequence models. 

The FFT convolution can be easily optimized for short sequences. It is common practice to reuse kernel filters over multiple batches, which makes it possible to precompute the FFT of the filter before reusing it. Thus, the FFT convolution is parallel across batches and filters, and kernel fusion allows intermediate convolution outputs to be cached in SRAM or registers.

  1. However, the team highlights that two major bottlenecks appear as the sequence length grows. Regarding current accelerators, FFT convolutions do not optimally utilize the specialized matrix-matrix multiply units. 
  2. Second, kernel fusion fails as sequences grow too long to fit in SRAM, and costly I/O operations are required. Padding operations for causality and conversions from real-valued inputs/outputs to complex-valued FFT intermediates might increase these I/O costs further.

In response, the researchers offer FlashFFTConv, a novel algorithm that employs a Monarch decomposition of the FFT to optimize the FFT convolution for extended sequences. The FFT can be effectively transferred onto hardware thanks to a Monarch decomposition of order p, which rewrites the FFT as a series of p matrix-matrix multiply operations. Higher p values incur less FLOP cost due to smaller matrices but call for more I/O to convey intermediate results. Hence, there is a tradeoff involved.

The study demonstrates how to optimize p for FLOP cost, and I/O cost in a GPU using a straightforward cost model based on sequence length. In addition to facilitating kernel fusion at greater sequence lengths, this decomposition reduces the amount of the sequence that must be maintained in SRAM. Therefore, FlashFFTConv can easily handle sequences anywhere from 256 to 4 million characters long. By using a real-valued FFT algorithm and skipping parts of the matrix-multiply operations when the input is zero-padded, FlashFFTConv can reduce the length of the FFT operation by as much as half. Last but not least, the matrix view of the FFT convolution provides a simple interface for implementing two architectural modifications: partial convolutions, which learn with a convolution kernel that is shorter than the input sequence, and frequency sparse convolutions, which zero out sections of the kernel in frequency space. Both approaches can be implemented simply by omitting sections of the matrix decomposition, lowering memory footprint and wall-clock runtime, and can be thought of as convolutional parallels of sparse/approximate attention in Transformers.

The researchers demonstrate that FlashFFTConv accelerates the FFT convolution, resulting in better quality, more efficient, and longer sequence models. 

  • FlashFFTConv improves the quality of convolutional sequence models via better efficiency: for the same compute budget, FlashFFTConv allows Hyena-GPT-s to achieve 2.3 points better perplexity and allows M2-BERT-base to achieve up to 3.3 higher average GLUE score—a gain in performance equivalent to doubling the parameters of the model.
  • FlashFFTConv improves the efficiency of convolutions by up to 7.93 and by up to 5.60 in memory savings compared to PyTorch, and this efficiency holds over four orders of magnitude in sequence length. FlashFFTConv is faster in wall-clock time than FlashAttention-v2 end-to-end for sequence lengths 2K and longer due to lower FLOP costs and achieves up to 62.3% end-to-end FLOP usage, which is only 10% less than FlashAttention-v2.
  • Models of longer sequences are possible with FlashFFTConv. FlashFFTConv has produced the only model capable of completing the lengthy arena benchmark’s Path-512 job (sequence length 256K) for high-resolution picture classification. FlashFFTConv is the first model to embed the longest human genes (up to 2.3M base pairs) at single nucleotide resolution; it extends HyenaDNA to 4M sequence length via partial convolutions. 

The team hopes that FlashFFTConv will pave the way for wider use of convolutional sequence models and that the lessons learned will lead to more resource-efficient computer architectures.


Check out the PaperGithub, and Blog Article. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..


Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone’s life easy.



Source link

Related posts
AI

PRISE: A Unique Machine Learning Method for Learning Multitask Temporal Action Abstractions Using Natural Language Processing (NLP)

2 Mins read
In the domain of sequential decision-making, especially in robotics, agents often deal with continuous action spaces and high-dimensional observations. These difficulties result…
AI

FLUTE: A CUDA Kernel Designed for Fused Quantized Matrix Multiplications to Accelerate LLM Inference

3 Mins read
Large Language Models (LLMs) face deployment challenges due to latency issues caused by memory bandwidth constraints. Researchers use weight-only quantization to address…
AI

Self-Route: A Simple Yet Effective AI Method that Routes Queries to RAG or Long Context LC based on Model Self-Reflection

3 Mins read
Large Language Models (LLMs) have revolutionized the field of natural language processing, allowing machines to understand and generate human language. These models,…

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *