AI

Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection

1 Mins read

Although Large Language Models (LLMs) have shown promise for human-like conversations, they are primarily pre-trained on text data. Incorporating audio or video improves performance, but collecting large-scale multimodal data and pre-training multimodal LLMs is challenging. To this end, we propose a Fusion Low Rank Adaptation (FLoRA) technique that efficiently adapts a pre-trained unimodal LLM to consume new, previously unseen modalities via low rank adaptation. For device-directed speech detection, using FLoRA, the multimodal LLM achieves 22% relative reduction in equal error rate (EER) over the text-only approach and attains performance parity with its full fine-tuning (FFT) counterpart while needing to tune only a fraction of its parameters. Furthermore, with the newly introduced adapter dropout, FLoRA is robust to missing data, improving over FFT by 20% lower EER and 56% lower false accept rate. The proposed approach scales well for model sizes from 16M to 3B parameters.


Source link

Related posts
AI

IBM AI Releases Granite-Vision-3.1-2B: A Small Vision Language Model with Super Impressive Performance on Various Tasks

2 Mins read
The integration of visual and textual data in artificial intelligence presents a complex challenge. Traditional models often struggle to interpret structured visual…
AI

Singapore University of Technology and Design (SUTD) Explores Advancements and Challenges in Multimodal Reasoning for AI Models Through Puzzle-Based Evaluations and Algorithmic Problem-Solving Analysis

3 Mins read
After the success of large language models (LLMs), the current research extends beyond text-based understanding to multimodal reasoning tasks. These tasks integrate…
AI

Unraveling Direct Alignment Algorithms: A Comparative Study on Optimization Strategies for LLM Alignment

3 Mins read
Aligning large language models (LLMs) with human values remains difficult due to unclear goals, weak training signals, and the complexity of human…

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *