AI

Meta AI introduces SPIRIT-LM: A Foundation Multimodal Language Model that Freely Mixes Text and Speech

3 Mins read

Prompting Large Language Models (LLMs) has emerged as a standard practice in Natural Language Processing (NLP) following the introduction of GPT-3. The scaling of language models to billions of parameters using extensive datasets contributes significantly to achieving broad language understanding and generation capabilities. Moreover, large-scale language models exhibit the ability to tackle novel tasks by leveraging a few examples via in-context, few-shot learning.

Speech Language Models (SpeechLMs), which are language models trained directly on speech, have been introduced by researchers, marking the beginning of an active area of research. Recent studies have contributed to advancing this field.

SPIRIT-LM is introduced as a foundational multimodal language model that seamlessly integrates text and speech. The model builds upon a pre-trained text language model and expands its capabilities to incorporate speech by continual training on a combination of text and speech data. Text and speech sequences are merged into a unified token set and trained using a word-level interleaving approach with a curated speech-text parallel corpus.

SPIRIT-LM is available in two variants: a BASE version employing speech semantic units and an EXPRESSIVE version that incorporates pitch and style units to model expressivity alongside semantic units. Both versions encode text using the subword BPE tokens. The resultant model demonstrates a fusion of semantic comprehension from text models and expressive qualities from speech models.

a. The architecture of SPIRIT-LM involves a language model trained through next-token prediction. Tokens are generated either from speech or text via an encoder and then reconstructed back to their original modality using a decoder. Training of SPIRIT-LM models encompasses a combination of text-only sequences, speech-only sequences, and interleaved speech-text sequences.

b. The scheme for interleaving speech and text involves encoding speech into tokens (depicted in pink) using clusterized speech units such as Hubert, Pitch, or Style tokens and text (depicted in blue) using BPE. Special tokens ([TEXT] for text and [SPEECH] for speech) are used to mark the respective modality. During training, a switch between modalities occurs randomly at word boundaries within aligned speech-text corpora. Speech tokens are deduplicated and then interleaved with text tokens at the boundary where the modality changes.

c. Expressive speech tokens are introduced for SPIRIT-LM-EXPRESSIVE. Pitch tokens and style tokens are interleaved after deduplication.

Their contributions are as follows:

  • They introduce SPIRIT-LM, a unified language model capable of generating both speech and text. SPIRIT-LM is developed by continuously pretraining LLAMA 2 with interleaved speech and text data.
  • Similar to text-based Language Models (LLMs), they observe that SPIRIT-LM can adeptly learn new tasks in a few-shot learning setting across text, speech, and crossmodal tasks (i.e., speech-to-text and text-to-speech).
  • To assess the expressive capabilities of generative models, they introduce the SPEECHTEXT SENTIMENT PRESERVATION benchmark (STSP). This benchmark evaluates how effectively generative models maintain the sentiment of prompts within and across modalities for both spoken and written expressions.
  • Finally, they propose an expressive variant of SPIRIT-LM, named SPIRIT-LM-EXPRESSIVE. Through the use of STSP, they demonstrate that SPIRIT-LM is the first language model capable of preserving the sentiment of both text and speech prompts within and across modalities.

As advancements in Large Language Models (LLMs) and Speech Language Models (SpeechLMs) continue, along with innovative approaches to prompt creation and model design, there’s great potential to improve natural language understanding systems. These advancements could profoundly impact many areas, such as conversational agents, virtual assistants, language translation, and accessibility tools. Ultimately, they could lead to more lifelike interactions between humans and machines.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel


Janhavi Lande, is an Engineering Physics graduate from IIT Guwahati, class of 2023. She is an upcoming data scientist and has been working in the world of ml/ai research for the past two years. She is most fascinated by this ever changing world and its constant demand of humans to keep up with it. In her pastime she enjoys traveling, reading and writing poems.




Source link

Related posts
AI

MMSearch Engine: AI Search with Advanced Multimodal Capabilities to Accurately Process and Integrate Text and Visual Queries for Enhanced Search Results

3 Mins read
Traditional search engines have predominantly relied on text-based queries, limiting their ability to process and interpret the increasingly complex information found online…
AI

CodeMaker AI Breakthrough in Software Development: Achieves 91% Accuracy in Recreating 90,000 Lines of Code, Setting a New Benchmark for AI-driven code Generation and Fine-Tuned Model

6 Mins read
In an era of AI-transforming industries, CodeMaker AI has achieved a landmark breakthrough by autonomously recreating a 90,000-line software library with an…
AI

ByteDance Introduced Hierarchical Large Language Model (HLLM) Architecture to Transform Sequential Recommendations, Overcoming Cold-Start Challenges, and Enhancing Scalability with State-of-the-Art Performance

4 Mins read
Recommendation systems have become the foundation for personalized services across e-commerce, streaming, and social media platforms. These systems aim to predict user…

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *