AI

Grounding Multimodal Large Language Models in Actions

1 Mins read

Multimodal Large Language Models (MLLMs) have demonstrated a wide range of capabilities across many domains, including Embodied AI. In this work, we study how to best ground a MLLM into different embodiments and their associated action spaces, with the goal of leveraging the multimodal world knowledge of the MLLM. We first generalize a number of methods through a unified architecture and the lens of action space adaptors. For continuous actions, we show that a learned tokenization allows for sufficient modeling precision, yielding the best performance on downstream tasks. For discrete actions, we demonstrate that semantically aligning these actions with the native output token space of the MLLM leads to the strongest performance. We arrive at these lessons via a thorough study of seven action space adapters on five different environments, encompassing over 114 embodied tasks.


Source link

Related posts
AI

TokenSkip: Optimizing Chain-of-Thought Reasoning in LLMs Through Controllable Token Compression

3 Mins read
Large Language Models (LLMs) face significant challenges in complex reasoning tasks, despite the breakthrough advances achieved through Chain-of-Thought (CoT) prompting. The primary…
AI

Sony Researchers Propose TalkHier: A Novel AI Framework for LLM-MA Systems that Addresses Key Challenges in Communication and Refinement

3 Mins read
LLM-based multi-agent (LLM-MA) systems enable multiple language model agents to collaborate on complex tasks by dividing responsibilities. These systems are used in…
AI

Meta AI Releases the Video Joint Embedding Predictive Architecture (V-JEPA) Model: A Crucial Step in Advancing Machine Intelligence

3 Mins read
Humans have an innate ability to process raw visual signals from the retina and develop a structured understanding of their surroundings, identifying…

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *