AI

Grounding Multimodal Large Language Models in Actions

1 Mins read

Multimodal Large Language Models (MLLMs) have demonstrated a wide range of capabilities across many domains, including Embodied AI. In this work, we study how to best ground a MLLM into different embodiments and their associated action spaces, with the goal of leveraging the multimodal world knowledge of the MLLM. We first generalize a number of methods through a unified architecture and the lens of action space adaptors. For continuous actions, we show that a learned tokenization allows for sufficient modeling precision, yielding the best performance on downstream tasks. For discrete actions, we demonstrate that semantically aligning these actions with the native output token space of the MLLM leads to the strongest performance. We arrive at these lessons via a thorough study of seven action space adapters on five different environments, encompassing over 114 embodied tasks.


Source link

Related posts
AI

Meta AI Releases 'NATURAL REASONING': A Multi-Domain Dataset with 2.8 Million Questions To Enhance LLMs’ Reasoning Capabilities

3 Mins read
Large language models (LLMs) have shown remarkable advancements in reasoning capabilities in solving complex tasks. While models like OpenAI’s o1 and DeepSeek’s…
AI

Google DeepMind Research Releases SigLIP2: A Family of New Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

4 Mins read
Modern vision-language models have transformed how we process visual data, yet they often fall short when it comes to fine-grained localization and…
AI

SGLang: An Open-Source Inference Engine Transforming LLM Deployment through CPU Scheduling, Cache-Aware Load Balancing, and Rapid Structured Output Generation

4 Mins read
Organizations face significant challenges when deploying LLMs in today’s technology landscape. The primary issues include managing the enormous computational demands required to…

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *