Vision-and-language models (VLMs) are important tools that use text to handle different computer vision tasks. Tasks like recognizing images, reading text from images (OCR), and detecting objects can be approached as answering visual questions with text responses. While VLMs have shown limited success on tasks, what remains unclear is how they process and represent multimodal inputs like images and text to produce those answers, which raises doubts about the kind of representations that enable them to achieve such tasks.
The current methods in vision-and-language models treat tasks as either text-based or image-based, focusing on one input type at a time. This misses the deeper possibilities of combining information from images and text. In-context learning (ICL), a feature of large language models (LLMs), allows models to adapt to tasks with minimal examples, driven by mechanisms like attention heads or task vectors that encode tasks as latent activations. Vision-and-language models (VLMs), inspired by LLMs, combine visual and text data using either late-fusion (pre-trained components) or early-fusion (end-to-end training) methods. Studies revealed that task representations can transfer across modalities, and even VLMs without image ICL can use task vectors for better performance, highlighting similarities between image and text ICL processes. Combining image and text input can allow VLMs to perform complex tasks more effectively.
To solve this, researchers from the University of California, Berkeley, experimented to analyze how task vectors are encoded and transferred in VLMs. Researchers found that VLMs map inputs into a shared task representation space, regardless of whether text examples, image examples, or explicit instructions define the task.
Researchers created six tasks to test whether VLMs behave similarly to task vectors and see how well task vectors could transfer across different modalities, using text, images, or direct instructions to define them. These vectors were then applied in cross-modal scenarios, like using text examples to define tasks but querying with images. Analyzing how token representations changed in VLMs showed a three-phase process: encoding input, forming a task representation, and generating outputs. The decoding of task vectors often summarized the task concept and aligned text and image modalities, although image-based tasks were less clear.
The study evaluated the cross-modal transfer performance of task vectors from text and image in-context learning (ICL), revealing significant improvements. Cross-modal patching (xPatch) surpassed same-context examples (xBase), boosting accuracy by 14–33% over text ICL xBase and 8–13% over image ICL Patch. Text-based task vectors proved more efficient than the image-based ones, as those involved extra recognition steps. Adding instruction-based and exemplar-based task vectors into a single vector improves task representation, reducing variance and increasing efficiency by 18%. Cross-modal transfer from text to image results were as high as 37–52% accuracy compared with the baselines. LLM-to-VLM transfers exhibited a high similarity in the task vectors (cosine similarity: 0.89–0.95). Thus, the results highlighted cross-modal patching and vector integration as key to optimizing task performance.
In summary, VLMs can effectively encode and transfer task representations across different modalities, which shows potential for achieving more versatile and efficient multi-modal models. Researchers attempted possible explanations, such as shared structures between language and perception or the models learning from the same underlying reality. They found better performance in transferring tasks from text to images than from images to text, likely because VLM training focuses more on text. Thus, this work can be a future baseline for further research and innovation!
Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 60k+ ML SubReddit.
🚨 [Must Attend Webinar]: ‘Transform proofs-of-concept into production-ready AI applications and agents’ (Promoted)
Divyesh is a consulting intern at Marktechpost. He is pursuing a BTech in Agricultural and Food Engineering from the Indian Institute of Technology, Kharagpur. He is a Data Science and Machine learning enthusiast who wants to integrate these leading technologies into the agricultural domain and solve challenges.