AI advancements have led to the incorporation of a large variety of datasets for multimodal models, allowing for a more comprehensive understanding of complex information and a substantial increase in accuracy. Leveraging their advantages, multimodal models find applications in healthcare, autonomous vehicles, speech recognition, etc. However, the large data requirement of these models has led to inefficiencies in computational costs, memory usage, and energy consumption. Even though the models are pretty advanced, it is difficult to curate data while maintaining or improving the model performance. These limitations hinder its real-world scalability. Researchers at Google, Google DeepMind, Tubingen AI Center, the University of Tubingen, and the University of Cambridge have devised a novel framework, Active Data Curation, to address these limitations.
Traditional approaches for optimizing model training include strategies like random sampling, data augmentation, and active learning. These methods have proven effective, but they face significant issues, such as ineffective fusion of diverse information from different modalities, ultimately hindering output evaluation. Moreover, these methods are also prone to overfitting due to the different generalizing rates of data types and require extensive resources.
The proposed framework, Active Data Curation, combines active learning principles and multimodal sampling techniques to create an efficient and effective data curation framework for training robust AI models. The model uses active learning to choose the most uncertain data and learns from it through a feedback loop. A multimodal sampling method is employed to maintain diversity in the different data types, such as texts and images. This framework is flexible to various multimodal models and can handle large datasets effectively by processing them distributively and using innovative sampling strategies. This approach reduces dataset size while maintaining or improving model performance.
The Active Data Curation framework accelerates the model training process and reduces the inference time by up to 11%. There is a significantly smaller computing workload when using compact but more informative datasets. Hence, the models were able to maintain their accuracy or improve upon tasks involving images and text while working with smaller datasets. This diversity and quality of the data have also enabled better performance in real-world settings.
In conclusion, the new Active Data Curation approach offers a novel way for training large-scale multimodal models. Selecting data based on a particular model’s needs solves the problems caused by traditional training methods. This approach significantly lowers computing costs while maintaining the model performance or even raising it, which is essential for efficient AI. This work has highlighted the importance of the innovative use of data in large multimodal models and comes with a novel benchmark for training scalable, sustainable models. Future research should be carried out to implement this framework into real-time training pipelines and further generalize it to multimodal tasks.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 60k+ ML SubReddit.
🚨 [Must Attend Webinar]: ‘Transform proofs-of-concept into production-ready AI applications and agents’ (Promoted)
Afeerah Naseem is a consulting intern at Marktechpost. She is pursuing her B.tech from the Indian Institute of Technology(IIT), Kharagpur. She is passionate about Data Science and fascinated by the role of artificial intelligence in solving real-world problems. She loves discovering new technologies and exploring how they can make everyday tasks easier and more efficient.