One of the most pressing challenges in the evaluation of Vision-Language Models (VLMs) is related to not having comprehensive benchmarks that assess the full spectrum of model capabilities. This is because most existing evaluations are narrow in terms of focusing on only one aspect of the respective tasks, such as either visual perception or question answering, at the expense of critical aspects like fairness, multilingualism, bias, robustness, and safety. Without a holistic evaluation, the performance of models may be fine in some tasks but critically fail in others that concern their practical deployment, especially in sensitive real-world applications. There is, therefore, a dire need for a more standardized and complete evaluation that is effective enough to ensure that VLMs are robust, fair, and safe across diverse operational environments.
The current methods for the evaluation of VLMs include isolated tasks like image captioning, VQA, and image generation. Benchmarks like A-OKVQA and VizWiz are specialized in the limited practice of these tasks, not capturing the holistic capability of the model to generate contextually relevant, equitable, and robust outputs. Such methods generally possess different protocols for evaluation; therefore, comparisons between different VLMs cannot be equitably made. Moreover, most of them are created by omitting important aspects, such as bias in predictions regarding sensitive attributes like race or gender and their performance across different languages. These are limiting factors toward an effective judgment with respect to the overall capability of a model and whether it is ready for general deployment.
Researchers from Stanford University, University of California, Santa Cruz, Hitachi America, Ltd., University of North Carolina, Chapel Hill, and Equal Contribution propose VHELM, short for Holistic Evaluation of Vision-Language Models, as an extension of the HELM framework for a comprehensive evaluation of VLMs. VHELM picks up particularly where the lack of existing benchmarks leaves off: integrating multiple datasets with which it evaluates nine critical aspects—visual perception, knowledge, reasoning, bias, fairness, multilingualism, robustness, toxicity, and safety. It allows the aggregation of such diverse datasets, standardizes the procedures for evaluation to allow for fairly comparable results across models, and has a lightweight, automated design for affordability and speed in comprehensive VLM evaluation. This provides precious insight into the strengths and weaknesses of the models.
VHELM evaluates 22 prominent VLMs using 21 datasets, each mapped to one or more of the nine evaluation aspects. These include well-known benchmarks such as image-related questions in VQAv2, knowledge-based queries in A-OKVQA, and toxicity assessment in Hateful Memes. Evaluation uses standardized metrics like ‘Exact Match’ and Prometheus Vision, as a metric that scores the models’ predictions against ground truth data. Zero-shot prompting used in this study simulates real-world usage scenarios where models are asked to respond to tasks for which they had not been specifically trained; having an unbiased measure of generalization skills is thus assured. The research work evaluates models over more than 915,000 instances hence statistically significant to gauge performance.
The benchmarking of 22 VLMs over nine dimensions indicates that there is no model excelling across all the dimensions, hence at the cost of some performance trade-offs. Efficient models like Claude 3 Haiku show key failures in bias benchmarking when compared with other full-featured models, such as Claude 3 Opus. While GPT-4o, version 0513, has high performances in robustness and reasoning, attesting to high performances of 87.5% on some visual question-answering tasks, it shows limitations in addressing bias and safety. On the whole, models with closed API are better than those with open weights, especially regarding reasoning and knowledge. However, they also show gaps in terms of fairness and multilingualism. For most models, there is only partial success in terms of both toxicity detection and handling out-of-distribution images. The results bring forth many strengths and relative weaknesses of each model and the importance of a holistic evaluation system such as VHELM.
In conclusion, VHELM has substantially extended the assessment of Vision-Language Models by offering a holistic frame that assesses model performance along nine essential dimensions. Standardization of evaluation metrics, diversification of datasets, and comparisons on equal footing with VHELM allow one to get a full understanding of a model with respect to robustness, fairness, and safety. This is a game-changing approach to AI assessment that in the future will make VLMs adaptable to real-world applications with unprecedented confidence in their reliability and ethical performance .
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit
[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Data Retrieval Conference (Promoted)