AI

ScreenSpot-Pro: The First Benchmark Driving Multi-Modal LLMs into High-Resolution Professional GUI-Agent and Computer-Use Environments

2 Mins read

GUI agents face three critical challenges in professional environments: (1) the greater complexity of professional applications compared to general-use software, requiring detailed comprehension of intricate layouts; (2) the higher resolution of professional tools, resulting in smaller target sizes and reduced grounding accuracy; and (3) the reliance on additional tools and documents, adding complexity to workflows. These challenges highlight the need for advanced benchmarks and solutions to enhance GUI agent performance in these demanding scenarios.

Current GUI grounding models and benchmarks are insufficient to fulfill professional environment requirements. Tools like ScreenSpot are designed for low-resolution tasks and lack the variety to simulate real-world scenarios accurately. Models such as OS-Atlas and UGround are computationally inefficient and fail when the targets are small or the interface is icon-rich, which is common in professional applications. In addition, the absence of multilingual support reduces their applicability in global workflows. These shortcomings highlight the need for more comprehensive and realistic benchmarks to further this field.

A team of researchers from the National University of Singapore, East China Normal University, and Hong Kong Baptist University introduce ScreenSpot-Pro: a new framework that is tailored to professional high-resolution environments. This benchmark has a dataset of 1,581 tasks across 23 applications in industries such as development, creative tools, CAD, scientific platforms, and office suites. It incorporates high-resolution, full-screen visuals and expert annotations that ensure accuracy and realism. Multilingual guidelines encompass both English and Chinese for an expanded range of evaluation. ScreenSpot-Pro is unique as it documents the actual workflows that result in real, high-quality annotations, therefore serving as a tool for the full assessment and development of GUI grounding models.

The dataset ScreenSpot-Pro captures realistic and challenging scenarios. The base of this dataset is formed by high-resolution images, where the target regions form an average of only 0.07% of the total screen, thus pointing to subtle and small GUI elements. Data was collected by professional users with experience in relevant applications, who used specialized tools to ensure accurate annotations. Additionally, the dataset supports multilingual capabilities to test bilingual functionality and contains several workflows to capture the subtleties of real professional tasks. These characteristics render it particularly advantageous for the assessment and enhancement of the accuracy and flexibility of GUI agents.

The analysis of current GUI grounding models utilizing ScreenSpot-Pro reveals considerable deficiencies in their capacity to manage high-resolution professional settings. OS-Atlas-7B attained the greatest accuracy rate of 18.9%. However, iterative methodologies, exemplified by ReGround, demonstrated the capacity to enhance performance, reaching an accuracy of 40.2% by fine-tuning predictions through a multi-step methodology. Minor components, such as icons, presented significant difficulties, whereas bilingual assignments further highlighted the limitations of the models. These findings emphasize the necessity for improved techniques that bolster contextual comprehension and resilience in intricate GUI situations.

ScreenSpot-Pro sets a transformative benchmark for the evaluation of GUI agents in professional high-resolution environments. It addresses the specific challenges in complex workflows, offering a diverse and precise dataset to guide innovations in GUI grounding. This contribution forms the foundation of much smarter and more efficient agents that support a seamless performance of professional tasks, significantly boosting productivity and innovation in all industry fields.


Check out the Paper and Data. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

🚨 FREE UPCOMING AI WEBINAR (JAN 15, 2025): Boost LLM Accuracy with Synthetic Data and Evaluation IntelligenceJoin this webinar to gain actionable insights into boosting LLM model performance and accuracy while safeguarding data privacy.


Aswin AK is a consulting intern at MarkTechPost. He is pursuing his Dual Degree at the Indian Institute of Technology, Kharagpur. He is passionate about data science and machine learning, bringing a strong academic background and hands-on experience in solving real-life cross-domain challenges.



Source link

Related posts
AI

This AI Paper from Tel Aviv University Introduces GASLITE: A Gradient-Based Method to Expose Vulnerabilities in Dense Embedding-Based Text Retrieval Systems

3 Mins read
Dense embedding-based text retrieval has become the cornerstone for ranking text passages in response to queries. The systems use deep learning models…
AI

Researchers from USC and Prime Intellect Released METAGENE-1: A 7B Parameter Autoregressive Transformer Model Trained on Over 1.5T DNA and RNA Base Pairs

3 Mins read
In a time when global health faces persistent threats from emerging pandemics, the need for advanced biosurveillance and pathogen detection systems is…
AI

3D Shape Tokenization - Apple Machine Learning Research

4 Mins read
We introduce Shape Tokens, a 3D representation that is continuous, compact, and easy to integrate into machine learning models. Shape Tokens serve…

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *