AI

Generalizable Error Modeling for Human Data Annotation: Evidence from an Industry-Scale Search Data Annotation Program

1 Mins read

Machine learning (ML) and artificial intelligence (AI) systems rely heavily on human-annotated data for training and evaluation. A major challenge in this context is the occurrence of annotation errors, as their effects can degrade model performance. This paper presents a predictive error model trained to detect potential errors in search relevance annotation tasks for three industry-scale ML applications (music streaming, video streaming, and mobile apps). Drawing on real-world data from an extensive search relevance annotation program, we demonstrate that errors can be predicted with moderate model performance (AUC=0.65-0.75) and that model performance generalizes well across applications (i.e., a global, task-agnostic model performs on par with task-specific models). In contrast to past research, which has often focused on predicting annotation labels from task-specific features, our model is trained to predict errors directly from a combination of task features and behavioral features derived from the annotation process, in order to achieve a high degree of generalizability. We demonstrate the usefulness of the model in the context of auditing, where prioritizing tasks with high predicted error probabilities considerably increases the amount of corrected annotation errors (e.g., 40% efficiency gains for the music streaming application). These results highlight that behavioral error detection models can yield considerable improvements in the efficiency and quality of data annotation processes. Our findings reveal critical insights into effective error management in the data annotation process, thereby contributing to the broader field of human-in-the-loop ML.


Source link

Related posts
AI

What Happens When Diffusion and Autoregressive Models Merge? This AI Paper Unveils Generation with Unified Diffusion

3 Mins read
Generative models based on diffusion processes have shown great promise in transforming noise into data, but they face key challenges in flexibility…
AI

MOSEL: Collection of Open Source Speech Data for Speech Foundation Model Training on EU Languages

2 Mins read
While existing speech datasets are heavily skewed towards English, many EU languages are underserved in terms of accessible and high-quality speech data….
AI

Vinoground: A Temporal Counterfactual Large Multimodal Models LMM Evaluation Benchmark Encompassing 1000 Short and Natural Video-Caption Pairs

3 Mins read
Generative Intelligence has remained a hot topic for some time, with the current world witnessing an unprecedented boom in AI-related innovations and…

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *