AI

Web-Scale Data Has Driven Incredible Progress in AI, But Do We Really Need All That Data? Meet SemDeDup: A New Method to Remove Semantic Duplicates in Web Data With Minimal Performance Loss

3 Mins read

The growth of self-supervised learning (SSL) applied to larger and larger models and unlabeled datasets has been a major factor in recent success in machine learning. Particularly, many contemporary huge datasets are obtained at a worldwide web size and are typically unfiltered, save for NSFW filtering. LAION is a public multi-modal dataset including 5 billion image/text pairs.

Test error often scales as a power law concerning data amount. This has been observed because of the growing interest in scaling laws that forecast how a model’s performance will change given more data and/or parameters. However, power law scaling cannot be maintained since it rapidly reaches the point of declining marginal returns, where more data is needed to make even smaller performance improvements. Hence, it would have a significant influence if data efficiency were improved. The same computational budget would allow models to achieve the same performance much faster or better. 

Recent studies have been motivated by these findings. It proposes that with an ideal data ranking metric, exponential scaling might be possible by reducing training data following an intelligent criterion, thus breaking the power law scaling with respect to data. Yet, there is little knowledge of the best ways to pick data. These methods may prioritize one of three groups of outliers, approximately ranked by the difficulty of identifying them:

  1. Perceptual duplicates are data pairs that are virtually indistinguishable from the naked eye.
  2. Semantic duplicates have nearly identical information content but are easily distinguishable to the human eye.
  3. Semantic redundancy differs from semantic duplicates because it does not result from the same things. Nonetheless, there may still be a lot of repetition in the data shown in such situations.

Instead of supplying no information, as with the preceding types of data, misleading data generate a negative or detrimental signal, so deleting them improves performance rather than having no effect at all.

SemDeDup, proposed by researchers from Meta AI and Stanford University, is a computationally tractable and straightforward method for detecting semantic duplicates. 

Semantically identical data that would be difficult to find using simple deduplication algorithms are the primary focus of this effort. Because input-space distance measurements are unlikely to reveal semantic duplicates, finding such data points is difficult. The researcher overcame this restriction by employing k-means clustering on a publicly available pre-trained model. The next step was identifying nearby residents who fell below a given cutoff.

By omitting redundant information, the train may go much more quickly. Alternately, one can achieve greater performance than the baseline, especially on OOD tasks, while still obtaining a speedup, albeit smaller than that for matched performance, by removing fewer duplicates. The LAION training set was shrunk by half with almost no performance loss, leading to faster learning and the same or better results out of distribution. The study applies SemDeDup to C4, a large text corpus, and achieves efficiency gains of 15% while often outperforming past methods of SoTA deduplication.

Getting rid of semantic duplication is a good starting point for minimizing data size, but it’s not the only option. The team’s goal is to eventually have much smaller datasets, reducing training time and making massive models more accessible.


Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 16k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.


Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring the new advancements in technologies and their real-life application.



Source link

Related posts
AI

4 Open-Source Alternatives to OpenAI’s $200/Month Deep Research AI Agent

2 Mins read
OpenAI’s Deep Research AI Agent offers a powerful research assistant at a premium price of $200 per month. However, the open-source community…
AI

Trellix lowers cost, increases speed, and adds delivery flexibility with cost-effective and performant Amazon Nova Micro and Amazon Nova Lite models

4 Mins read
This post is co-written with Martin Holste from Trellix.  Security teams are dealing with an evolving universe of cybersecurity threats. These threats…
AI

OfferUp improved local results by 54% and relevance recall by 27% with multimodal search on Amazon Bedrock and Amazon OpenSearch Service

9 Mins read
This post is co-written with Andrés Vélez Echeveri and Sean Azlin from OfferUp. OfferUp is an online, mobile-first marketplace designed to facilitate…

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *