arxiv:2407.05528

An accurate detection is not all you need to combat label noise in web-noisy datasets

Published on Jul 8

· Submitted by

PAlbert31 on Jul 11

Upvote

Authors:

Paul Albert ,

Jack Valmadre ,

Abstract

Training a classifier on web-crawled data demands learning algorithms that are robust to annotation errors and irrelevant examples. This paper builds upon the recent empirical observation that applying unsupervised contrastive learning to noisy, web-crawled datasets yields a feature representation under which the in-distribution (ID) and out-of-distribution (OOD) samples are linearly separable. We show that direct estimation of the separating hyperplane can indeed offer an accurate detection of OOD samples, and yet, surprisingly, this detection does not translate into gains in classification accuracy. Digging deeper into this phenomenon, we discover that the near-perfect detection misses a type of clean examples that are valuable for supervised learning. These examples often represent visually simple images, which are relatively easy to identify as clean examples using standard loss- or distance-based methods despite being poorly separated from the OOD distribution using unsupervised learning. Because we further observe a low correlation with SOTA metrics, this urges us to propose a hybrid solution that alternates between noise detection using linear separation and a state-of-the-art (SOTA) small-loss approach. When combined with the SOTA algorithm PLS, we substantially improve SOTA results for real-world image classification in the presence of web noise github.com/PaulAlbert31/LSA

View arXiv page View PDF Add to collection

Community

PAlbert31

Paper author Paper submitter Jul 11

We extend our previous work observing a linear separation between in- and out-of-distribution samples on the contrastive hypersphere when training unsupervised contrastive objectives on web-noise datasets.
We observe that although very accurate to detect explicit out-of-distribution samples, linear separation can also wrongly identify important clean samples as noisy that when removed greatly reduce validation accuracy.

We devise a alternating noise detection strategy where we use either linear separation or small loss every other epoch. While linear separation is accurate at identifying out-of-distribution noise, the small loss re-identifies the important clean samples and the in-distribution noise.

top-1 mini-Webvision accuracy: 82.08

nielsr

Jul 12

Hi @PAlbert31 congrats on this work! Are you planning to share any artifacts (such as models, datasets) on the hub?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2407.05528 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2407.05528 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2407.05528 in a Space README.md to link it from this page.