TL;DR: We introduce a dataset, Noisy Imagenette, which is a version of the Imagenette dataset with noisy labels. We hope this dataset is useful for rapid experimentation and testing of methods to address noisy label training.

# Introduction

## Dataset have noisy labels!

Deep learning has led to impressive results on datasets of all types, but its success often shines when models are trained with large datasets with human-annotated labels (extreme example: GPT-3 and more recently CLIP/ALIGN/DALL-E). A major challenge when constructing these datasets is obtaining enough labels to train a neural network model. There is an inherent tradeoff between the quality of the annotations and the cost of annotation (in the form of time or money). For example, while using sources like Amazon Mechanical Turk provide cheap labeling, the use of these non-expert labeling services will often produce unreliable labels. This is what is referred to as noisy labels, as these unreliable labels are not necessarily ground truth. Unfortunately, neural networks are known to be susceptible to overfitting to noisy labels (see here) which means alternative approaches are needed to achieve good generalization in the presence of noisy labels.

## Prior research on noisy labels

Recently, many techniques have been presented in order to address label noise. These include novel loss functions like Bi-Tempered Logistic LossTaylor Cross Entropy Loss, or Symmetric Cross Entropy. Additionally, there are many novel training techniques that have been recently developed like MentorMix, DivideMix, Early-Learning Regularization and Noise-Robust Contrastive Learning.

Most of these papers are using MNIST, SVHN, CIFAR10 or related datasets with synthetically-added noise. Other common datasets are the WebVision and Clothing1M datasets, which are real-world noisy, large-scale datasets with millions of images. Therefore there is an opportunity to develop a mid-scale dataset that allows for rapid prototyping but is complex enough to provide useful results when it comes to noisy label training.

## fastai's Imagenette - a dataset for rapid prototyping

The idea of mid-scale datasets for rapid prototyping has been explored in the past. For example, in 2019, fast.ai released the Imagenette and Imagewoof datasets (subsequently updated in 2020), subsets of Imagenet for rapid experimentation and prototyping. It can serve as a small dataset proxy for the ImageNet, or a dataset with more complexity than MNIST or CIFAR10 but still small and simple enough for benchmarking and rapid experimentation. This dataset has been used to test and establish new training techniques like Mish activation function and Ranger optimizer (see here). The dataset also has been used in various papers (see here, here, here, here, here, and here). Clearly, this dataset has been quite useful to machine learning researchers and practitioners for testing and comparing new methods. We believe that an analogous dataset could be useful to researchers with modest compute for testing and comparing new methods for addressing label noise.

# Introducing Noisy Imagenette

We introduce Noisy Imagenette, a version of Imagenette (and Imagewoof) that has synthetically noisy labels at different levels: 1%, 5%, 25%, and 50% incorrect labels. The Noisy Imagenette dataset already comes with the Imagenette dataset:

from fastai.vision.all import *
source = untar_data(URLs.IMAGENETTE)


While the regular labels for Imagenette dataset are given as the names of the image folder, the noisy labels are provided as a separate CSV file with columns corresponding to the image filename and labels for each of the different noise levels:

csv_file = pd.read_csv(source/'noisy_imagenette.csv')

path noisy_labels_0 noisy_labels_1 noisy_labels_5 noisy_labels_25 noisy_labels_50 is_valid
0 train/n02979186/n02979186_9036.JPEG n02979186 n02979186 n02979186 n02979186 n02979186 False
1 train/n02979186/n02979186_11957.JPEG n02979186 n02979186 n02979186 n02979186 n03000684 False
2 train/n02979186/n02979186_9715.JPEG n02979186 n02979186 n02979186 n03417042 n03000684 False
3 train/n02979186/n02979186_21736.JPEG n02979186 n02979186 n02979186 n02979186 n03417042 False
4 train/n02979186/ILSVRC2012_val_00046953.JPEG n02979186 n02979186 n02979186 n02979186 n03394916 False

The generation of these noisy labels are provided in this Jupyter notebook. We have also updated fastai's train_imagenette.py to utilize the new noisy labels. If you want to train on the Noisy Imagenette dataset using this script, just simply pass the --pct-noise argument to the script with the desired noise level.

Note: The validation set remains clean and its labels are not changed. While technically the accuracy metric is robust to noise, I believe it’s simpler to use a clean validation set to clearly understand see if a model is learning appropriate decision boundaries on the ground truth.