Cleaning Image Dataset: A Step-by-Step Tutorial with Fastdup Using Kaggle

5 min readMay 17, 2023

Fastdup provided by Visual Layer.

Fastdup is a powerful free tool designed to rapidly extract valuable insights from your image & video datasets. Building a clean and reliable image dataset is crucial for successful machine learning and computer vision projects. However, image datasets often suffer from various issues such as duplicate images, corrupt files, inconsistent formats, and noisy data. In this tutorial, we will explore how to clean an image collection or dataset using the powerful tool Fastdup. Fastdup not only identifies duplicate images but also addresses other common issues, ensuring a high-quality dataset for your AI projects.

In this blog we will see how to use fastdup for cleaning image dataset using kaggle notebook. Kaggle notebook Link — https://www.kaggle.com/code/joelbhaskarnadar/visuallayercleaningimagedataset

Installing Fastdup: To get started, make sure your in working directory of Kaggle and have Fastdup.

Once installed, import Fastdup into your notebook using the following command:

Downloading the Image Dataset: we will be using the food-101 dataset which consists of 101 food classes with 1,000 images per class. Download and extract the dataset by running:

Run fastdup — With the folder set in place, let’s run fastdup:

Once the run completes you can get a summary of the run with:

This command provide the clear view of dataset is present in directory.

Broken Images: Let find out running below command.

Luckily there is no broken images in this dataset.

Moving let create a list for displaying broken images count is (0) 😂.For that run this command.

Duplicates — Let’s visualize duplicate image pairs for that run this command. Best part is we get a clear view of duplicates images.

Image Clusters: visualize image clusters with following commands.

Let make List of Duplicates using following command:

write a utility function to get the clusters and run it.

Now let’s keep one image from each cluster and remove the rest run the below code shell:

For list of duplicates run this command:

Outliers: visualize the outliers with following commands.

List of Outliers: Let’s first get the outliers Data frame :

Dark, Bright, Blurry Images:

We can visualize the dark, bright, blurry images using this code:

DataFrame of dark: use this command.

KaggleNotebook

If want to check an image has a mean < 13 then we conclude it's a dark image run this command. and there many condition this command in the code.

List of the dark, bright, blurry images: use this command.

Let come to the end part findout the report of this image dataset.

In case If I have missed any step do refer to my Kaggle notebook.

Cleaning an image dataset is a crucial step in ensuring the accuracy and reliability of your machine learning or computer vision projects. FastDup simplifies this process by efficiently detecting and removing duplicate images, handling corrupt files, and converting images to a consistent format. By following this step-by-step tutorial, you can leverage FastDup to clean your image dataset, paving the way for more accurate and reliable AI applications. Embrace the power of FastDup and unlock the full potential of your image datasets.

My LinkedIn Profile — https://www.linkedin.com/in/joelnadar123

My YouTube Channel — https://www.youtube.com/@joelnadarai

My Twitter Page — https://twitter.com/joelnadarai

Thank you Feel Free to contact me…

References:

GitHub - visual-layer/fastdup: fastdup is a powerful free tool designed to rapidly extract valuable…

fastdup is a powerful free tool designed to rapidly extract valuable insights from your image & video datasets…

github.com