Cleaning Image Dataset: A Step-by-Step Tutorial with Fastdup Using Kaggle

JOEL BHASKAR NADAR
5 min readMay 17, 2023

--

Fastdup provided by Visual Layer.

Image by Author

Fastdup is a powerful free tool designed to rapidly extract valuable insights from your image & video datasets. Building a clean and reliable image dataset is crucial for successful machine learning and computer vision projects. However, image datasets often suffer from various issues such as duplicate images, corrupt files, inconsistent formats, and noisy data. In this tutorial, we will explore how to clean an image collection or dataset using the powerful tool Fastdup. Fastdup not only identifies duplicate images but also addresses other common issues, ensuring a high-quality dataset for your AI projects.

In this blog we will see how to use fastdup for cleaning image dataset using kaggle notebook. Kaggle notebook Link — https://www.kaggle.com/code/joelbhaskarnadar/visuallayercleaningimagedataset

Installing Fastdup: To get started, make sure your in working directory of Kaggle and have Fastdup.

KaggleNotebook

Once installed, import Fastdup into your notebook using the following command:

KaggleNotebook

Downloading the Image Dataset: we will be using the food-101 dataset which consists of 101 food classes with 1,000 images per class. Download and extract the dataset by running:

KaggleNotebook

Run fastdup — With the folder set in place, let’s run fastdup:

KaggleNotebook
From Visual Layer

Once the run completes you can get a summary of the run with:

KaggleNotebook

This command provide the clear view of dataset is present in directory.

Broken Images: Let find out running below command.

KaggleNotebook

Luckily there is no broken images in this dataset.

Moving let create a list for displaying broken images count is (0) 😂.For that run this command.

KaggleNotebook

Duplicates — Let’s visualize duplicate image pairs for that run this command. Best part is we get a clear view of duplicates images.

KaggleNotebook

Image Clusters: visualize image clusters with following commands.

KaggleNotebook

Let make List of Duplicates using following command:

KaggleNotebook

write a utility function to get the clusters and run it.

KaggleNotebook
KaggleNotebook

Now let’s keep one image from each cluster and remove the rest run the below code shell:

KaggleNotebook

For list of duplicates run this command:

KaggleNotebook

Outliers: visualize the outliers with following commands.

KaggleNotebook

List of Outliers: Let’s first get the outliers Data frame :

KaggleNotebook

Dark, Bright, Blurry Images:

We can visualize the dark, bright, blurry images using this code:

KaggleNotebook

DataFrame of dark: use this command.

KaggleNotebook

If want to check an image has a mean < 13 then we conclude it's a dark image run this command. and there many condition this command in the code.

KaggleNotebook

List of the dark, bright, blurry images: use this command.

KaggleNotebook
KaggleNotebook
KaggleNotebook

Let come to the end part findout the report of this image dataset.

In case If I have missed any step do refer to my Kaggle notebook.

Cleaning an image dataset is a crucial step in ensuring the accuracy and reliability of your machine learning or computer vision projects. FastDup simplifies this process by efficiently detecting and removing duplicate images, handling corrupt files, and converting images to a consistent format. By following this step-by-step tutorial, you can leverage FastDup to clean your image dataset, paving the way for more accurate and reliable AI applications. Embrace the power of FastDup and unlock the full potential of your image datasets.

My LinkedIn Profile — https://www.linkedin.com/in/joelnadar123

My YouTube Channel — https://www.youtube.com/@joelnadarai

My Twitter Page — https://twitter.com/joelnadarai

Thank you Feel Free to contact me…

--

--

JOEL BHASKAR NADAR
JOEL BHASKAR NADAR

Written by JOEL BHASKAR NADAR

Computer Vision || Data Analytics & Data Science || Object Detection || Segmentation || Power BI || SQL ||

Responses (1)