How I Won the First Data-centric AI Competition: Pierre-Louis Bescond

In this blog post, Pierre-Louis Bescond, one of the winners of the Data-Centric AI Competition, describes techniques and strategies that led to victory. Participants received a fixed model architecture and a dataset of 1,500 handwritten Roman numerals. Their task was to optimize model performance solely by improving the dataset and dividing it into training and validation sets. The dataset size was capped at 10,000. You can find more details about the competition here.

  1. Your personal journey in AI

Back in 2018, a Data Scientist I met advised me to follow Andrew Ng’s course on Machine Learning. I completed it in 6 weeks instead of 11; that was pure addiction!

From then, I started what I call “daily learning”: making sure that I had gained one new skill or understanding at the end of each day by reading articles, testing frameworks, libraries, snippets, etc. I have a Mechanical Engineering Degree with a specialization in Industrial IT, and I have always been very interested in Computer Science. Ten years ago, I specialized in Statistical Process Control applied to Manufacturing processes and quickly realized how data and statistics could deliver great insights!

The projects I am working on with my team focus on physical-chemical processes modeling, optimization, forecasting, and computer vision.

2. Why you decided to participate in the competition

During my first Machine Learning experiments, I learned the hard way that there was no sense in chasing terabytes of data to feed complex models. I remember struggling to deliver a model that was reaching the expected accuracy (±2%) to production operators.

I kept on increasing the volume of data as well as the complexity of models (from linear regressions… to ensemble trees… to deep learning) but, at some point (and maybe a bit too late!), I understood that their sampling method was suffering from a ±5% inaccuracy. The data I was processing was so noisy that, no matter the model, no algorithm would reach the expected performance.

This competition should be an eye-opener for all data scientists: “We should focus on data first, even if this task is not as appealing as jumping immediately on deep-learning models.”The techniques you used

3. Pictures review

  1. This first task is probably the most demanding: reviewing each picture to check a few criteria. Here are the ones I had used:
  • Does this look like a roman number? (If not, we should remove it!)
  • Is the picture correctly labeled? (ex. “II” in “III” folder or vice versa)
  • What is the number quality? (rating from 1: good to 4: poor)
  • What is the background quality? (same rating as above)
  • What is the font style? (“Arial” or “Roman”)
  • What is the exact format of the number? (“viii” or “VIII”)
  • Could we apply symmetries? (horizontal or vertical symmetries are usually suiting “I, II, III, or X” numbers but not for “i,” “ii,” or “VII”)

Here are three examples of my evaluation(stored in a tabular way):

While reviewing the 3000 pictures (which took me approximately 2 or 3 hours 😅), I sometimes felt “déjà vu,” and I started to wonder whether some duplicates were hidden in the dataset? That would not have been very surprising, so I had to take this also into consideration.

I also designed a simple function to automatically check the content of the folders to evaluate the results of the different operations I would perform. Like all other functions I would use afterward in the notebook, I stored it in a dedicated “dcc_functions.py” (available on the GitHub repository).

Here is the output on the initial dataset:

Dataset cleaning

4. Noise removal

I identified approx. 260 pictures to be removed (the corresponding list is stored in an Excel file on the GitHub repo). I started by excluding all the pictures that I had identified as pure noise or, at least, too noisy to train the model correctly. This is a personal choice, and each participant has probably ended up with a different selection. 

5. Duplicates removal

As explained before, I had the feeling that some pictures were the same but manually identifying them was impossible. I knew a few techniques that could help solve this issue:

  • Pairing files with identical sizes… but a lot of false positives would arise
  • Pairing files with identical sizes & configurations (like two “II” or “viii”)
  • Pairing files according to their statistics (using, for ex., PIL’s ImageStat)
  • Pairing files using Structure Similarity Index (some explanations here)
  • Pairing files according to their “hash” number

As it was not a “life or death” matter, I decided to use the second solution, pairing files with identical sizes & configurations, because iteasy and quick to implement. The script (available here) identified approx 200 pairs of twin pictures, out of which 53 were genuine duplicates (some examples below):

6. Moving some pictures in the correct folders

No need to spend much time on that one: when a picture was mislabeled, I moved it back to the folder it belongs to.

7.  Edgy or not edgy?

Before we go further, I’d like to share an interesting finding: Including edge cases in the data usually improves a model’s performance but not always… and I discovered this while reviewing images.

I did so twice: Once when I entered the competition and a second time when I had a better idea of what to look for in the pictures.

When reviewing the original pictures for the first time, I had excluded many “edgy cases” that seemed too ambiguous to train the model.

But a few weeks after the competition started, I started to get used to these edgy cases and consider them differently, like: “Well, it could be good to include this one to teach to the model that this case might happen.” I ended up adding approximately 80 pictures to the dataset.

Counter-intuitively, the performance was decreasing with this new selection, including more edgy pictures. How come?

One of the participants, Mohamed Mohey, highlighted on the dedicated Discourse thread that the 32×32 transformation (applied to the dataset before the training) would sometimes completely denature the essence of the picture, as shown in the example below:

We can observe that, due to this 32×32 transformation, an apparent “III” is becoming a plausible “II,” explaining why some edgy cases would not necessarily bring valuable information to the model.

It would probably have been good to review the pictures after a 32×32 transformation, but I did not!

8. Using the “label book” pictures to train the model

The organizers from DeepLearning.ai had provided a set of 52 pictures not existing in the “train” or “validation” folders to evaluate our model’s performance when the ResNet50 training was over.

It was a good way to understand how the model would be performing on the final and hidden dataset. But, after looking at the scores displayed on the leaderboard, I deduced that the final evaluation on the hidden dataset included 2420 pictures (see the corresponding notebook here). In that case, the52 pictures were not very representative!

So, I simply included these pictures in my training folder! The more, the merrier 😁.

9. Evaluating the impact of the augmentation techniques

As you might know, it is quite common to use augmentation techniques on a dataset composed of pictures to help deep learning models identify the features that allow inferring the classes correctly.

I decided to consider a few of them:

  • Horizontal and Vertical Symmetries
  • Clockwise and Anti-clockwise rotations (10° and 20°)
  • Horizontal and Vertical Translations
  • Cropping the white areas in the pictures
  • Adding synthetic “salt and pepper” noise
  • Transfering noise of some pictures to some others

10. Implementing the customs functions

The first functions are quite simple and easily implemented with PIL, OpenCV, or even “packaged solutions” such as ImgAug. I thought it would be more interesting to share some tips regarding some of the custom functions I had designed 😀.

11. Squared Cropping Function

The cropping operation is an interesting one! As the picture will, ultimately, be converted to a 32×32 picture, it might be better to zoom in on the area where the number is located.

However, if the number does not have a “squared” shape, the result could be distorted when converted to 32×32 (as shown below). I redesigned the function so that the cropped output will always have a square shape and avoid this distortion effect.

11 “Salt and Pepper” Function

As the background is probably not always plain white on the final evaluation dataset, I tried to augment pictures by adding a synthetic background.

I used the “salt & pepper” function, which is adding “0” and “1” randomly into the NumPy arrays describing the pictures.

12. Background Noise Transfer Function

I was not fully happy with the results of the “Salt and Pepper” function, as the noise was always homogeneous. I imagined another way to add noise to the pictures.

I recycled some of the pictures I had initially considered unreadable and made them become some “noisy background” basis. There were also some pictures with a “heavy background.” I removed the number (as shown below) to get more samples.

It provided me with a “noisy backgrounds bank” of 10 pictures that I added randomly to some pictures after applying horizontal or vertical symmetries.

13. Choosing the best augmentations

As the number of pictures allowed could not exceed 10.000 elements, I had to know which transformations provided the highest impact.

I decided to benchmark them by comparing a baseline (a cleaned dataset with no transformation) and the individual performance of each of the augmentation techniques(summary below):

We can observe that the rotations, translations, and cropping had a significant impact compared to others, so I decided to focus on those.

And “voilà”!

As the process is stochastic (transformations are applied with a 50% probability and some random parameters), each script iteration will produce a unique combination of pictures.

Many of my tests produced a performance of around 84%.

There would have been some additional tweaks to consider (like creating my pictures and adding them to the dataset, but I choose to only rely on the initial pictures provided). Some others probably gave it a try!

Link to the GitHub repository: https://github.com/pierrelouisbescond/data-centric-challenge-public.

14. Your advice for other learners

My strongest belief is that learning should be a regular and never-ending process, especially in an uprising field like data science. Every day, I make sure to read new articles, sometimes to secure knowledge I’ve already learned, but also to explore new concepts.

The same applies to testing tools/software/libraries. I have two ready-to-use datasets (one regression and a three-class classification created with SciKit-Learn) and a folder with 500 pictures adequately labeled. I know them by heart and use them to benchmark any new library I am willing to test.

And, most importantly, the best way to learn is probably to teach and coach others. Sharing complex ideas or concepts with others forces you to increase and expand your understanding; that is a key differentiator.