Trash Trash Trash
Updated: May 18
One of the most common says in computer science is “garbage in garbage out.” This saying tries to summarize the assent that if you input a system with bad data, the output will bad. This is especially true in data science, where the model is based on the input data, so if the data is bad, so will the model.
In the era of big data, we enjoy abundance on one end, but on the other end, the data is full of trash.
There are two typical methods to process the trash:
1) Labeling the data – annotating the data is a complicated task. Even experts in the field of the problem won’t agree on the label.
Since most of the time is handle big data, in many cases, the labeling is done using some outsourcing services that might use less-skilled workers. Because of those issues, designing the experiment and labeling high quality is time-consuming, expensive, and difficult.
2) Data exploration and cleaning – cleaning the data from mistakes and anomalies is a crucial process.
It is a tedious part of the data science project, involve manual deleting of rows and scripting trying to eliminate some observed situations. This requires specific domain knowledge, which is not always easy to master.
Prof Ng, a well-respected researcher of AI both in academia and the industry, suggests companies focus on data-centric processes instead of model-centric. This correlates with the topic of this article. Ng argues that clean data is more important, for example, the consistency of the labeling.
In The figure below, we can see an experiment where they compared the number of samples of a data set, one that was cleaned and the second with noise as it can be observed, even with a third of the labels, the clean model performs the noisy model.
In conclusion, the important but hated part of cleaning the data is the key to success.
Taking your time to work with clean quality data has more potential than sophisticated modeling. Make sure to know your data and clean it. Your KPI will thank you for that.