Sometimes the gap between theory and its application in the real world may lead to critical mistakes. This is exactly the case with Machine Learning and AI. The method academia currently teaches the creation of a test set from randomly selected data might be appropriate for academic purposes but not necessarily for business-related problems.
In many cases, a test set created from random data does not fit the business requirement. To overcome this problem, organizations usually provide us with a list of conditions ("If… Then") aimed to deal with the randomness of the data.
But the right approach in such a case would be to build a test set from examples suited to the business problem requiring resolution: i.e., for us to choose the samples in the test set.
Thus, when asked about the quality of the data, we are equipped with a test set that can provide an answer from a business-related perspective rather than a random selection which would lead to random answers.
However, we would not use just one test set. Instead, we should employ three testing methods:
One test set to answer the business question.
A large random test set to test the numbers and enable a more precise examination of the model.
Cross-validation to examine the model and identify how inclusive it is.
How to Choose Suitable Data for the Test Set?
Sometimes the data can be confusing, and whoever examines the model might choose an incorrect test set. Below is an example of how easily the creation of a test set and a training set can be flawed:
Suppose we are building an algorithm for medical imaging such as MRI or X-ray. And suppose we have data of a thousand patients who have participated in the study, and each patient has had 100 images throughout their treatment. We wish to build an algorithm that will determine, based on the images, whether the patient has cancer or not.
The instinct of data scientists is to take the entire group of 100,000 images and randomly divide it into a training set and a test set, without relating to the chronological order of the images or each specific patient. But this random distribution would be a mistake since the true goal of the model is to identify cancer among people whose images have not been included in the training set.
If we randomly divide the images, we might use a later image in the training set and an earlier image in the test set - for the same patient. The model will provide a "good" result, which is unrealistic since we have predicted someone's future while already having their future information. If we use the future image in the training set, the results will be much better than those obtained as we run the actual model on a new case. It is easy to make a mistake here. If we don't plan If the experiment correctly, its entire evaluation is incorrect, and we might assume the experiment has succeeded while it has not.
We try to generalize the problem, but if we choose an unsuitable test set, instead of generalizing, we might teach the system the opposite: to identify the person rather than the condition. Therefore, we should not select randomly. We should divide the data considering both the patient's identity and the chronological order of the images. There is no point in building the model based on a terminal case because it will not identify the initial stages of the disease.
Appropriate Distribution of the Data is Vital
Another error can result from multiple images of the same patient in one training set. If a training set includes many images of a specific patient, the model might identify the cancer by identifying the patient rather than the symptoms. The goal is to identify cancer in a completely new person whose images are not in the training set. The appropriate distribution of the data is therefore critical. However, most data practitioners still divide the data randomly and are not aware of the issue.
Efficiently Work with Several Test Sets Simultaneously
Thanks to the unique data structure of the CoreAI system, it meets the need of organizations to build a test set for a specific business question. The system enables to:
hold an experiment with all the data
Easily build a test set
work with several test sets simultaneously
perform cross-validation
Testing in three different methods and the ease of creating and running multiple test sets in parallel provide a convenient and fast tool for professionals who wish to verify the quality of their model and avoid mistakes.
To learn more about the CoreAI test-set creation method, please contact us via our contact form.
Komentáře