7 reasons for failure of machine learning projects

Home > Sci-Tech

7 reasons for failure of machine learning projects

2021-12-05 12:03:26 22 ℃

Heavy dry goods, the first time served

Machine learning is a great tool, it is changing our world. In many great applications, machine learning, especially in deep learning, has proved far superior to traditional methods. From Alex-Net sort the image to image segmentation U-Net, we have seen great success in the computer vision and medical image processing. However, I will often fail to see the machine learning methods. When this happens, people tend to make one of the seven deadly sins of machine learning.

While all these issues are very serious, and can lead to erroneous conclusions, but some of the problems more serious than others, even experts in machine learning their jobs when excited, would make such a mistake. Even other experts, it is difficult to find many of them wrong, because you need to look at the code in detail and experimental settings so you can identify these errors. In particular, if your results are good to be true when an error occurs often, you might want this blog as a list to avoid your work to draw the wrong conclusions. Only when you are completely sure he did not fall into any of these errors, you should just go ahead and report your results colleagues or the public.

Sin # 1: bias data and models

Overfitting resulting model can perfectly interpret training data, but usually can not generalize to new observations

Beginners learn in depth often make this mistake. In the most common case, the experimental design was flawed, for example, using the training data as test data. Such a simple nearest neighbor classifier, a 100% recognition rate can be obtained on most issues. In more complex and in-depth model, its accuracy may not be 100%, but 98-99%. So, if you are the first time modeling to obtain such a high recognition rate, you should double-check your experiment setup. However, if you use the new data, your model will completely collapse, you might even produce worse results than random guessing, that is lower than 1 / K accuracy, where K is the number of classes, such as in a fewer than 50% of the two types of problems. At the same time, you can also easily fit over your model by increasing the number of parameters, so that you can fully remember the training data set. Another variation is to use a small training set, it can not represent your actual application. All of these models are likely to fail when new data is present, ie when used in practical application scenarios hou.

Sin # 2: not a fair comparison

Do not make a fair comparison. You might get the desired results, but they may not be duplicated on other data.

Even an expert machine learning will make this mistake. If you want to prove your new method is better than the method of SOTA, they often make this mistake. In particular research papers, they tend to succumb to this, to convince them of the advantages of the method reviewers. In the simplest case, you download from a public warehouse a model and use this model, the model is not appropriate to fine-tune or super search parameters, then compare yourself to tune off the new method. There are many such examples in the paper. Isensee et al. Disclose a recent example in their paper, they demonstrated in the text, on 10 different issues, the original U-net is actually better than that since 2015 all the recommendations of the method of improvement. Therefore, you should always perform parameter tuning and used in the new method proposed by the latest model of the same.

Sin # 3: no significant improvement

Significance testing to ensure that your report does not just drop in the ocean

After doing all the experiments, you finally found a model produces better results than the most advanced models. However, even at this point, you have not done. All machine learning is inaccurate. In addition, because the probability of the learning process, your experiment is influenced by many random factors. In order to consider this randomness, you need to perform statistical tests. This is usually through the use of different random seed you run multiple experiments to be performed. In this way, you can report the performance of the average and standard deviation of all experimental. Use test of significance, such as t-test, you can determine if the observed improvement is only now probability happens. This probability must be at least 5% or less than 1%, so you can trust your results are important. To do this, you do not need to be a professional statistician. There are even online tools to calculate them. If you repeat the experiment, make sure you have applied the Bonferroni Correction, that is, you use a significance level test on the number of repetitions of the same data by required.

Sin # 4: confusion and incorrect data interference

With two different microphones recorded voice 51 speakers. Each dot represents a record. These changes are a major factor data is different microphones.

Data quality is one of the largest traps of machine learning. It may lead to serious prejudice, even leading to racism's artificial intelligence. However, the problem is not in the training algorithm, but the data itself. For example, we showed the design of 51 speakers using two different microphones. Because we record the same speaker, they should be projected to the same point, give the same feature. However, we can observe that the same recording forms two independent clusters. In fact, a microphone is located directly in the mouth of the speaker, and the other microphone is located on a camera outside 2.5 meters, and the camera records the scene. Similar effects can be created by using two different types of scanners from two different vendors or in the context of medical imaging. If you record all pathological patients on the scanner A, record all control subjects on the scanner B, your machine learning method is likely to learn to distinguish between scanners, rather than actual pathology. You will be very satisfied with the experimental results, producing close to the perfect recognition. However, your model will fail completely in practice. Therefore, avoid confusing interference and erroneous data. SiN # 5: Inappropriate label

One label for each training sample is usually insufficient to understand the complexity of the problem. In some cases, many different labels may be generated, give multiple scorer (blue distribution), all scores generate the same tag (red curve).

Protagoras knows: "Everything is in all things, the measure is a person." This is also suitable for labels or gruth of many classification issues. We train machine learning models to reflect people-oriented categories. In many issues, we believe that the class is already very clear when defining the class. When we view data, we will find that it often contains some unclear situations, for example, there are two objects in ImageNet Challenge instead of an image of an object. If we study complex phenomena, such as emotional identification, it will become more difficult. Here, we realized that in many observations in real life, emotions even could not clearly assess it. In order to get the correct label, we need to ask multiple scores and get a label distribution. We describe this in the figure: The red curve shows a clear case, the so-called prototype spike distribution. The blue curve shows a blurred situation in a wide range of distribution. In this case, not only the machine, the human reviewer may also have a contradictory explanation. If you only use a reviewer to create your Ground Truth, you will not even realize this problem, and then usually cause [about tag noise and how to effectively handle it] (https://papers.nips.cc) / Paper / 5073-Learning-with- Noise -Labels.pdf). If you get a real label distribution (of course, this is expensive), you can even prove that you can significantly improve the performance of the system by eliminating the ambiguity, as we see, for example, emotionally identified emotions and true emotions. However, it may not be like this in practical applications, because you have never seen an unclear situation. Therefore, you should more inclined to multiplayer ratings, rather than single score.

SIN # 6: Cross-validation caused by chaos

Do not use the same data to select your model and feature, and also used for the evaluation of the model

This is almost the same as the first error, but it is camouflage, I have seen this situation even in a doctoral paper that is almost submitted. So even experts will be confused by this problem. A typical solution is to perform a model, architecture, or feature selected in the first step. Because you have only a very small amount of data sample, you decide to use cross-validation to evaluate each step. Therefore, you can divide the data into N fold, select the characteristics of N-1 folding, and calculate on the nth fold. After repeating N times, calculate the average performance and select the best features. Now, you already know what is the best feature, you can use cross-validation to choose the best parameters for your machine learning model.

This seems to be correct, right? Do not! It is defective because you have seen all test data in the first step and average all observed values. Therefore, information from all data is passed to the next step, and you can even get excellent results from full random data. To avoid this, you need to follow a nested process to nested the first step in the second cross-validation cycle. Of course, this is very expensive and produces a lot of experiments that need to be run. Note that only a good result is possible due to accidental factors in the case of a large number of experiments for the same data. Such statistical tests and Bonferroni corrections are mandatory. I usually try to avoid large cross-validation experiments and try to get more data so you can use training / verification / test sets.

SiN # 7: Excessive explanation of the results

Let others praise your work, not yourself.

In addition to all the previous mistakes, I think we often make the biggest mistake in machine learning is that in the current speculation phase, we exhable and exaggerate our own results. Of course, everyone is happy with successful solutions created by machine learning, and you have the right to be proud of them. However, you should avoid pushing the result to invisible data or a problem that is unbamped, because you have handled two different problems using the same method. Similarly, due to the observation of our SIN # 5, the assertion of the performance of human beings has also caused people's doubts. How do you surpass your source of your label? Of course, you can defeat someone in terms of fatigue and attention, but can you surpass humans in the category set by others? You should be careful about this statement.

Each statement should be based on the fact. You can make assumptions on your method's universal applicability, clearly indicate that this is truly declared, and you must provide experiments or theoretical evidence. Now, it is difficult for you to get your way you think it is worth seeing, and a wide range of statements will of course help promote your approach. However, I suggest that it is still the earth, and persists. Otherwise, we may soon usher in the next artificial intelligence winter, as well as the universal suspicion of artificial intelligence we have already owned in the past few years. Let us avoid this in the current cycle and insist that we can truly demonstrate the goals that can be achieved.

Of course, most people already know these traps. However, you may want to look at the seven sins of machine learning from time to time, just to ensure that you are still on the ground, do not fall into their traps.