Member-only story
Training Neural Networks on Data That’s Lying
Oh no! You find out that your data is corrupted — there are enough instances of training data being attached to the incorrect label for it to be a significant problem. What should you do?
If you want to be a radical optimist, you could think of this data corruption as a form of regularization — depending on the level of corruption. However, if too many labels are corrupted and it is not done in a balanced way, this view may not be very practical (if it was practical to begin with, of course).
Depending on the problem, though, the model may actually be able to learn from the corrupted data. Youngdong Kim, Junho Yim, Juseung Yun, and Junmo Kim introduce a solution to corrupted labels for multiclass problems. This applies to problems that have more than 2 exclusive classes, like the MNIST dataset — each image belongs to only one of ten classes. Hence, the scope of this article applies only to multiclass problems.
Consider the traditional way neural networks are trained — positive learning. That is, the network is given data points in the format “x has label y”.
- “[This image of a bird] has label [bird].”
- “[This image of a cat] has label [cat].”
- “[This image of a dog] has…