Abhishek Kalra
3 min readJul 20, 2021

MixUp Regularization: Synthetic Data Generation and Model Generalization Improvement Technique

Mixup is a data augmentation technique introduced by Zhang et al. (2018) to train neural networks by constructing virtual training examples using convex combinations of pairs of examples and their labels. In effect, Mixup regularizes the neural network to favor simple linear behavior in between training examples and improves generalization.

Mathematically mixup can be represented as follows:

where (xi; yi) and (xj ; yj) are two examples drawn at random from our training data.

Mixup works on the principle of Vicinal Risk Minimization(VRM) as opposed to Empirical Risk Minimization (ERM) that requires error minimization and instead draws/generates on examples from vicinity distribution of the training examples to enlarge the support for the training distribution.

As shown in figure 8 below, Mixup allows for a more robust training of the classifier by introducing synthetic training examples drawn from the combinations of original training distribution.

Figure 8. Regular Vs Mixup Classifier Training

Historically vicinity or neighborhood identification requires human knowledge. Mixup, however, offers a simple alternative as it does not require domain knowledge (data-agnostic), and is quite effective. Mixup further offers the following distinct advantages:

1. Regularization: MixUp transformations establish a linear relationship between data augmentation and the supervision signal. This in turn leads to a strong regularization of the state leading to improved model performance. The fact is supported by creators of fastai who noted it to be extremely efficient at regularizing computer vision models and allowed to get training time on CIFAR10 to 94% on one GPU to 6 minutes.

2. Generalization: Authors noted in their experiments an improvement in the generalization error for state-of-the-art models trained ImageNet, CIFAR, speech, and tabular datasets. As shown below in Table 1 mixup improves the average test error on four out of the six considered UCI datasets, and never under-performed ERM. As noted by DeVries & Taylor (2017) interpolation and extrapolation of the nearest neighbors of the same class in feature space can improve generalization or minimize overfitting.

Table 1: ERM and mixup classification errors on the UCI datasets

3. Simplicity and speed: MixUp concept is simple to implement and introduces little or no computational overhead.

As a result, Mixup increases the robustness to adversarial examples, be used to boost the performance of ML algorithms on tabular data, and stabilizes the training of generative adversarial networks.

References:

Zhang,H., Cissé,M., Dauphin,Y.N., Lopez-Paz,D. mixup: Beyond empirical risk minimization. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 — May 3, 2018, Conference Track Proceedings, 2018. URL https://openreview.net/forum?id=r1Ddp1-Rb

Abhishek Kalra
Abhishek Kalra

Written by Abhishek Kalra

Data Scientist working in Financial Crime Compliance. Here to share some nerdy anecdotes.

No responses yet