Abhishek Kalra
4 min readJul 7, 2021

Evolution of ResNets: Identity Residual Units and other architectural experiments

Deep Residual networks (ResNets) put forth by He et al. (2015) commonly used for computer vision thesedays, were formed using residual units (Figure 1) stacked on top of each other. The underlying principle of the ResNets is to learn the additive residual function, with a key choice of using an identity mapping.

Figure(1): Residual Unit

A residual unit has been described using the following equation:

y(l) = h(x(l) + F(x(l);w(l)) ……(1)

x(l+1) = f(y(l)) …….. (2)

x(l) and x(l+1) are the input and output of the residual unit, F(x(l);w(l)) is the residual function and f is a ReLU function applied post additive layer and yields the input x(l+1) for the next layer. ResNets have been demonstrated to evidence a high degree to accuracy with networks as deep as 100 layers. However, the problem of vanishing gradients and memorization has since been noted to augment with further increase in network depth. This observation can be attributed to the application of ReLU post addition, with the result of the addition denoted as y(l) not returned directly and instead the output of the residual unit, denoted as x(l+ 1) the result of applying ReLU on the y(l). The application of function f therefore, leads to loss of information about the original state and the next layer no longer contain exactly the original data.

To address the same and ensure unimpeded information flow in subsequent iteration of the Resnets(He et al.(2016)) authors focused on creating a “direct” path for information propagation thorugh both residual units and the entire network. Through their derivations they demonstrated that direct transmission of signals is possible if both h(x(l)) and f(y(l)) are identity functions during both backward and forward passes. The structure of the residual unit was therefore modified as shown In Figure 2.

Figure2: Modified Residual Unit (with Identity mapping)

Given f is also an identity mapping Eqn.(1) can be modified as follows and can be generalized as equation 4.

x(l+1) = x(l) + F(x(l);w(l))……. (3)

Equation 4: Identity Residual Unit Signal Output

As the equations above illustrates, the original input is added to the output of the layers without modification ensuring no loss of information.

To achieve the identity connection residual connections, the authors experimented with both gating mechanisms and activation function. Gating mechanisms were applied to either the original input, the function output or both (Figure 3), however, all of them paled in comparison to the identity residual units.

Figure 3: Various Types of Shortcut Connections

The authors further experimented with the activation functions and designed new residual units with pre-activation. As shown in figure 4 this entails, putting batch normalization and relu before convolution rather than after. By Modification of the activation function the output of the addition becomes the output of the layer, achieving the desired identity effect.

Figure 4: Experiments on Usage of Activation Function

Deep residual networks with residual functions as identity mappings therefore, work well due to unimpeded flow of information from the first layer to the last layer of the network. The authors noted being able to train residual networks as deep as 1001 layers with increasing accuracy overcoming the problem of vanishing gradients. As a result better performance, improved stability and increased accuracy were achieved.

References:

1. Chawla,N.V., BowyerK.W., Hall,H.O., Kegelmeyer,W.P. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357, 2002.

2. DeVries,T., Taylor,G.W. Dataset augmentation in feature space. ICLR Workshops, 2017.

3. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, (2016)

4. He K., Zhang X., Ren S., Sun J. Identity Mappings in Deep Residual Networks. In: Leibe B., Matas J., Sebe N., Welling M. (eds) Computer Vision — ECCV 2016. ECCV 2016. Lecture Notes in Computer Science, vol 9908. Springer, Cham. https://doi.org/10.1007/978-3-319-46493-0_38

Abhishek Kalra

Data Scientist working in Financial Crime Compliance. Here to share some nerdy anecdotes.