Training Faster by Separating Modes of Variation in Batch-Normalized Models

28 January 2019

journal article
research article
Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Pattern Analysis and Machine Intelligence

Vol. 42 (6), 1483-1500
https://doi.org/10.1109/tpami.2019.2895781

Abstract

Batch Normalization (BN) is essential to effectively train state-of-the-art deep Convolutional Neural Networks (CNN). It normalizes the layer outputs during training using the statistics of each mini-batch. BN accelerates training procedure by allowing to safely utilize large learning rates and alleviates the need for careful initialization of the parameters. In this work, we study BN from the viewpoint of Fisher kernels that arise from generative probability models. We show that assuming samples within a mini-batch are from the same probability density function, then BN is identical to the Fisher vector of a Gaussian distribution. That means batch normalizing transform can be explained in terms of kernels that naturally emerge from the probability density function that models the generative process of the underlying data distribution. Consequently, it promises higher discrimination power for the batch-normalized mini-batch. However, given the rectifying non-linearities employed in CNN architectures, distribution of the layer outputs show an asymmetric characteristic. Therefore, in order for BN to fully benefit from the aforementioned properties, we propose approximating underlying data distribution not with one, but a mixture of Gaussian densities. Deriving Fisher vector for a Gaussian Mixture Model (GMM), reveals that batch normalization can be improved by independently normalizing with respect to the statistics of disentangled sub-populations. We refer to our proposed soft piecewise version of batch normalization as Mixture Normalization (MN). Through extensive set of experiments on CIFAR-10 and CIFAR-100, using both a 5-layers deep CNN and modern Inception-V3 architecture, we show that mixture normalization reduces required number of gradient updates to reach the maximum test accuracy of the batch-normalized model by $\sim 31\%-47\%$ across a variety of training scenarios. Replacing even a few BN modules with MN in the 48-layers deep Inception-V3 architecture is sufficient to not only obtain considerable training acceleration but also better final test accuracy. We show that similar observations are valid for 40 and 100-layers deep DenseNet architectures as well. We complement our study by evaluating the application of mixture normalization to the Generative Adversarial Networks (GANs), where “mode collapse” hinders the training process. We solely replace a few batch normalization layers in the generator with our proposed mixture normalization. Our experiments using Deep Convolutional GAN (DCGAN) on CIFAR-10 show that mixture-normalized DCGAN not only provides an acceleration of $\sim 58\%$ but also reaches lower (better) “Fréchet Inception Distance” (FID) of 33.35 compared to 37.56 of its batch-normalized counterpart.

Keywords

Funding Information

National Science Foundation (1741431)

This publication has 19 references indexed in Scilit:

Image Style Transfer Using Convolutional Neural Networks
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2016
Deep Networks with Stochastic Depth
Published by Springer Science and Business Media LLC ,2016
Rethinking the Inception Architecture for Computer Vision
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2016
Image Classification with the Fisher Vector: Theory and Practice
International Journal of Computer Vision, 2013
Advances in optimizing recurrent networks
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2013
Scalable k-means++
Proceedings of the VLDB Endowment, 2012
Fisher Kernels on Visual Vocabularies for Image Categorization
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2007
Efficient Greedy Learning of Gaussian Mixture Models
Neural Computation, 2003
Natural Gradient Works Efficiently in Learning
Neural Computation, 1998
Hierarchical Mixtures of Experts and the EM Algorithm
Neural Computation, 1994

Cited by 37 articles