Training Faster by Separating Modes of Variation in Batch-Normalized Models
- 28 January 2019
- journal article
- research article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Pattern Analysis and Machine Intelligence
- Vol. 42 (6), 1483-1500
- https://doi.org/10.1109/tpami.2019.2895781
Abstract
Batch Normalization (BN) is essential to effectively train state-of-the-art deep Convolutional Neural Networks (CNN). It normalizes the layer outputs during training using the statistics of each mini-batch. BN accelerates training procedure by allowing to safely utilize large learning rates and alleviates the need for careful initialization of the parameters. In this work, we study BN from the viewpoint of Fisher kernels that arise from generative probability models. We show that assuming samples within a mini-batch are from the same probability density function, then BN is identical to the Fisher vector of a Gaussian distribution. That means batch normalizing transform can be explained in terms of kernels that naturally emerge from the probability density function that models the generative process of the underlying data distribution. Consequently, it promises higher discrimination power for the batch-normalized mini-batch. However, given the rectifying non-linearities employed in CNN architectures, distribution of the layer outputs show an asymmetric characteristic. Therefore, in order for BN to fully benefit from the aforementioned properties, we propose approximating underlying data distribution not with one, but a mixture of Gaussian densities. Deriving Fisher vector for a Gaussian Mixture Model (GMM), reveals that batch normalization can be improved by independently normalizing with respect to the statistics of disentangled sub-populations. We refer to our proposed soft piecewise version of batch normalization as Mixture Normalization (MN). Through extensive set of experiments on CIFAR-10 and CIFAR-100, using both a 5-layers deep CNN and modern Inception-V3 architecture, we show that mixture normalization reduces required number of gradient updates to reach the maximum test accuracy of the batch-normalized model by $\sim 31\%-47\%$ across a variety of training scenarios. Replacing even a few BN modules with MN in the 48-layers deep Inception-V3 architecture is sufficient to not only obtain considerable training acceleration but also better final test accuracy. We show that similar observations are valid for 40 and 100-layers deep DenseNet architectures as well. We complement our study by evaluating the application of mixture normalization to the Generative Adversarial Networks (GANs), where “mode collapse” hinders the training process. We solely replace a few batch normalization layers in the generator with our proposed mixture normalization. Our experiments using Deep Convolutional GAN (DCGAN) on CIFAR-10 show that mixture-normalized DCGAN not only provides an acceleration of $\sim 58\%$ but also reaches lower (better) “Fréchet Inception Distance” (FID) of 33.35 compared to 37.56 of its batch-normalized counterpart.
Keywords
Funding Information
- National Science Foundation (1741431)
This publication has 19 references indexed in Scilit:
- Image Style Transfer Using Convolutional Neural NetworksPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2016
- Deep Networks with Stochastic DepthPublished by Springer Science and Business Media LLC ,2016
- Rethinking the Inception Architecture for Computer VisionPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2016
- Image Classification with the Fisher Vector: Theory and PracticeInternational Journal of Computer Vision, 2013
- Advances in optimizing recurrent networksPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2013
- Scalable k-means++Proceedings of the VLDB Endowment, 2012
- Fisher Kernels on Visual Vocabularies for Image CategorizationPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2007
- Efficient Greedy Learning of Gaussian Mixture ModelsNeural Computation, 2003
- Natural Gradient Works Efficiently in LearningNeural Computation, 1998
- Hierarchical Mixtures of Experts and the EM AlgorithmNeural Computation, 1994