On the Convergence Proof of AMSGrad and a New Version
Open Access
- 13 May 2019
- journal article
- research article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Access
- Vol. 7, 61706-61716
- https://doi.org/10.1109/access.2019.2916341
Abstract
The adaptive moment estimation algorithm Adam (Kingma and Ba) is a popular optimizer in the training of deep neural networks. However, Reddi et al. have recently shown that the convergence proof of Adam is problematic, and they have also proposed a variant of Adam called AMSGrad as a fix. In this paper, we show that the convergence proof of AMSGrad is also problematic. Concretely, the problem in the convergence proof of AMSGrad is in handling the hyper-parameters, treating them as equal while they are not. This is also the neglected issue in the convergence proof of Adam. We provide an explicit counter-example of a simple convex optimization setting to show this neglected issue. Depending on manipulating the hyper-parameters, we present various fixes for this issue. We provide a new convergence proof for AMSGrad as the first fix. We also propose a new version of AMSGrad called AdamX as another fix. Our experiments on the benchmark dataset also support our theoretical results.Keywords
This publication has 3 references indexed in Scilit:
- Identity Mappings in Deep Residual NetworksPublished by Springer Science and Business Media LLC ,2016
- Deep Residual Learning for Image RecognitionPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2016
- A Stochastic Approximation MethodThe Annals of Mathematical Statistics, 1951