Deep learning: a statistical viewpoint

Open Access

1 May 2021

journal article
research article
Published by Cambridge University Press (CUP) in Acta Numerica

Vol. 30, 87-201
https://doi.org/10.1017/s0962492921000027

Abstract

The remarkable practical success of deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find near-optimal solutions to non-convex optimization problems, and despite giving a near-perfect fit to training data without any explicit effort to control model complexity, these methods exhibit excellent predictive accuracy. We conjecture that specific principles underlie these phenomena: that overparametrization allows gradient methods to find interpolating solutions, that these methods implicitly impose regularization, and that overparametrization leads to benign overfitting, that is, accurate predictions despite overfitting training data. In this article, we survey recent progress in statistical learning theory that provides examples illustrating these principles in simpler settings. We first review classical uniform convergence results and why they fall short of explaining aspects of the behaviour of deep learning methods. We give examples of implicit regularization in simple settings, where gradient methods lead to minimal norm functions that perfectly fit the training data. Then we review prediction methods that exhibit benign overfitting, focusing on regression problems with quadratic loss. For these methods, we can decompose the prediction rule into a simple component that is useful for prediction and a spiky component that is useful for overfitting but, in a favourable setting, does not harm prediction accuracy. We focus specifically on the linear regime for neural networks, where the network can be approximated by a linear model. In this regime, we demonstrate the success of gradient flow, and we consider benign overfitting with two-layer networks, giving an exact asymptotic analysis that precisely demonstrates the impact of overparametrization. We conclude by highlighting the key challenges that arise in extending these insights to realistic deep learning settings.

This publication has 80 references indexed in Scilit:

The spectrum of kernel random matrices
The Annals of Statistics, 2010
FAST RATES FOR ESTIMATION ERROR AND ORACLE INEQUALITIES FOR MODEL SELECTION
Econometric Theory, 2008
Combinatorics of random processes and sections of convex bodies
Annals of Mathematics, 2006
Optimal Rates for the Regularized Least-Squares Algorithm
Foundations of Computational Mathematics, 2006
Convexity, Classification, and Risk Bounds
Journal of the American Statistical Association, 2006
On the Bayes-risk consistency of regularized boosting methods
The Annals of Statistics, 2004
Comparison of worst case errors in linear and neural network approximation
IEEE Transactions on Information Theory, 2002
A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Journal of Computer and System Sciences, 1997
Analysis of Two Simple Heuristics on a Random Instance ofk-sat
Journal of Algorithms, 1996
What Size Net Gives Valid Generalization?
Neural Computation, 1989

Cited by 57 articles