Section: New Results
Fast and Faster Convergence of SGD for Over-Parameterized Models (and an Accelerated Perceptron).
Modern machine learning focuses on highly
expressive models that are able to fit or interpolate
the data completely, resulting in zero
training loss. For such models, we show that
the stochastic gradients of common loss functions
satisfy a strong growth condition. Under
this condition, we prove that constant
step-size stochastic gradient descent (SGD)
with Nesterov acceleration matches the convergence
rate of the deterministic accelerated
method for both convex and strongly-convex
functions. We also show that this condition
implies that SGD can find a first-order stationary
point as efficiently as full gradient descent
in non-convex settings. Under interpolation,
we further show that all smooth loss
functions with a finite-sum structure satisfy a
weaker growth condition. Given this weaker
condition, we prove that SGD with a constant
step-size attains the deterministic convergence
rate in both the strongly-convex and
convex settings. Under additional assumptions,
the above results enable us to prove
an