Section: New Results
Fast Convergence of Stochastic Gradient Descent under a Strong Growth Condition
Participants : Mark Schmidt [correspondent] , Nicolas Le Roux [correspondent] .
In [33] we consider optimizing a function smooth convex function that is the average of a set of differentiable functions , under the assumption considered by [87] and [90] that the norm of each gradient is bounded by a linear function of the norm of the average gradient . We show that under these assumptions the basic stochastic gradient method with a sufficiently-small constant step-size has an convergence rate, and has a linear convergence rate if is strongly-convex.
We write our problem
where we assume that is convex and its gradient is Lipschitz-continuous with constant , meaning that for all and we have
If is twice-differentiable, these assumptions are equivalent to assuming that the eigenvalues of the Hessian are bounded between 0 and for all .
Deterministic gradient methods for problems of this form use the iteration
for a sequence of step sizes . In contrast, stochastic gradient methods use the iteration
for an individual data sample selected uniformly at random from the set .
The stochastic gradient method is appealing because the cost of its iterations is independent of . However, in order to guarantee convergence stochastic gradient methods require a decreasing sequence of step sizes and this leads to a slower convergence rate. In particular, for convex objective functions the stochastic gradient method with a decreasing sequence of step sizes has an expected error on iteration of [78] , meaning that
In contrast, the deterministic gradient method with a constant step size has a smaller error of [79] . The situation is more dramatic when is strongly convex, meaning that
for all and and some . For twice-differentiable functions, this is equivalent to assuming that the eigenvalues of the Hessian are bounded below by . For strongly convex objective functions, the stochastic gradient method with a decreasing sequence of step sizes has an error of [77] while the deterministic method with a constant step size has an linear convergence rate. In particular, the deterministic method satisfies
for some [71] .
We show that if the individual gradients satisfy a certain strong growth condition relative to the full gradient , the stochastic gradient method with a sufficiently small constant step size achieves (in expectation) the convergence rates stated above for the deterministic gradient method.