EN FR
EN FR
Bilateral Contracts and Grants with Industry
Bibliography
Bilateral Contracts and Grants with Industry
Bibliography


Section: New Results

Fast Convergence of Stochastic Gradient Descent under a Strong Growth Condition

Participants : Mark Schmidt [correspondent] , Nicolas Le Roux [correspondent] .

In [33] we consider optimizing a function smooth convex function f that is the average of a set of differentiable functions fi, under the assumption considered by  [87] and  [90] that the norm of each gradient fi' is bounded by a linear function of the norm of the average gradient f'. We show that under these assumptions the basic stochastic gradient method with a sufficiently-small constant step-size has an O(1/k) convergence rate, and has a linear convergence rate if g is strongly-convex.

We write our problem

minxPf(x):=1Ni=1Nfi(x), (2)

where we assume that f is convex and its gradient f' is Lipschitz-continuous with constant L, meaning that for all x and y we have

||f'(x)-f'(y)||L||x-y||.

If f is twice-differentiable, these assumptions are equivalent to assuming that the eigenvalues of the Hessian f''(x) are bounded between 0 and L for all x.

Deterministic gradient methods for problems of this form use the iteration

xk+1=xk-αkf'(xk), (3)

for a sequence of step sizes αk. In contrast, stochastic gradient methods use the iteration

xk+1=xk-αkfi'(xk), (4)

for an individual data sample i selected uniformly at random from the set {1,2,,N}.

The stochastic gradient method is appealing because the cost of its iterations is independent of N. However, in order to guarantee convergence stochastic gradient methods require a decreasing sequence of step sizes {αk} and this leads to a slower convergence rate. In particular, for convex objective functions the stochastic gradient method with a decreasing sequence of step sizes has an expected error on iteration k of O(1/k)  [78] , meaning that

𝔼[f(xk)]-f(x*)=O(1/k).

In contrast, the deterministic gradient method with a constant step size has a smaller error of O(1/k)  [79] . The situation is more dramatic when f is strongly convex, meaning that

f(y)f(x)+f'(x),y-x+μ2||y-x||2, (5)

for all x and y and some μ>0. For twice-differentiable functions, this is equivalent to assuming that the eigenvalues of the Hessian are bounded below by μ. For strongly convex objective functions, the stochastic gradient method with a decreasing sequence of step sizes has an error of O(1/k)  [77] while the deterministic method with a constant step size has an linear convergence rate. In particular, the deterministic method satisfies

f(xk)-f(x*)ρk[f(x0)-f(x*)],

for some ρ<1  [71] .

We show that if the individual gradients fi'(xk) satisfy a certain strong growth condition relative to the full gradient f'(xk), the stochastic gradient method with a sufficiently small constant step size achieves (in expectation) the convergence rates stated above for the deterministic gradient method.