Section:
New Results
Minimizing Finite Sums with the Stochastic Average Gradient.
Participants :
Mark Schmidt [correspondent] , Nicolas Le Roux, Francis Bach.
In [32] we propose the stochastic average gradient (SAG) method for optimizing
the sum of a finite number of smooth convex functions. Like stochastic
gradient (SG) methods, the SAG method's iteration cost is independent of the
number of terms in the sum. However, by incorporating a memory of previous
gradient values the SAG method achieves a faster convergence rate than black-box SG methods. The convergence rate is improved from to in general, and when the sum is strongly-convex the convergence rate is improved from the sub-linear to a linear
convergence rate of the form for .
Further, in many cases the convergence rate of the new method is also faster than black-box deterministic gradient methods, in terms of the number of gradient evaluations.
Numerical experiments indicate that the new algorithm often dramatically
outperforms existing SG and deterministic gradient methods.
The primary contribution of this work is the analysis of a new algorithm that we call the stochastic average
gradient (SAG) method, a randomized variant of the incremental aggregated
gradient (IAG) method of [43] . The SAG method has the low
iteration cost of SG methods, but achieves the convergence rates stated above for the FG
method. The SAG iterations take the form
where at each iteration a random index is selected and we
set if , and otherwise.
That is, like the FG method, the step incorporates a gradient with respect
to
each function. But, like the SG method, each iteration only computes
the gradient with respect to a single example and the cost of the
iterations is independent of . Despite the low cost of the SAG
iterations, we show in this paper that with a constant step-size the SAG iterations have an convergence rate for convex objectives and a linear convergence rate for strongly-convex objectives, like the FG method.
That is, by having access to
and by keeping a memory of the most recent gradient value
computed for each index , this iteration achieves a faster
convergence rate than is possible for standard SG methods.
Further, in terms of effective passes through the data, we will also see that
for many problems the convergence rate of the SAG method is also faster than is possible
for standard FG methods.