Section: New Results
Minimizing Finite Sums with the Stochastic Average Gradient.
Participants : Mark Schmidt [correspondent] , Nicolas Le Roux, Francis Bach.
In  we propose the stochastic average gradient (SAG) method for optimizing the sum of a finite number of smooth convex functions. Like stochastic gradient (SG) methods, the SAG method's iteration cost is independent of the number of terms in the sum. However, by incorporating a memory of previous gradient values the SAG method achieves a faster convergence rate than black-box SG methods. The convergence rate is improved from to in general, and when the sum is strongly-convex the convergence rate is improved from the sub-linear to a linear convergence rate of the form for . Further, in many cases the convergence rate of the new method is also faster than black-box deterministic gradient methods, in terms of the number of gradient evaluations. Numerical experiments indicate that the new algorithm often dramatically outperforms existing SG and deterministic gradient methods.
The primary contribution of this work is the analysis of a new algorithm that we call the stochastic average gradient (SAG) method, a randomized variant of the incremental aggregated gradient (IAG) method of  . The SAG method has the low iteration cost of SG methods, but achieves the convergence rates stated above for the FG method. The SAG iterations take the form
where at each iteration a random index is selected and we set if , and otherwise. That is, like the FG method, the step incorporates a gradient with respect to each function. But, like the SG method, each iteration only computes the gradient with respect to a single example and the cost of the iterations is independent of . Despite the low cost of the SAG iterations, we show in this paper that with a constant step-size the SAG iterations have an convergence rate for convex objectives and a linear convergence rate for strongly-convex objectives, like the FG method. That is, by having access to and by keeping a memory of the most recent gradient value computed for each index , this iteration achieves a faster convergence rate than is possible for standard SG methods. Further, in terms of effective passes through the data, we will also see that for many problems the convergence rate of the SAG method is also faster than is possible for standard FG methods.