## Section: New Results

### Minimizing Finite Sums with the Stochastic Average Gradient.

Participants : Mark Schmidt [correspondent] , Nicolas Le Roux, Francis Bach.

In [32] we propose the stochastic average gradient (SAG) method for optimizing the sum of a finite number of smooth convex functions. Like stochastic gradient (SG) methods, the SAG method's iteration cost is independent of the number of terms in the sum. However, by incorporating a memory of previous gradient values the SAG method achieves a faster convergence rate than black-box SG methods. The convergence rate is improved from $O(1/\sqrt{k})$ to $O(1/k)$ in general, and when the sum is strongly-convex the convergence rate is improved from the sub-linear $O(1/k)$ to a linear convergence rate of the form $O\left({\rho}^{k}\right)$ for $\rho <1$. Further, in many cases the convergence rate of the new method is also faster than black-box deterministic gradient methods, in terms of the number of gradient evaluations. Numerical experiments indicate that the new algorithm often dramatically outperforms existing SG and deterministic gradient methods.

The primary contribution of this work is the analysis of a new algorithm that we call the *stochastic average
gradient* (SAG) method, a randomized variant of the incremental aggregated
gradient (IAG) method of [43] . The SAG method has the low
iteration cost of SG methods, but achieves the convergence rates stated above for the FG
method. The SAG iterations take the form

where at each iteration a random index ${i}_{k}$ is selected and we
set ${y}_{i}^{k}={f}_{i}^{\text{'}}\left({x}^{k}\right)$ if $i={i}_{k}$, and ${y}_{i}^{k-1}$ otherwise.
That is, like the FG method, the step incorporates a gradient with respect
to
each function. But, like the SG method, each iteration only computes
the gradient with respect to a single example and the cost of the
iterations is independent of $n$. Despite the low cost of the SAG
iterations, we show in this paper that with a constant step-size *the SAG iterations have an $O(1/k)$ convergence rate for convex objectives and a linear convergence rate for strongly-convex objectives*, like the FG method.
That is, by having access to
${i}_{k}$ and by keeping a *memory* of the most recent gradient value
computed for each index $i$, this iteration achieves a faster
convergence rate than is possible for standard SG methods.
Further, in terms of effective passes through the data, we will also see that
for many problems the convergence rate of the SAG method is also faster than is possible
for standard FG methods.