## Section: New Results

### Fast Convergence of Stochastic Gradient Descent under a Strong Growth Condition

Participants : Mark Schmidt [correspondent] , Nicolas Le Roux [correspondent] .

In [33] we consider optimizing a function smooth convex function $f$ that is the average of a set of differentiable functions ${f}_{i}$, under the assumption considered by [87] and [90] that the norm of each gradient ${f}_{i}^{\text{'}}$ is bounded by a linear function of the norm of the average gradient ${f}^{\text{'}}$. We show that under these assumptions the basic stochastic gradient method with a sufficiently-small constant step-size has an $O(1/k)$ convergence rate, and has a linear convergence rate if $g$ is strongly-convex.

We write our problem

$\underset{x\in {\mathbb{R}}^{P}}{min}f\left(x\right):=\frac{1}{N}\sum _{i=1}^{N}{f}_{i}\left(x\right),$ | (2) |

where we assume that $f$ is convex and its gradient ${f}^{\text{'}}$ is Lipschitz-continuous with constant $L$, meaning that for all $x$ and $y$ we have

If $f$ is twice-differentiable, these assumptions are equivalent to assuming that the eigenvalues of the Hessian ${f}^{\text{'}\text{'}}\left(x\right)$ are bounded between 0 and $L$ for all $x$.

*Deterministic gradient* methods for problems of this form use the iteration

for a sequence of step sizes ${\alpha}_{k}$. In contrast, *stochastic gradient* methods use the iteration

for an individual data sample $i$ selected uniformly at random from the set $\{1,2,\cdots ,N\}$.

The stochastic gradient method is appealing because the cost of its iterations is *independent of* $N$. However, in order to guarantee convergence stochastic gradient methods require a decreasing sequence of step sizes $\left\{{\alpha}_{k}\right\}$ and this leads to a slower convergence rate. In particular, for convex objective functions the stochastic gradient method with a decreasing sequence of step sizes has an expected error on iteration $k$ of $O(1/\sqrt{k})$ [78] , meaning that

In contrast, the deterministic gradient method with a *constant* step size has a smaller error of $O(1/k)$ [79] . The situation is more dramatic when $f$ is *strongly* convex, meaning that

$f\left(y\right)\ge f\left(x\right)+\langle {f}^{\text{'}}\left(x\right),y-x\rangle +\frac{\mu}{2}\left|\right|y-{x\left|\right|}^{2},$ | (5) |

for all $x$ and $y$ and some $\mu >0$. For twice-differentiable functions, this is equivalent to assuming that the eigenvalues of the Hessian are bounded below by $\mu $. For strongly convex objective functions, the stochastic gradient method with a decreasing sequence of step sizes has an error of $O(1/k)$ [77] while the deterministic method with a constant step size has an *linear* convergence rate. In particular, the deterministic method satisfies

for some $\rho <1$ [71] .

We show that if the individual gradients ${f}_{i}^{\text{'}}\left({x}_{k}\right)$ satisfy a certain strong growth condition relative to the full gradient ${f}^{\text{'}}\left({x}_{k}\right)$, the stochastic gradient method with a sufficiently small constant step size achieves (in expectation) the convergence rates stated above for the deterministic gradient method.