EN FR
EN FR

## Section: New Results

### Fast Convergence of Stochastic Gradient Descent under a Strong Growth Condition

Participants : Mark Schmidt [correspondent] , Nicolas Le Roux [correspondent] .

In  we consider optimizing a function smooth convex function $f$ that is the average of a set of differentiable functions ${f}_{i}$, under the assumption considered by   and   that the norm of each gradient ${f}_{i}^{\text{'}}$ is bounded by a linear function of the norm of the average gradient ${f}^{\text{'}}$. We show that under these assumptions the basic stochastic gradient method with a sufficiently-small constant step-size has an $O\left(1/k\right)$ convergence rate, and has a linear convergence rate if $g$ is strongly-convex.

We write our problem

 $\underset{x\in {ℝ}^{P}}{min}f\left(x\right):=\frac{1}{N}\sum _{i=1}^{N}{f}_{i}\left(x\right),$ (2)

where we assume that $f$ is convex and its gradient ${f}^{\text{'}}$ is Lipschitz-continuous with constant $L$, meaning that for all $x$ and $y$ we have

$||{f}^{\text{'}}\left(x\right)-{f}^{\text{'}}\left(y\right)||\le L||x-y||.$

If $f$ is twice-differentiable, these assumptions are equivalent to assuming that the eigenvalues of the Hessian ${f}^{\text{'}\text{'}}\left(x\right)$ are bounded between 0 and $L$ for all $x$.

Deterministic gradient methods for problems of this form use the iteration

 ${x}_{k+1}={x}_{k}-{\alpha }_{k}{f}^{\text{'}}\left({x}_{k}\right),$ (3)

for a sequence of step sizes ${\alpha }_{k}$. In contrast, stochastic gradient methods use the iteration

 ${x}_{k+1}={x}_{k}-{\alpha }_{k}{f}_{i}^{\text{'}}\left({x}_{k}\right),$ (4)

for an individual data sample $i$ selected uniformly at random from the set $\left\{1,2,\cdots ,N\right\}$.

The stochastic gradient method is appealing because the cost of its iterations is independent of $N$. However, in order to guarantee convergence stochastic gradient methods require a decreasing sequence of step sizes $\left\{{\alpha }_{k}\right\}$ and this leads to a slower convergence rate. In particular, for convex objective functions the stochastic gradient method with a decreasing sequence of step sizes has an expected error on iteration $k$ of $O\left(1/\sqrt{k}\right)$   , meaning that

$𝔼\left[f\left({x}_{k}\right)\right]-f\left({x}^{*}\right)=O\left(1/\sqrt{k}\right).$

In contrast, the deterministic gradient method with a constant step size has a smaller error of $O\left(1/k\right)$   . The situation is more dramatic when $f$ is strongly convex, meaning that

 $f\left(y\right)\ge f\left(x\right)+〈{f}^{\text{'}}\left(x\right),y-x〉+\frac{\mu }{2}||y-{x||}^{2},$ (5)

for all $x$ and $y$ and some $\mu >0$. For twice-differentiable functions, this is equivalent to assuming that the eigenvalues of the Hessian are bounded below by $\mu$. For strongly convex objective functions, the stochastic gradient method with a decreasing sequence of step sizes has an error of $O\left(1/k\right)$   while the deterministic method with a constant step size has an linear convergence rate. In particular, the deterministic method satisfies

$f\left({x}_{k}\right)-f\left({x}^{*}\right)\le {\rho }^{k}\left[f\left({x}_{0}\right)-f\left({x}^{*}\right)\right],$

for some $\rho <1$   .

We show that if the individual gradients ${f}_{i}^{\text{'}}\left({x}_{k}\right)$ satisfy a certain strong growth condition relative to the full gradient ${f}^{\text{'}}\left({x}_{k}\right)$, the stochastic gradient method with a sufficiently small constant step size achieves (in expectation) the convergence rates stated above for the deterministic gradient method.