SIERRA - 2013 - Annual activity report

SIERRA

SIERRA - 2013

Project-Team Sierra

Members

Overall Objectives

Research Program

Application Domains

Software and Platforms

New Results

Bilateral Contracts and Grants with Industry

Partnerships and Cooperations

Dissemination

Bibliography

Previous |

Home | Next next

Section: New Results

Fast Convergence of Stochastic Gradient Descent under a Strong Growth Condition

Participants : Mark Schmidt [correspondent] , Nicolas Le Roux [correspondent] .

In [33] we consider optimizing a function smooth convex function $f$ that is the average of a set of differentiable functions $f_{i}$ , under the assumption considered by [87] and [90] that the norm of each gradient $f_{i}^{'}$ is bounded by a linear function of the norm of the average gradient $f^{'}$ . We show that under these assumptions the basic stochastic gradient method with a sufficiently-small constant step-size has an $O (1 / k)$ convergence rate, and has a linear convergence rate if $g$ is strongly-convex.

We write our problem

min_{x \in ℝ^{P}} f (x) : = \frac{1}{N} \sum_{i = 1}^{N} f_{i} (x),

(2)

where we assume that $f$ is convex and its gradient $f^{'}$ is Lipschitz-continuous with constant $L$ , meaning that for all $x$ and $y$ we have

| | f^{'} (x) - f^{'} (y) | | \leq L | | x - y | | .

If $f$ is twice-differentiable, these assumptions are equivalent to assuming that the eigenvalues of the Hessian $f^{''} (x)$ are bounded between 0 and $L$ for all $x$ .

Deterministic gradient methods for problems of this form use the iteration

x_{k + 1} = x_{k} - α_{k} f^{'} (x_{k}),

(3)

for a sequence of step sizes $α_{k}$ . In contrast, stochastic gradient methods use the iteration

x_{k + 1} = x_{k} - α_{k} f_{i}^{'} (x_{k}),

(4)

for an individual data sample $i$ selected uniformly at random from the set ${1, 2, \dots, N}$ .

The stochastic gradient method is appealing because the cost of its iterations is independent of $N$ . However, in order to guarantee convergence stochastic gradient methods require a decreasing sequence of step sizes ${α_{k}}$ and this leads to a slower convergence rate. In particular, for convex objective functions the stochastic gradient method with a decreasing sequence of step sizes has an expected error on iteration $k$ of $O (1 / \sqrt{k})$ [78] , meaning that

𝔼 [f (x_{k})] - f (x^{*}) = O (1 / \sqrt{k}) .

In contrast, the deterministic gradient method with a constant step size has a smaller error of $O (1 / k)$ [79] . The situation is more dramatic when $f$ is strongly convex, meaning that

f (y) \geq f (x) + 〈 f^{'} (x), y - x 〉 + \frac{μ}{2} | | y - {x | |}^{2},

(5)

for all $x$ and $y$ and some $μ > 0$ . For twice-differentiable functions, this is equivalent to assuming that the eigenvalues of the Hessian are bounded below by $μ$ . For strongly convex objective functions, the stochastic gradient method with a decreasing sequence of step sizes has an error of $O (1 / k)$ [77] while the deterministic method with a constant step size has an linear convergence rate. In particular, the deterministic method satisfies

f (x_{k}) - f (x^{*}) \leq ρ^{k} [f (x_{0}) - f (x^{*})],

for some $ρ < 1$ [71] .

We show that if the individual gradients $f_{i}^{'} (x_{k})$ satisfy a certain strong growth condition relative to the full gradient $f^{'} (x_{k})$ , the stochastic gradient method with a sufficiently small constant step size achieves (in expectation) the convergence rates stated above for the deterministic gradient method.

Previous |

Home | Next next