In this lecture we move on to see efficient algorithms for computing (approximately) solutions to convex optimization problems. We will assume throughout the lecture that the optimization objective is specified by a black-box gradient (first-order) oracle.

Gradient descent

First, let us consider unconstrained minimization of $f: \R^d \to \R$.

Here we assume that $f$ is convex, differentiable and $G$-Lipschitz.
We will also assume that $f$ is minimized at some point $x^* \in \R^d$.

Recall that the negative gradient $-\nabla f(x)$ gives the direction of steepest descent of $f$ at the point $x$. Thus, a very natural “greedy” approach for minimizing $f$ is via an iterative process:

<aside> ⚙ Algorithm: Gradient Descent (GD)

Starting from $x_1 \in \R^d$ with step size $\eta>0$, do:

$$ %\begin{aligned} %g_t \quad&= \nabla f(x_t); %\\ x_{t+1} = x_t - \eta \nabla f(x_t) %\end{aligned} \qquad\qquad t=1,2,\ldots. $$

</aside>

Gradient descent dates back to Cauchy (circa 1840’s), but the analysis we will present here was only laid out in as late as the 1940’s. Here we will prove:

<aside> 💡 Theorem:
Assume that $\|x_1-x^*\| \leq D$ and set $\eta = D/(G\sqrt{T})$. Then for the average iterate of gradient descent, we have:

$$ f(\bar x_T) - f(x^*) \leq \frac{GD}{\sqrt{T}}, \qquad\text{for}\qquad

\bar x_T = \frac1T \sum_{t=1}^T x_t . $$

</aside>

The right hand-side of the bound is called the convergence rate of the algorithm: the rate at which the approximation quality improves as a function of the number of gradient calls $T$.
Put differently, given $\varepsilon>0$, gradient descent will require $T \geq G^2 D^2 / \varepsilon^2$ iterations to obtain $\varepsilon$-approximation to $f(x^*)$.
Note that the convergence rate of gradient descent is independent of the dimension $d$! Convexity is at play here - recall that this was not possible with only Lipschitz assumptions.
The same bound also holds for an algorithm that returns the “best” iterate $\bar x_T = \argmin_{1 \leq t \leq T} f(x_t)$ instead of the average. However, this requires computing function values at $x_1,\ldots,x_T$, which is not always possible or efficient.