Recap and context
In the previous lecture,
- we introduced basic gradient methods (for convex and Lipschitz objectives) and covered (sub)gradient descent and its projected variant;
- we saw that despite the name “gradient descent”, these methods do not necessarily descend in function values — namely, they are not monotone.
Intuitively, the reason why moving from a point $x_t$ along the negative gradient $-\nabla f(x_t)$ does not necessarily lead to a descent is because the linear approximation to $f$ at $x_t$, defined by the gradient at $x_t$, could be a quite bad approximation as we move only slightly away from $x_t$.
Smoothness
The next notion we introduce is aimed precisely at quantifying how quickly does the quality of the linear approximation decreases as move locally.
<aside>
💡 Definition: Smooth function
A differentiable function $f$ is $\boldsymbol\beta$-smooth over $S \subseteq \mathrm{dom} f$ if for all $x,y \in S$:
$$
\begin{align*}
-\frac{\beta}{2} \|y-x\|^2
\leq
f(y) - f(x) - \nabla f(x) \cdot (y-x)
\leq
\frac{\beta}{2} \|y-x\|^2
.
\end{align*}
$$
</aside>

- In words: $f$ is $\beta$-smooth iff the difference between $f$ and its linear approximation at $x$ is bounded by a quadratic function with leading coefficient $\beta$. That is, for small $\beta$, the linear approximation defined by the gradient $\nabla f(x)$ is a good approximation around $x$.
- Note that if $f$ is also convex, the lower bound in the theorem is redundant. (Why?)
- Smoothness is unrelated to convexity: there are smooth non-convex functions, as there are convex non-smooth functions. (Examples?)
Examples and basic properties