Last lecture:
- saw that regularized ERM has optimal population convergence
- but it is still not a fully specified algorithm (and we already know that the actual algorithm matters towards out-of-sample performance)
Our goal here:
- Consider actual, natural algorithms for optimizing the empirical risk (approximate ERMs)
- Our main focus will be on (batch) gradient descent, but the techniques are applicable to a wide range of gradient methods
Recap: SCO, algorithmic stability
<aside>
🚧 Stochastic Convex Optimization (SCO)
Given:
- convex optimization domain $W \subseteq \R^d$, arbitrary sample space $Z$
- loss function $f : W \times Z \to \R$, convex in $w$
- sample $S = \{ z_1, \ldots, z_n \}$ drawn iid (unknown) population distribution $\cal D$ over $Z$
Goal: minimize:
$$
\newcommand{\E}{\mathbb E}
\begin{aligned}
F(w) = \E_{z \sim D}[f(w,z)]
\end{aligned}
$$
</aside>
-
$F$ is called true loss/risk, or population loss/risk, minimizer is denoted $w^*$:
$$
⁍
$$
-
the empirical loss/risk of $w \in W$ is:
$$
\newcommand{\E}{\mathbb E}
\begin{aligned}
F_S(w) = \frac1n \sum_{i=1}^n f(w,z_i)
\end{aligned}
$$
-
generalization gap of $w \in W$:
$$
\newcommand{\E}{\mathbb E}
\begin{aligned}
\Delta_S(w) = F(w) - F_S(w)
\end{aligned}
$$
We will require the notion of uniform stability seen last lecture: