Last lecture:

saw that regularized ERM has optimal population convergence
but it is still not a fully specified algorithm (and we already know that the actual algorithm matters towards out-of-sample performance)

Our goal here:

Consider actual, natural algorithms for optimizing the empirical risk (approximate ERMs)
Our main focus will be on (batch) gradient descent, but the techniques are applicable to a wide range of gradient methods

Recap: SCO, algorithmic stability

<aside> 🚧 Stochastic Convex Optimization (SCO)

Given:

convex optimization domain $W \subseteq \R^d$, arbitrary sample space $Z$
loss function $f : W \times Z \to \R$, convex in $w$
sample $S = \{ z_1, \ldots, z_n \}$ drawn iid (unknown) population distribution $\cal D$ over $Z$

Goal: minimize:

$$ \newcommand{\E}{\mathbb E}

\begin{aligned} F(w) = \E_{z \sim D}[f(w,z)] \end{aligned} $$

</aside>

$F$ is called true loss/risk, or population loss/risk, minimizer is denoted $w^*$:

$$ ⁍ $$
the empirical loss/risk of $w \in W$ is:

$$ \newcommand{\E}{\mathbb E}

\begin{aligned} F_S(w) = \frac1n \sum_{i=1}^n f(w,z_i) \end{aligned} $$
generalization gap of $w \in W$:

$$ \newcommand{\E}{\mathbb E}

\begin{aligned} \Delta_S(w) = F(w) - F_S(w) \end{aligned} $$

We will require the notion of uniform stability seen last lecture: