Context & scope

Last lecture:

We saw that uniform convergence rate in SCO is necessarily dimension dependent: sample complexity for constant error $\varepsilon$ is $n = \Omega(d)$
In fact, even just the generalization gap of (generic) ERM is dimension dependent
In contrast, for online-to-batch algorithms we got dimension independent $O(1/\sqrt{n})$ rate

(hence generalization in SCO is strongly dependent on the optimization algorithm used)

This lecture, we will see more positive results:

conditions under which ERM and related algorithms can actually achieve dimension-independent converge / sample complexity rates
another important property that grantees generalization: algorithm stability
(time permitting) smooth optimization and implications of smoothness to generalization

Recap: SCO, ERM, …

<aside> 🚧 Stochastic Convex Optimization (SCO)

Given:

convex optimization domain $W \subseteq \R^d$, arbitrary sample space $Z$
loss function $f : W \times Z \to \R$, convex in $w$
sample $S = \{ z_1, \ldots, z_n \}$ drawn iid (unknown) population distribution $\cal D$ over $Z$

Goal: minimize:

$$ \newcommand{\E}{\mathbb E}

\begin{aligned} F(w) = \E_{z \sim D}[f(w,z)] \end{aligned} $$

</aside>

$F$ is called true loss/risk, or population loss/risk, minimizer is denoted $w^*$:

$$ ⁍ $$
the empirical loss/risk of $w \in W$ is:

$$ \newcommand{\E}{\mathbb E}

\begin{aligned} F_S(w) = \frac1n \sum_{i=1}^n f(w,z_i) \end{aligned} $$
generalization gap of $w \in W$:

$$ \newcommand{\E}{\mathbb E}

\begin{aligned} \Delta_S(w) = F(w) - F_S(w) \end{aligned} $$