Last lecture:
We saw that uniform convergence rate in SCO is necessarily dimension dependent: sample complexity for constant error $\varepsilon$ is $n = \Omega(d)$
In fact, even just the generalization gap of (generic) ERM is dimension dependent
In contrast, for online-to-batch algorithms we got dimension independent $O(1/\sqrt{n})$ rate
(hence generalization in SCO is strongly dependent on the optimization algorithm used)
This lecture, we will see more positive results:
<aside> 🚧 Stochastic Convex Optimization (SCO)
Given:
Goal: minimize:
$$ \newcommand{\E}{\mathbb E}
\begin{aligned} F(w) = \E_{z \sim D}[f(w,z)] \end{aligned} $$
</aside>
$F$ is called true loss/risk, or population loss/risk, minimizer is denoted $w^*$:
$$ ⁍ $$
the empirical loss/risk of $w \in W$ is:
$$ \newcommand{\E}{\mathbb E}
\begin{aligned} F_S(w) = \frac1n \sum_{i=1}^n f(w,z_i) \end{aligned} $$
generalization gap of $w \in W$:
$$ \newcommand{\E}{\mathbb E}
\begin{aligned} \Delta_S(w) = F(w) - F_S(w) \end{aligned} $$