Optimal methods: momentum and acceleration

Let us revisit the last result we have seen: convergence of GD in the strongly convex and smooth case is exponentially fast and needs roughly $\kappa$ gradient steps to “kick in”.

Is this the best we can get? After all, this is still just same old gradient descent, nothing fancy...

Polyak’s “Heavy ball” method

The “Heavy ball” method was suggested by Polyak (one of the fathers of classic optimization, who just very recently passed away) in the 1960’s as an improvement to gradient descent.

Why “heavy ball”?

“Gradient descent is a man walking down a hill. He follows the steepest path downwards; his progress is slow, but steady. Momentum is a heavy ball rolling down the same hill. The added inertia acts both as a smoother and an accelerator, dampening oscillations and causing us to barrel through narrow valleys, small humps and local minima.”

The method is only slightly more complex than basic GD:

$$ x_{t+1} \gets x_t - \eta \nabla f(x_t) + \mu (x_t - x_{t-1}) $$