Machine Learning

news/2025/9/19 12:16:14/文章来源:https://www.cnblogs.com/xcyle/p/19026140

Intro

History

1950, Turing test.

1951-1955, Perceptron.

1951, Marvin Minsky designed SNARC, the first artificial neural network;
1951, Christopher Strachey wrote a checkers program and Dietrich Prinz wrote one for chess;
1955, Allen Newell and Herbert Simon created “Logic Theorilst” for
proving theorems.

1956, the birth of AI. Dartmouth conference, John McCarthy coined the name "Artificial Intelligence".

1956-1974, golden years. People were optimistic about AI, thinking that it would be solved in a few years.

Search algorithms via path elimination;
NLP: First chatterbot ELIZA;
Robotics: stack blocks, walk with lower limbs.

1974-1980, first AI winter. People realized that AI was much harder than they thought, and funding was cut.

1980-1987, AI Boom. Neural network revives.

Expert system: answer questions about a specific domain of knowledge, using logical rules from experts;
1982, John Hopfield proposes Hopfield net;
1982, Geoffrey Hinton and David Rumelhart proposes backpropagation.

1987-1993, second AI winter. Most AI projects are not that useful.

1993-2011, steady development. However, AI researchers call AI other names. Neural network were dead again. SVM, graphical models, and reinforcement learning were popular.

1997, Deep blue from IBM beats world chess champion;
2005, Stanford robot won DARPA grand challenge by driving autonomously for 131 miles;

2012-now, deep learning.

2012, Hinton’s group use deep learning for imagenet, improved accuracy by 10%, which is seen as the beginning of deep learning era;
2013, Deepmind beats human on Atari;
2016, Deepmind (AlphaGo) beats human on Go;
2018-, Foundation models (CLIP, ChatGPT, Midjourey, etc.)

Framework

Supervised Learning: Given

Domain set $\mathcal{X}$, which is the set of all possible inputs;
Label set $\mathcal{Y}$, which is the set of all possible outputs ($\mathcal{Y}=\{0,1\}$ for classification);
Training set $S=((x_1,y_1),...,(x_n,y_n))$, where $x_i\in\mathcal{X}$ and $y_i\in\mathcal{Y}$. Denote $X_{\text{train}}=(x_1,x_2,...,x_n)$ and $Y_{\text{train}}=(y_1,y_2,...,y_n)$.

You want to learn a function (called predictor, hypothesis or classifier) $f:\mathcal{X}\to\mathcal{Y}$ such that $f(x_i) \approx y_i$ for all $i$.

In order to evaluate the performance of the learned function, we use a loss function $l(f, x_i, y_i)$ that quantifies the error between the predicted value $f(x_i)$ and the true value $y_i$.

For categorical target (classification), letting $l(f,x_i,y_i)=[f(x_i)\neq y_i]$ is good but not differentiable.
For real number target (regression), let $l(f,x_i,y_i)=\text{dist}(f(x_i)-y_i).$

Then the training loss is the average of all individual losses

\[L(f,X_{\text{train}},Y_{\text{train}})=\frac{1}{n}\sum_{i=1}^n l(f,x_i,y_i). \]

Similarly, we have test loss, validation loss, and population loss.

Step by step process:

Identify the task you want to solve;
Create a dataset, containing thousands, or millions of examples;
Define a loss function to evaluate the performance of the learned function;
Learn a function to minimize the loss function;
Evaluate the learned function on a test set to see how well it generalizes.

Optimization: find $f$ to minimize the loss $L$.

Generalization: the learned function $f$ should perform well on unseen data, not just the training data.

Sometimes, your function may overfit or underfit:

Overfitting: Your function had too much representation power. It captures the noise in the training data, leading to poor performance on unseen data.
Underfitting: Your function does not have enough representation power. It fails to capture the underlying patterns in the training data, leading to poor performance on both training and unseen data.

In a classical view, to avoid overfitting, we should restrict the representation power, which is called regularization. However, in a modern view, there are implicit regularizations to prevent overfitting for neural networks, so that overfitting almost never happens.

Unsupervised learning: Given input $X=(x_1,x_2,...,x_n)$, learn a function $f$ such that $f(x_i)$ captures the underlying structure of the data.

Common unsupervised learning tasks:

Clustering: Group similar data points together.
Principle component analysis (PCA) / Dimension reduction: Reduce the dimensionality of the data while preserving its variance.
Generative model: Learn the underlying distribution of the data to generate new samples that resemble the training data. Usually done by mapping a Gaussian to the target distribution.
Anomaly detection: Identify data points that are significantly different from the rest of the data.
...

In pratice, semi-supervised learning is often used, where a small amount (say 10%) of labeled data is used to guide the learning process.

(Implicitly), we need some assumptions.

Continuity assumption: Points which are close to each other are more likely to share a label;
Manifold assumption: The data lie approximately on a manifold of much lower dimension than the input space.

Optimization

Zero-order methods: only has information about $f$. Hard to optimize;
First-order methods: has information about $f$ and $\nabla f$;
Second-order methods: has information about $f$, $\nabla f$, and $\nabla^2 f$. Hessian matrix has size $O(d^2)$, is time-consuming.

Almost universal optimization algorithm: (Stochastic) Gradient Descent.

Gradient Descent

The intuition is, gradient gives the direction of fastest descent of the function.

where $\eta_n$ is the step size (learning rate) at iteration $n$.

Smoothness

We assume that $f\in \mathscr{C}^2$ in the following.

$\boldsymbol{x^*}$ is a local minimum of $f$ if $\nabla f(\boldsymbol{x^*})=0$ and $\nabla^2 f(\boldsymbol{x^*})$ is positive definite. Here $\nabla^2 f$ is the Hessian matrix of $f$.

Consider the first order Taylor expansion of $f$:

\[f(\boldsymbol{x'})=f(\boldsymbol{x})+\nabla f(\boldsymbol{x})^\top(\boldsymbol{x'}-\boldsymbol{x})+\frac{g(\boldsymbol{\xi})}{2}\|\boldsymbol{x'}-\boldsymbol{x}\|^2. \]

Now we would like to bound the term $g(\boldsymbol{\xi})$.

We say that a function $f$ is $L$-smooth if for all $\boldsymbol{x},\boldsymbol{x'}$, $$ |g(\boldsymbol{\xi})|\le L. $$ In other words, $$|f(\boldsymbol{x'})-f(\boldsymbol{x})-\nabla f(\boldsymbol{x})^\top(\boldsymbol{x'}-\boldsymbol{x})|\le \frac{L}{2}\|\boldsymbol{x'}-\boldsymbol{x}\|^2.$$

Smoothness means that the function is bounded by a quadratic function at every point. We will show that it is equivalent to the Lipschitz continuity of the gradient and the norm of the Hessian matrix being bounded.

We say that a function $f$ is $L$-Lipschitz continuous if for all $\boldsymbol{x},\boldsymbol{y}$, $$\|f(\boldsymbol{x})-f(\boldsymbol{y})\|\le L\|\boldsymbol{x}-\boldsymbol{y}\|.$$

The following three statements are equivalent:

$f$ is $L$-smooth;
$\nabla f$ is $L$-Lipschitz continuous;
$\|\nabla^2 f(\boldsymbol{x})\|\le L$ for all $\boldsymbol{x}$, where $\|\cdot\|$ is the Spectral Norm of the matrix.

$1 \Rightarrow 3$: Consider its remainder term in the Lagrange form

\[f(\boldsymbol{x'})=f(\boldsymbol{x})+\nabla f(\boldsymbol{x})^\top(\boldsymbol{x'}-\boldsymbol{x})+\frac{1}{2}(\boldsymbol{x'}-\boldsymbol{x})^\top\nabla^2 f(\boldsymbol{x}+\theta(\boldsymbol{x'}-\boldsymbol{x}))(\boldsymbol{x'}-\boldsymbol{x}). \]

Then $f$ is $L$-smooth iff for all $\boldsymbol{x},\boldsymbol{y}$,

\[\boldsymbol{y}^\top\nabla^2 f(\boldsymbol{x}+\theta\boldsymbol{y})\boldsymbol{y}\le L\|\boldsymbol{y}\|^2. \]

Suppose $\boldsymbol{y}=t\boldsymbol{y'}$, where $\|\boldsymbol{y'}\|=1$. Then we have

\[\boldsymbol{y'}^\top\nabla^2 f(\boldsymbol{x}+\theta t\boldsymbol{y'})\boldsymbol{y'}\le L. \]

Let $t\to 0^+$, it gives

\[\boldsymbol{y'}^\top\nabla^2 f(\boldsymbol{x})\boldsymbol{y'}\le L. \]

Notice that $\nabla^2 f(\boldsymbol{x})$ is self-adjoint, so

\[\|\nabla^2 f(\boldsymbol{x})\|=|\lambda_{\max}(\nabla^2 f(\boldsymbol{x}))|=\max_{\|\boldsymbol{y}\|=1} \boldsymbol{y}^\top\nabla^2 f(\boldsymbol{x})\boldsymbol{y}\le L. \]

$3\Rightarrow 2$:

\[\begin{align*} &\ \|\nabla f(\boldsymbol{x})-\nabla f(\boldsymbol{y})\|\\ =&\ \left\|\int_0^1\nabla^2 f(\boldsymbol{y}+t(\boldsymbol{x}-\boldsymbol{y}))(\boldsymbol{x}-\boldsymbol{y})\mathrm{d}t\right\|\ \\ \le &\ \int_0^1\left\|\nabla^2 f(\boldsymbol{y}+t(\boldsymbol{x}-\boldsymbol{y}))(\boldsymbol{x}-\boldsymbol{y})\right\|\ \mathrm{d}t \\ \le &\ \|(\boldsymbol{x}-\boldsymbol{y})\|\int_0^1\left\|\nabla^2 f(\boldsymbol{y}+t(\boldsymbol{x}-\boldsymbol{y}))\right\|\ \mathrm{d}t \\ \le &\ L\|\boldsymbol{x}-\boldsymbol{y}\|. \end{align*} \]

$2\Rightarrow 1$:

\[\begin{align*} &\ \left|f(\boldsymbol{x'})-f(\boldsymbol{x})-\nabla f(\boldsymbol{x})^\top(\boldsymbol{x'}-\boldsymbol{x})\right|\\ =&\ \left|\int_0^1\nabla f(\boldsymbol{x}+t(\boldsymbol{x'}-\boldsymbol{x}))^\top(\boldsymbol{x'}-\boldsymbol{x})\mathrm{d}t-\nabla f(\boldsymbol{x})^\top(\boldsymbol{x'}-\boldsymbol{x})\right|\\ = &\ \left|\int_0^1(\nabla f(\boldsymbol{x}+t(\boldsymbol{x'}-\boldsymbol{x}))-\nabla f(\boldsymbol{x}))^\top(\boldsymbol{x'}-\boldsymbol{x})\mathrm{d}t\right|\\ \le &\ \int_0^1\left|(\nabla f(\boldsymbol{x}+t(\boldsymbol{x'}-\boldsymbol{x}))-\nabla f(\boldsymbol{x}))^\top(\boldsymbol{x'}-\boldsymbol{x})\right|\mathrm{d}t\\ \le &\ \int_0^1\|\nabla f(\boldsymbol{x}+t(\boldsymbol{x'}-\boldsymbol{x}))-\nabla f(\boldsymbol{x})\|\|(\boldsymbol{x'}-\boldsymbol{x})\|\mathrm{d}t \tag{Cauchy-Schwarz}\\ \le &\ \int_0^1 L\|t(\boldsymbol{x'}-\boldsymbol{x})\|\|(\boldsymbol{x'}-\boldsymbol{x})\|\mathrm{d}t \\ \le &\ \frac{L}{2}\|\boldsymbol{x'}-\boldsymbol{x}\|^2. \end{align*} \]

Back to GD, we can show that

If $f$ is $L$-smooth, then running GD with step size $\eta\le \frac{1}{L}$ gives $$f(\boldsymbol{x_{n+1}})\le f(\boldsymbol{x_n})-\frac{\eta}{2}\|\nabla f(\boldsymbol{x_n})\|^2. $$

$$ \begin{align*}f(\boldsymbol{x_{n+1}})-f(\boldsymbol{x_n}) \le &\ \nabla f(\boldsymbol{x_n})^\top(\boldsymbol{x_{n+1}}-\boldsymbol{x_n})+\frac{L}{2}\|\boldsymbol{x_{n+1}}-\boldsymbol{x_n}\|^2\\ = &\ -\eta\nabla f(\boldsymbol{x_n})^\top\nabla f(\boldsymbol{x_n})+\frac{L\eta^2}{2}\|\nabla f(\boldsymbol{x_n})\|^2\\ = &\ -\eta\left(1-\frac{L\eta}{2}\right)\|\nabla f(\boldsymbol{x_n})\|^2\\ \le &\ -\frac{\eta}{2}\|\nabla f(\boldsymbol{x_n})\|^2. \end{align*} $$

Thus we should set $\eta=O\left(\frac{1}{L}\right)$.

Convexity

We say that a function $f$ is convex if for all $\boldsymbol{x},\boldsymbol{y}$ and $t\in[0,1]$, $$ f(t\boldsymbol{x}+(1-t)\boldsymbol{y})\le tf(\boldsymbol{x})+(1-t)f(\boldsymbol{y}).$$

Similarly, we have

The following three statements are equivalent:

$f$ is convex;
$f(\boldsymbol{x'})\ge f(\boldsymbol{x})+\nabla f(\boldsymbol{x})^\top(\boldsymbol{x'}-\boldsymbol{x})$ for all $\boldsymbol{x},\boldsymbol{x'}$;
$\nabla^2 f(\boldsymbol{x})$ is positive semi-definite for all $\boldsymbol{x}$.

Convexity assumes lower bound of Hessian eigenvalues, while smoothness assumes upper bound of Hessian eigenvalues.

If $f$ is $L$-smooth and convex and $\boldsymbol{x^*}=\arg\min_{\boldsymbol{x}} f(\boldsymbol{x})$, then running GD with step size $\eta\le\frac{1}{L}$ satisfies $$f(\boldsymbol{x_n})\le f(\boldsymbol{x^*})+\frac{\|\boldsymbol{x_0}-\boldsymbol{x^*}\|^2}{2\eta n}$$

$$ \begin{align*} f(\boldsymbol{x_{n+1}})\le &\ f(\boldsymbol{x_n})-\frac{\eta}{2}\|\nabla f(\boldsymbol{x_n})\|^2 \tag{Lemma}\\ \le &\ f(\boldsymbol{x^*})+\nabla f(\boldsymbol{x_n})^\top(\boldsymbol{x_n}-\boldsymbol{x^*})-\frac{\eta}{2}\|\nabla f(\boldsymbol{x_n})\|^2 \tag{Convexity}\\ = &\ f(\boldsymbol{x^*})-\frac{1}{\eta}(\boldsymbol{x_{n+1}}-\boldsymbol{x_n})^\top(\boldsymbol{x_n}-\boldsymbol{x^*})-\frac{1}{2\eta}\|\boldsymbol{x_{n+1}}-\boldsymbol{x_n}\|^2 \tag{GD}\\ = &\ f(\boldsymbol{x^*})+\frac{1}{2\eta}\|\boldsymbol{x_n}-\boldsymbol{x^*}\|^2-\frac{1}{2\eta}\|\boldsymbol{x_{n+1}}-\boldsymbol{x_n}\|^2\\ \end{align*} $$ Thus $$ \sum\limits_{i=0}^{t-1}(f(\boldsymbol{x_{i+1}})-f(\boldsymbol{x^*}))\le \frac{1}{2\eta}\left(\|\boldsymbol{x_0}-\boldsymbol{x^*}\|^2-\|\boldsymbol{x_t}-\boldsymbol{x^*}\|^2\right)\le \frac{1}{2\eta}\|\boldsymbol{x_0}-\boldsymbol{x^*}\|^2. $$

Since $f(\boldsymbol{x_t})\le f(\boldsymbol{x_{t-1}})\le ...\le f(\boldsymbol{x_0})$, we have

\[f(\boldsymbol{x_t})-f(\boldsymbol{x^*})\le \frac{1}{t}\sum\limits_{i=0}^{t-1}(f(\boldsymbol{x_{i+1}})-f(\boldsymbol{x^*}))\le \frac{1}{2\eta t}\|\boldsymbol{x_0}-\boldsymbol{x^*}\|^2. \]

Therefore, we only need $T=\frac{L\|\boldsymbol{x_0}-\boldsymbol{x^*}\|^2}{2\epsilon}$ steps to find a solution $f(\boldsymbol{x_T})\le f(\boldsymbol{x^*})+\epsilon$.

We say that a function $f$ is $\mu$-strongly convex if for all $\boldsymbol{x},\boldsymbol{y}$ and $t\in[0,1]$, $$f(\boldsymbol{x'})\ge f(\boldsymbol{x})+\nabla f(\boldsymbol{x})^\top(\boldsymbol{x'}-\boldsymbol{x})+\frac{\mu}{2}\|\boldsymbol{x'}-\boldsymbol{x}\|^2.$$

A function $f$ is $\mu$-strongly convex if and only if $\lambda_{\min}(\nabla^2 f(\boldsymbol{x}))\ge \mu$ for all $\boldsymbol{x}$.

If $f$ is $L$-smooth and $\mu$-strongly convex and $\boldsymbol{x^*}=\arg\min_{\boldsymbol{x}} f(\boldsymbol{x})$, then running GD with step size $\eta\le\frac{1}{L}$ satisfies $$f(\boldsymbol{x_n})\le f(\boldsymbol{x^*})+\frac{(1-\mu\eta)^nL}{2}\|\boldsymbol{x_0}-\boldsymbol{x^*}\|^2$$

$$ \begin{align*} \|\boldsymbol{x_{n+1}}-\boldsymbol{x^*}\|^2=&\ \|\boldsymbol{x_n}-\eta\nabla f(\boldsymbol{x_n})-\boldsymbol{x^*}\|^2\\ =&\ \|\boldsymbol{x_n}-\boldsymbol{x^*}\|^2-2\eta\nabla f(\boldsymbol{x_n})^\top(\boldsymbol{x_n}-\boldsymbol{x^*})+\eta^2\|\nabla f(\boldsymbol{x_n})\|^2\\ \le &\ \|\boldsymbol{x_n}-\boldsymbol{x^*}\|^2+2\eta\nabla f(\boldsymbol{x_n})^\top(\boldsymbol{x^*}-\boldsymbol{x_n})+2\eta(f(\boldsymbol{x_n})-f(\boldsymbol{x^*})) \tag{Lemma}\\ \le &\ \|\boldsymbol{x_n}-\boldsymbol{x^*}\|^2+2\eta(f(\boldsymbol{x^*})-f(\boldsymbol{x_n})-\frac{\mu}{2}\|\boldsymbol{x^*}-\boldsymbol{x_n}\|^2)+2\eta(f(\boldsymbol{x_n})-f(\boldsymbol{x^*})) \tag{Convexity}\\ = &\ (1-\mu\eta)\|\boldsymbol{x_n}-\boldsymbol{x^*}\|^2 \end{align*} $$ Immediately we have $$ \|\boldsymbol{x_{n}}-\boldsymbol{x^*}\|^2\le (1-\mu\eta)^n\|\boldsymbol{x_0}-\boldsymbol{x^*}\|^2. $$

Thus

\[\begin{align*} f(\boldsymbol{x_n})-f(\boldsymbol{x^*})\le &\ \nabla f(\boldsymbol{x^*})^\top(\boldsymbol{x_n}-\boldsymbol{x^*}) + \frac{L}{2}\|\boldsymbol{x_n}-\boldsymbol{x^*}\|^2 \tag{Smoothness}\\ = &\ \frac{L}{2}\|\boldsymbol{x_n}-\boldsymbol{x^*}\|^2 \tag{Since $\nabla f(\boldsymbol{x^*})=0$}\\ \le &\ \frac{L}{2}(1-\mu\eta)^n\|\boldsymbol{x_0}-\boldsymbol{x^*}\|^2\\ \end{align*} \]

We can see that strong convexity ensures a linear convergence rate.

Stochastic Gradient Descent

GD has two limitations:

Each iteration requires computing the full gradient, which is expensive for large datasets;
It may get stuck at stationary points (local minima, saddle points) for non-convex functions.

The solution is to add randomness. Stochastic Gradient Descent (SGD) is a variant of GD that only uses a small random subset of the data to compute the gradient at each iteration. Formally,

Starts with a guess $\boldsymbol{x_0}$, and considers the sequence $\boldsymbol{x_0},\boldsymbol{x_1},\boldsymbol{x_2},...$ such that $$ \boldsymbol{x_{n+1}}=\boldsymbol{x_n} -\eta_n \boldsymbol{G_n}, n\ge 0 $$ where $\boldsymbol{G_n}$ is a random vector satisfying $\mathbb{E}[\boldsymbol{G_n}]=\nabla f(\boldsymbol{x_n})$. A common choice is to let $$\boldsymbol{G_n}=\frac{1}{|S|}\sum_{i\in S}\nabla \ell(\boldsymbol{x_n},x_i,y_i)$$ where $S$ is a random subset with fixed size (mini-batch).

Usually, $|S|$ (batch size) is 64, 128, or 256.

If $|S|$ is too small, the variance of $\boldsymbol{G_n}$ is large;
If $|S|$ is too large, $\boldsymbol{G_n}$ is slow to compute.

$\mathrm{var}(\boldsymbol{G_n})=\mathbb{E}[\|\boldsymbol{G_n}\|^2]-\|\mathbb{E}[\boldsymbol{G_n}]\|^2=\mathbb{E}[\|\boldsymbol{G_n}\|^2]-\|\nabla f(\boldsymbol{x_n})\|^2.$

If $f$ is $L$-smooth and convex and $\boldsymbol{x^*}=\arg\min_{\boldsymbol{x}} f(\boldsymbol{x})$, then running SGD with step size $\eta\le\frac{1}{L}$ and $\mathrm{var}(\boldsymbol{G_n})\le \sigma^2$ for all $n$ satisfies $$\mathbb{E}[f(\boldsymbol{\overline{x_n}})]\le f(\boldsymbol{x^*})+\frac{\|\boldsymbol{x_0}-\boldsymbol{x^*}\|^2}{2\eta n}+\eta\sigma^2$$ where $\boldsymbol{\overline{x_n}}=\frac{1}{n}\sum_{i=1}^{n}\boldsymbol{x_i}$.

$$ \begin{align*} \mathbb{E}[f(\boldsymbol{x_{n+1}})]\le &\ \mathbb{E}\left[f(\boldsymbol{x_n})+\nabla f(\boldsymbol{x_n})^\top(\boldsymbol{x_{n+1}}-\boldsymbol{x_n})]+\frac{L}{2}\|\boldsymbol{x_{n+1}}-\boldsymbol{x_n}\|^2\right] \tag{Smoothness}\\ = &\ f(\boldsymbol{x_n})+\nabla f(\boldsymbol{x_n})^\top\mathbb{E}[\boldsymbol{x_{n+1}}-\boldsymbol{x_n}]+\frac{L}{2}\mathbb{E}\left[\|\boldsymbol{x_{n+1}}-\boldsymbol{x_n}\|^2\right] \\ \le &\ f(\boldsymbol{x_n})-\eta\nabla f(\boldsymbol{x_n})^\top\mathbb{E}\left[\boldsymbol{G_n}\right]+\frac{L\eta^2}{2}\mathbb{E}\left[\|\boldsymbol{G_n}\|^2\right] \tag{SGD}\\ = &\ f(\boldsymbol{x_n})-\eta\|\nabla f(\boldsymbol{x_n})\|^2+\frac{L\eta^2}{2}\mathbb{E}\left[\mathrm{var}(\boldsymbol{G_n})+\|\nabla f(\boldsymbol{x_n})\|^2\right] \tag{Lemma}\\ \le &\ f(\boldsymbol{x_n})-\eta\left(1-\frac{L\eta^2}{2}\right)\|\nabla f(\boldsymbol{x_n})\|^2+\frac{L\eta^2}{2}\sigma^2\\ \le &\ f(\boldsymbol{x_n})-\frac{\eta}{2}\|\nabla f(\boldsymbol{x_n})\|^2+\frac{\eta}{2}\sigma^2\\ \le &\ f(\boldsymbol{x^*})+\nabla f(\boldsymbol{x_n})^\top(\boldsymbol{x^*}-\boldsymbol{x_n})-\frac{\eta}{2}\|\nabla f(\boldsymbol{x_n})\|^2+\frac{\eta}{2}\sigma^2 \tag{Convexity}\\ = &\ f(\boldsymbol{x^*})-\frac{1}{\eta}\mathbb{E}[\boldsymbol{x_{n+1}}-\boldsymbol{x_n}]^\top(\boldsymbol{x_n}-\boldsymbol{x^*})-\frac{\eta}{2}\left(\mathbb{E}\left[\|\boldsymbol{G_n}\|^2\right]-\mathrm{var}[\boldsymbol{G_n}]\right)+\frac{\eta}{2}\sigma^2\\ \le &\ f(\boldsymbol{x^*})-\mathbb{E}\left[\frac{1}{\eta}(\boldsymbol{x_{n+1}}-\boldsymbol{x_n})^\top(\boldsymbol{x_n}-\boldsymbol{x^*})+\frac{1}{2\eta}\|\boldsymbol{x_{n+1}}-\boldsymbol{x_n}\|^2\right]+\eta\sigma^2\\ \le &\ f(\boldsymbol{x^*})-\mathbb{E}\left[\frac{1}{2\eta}\left(\|\boldsymbol{x_{n+1}}-\boldsymbol{x^*}\|^2-\|\boldsymbol{x_n}-\boldsymbol{x^*}\|^2\right)\right]+\eta\sigma^2\\ \end{align*} $$

Thus

\[\begin{align*} \mathbb{E}[f(\boldsymbol{\overline{x_n}})]= &\ \frac{1}{n}\sum_{i=0}^{n-1}\mathbb{E}[f(\boldsymbol{x_i})]\\ \le &\ \frac{1}{n}\sum_{i=0}^{n-1}\left(f(\boldsymbol{x^*})-\mathbb{E}\left[\frac{1}{2\eta}(\|\boldsymbol{x_{i+1}}-\boldsymbol{x^*}\|^2-\|\boldsymbol{x_i}-\boldsymbol{x^*}\|^2)\right]+\eta\sigma^2\right)\\ \le &\ f(\boldsymbol{x^*})+\frac{1}{2\eta n}\|\boldsymbol{x_0}-\boldsymbol{x^*}\|^2+\eta\sigma^2\\ \end{align*} \]

Therefore, letting

\[T=\frac{2\|\boldsymbol{x_0}-\boldsymbol{x^*}\|^2\sigma^2}{\epsilon^2},\eta=\frac{\epsilon}{2\sigma^2} \]

gives $\mathbb{E}[f(\boldsymbol{\overline{x_T}})]\le f(\boldsymbol{x^*})+\epsilon$.

We can see that the convergence rate of SGD is $O\left(\frac{1}{\sqrt{n}}\right)$, which is slower than GD's $O\left(\frac{1}{n}\right)$ rate. This is because of the variance introduced by the stochastic gradient. Furthermore, in strongly convex case, SGD achieves $O\left(\frac{1}{n}\right)$ rate.

In practice, noise is not that bad, and we do not care about training loss but population loss. What's more, we can use various tricks to reduce the variance.

Stochastic Variance Reduced Gradient

The main idea is, maintaining some gradient statistics so that the expected gradient is still correct, but the variance is reduced.

For $s=1,2,...$
- $\widetilde{\boldsymbol{x}}=\boldsymbol{\widetilde{x}_{s-1}}$
- $\widetilde{\boldsymbol{u}}=\frac{1}{N}\sum_{i=1}^N\nabla l_i(\widetilde{\boldsymbol{x}})=\nabla f(\widetilde{\boldsymbol{x}})$
- $\boldsymbol{x_0}=\widetilde{\boldsymbol{x}}$
- For $t=1,2,...,m$
  - Randomly pick a sample $i_t$ from $\{1,2,...,N\}$
  - $\boldsymbol{x_t}=\boldsymbol{x_{t-1}}-\eta(\nabla l_{i_t}(\boldsymbol{x_{t-1}})-\nabla l_{i_t}(\widetilde{\boldsymbol{x}})+\boldsymbol{\widetilde{u}})$
- Two update rules: $\boldsymbol{\widetilde{x}_s}=\boldsymbol{x_m}$ or $\boldsymbol{x_t}$ for randomly chosen $t\in \{0,1,...,m-1\}$.

In practice, we usually choose the first rule, while the second rule is used for theoretical analysis.

If $f$ is $L$-smooth and convex and $\boldsymbol{x^*}=\arg\min_{\boldsymbol{x}} f(\boldsymbol{x})$, then running SVRG with step size $\eta\le\frac{1}{L}$ satisfies $$\mathbb{E}[f(\boldsymbol{\widetilde{x}_n})]\le f(\boldsymbol{x^*})+\left(\frac 1{m\mu \eta (1 - 2L \eta)} + \frac {2L\eta}{1 - 2L\eta}\right)^n\left(\mathbb{E}[f(\boldsymbol{\widetilde{x}_0})]-f(\boldsymbol{x^*})\right).$$

For all $\boldsymbol{x}$, if $i$ is randomly chosen from $\{1,2,...,N\}$, then $$ \mathbb{E}\left[\|\nabla l_i(\boldsymbol{x})-\nabla l_i(\boldsymbol{x^*})\|^2\right]\le 2L(f(\boldsymbol{x})-f(\boldsymbol{x^*})). $$

Consider any $i\in \{1,2,...,N\}$, let $$g_i(\boldsymbol{x})=l_i(\boldsymbol{x})-l_i(\boldsymbol{x^*})-\nabla l_i(\boldsymbol{x^*})^\top(\boldsymbol{x}-\boldsymbol{x^*}).$$ Clearly, $$ \nabla g_i(\boldsymbol{x})=\nabla l_i(\boldsymbol{x})-\nabla l_i(\boldsymbol{x^*}). $$ Since $l_i(\boldsymbol{x})$ is $L$-smooth and convex, $g_i(\boldsymbol{x})$ is also $L$-smooth and convex. Also we have $\nabla g_i(\boldsymbol{x^*})=0$, so $\boldsymbol{x^*}=\arg\min_x g_i(\boldsymbol{x})$.

\[\begin{align*} 0=g_i(\boldsymbol{x^*})\le &\ \min_\eta\{g_i(\boldsymbol{x}-\eta\nabla g_i(\boldsymbol{x}))\}\\ \le &\ \min_\eta\{g_i(\boldsymbol{x})-\eta \|\nabla g_i(\boldsymbol{x})\|^2 + \frac L2 \eta^2 \|\nabla g_i(\boldsymbol{x})\|^2\} \tag{Smoothness} \\= &\ g_i(\boldsymbol{x}) - \frac 1{2L} \|\nabla g_i(\boldsymbol{x})\|^2 \end{align*} \]

Immediately we have

\[\|\nabla l_i(\boldsymbol{x})-\nabla l_i(\boldsymbol{x^*})\|^2\le 2L(l_i(\boldsymbol{x})-l_i(\boldsymbol{x^*})-\nabla l_i(\boldsymbol{x^*})^\top(\boldsymbol{x}-\boldsymbol{x^*})). \]

Taking expectation over $i$, we have

\[\mathbb{E}\left[\|\nabla l_i(\boldsymbol{x})-\nabla l_i(\boldsymbol{x^*})\|^2\right]\le 2L(f(\boldsymbol{x})-f(\boldsymbol{x^*})). \]

Denote the update term as $\boldsymbol{v_t}=\nabla l_{i_t}(\boldsymbol{x_{t-1}})-\nabla l_{i_t}(\boldsymbol{\widetilde{x}})+\nabla f(\widetilde{\boldsymbol{x}})$. Then

\[\mathbb{E}\left[\boldsymbol{v_t}\right]=\mathbb{E}\left[\nabla l_{i_t}(\boldsymbol{x_{t-1}})\right]-\mathbb{E}\left[\nabla l_{i_t}(\widetilde{\boldsymbol{x}})\right]+\nabla f(\widetilde{\boldsymbol{x}})=\nabla f(\boldsymbol{x_{t-1}}) \]

and

\[\begin{align*}&\ \mathbb E\left[\|\boldsymbol{v_t}\|^2\right] \\\le &\ 2 \mathbb E\left[\|\nabla l_{i_t}(\boldsymbol{x_{t-1}}) - \nabla l_{i_t}(\boldsymbol{x^*})\|^2\right] + 2 \mathbb E\left[\|\nabla l_{i_t}(\widetilde {\boldsymbol{x}}) - \nabla l_{i_t}(\boldsymbol{x^*}) - \nabla f(\widetilde {\boldsymbol{x}})\|^2\right] \tag{$\|a + b\|^2 \leq 2\|a\|^2 + 2\|b\|^2$} \\ = &\ 2 \mathbb E\left[\|\nabla l_{i_t}(\boldsymbol{x_{t-1}}) - \nabla l_{i_t}(\boldsymbol{x^*})\|^2\right] + 2 \mathbb E\left[\|\nabla l_{i_t}(\widetilde {\boldsymbol{x}}) - \nabla l_{i_t}(\boldsymbol{x^*}) - \mathbb E\left[\nabla l_{i_t}(\widetilde {\boldsymbol{x}})\right]\|^2\right] \\=&\ 2 \mathbb E\left[\|\nabla l_{i_t}(\boldsymbol{x_{t-1}}) - \nabla l_{i_t}(\boldsymbol{x^*})\|^2\right] + 2 \mathbb E\left[\|\nabla l_{i_t}(\widetilde {\boldsymbol{x}}) - \nabla l_{i_t}(\boldsymbol{x^*}) - \mathbb E\left[\nabla l_{i_t}(\widetilde {\boldsymbol{x}})-\nabla l_{i_t}(\boldsymbol{x^*})\right]\|^2\right] \qquad\qquad\qquad\qquad\tag{$\mathbb{E}\left[\nabla l_{i_t}(\boldsymbol{x^*})\right]=0$}\\\le &\ 2 \mathbb E\left[\|\nabla l_{i_t}(\boldsymbol{x_{t-1}}) - \nabla l_{i_t}(\boldsymbol{x^*})\|^2\right] + 2 \mathbb E\left[\|\nabla l_{i_t}(\widetilde {\boldsymbol{x}}) - \nabla l_{i_t}(\boldsymbol{x^*})\|^2\right] \tag{$\mathbb{E}\left[\|X-\mathbb{E}[X]\|^2\right]\le \mathbb{E}\left[\|X\|^2\right]$}\\\le &\ 4L(f(\boldsymbol{x_{t-1}}) - f(\boldsymbol{x^*}) + f(\widetilde {\boldsymbol{x}}) - f(\boldsymbol{x^*})) \tag{Lemma}\\ \end{align*} \]

Next, we consider a small step $t$,

\[\begin{align*} \mathbb{E}\left[\|\boldsymbol{x_t}-\boldsymbol{x^*}\|^2\right]=&\ \mathbb{E}\left[\|\boldsymbol{x_{t-1}}-\eta \boldsymbol{v_t}-\boldsymbol{x^*}\|^2\right]\\ =&\ \mathbb{E}\left[\|\boldsymbol{x_{t-1}}-\boldsymbol{x^*}\|^2\right]-2\eta(\boldsymbol{x_{t-1}}-\boldsymbol{x^*})^\top\mathbb{E}\left[\boldsymbol{v_t}\right]+\eta^2\mathbb{E}\left[\|\boldsymbol{v_t}\|^2\right]\\ \le &\ \mathbb{E}\left[\|\boldsymbol{x_{t-1}}-\boldsymbol{x^*}\|^2\right]-2\eta(\boldsymbol{x_{t-1}}-\boldsymbol{x^*})^\top\nabla f(\boldsymbol{x_{t-1}})+4L\eta^2(f(\boldsymbol{x_{t-1}}) - f(\boldsymbol{x^*}) + f(\widetilde {\boldsymbol{x}}) - f(\boldsymbol{x^*})) \qquad\\ \le &\ \mathbb{E}\left[\|\boldsymbol{x_{t-1}}-\boldsymbol{x^*}\|^2\right]-2\eta(f(\boldsymbol{x_{t-1}})-f(\boldsymbol{x^*}))+4L\eta^2(f(\boldsymbol{x_{t-1}}) - f(\boldsymbol{x^*}) + f(\widetilde {\boldsymbol{x}}) - f(\boldsymbol{x^*})) \tag{Convexity}\\ \le &\ \mathbb{E}\left[\|\boldsymbol{x_{t-1}}-\boldsymbol{x^*}\|^2\right]-2\eta(1-2L\eta)(f(\boldsymbol{x_{t-1}})-f(\boldsymbol{x^*})) + 4L\eta^2(f(\widetilde {\boldsymbol{x}}) - f(\boldsymbol{x^*}))\\ \end{align*} \]

Summing over $t=1,2,...,m$, we have

\[\begin{align*} 0\le &\ \mathbb{E}\left[\|\boldsymbol{x_m}-\boldsymbol{x^*}\|^2\right]\\ \le &\ \mathbb{E}\left[\|\boldsymbol{x_0}-\boldsymbol{x^*}\|^2\right]-2\eta(1-2L\eta)\mathbb{E}\left[\sum_{t=0}^{m-1}(f(\boldsymbol{x_t})-f(\boldsymbol{x^*}))\right] + 4L\eta^2 m\mathbb{E}\left[(f(\widetilde {\boldsymbol{x}}) - f(\boldsymbol{x^*}))\right]\\ \le &\ \mathbb{E}\left[\|\boldsymbol{\widetilde{x}_{s-1}}-\boldsymbol{x^*}\|^2\right]-2\eta(1-2L\eta)m\mathbb{E}[(f(\boldsymbol{\widetilde{x}_s})-f(\boldsymbol{x^*}))] + 4L\eta^2 m\mathbb{E}\left[(f(\boldsymbol{\widetilde {x}_{s-1}}) - f(\boldsymbol{x^*}))\right]\\ \le &\ \frac{2}{\mu}\mathbb{E}\left[f(\boldsymbol{\widetilde{x}_{s-1}})-f(\boldsymbol{x^*})-\nabla f(\boldsymbol{x^*})^\top(\boldsymbol{\widetilde{x}_{s-1}}-\boldsymbol{x^*}) \right] \tag{Convexity}\\ &\ -2\eta(1-2L\eta)m\mathbb{E}[(f(\boldsymbol{\widetilde{x}_s})-f(\boldsymbol{x^*}))] + 4L\eta^2 m\mathbb{E}\left[(f(\boldsymbol{\widetilde {x}_{s-1}}) - f(\boldsymbol{x^*}))\right] \\ \le &\ \left(\frac{2}{\mu}+4L\eta^2m\right)\mathbb{E}\left[f(\boldsymbol{\widetilde{x}_{s-1}})-f(\boldsymbol{x^*})\right]-2\eta(1-2L\eta)m\mathbb{E}[(f(\boldsymbol{\widetilde{x}_s})-f(\boldsymbol{x^*}))]\\ \end{align*} \]

Finally we have

\[\mathbb{E}[f(\boldsymbol{\widetilde{x}_s})]-f(\boldsymbol{x^*})\le \left(\frac 1{m\mu \eta (1 - 2L \eta)} + \frac {2L\eta}{1 - 2L\eta}\right)\left(\mathbb{E}[f(\boldsymbol{\widetilde{x}_{s-1}})]-f(\boldsymbol{x^*})\right). \]

This theorem shows that SVRG has a linear convergence rate in the strongly convex case. Remember that SGD only achieves $O\left(\frac{1}{n}\right)$ rate in this case.

Linear Coupling

Mirror Descent

Here's another way to explain GD. By Taylor expansion, $f(\boldsymbol{x_n})+\nabla f(\boldsymbol{x_n})^\top(\boldsymbol{x}-\boldsymbol{x_n})$ can approximate $f(\boldsymbol{x})$ to a certain extent. However, we can't directly minimize this approximation, so a regularization term $\frac{1}{2\eta}\|\boldsymbol{x}-\boldsymbol{x_n}\|^2$ is added to ensure the solution not too far from $\boldsymbol{x_n}$. That is,

\[\begin{align*} \boldsymbol{x_{n+1}}=&\ \arg\min_{\boldsymbol{x}}\left\{f(\boldsymbol{x_n})+\nabla f(\boldsymbol{x_n})^\top(\boldsymbol{x}-\boldsymbol{x_n})+\frac{1}{2\eta}\|\boldsymbol{x}-\boldsymbol{x_n}\|^2\right\}\\ =&\ \arg\min_{\boldsymbol{x}}\left\{\eta\nabla f(\boldsymbol{x_n})^\top(\boldsymbol{x}-\boldsymbol{x_n})+\frac{1}{2}\|\boldsymbol{x}-\boldsymbol{x_n}\|^2\right\} \end{align*} \]

which is equivalent to $\boldsymbol{x_{n+1}}=\boldsymbol{x_n}-\eta\nabla f(\boldsymbol{x_n})$, the update rule of GD.

Now we want to generalize this term.

For a strictly convex function $w$ (called distance generating function), we define the Bregman divergence w.r.t. $w$ as $$V_{\boldsymbol{x}}(\boldsymbol{y})=w(\boldsymbol{y})-\nabla w(\boldsymbol{x})^\top(\boldsymbol{y}-\boldsymbol{x})-w(\boldsymbol{x}).$$

Assume $w$ is $1$-strictly convex, then $V_{\boldsymbol{x}}(\boldsymbol{y})\ge \frac{1}{2}\|\boldsymbol{y}-\boldsymbol{x}\|^2$, which is the original term.

For any $\boldsymbol{x},\boldsymbol{y},\boldsymbol{z}$, we have $$ -\nabla V_{\boldsymbol{x}}(\boldsymbol{y})^\top(\boldsymbol{y}-\boldsymbol{z})=V_{\boldsymbol{x}}(\boldsymbol{z})-V_{\boldsymbol{y}}(\boldsymbol{z})-V_{\boldsymbol{x}}(\boldsymbol{y}). $$

$$ \begin{align*} &\ V_{\boldsymbol{x}}(\boldsymbol{z})-V_{\boldsymbol{y}}(\boldsymbol{z})-V_{\boldsymbol{x}}(\boldsymbol{y}) \\ =&\ (w(\boldsymbol{z})-\nabla w(\boldsymbol{x})^\top(\boldsymbol{z}-\boldsymbol{x})-w(\boldsymbol{x}))-(w(\boldsymbol{z})-\nabla w(\boldsymbol{y})^\top(\boldsymbol{z} -\boldsymbol{y})-w(\boldsymbol{y}))-(w(\boldsymbol{y})-\nabla w(\boldsymbol{x})^\top(\boldsymbol{y}-\boldsymbol{x})-w(\boldsymbol{x}))\\ =&\ \nabla w(\boldsymbol{x})^\top(\boldsymbol{y}-\boldsymbol{z})-\nabla w(\boldsymbol{y})^\top(\boldsymbol{y}-\boldsymbol{z})\\ =&\ -\nabla V_{\boldsymbol{x}}(\boldsymbol{y})^\top(\boldsymbol{y}-\boldsymbol{z}) \end{align*} $$

Starts with a guess $\boldsymbol{x_0}$, and considers the sequence $\boldsymbol{x_0},\boldsymbol{x_1},\boldsymbol{x_2},...$ such that $$ \boldsymbol{x_{n+1}}=\mathrm{Mirr}_{\boldsymbol{x_n}}(\alpha \nabla f(\boldsymbol{x_n})), n\ge 0 $$ where $\mathrm{Mirr}_{\boldsymbol{x}}(\boldsymbol{\xi})=\arg\min_{\boldsymbol{y}} \left\{V_{\boldsymbol{x}}(\boldsymbol{y})+\boldsymbol{\xi}^\top(\boldsymbol{y}-\boldsymbol{x})\right\}$.

We can see that it's actually GD on mirror space. One can verify that the mirror map is $\nabla w$, so

For any $n\ge 0$, $\nabla w(\boldsymbol{x_{n+1}})=\nabla w(\boldsymbol{x_n})-\alpha\nabla f(\boldsymbol{x_n}).$

If $f$ is $\rho$-Lipschitz and convex and $\boldsymbol{x^*}=\arg\min_{\boldsymbol{x}} f(\boldsymbol{x})$, then running MD with step size $\alpha=\frac{\sqrt{2\Theta}}{\rho\sqrt{n}}$ satisfies $$ f(\boldsymbol{\overline{x_n}})-f(\boldsymbol{x^*})\le \frac{\rho\sqrt{2\Theta}}{\sqrt{n}} $$ where $V_{\boldsymbol{x_0}}(\boldsymbol{x^*})\le \Theta$.

$$ \begin{align*} &\ \alpha (f(\boldsymbol{x_n})-f(\boldsymbol{x^*}))\\ \le&\ \alpha \nabla f(\boldsymbol{x_n})^\top(\boldsymbol{x_n}-\boldsymbol{x^*}) \tag{Convexity}\\ =&\ \alpha \nabla f(\boldsymbol{x_n})^\top(\boldsymbol{x_n}-\boldsymbol{x_{n+1}})+\alpha \nabla f(\boldsymbol{x_n})^\top(\boldsymbol{x_{n+1}}-\boldsymbol{x^*})\\ =&\ \alpha \nabla f(\boldsymbol{x_n})^\top(\boldsymbol{x_n}-\boldsymbol{x_{n+1}})- \nabla V_{\boldsymbol{x_n}}(\boldsymbol{x_{n+1}})^\top(\boldsymbol{x_{n+1}}-\boldsymbol{x^*}) \tag{Lemma}\\ =&\ \alpha \nabla f(\boldsymbol{x_n})^\top(\boldsymbol{x_n}-\boldsymbol{x_{n+1}})+V_{\boldsymbol{x_n}}(\boldsymbol{x^*})-V_{\boldsymbol{x_{n+1}}}(\boldsymbol{x^*})-V_{\boldsymbol{x_n}}(\boldsymbol{x_{n+1}}) \tag{Proposition}\\ \le &\ \left(\alpha \nabla f(\boldsymbol{x_n})^\top(\boldsymbol{x_n}-\boldsymbol{x_{n+1}})-\frac{1}{2}\|\boldsymbol{x_n}-\boldsymbol{x_{n+1}}\|^2\right)+V_{\boldsymbol{x_n}}(\boldsymbol{x^*})-V_{\boldsymbol{x_{n+1}}}(\boldsymbol{x^*}) \\ \le &\ \frac{\alpha^2}{2}\|\nabla f(\boldsymbol{x_n})\|^2+V_{\boldsymbol{x_n}}(\boldsymbol{x^*})-V_{\boldsymbol{x_{n+1}}}(\boldsymbol{x^*}) \\ \le &\ \frac{\alpha^2\rho^2}{2}+V_{\boldsymbol{x_n}}(\boldsymbol{x^*})-V_{\boldsymbol{x_{n+1}}}(\boldsymbol{x^*}) \tag{Lipschitz}\\ \end{align*} $$

Thus

\[\begin{align*} &\ \alpha n(f(\overline{\boldsymbol{x_n}})-f(\boldsymbol{x^*}))\\ \le &\ \sum_{i=0}^{n-1}\alpha (f(\boldsymbol{x_i})-f(\boldsymbol{x^*}))\\ \le &\ \sum_{i=0}^{n-1}\left(\frac{\alpha^2\rho^2}{2}+V_{\boldsymbol{x_i}}(\boldsymbol{x^*})-V_{\boldsymbol{x_{i+1}}}(\boldsymbol{x^*})\right)\\ =&\ \frac{n\alpha^2\rho^2}{2}+V_{\boldsymbol{x_0}}(\boldsymbol{x^*})-V_{\boldsymbol{x_n}}(\boldsymbol{x^*})\\ \le &\ \frac{n\alpha^2\rho^2}{2}+\Theta\\ \end{align*} \]

Immediatey we have

\[f(\boldsymbol{\overline{x_n}})-f(\boldsymbol{x^*})\le \frac{\alpha\rho^2}{2}+\frac{\Theta}{\alpha n}=\frac{\rho\sqrt{2\Theta}}{\sqrt{n}}. \]

We can see that MD has a convergence rate of $O\left(\frac{1}{\sqrt{n}}\right)$.

MD performs well when $\rho$ (or $\|\nabla f(\boldsymbol{x})\|$) is small, while GD performs well when $\|\nabla f(\boldsymbol{x})\|$ is large (recall lemma). A straightforward idea is to balance these two algorithms.

Linear Coupling

This leads to the linear coupling method. For each step, we run MD and GD in parallel, and then combine the results.

Starts with a guess $\boldsymbol{x_0}$, and considers the sequences $\left\{\boldsymbol{x_n}\right\}_{n\ge 0},\left\{\boldsymbol{y_n}\right\}_{n\ge 0},\left\{\boldsymbol{z_n}\right\}_{n\ge 0},$ such that for $n\ge 0$, $$ \begin{align*} \boldsymbol{x_{n+1}}=&\ \tau\boldsymbol{z_n}+(1-\tau)\boldsymbol{y_n}, \\ \boldsymbol{y_{n+1}}=&\ \mathrm{Grad}(\boldsymbol{x_{n+1}})=\boldsymbol{x_{n+1}}-\eta \nabla f(\boldsymbol{x_{n+1}}), \\ \boldsymbol{z_{n+1}}=&\ \mathrm{Mirr}_{\boldsymbol{z_n}}(\alpha \nabla f(\boldsymbol{x_{n+1}})), \\ \end{align*} $$ where $\eta,\alpha,\tau$ are hyperparameters.

If $f$ is $L$-smooth and convex and $\boldsymbol{x^*}=\arg\min_{\boldsymbol{x}} f(\boldsymbol{x})$, then running LC with $\eta=\frac{1}{L},\alpha=\sqrt{\frac{\Theta}{Ld}}$ and $\tau=\frac{1}{\alpha L+1}$ satisfies $$ f(\boldsymbol{\overline{x_n}})-f(\boldsymbol{x^*})\le \frac{2\sqrt{L\Theta d}}{n}. $$ where $V_{\boldsymbol{x_0}}(\boldsymbol{x^*})\le \Theta$ and $f(\boldsymbol{y_0})-f(\boldsymbol{x^*})\le d$.

\[\begin{align*} &\ \alpha (f(\boldsymbol{x_{n+1}})-f(\boldsymbol{x^*}))\\ \le &\ \alpha \nabla f(\boldsymbol{x_{n+1}})^\top(\boldsymbol{x_{n+1}}-\boldsymbol{x^*}) \tag{Convexity}\\ =&\ \alpha \nabla f(\boldsymbol{x_{n+1}})^\top(\boldsymbol{x_{n+1}}-\boldsymbol{z_n})+\alpha f(\boldsymbol{x_{n+1}})^\top(\boldsymbol{z_n}-\boldsymbol{x^*}) \end{align*} \]

The first term:

\[\begin{align*} &\ \alpha \nabla f(\boldsymbol{x_{n+1}})^\top(\boldsymbol{x_{n+1}}-\boldsymbol{z_n})\\ = &\ \frac{(1-\tau)\alpha}{\tau}\nabla f(\boldsymbol{x_{n+1}})^\top(\boldsymbol{y_{n}}-\boldsymbol{x_{n+1}})\\ \le &\ \alpha^2L (f(\boldsymbol{y_{n}})-f(\boldsymbol{x_{n+1}})) \tag{Convexity}\\ \end{align*} \]

The second term:

\[\begin{align*} &\ \alpha \nabla f(\boldsymbol{x_{n+1}})^\top(\boldsymbol{z_n}-\boldsymbol{x^*})\\ \le &\ \frac{\alpha^2}{2}\|\nabla f(\boldsymbol{x_{n+1}})\|^2+V_{\boldsymbol{z_n}}(\boldsymbol{x^*})-V_{\boldsymbol{z_{n+1}}}(\boldsymbol{x^*}) \tag{MD}\\ \le &\ \alpha^2L(f(\boldsymbol{x_{n+1}})-f(\boldsymbol{y_{n+1}}))+V_{\boldsymbol{z_n}}(\boldsymbol{x^*})-V_{\boldsymbol{z_{n+1}}}(\boldsymbol{x^*}) \tag{GD}\\ \end{align*} \]

Combining them gives

\[\alpha (f(\boldsymbol{x_{n+1}})-f(\boldsymbol{x^*}))\le \alpha^2L (f(\boldsymbol{y_{n}})-f(\boldsymbol{y_{n+1}}))+V_{\boldsymbol{z_n}}(\boldsymbol{x^*})-V_{\boldsymbol{z_{n+1}}}(\boldsymbol{x^*}). \]

Summing over $n$ gives

\[\begin{align*} &\ \alpha n(f(\boldsymbol{\overline{x_n}})-f(\boldsymbol{x^*}))\\ \le &\ \alpha \sum_{i=0}^{n-1} (f(\boldsymbol{x_{i+1}})-f(\boldsymbol{x^*})) \tag{Convexity}\\ \le &\ \alpha^2L (f(\boldsymbol{y_{0}})-f(\boldsymbol{y_{n}}))+V_{\boldsymbol{z_0}}(\boldsymbol{x^*})-V_{\boldsymbol{z_n}}(\boldsymbol{x^*}) \\ \end{align*} \]

Thus

\[f(\boldsymbol{\overline{x_n}})-f(\boldsymbol{x^*})\le \frac{1}{n}\left(\alpha L d+\frac{\Theta}{\alpha}\right)=\frac{2\sqrt{L\Theta d}}{n}. \]

Setting $T=4\sqrt{\frac{L\Theta}{d}}$ gives

\[f(\boldsymbol{\overline{x_T}})-f(\boldsymbol{x^*})\le \frac{d}{2}. \]

Now we get a new parameter $d'=\frac{d}{2}$ and update all the hyperparameters. Then do the same thing again with $T'=4\sqrt{\frac{L\Theta}{d'}}$, and so on. Finally, we want $ f(\boldsymbol{\overline{x_T}})-f(\boldsymbol{x^*})\le \epsilon$. So the total number of iterations is

\[O\left(\sqrt{\frac{L\Theta}{\epsilon}}+\sqrt{\frac{L\Theta}{2\epsilon}}+\sqrt{\frac{L\Theta}{4\epsilon}}+\cdots\right)=O\left(\sqrt{\frac{L\Theta}{\epsilon}}\right). \]

It achieves a convergence rate of $O\left(\frac{1}{n^2}\right)$, which is proved to be the best possible under such assumptions.

Non-convex Optimization

Matrix Completion

Given a matrix $A$ with some entries missing, complete the matrix by filling in the missing entries. Assume

$A$ is low-rank;
The known entries of $A$ are uniformly distributed;
(Incoherence) Let the SVD of $A$ be $A = U \Sigma V^\top$, where $U\in\mathbb{R}^{n\times r}, V\in\mathbb{R}^{m\times r},\Sigma\in\mathbb{R}^{r\times r}$. There is a constant $u$ such that for all $i\in[n],j\in[m]$,

\[\left\|e_i^\top U\right\|\le \sqrt{\frac{ur}{n}},\quad \left\|e_j^\top V\right\|\le \sqrt{\frac{ur}{m}}. \]

Approach: Find low-rank $U$ and $V$, so that $UV^\top$ matches the known entries of $A$. Formally,

\[\min_{U\in\mathbb{R}^{n\times r}, V\in\mathbb{R}^{m\times r}} \|P_\Omega\left(UV^\top-A\right)\|^2, \]

where $\Omega$ is the set of indices of the known entries of $A$ and $P_\Omega$ is the projection onto these entries.

Unlike before, this loss function is not convex, one can find multiple local minima. However, we notice that if we fix $V$, the loss function is convex in $U$ and vice versa. Therefore, we can use alternating minimization to find a local minimum. That is, for $t\ge 0,1,...,T-1$, do

\[\begin{align*} U_{t+1} &\leftarrow \arg\min_U \|P_\Omega\left(UV_t^\top-A\right)\|^2,\\ V_{t+1} &\leftarrow \arg\min_V \|P_\Omega\left(U_{t+1}V^\top-A\right)\|^2. \end{align*} \]

So that each subproblem is convex and can be solved efficiently.

Escaping From Saddle Points

In fact, non-convex objective is common in ML, since this assumption: no spurious local minima (i.e., all the local minima are equally good, it suffices to find one of them). Now the only remaining question is how to escape saddle points.

For stationary points $\nabla L(\boldsymbol{x})=0$:

If $\nabla^2 L(\boldsymbol{x})\succ 0$ , then $\boldsymbol{x}$ is a local minimum;
If $\nabla^2 L(\boldsymbol{x})\prec 0$, then $\boldsymbol{x}$ is a local maximum;
If $\nabla^2 L(\boldsymbol{x})$ has both positive and negative eigenvalues, then $\boldsymbol{x}$ is a strict saddle point;
If $\nabla^2 L(\boldsymbol{x})\succeq 0$, then $\boldsymbol{x}$ is a local minimum or a flat saddle point.

Assume the loss function is strict saddle, i.e., it does not contain any flat saddle points (which is reasonable). We want to show that SGD can (efficietly) escape saddle points for these loss functions.

If $f$ is smooth, bounded, strict saddle, has smooth Hessian, and SGD noise has non-negligible variance in every direction with constant probability, SGD will escape all saddle points and local maxima, converge to a local minimum after polynomial number of steps.

Here only give the outline of the proof. For the current point $\boldsymbol{x_0}$, we discuss what happens with SGD.

$\| f(\boldsymbol{x_0})\|$ is big. By smoothness, there exists some constant $c>0$ such that $$\mathbb{E}\left[f(\boldsymbol{x_1})\right]\le f(\boldsymbol{x_0})-c.$$
Otherwise, $\boldsymbol{x_0}$ is close to a stationary point.
- If $\boldsymbol{x_0}$ is close to a local minimum, then we will not get out with large probability.
- Otherwise, $\boldsymbol{x_0}$ is close to a strict saddle point. Then there is a negative direction. Then a random perturbation (in SGD) will give positive projection on the negative direction. We will follow this direction and escape the saddle. Since the Hessian is smooth, the needed number of steps is polynomial.

Generalization

The data is sampled from a distribution $\mathcal{D}$ on $\mathcal{X}\times \mathcal{Y}$, which is unknown to the learner. The learner has access to a training set $S$ of size $m$, which is sampled from $\mathcal{D}^m$. The learner can calculate the empirical loss on the training set:

\[L_S(h)=\frac{1}{m}\sum_{(x,y)\in S}l(h(x),y),\]

where $l$ is the loss function.

However, the generalization error of a classifier $h:\mathcal{X}\to\mathcal{Y}$ is defined as the expected loss with respect to the distribution $\mathcal{D}$:

\[L_\mathcal{D}(h)=\mathbb{E}_{(x,y)\sim \mathcal{D}}\left[l(h(x),y)\right],\]

where $l$ is the loss function. For reliability, we assume that there exists a labeling function $f:\mathcal{X}\to\mathcal{Y}$ such that $L_\mathcal{D}(f)=0.$

We would study how to bound the generalization error $L_\mathcal{D}(h)$ based on the empirical loss $L_S(h)$.

The No-Free-Lunch Theorem

First, we show that there is no universal learner that can perform well on all possible distributions. In other words, prior knowledge about $\mathcal{D}$ is necessary for it to be learnable. This is known as the No-Free-Lunch theorem.

Let $A$ be any learning algorithm for the task of binary classification with respect to the $0-1$ loss function over a domain $\mathcal{X}$. Let $m$ be any number smaller than $|\mathcal{X}/2|$, representing a training set size. Then, there exists a distribution $\mathcal{D}$ over $\mathcal{X}\times \{0,1\}$ such that

There exists a function $f:\mathcal{X}\to\{0,1\}$ with $L_\mathcal{D}(f)=0$;
\[\mathbb{E}_{S\sim \mathcal{D}^m}\left[L_{\mathcal{D}}(A(S))\right]\ge\frac{1}{4}. \]

W.l.o.g., assume $|\mathcal{X}|=2m$. Note that there are $T=2^{2m}$ possible functions from $\mathcal{X}$ to $\{0,1\}$, denote them by $f_1,\ldots,f_T$. For each $i\in [T]$, let $\mathcal{D}_i$ be a distribution over $\mathcal{X}\times \{0,1\}$ such that $$ \mathcal{D}_i(\{x,y\})=\begin{cases}\frac{1}{|\mathcal{X}|} & \text{if } y=f_i(x)\\0 & \text{otherwise}\end{cases}. $$ Clearly, $L_{\mathcal{D}_i}(f_i)=0$.

We will show that for every algorithm $A$, that receives a training set $S$ of size $m$ from $\mathcal{X}\times \{0,1\}$ and returns a function $A(S):\mathcal{X}\to\{0,1\}$, there exists a distribution $\mathcal{D}_i$ such that $\mathbb{E}_{S\sim \mathcal{D}^m}\left[L_{\mathcal{D}}(A(S))\right]\ge\frac{1}{4}.$ Equivalently, we will show that

\[\max_{i\in [T]}\mathbb{E}_{S\sim \mathcal{D}_i^m}\left[L_{\mathcal{D}_i}(A(S))\right]\ge\frac{1}{4}. \]

There are $k=(2m)^m$ possible sequences of $m$ examples from $\mathcal{X}$, denote them by $S_1,\ldots,S_k$. For each $S_j=(x_1,\ldots,x_m)$, let $S^i_j=\left((x_1,f_i(x_1)),\ldots,(x_m,f_i(x_m))\right)$. If the distribution is $\mathcal{D}_i$, then $S_1^i,\ldots,S_k^i$ are the possible training sets.

\[\begin{align*} \max_{i\in[T]}\mathbb{E}_{S\sim \mathcal{D}_i^m}\left[L_{\mathcal{D}_i}(A(S))\right] &= \max_{i\in[T]}\frac{1}{k}\sum_{j=1}^k L_{\mathcal{D}_i}\left(A(S_j^i)\right) \\ &\ge \frac{1}{T}\sum_{i=1}^\top\frac{1}{k}\sum_{j=1}^k L_{\mathcal{D}_i}\left(A(S_j^i)\right) \\ &\ge \min_{j\in[k]}\frac{1}{T}\sum_{i=1}^\top L_{\mathcal{D}_i}\left(A(S_j^i)\right) \\ \end{align*} \]

Now we fix some $j\in[k]$. Denote $S_j=(x_1,\ldots,x_m)$ and $\{v_1,...,v_m\}=\mathcal{X}-S_j$. For each function $h:\mathcal{X}\to\{0,1\}$ and $i\in[T]$ we have

\[L_{\mathcal{D}_i}(h)=\frac{1}{2m}\sum_{x\in \mathcal{X}}\mathbb{I}\left[h(x)\neq f_i(x)\right]\ge \frac{1}{2m}\sum_{r=1}^m\mathbb{I}\left[h(v_r)\neq f_i(v_r)\right]. \]

Thus

\[\begin{align*} \frac{1}{T}\sum_{i=1}^\top L_{\mathcal{D}_i}\left(A(S_j^i)\right)&\ge \frac{1}{T}\sum_{i=1}^\top\frac{1}{2m}\sum_{r=1}^m\mathbb{I}\left[A(S_j^i)(v_r)\neq f_i(v_r)\right] \\ &= \frac{1}{2m}\sum_{r=1}^m\frac{1}{T}\sum_{i=1}^\top\mathbb{I}\left[A(S_j^i)(v_r)\neq f_i(v_r)\right] \\ &\ge \frac{1}{2}\min_{r\in[m]}\frac{1}{T}\sum_{i=1}^\top\mathbb{I}\left[A(S_j^i)(v_r)\neq f_i(v_r)\right]. \end{align*} \]

Now we fix some $r\in[p]$. We can partition all the function $f_1,\ldots,f_T$ into $\frac{T}{2}$ disjoint pairs, where for a pair $(f_i,f_{i'})$ we have that for every $x\in\mathcal{X}$, $f_i(x)\neq f_{i'}(x)$ if and only if $c=v_r$. Since for such a pair we have that $S_j^i=S_j^{i'}$, it follows that

\[\mathbb{I}\left[A(S_j^i)(v_r)\neq f_i(v_r)\right]+\mathbb{I}\left[A(S_j^{i'})(v_r)\neq f_{i'}(v_r)\right]= 1. \]

Immediately,

\[\frac{1}{T}\sum_{i=1}^\top\mathbb{I}\left[A(S_j^i)(v_r)\neq f_i(v_r)\right]=\frac{1}{2}. \]

Thus

\[\min_{j\in[k]}\frac{1}{T}\sum_{i=1}^\top L_{\mathcal{D}_i}\left(A(S_j^i)\right)\ge \frac{1}{4}. \]

...With probability of at least $\frac{1}{7}$ over the choice of $S\sim\mathcal{D}^m$, we have that $L_\mathcal{D}(A(S))\ge\frac{1}{8}$, i.e., $$ \mathrm{Pr}_{S\sim \mathcal{D}^m}\left[L_{\mathcal{D}}(A(S))\ge\frac{1}{8}\right]\ge\frac{1}{7}. $$

Directly follows from Markov's inequality on the theorem above.

PAC Learning

Empirical Risk Minimization (ERM) means that the learner chooses a hypothesis $h$ that minimizes the empirical loss on the training set $S$. But it may lead to overfitting. To avoid this, we may apply the ERM learning rule over a restricted search space. The PAC (Probably Approximately Correct) learning framework formalizes this idea.

Let $\mathcal{H}$ be a hypothesis class, the ERM learner over $\mathcal{H}$ is defined as $$ \mathrm{ERM}_\mathcal{H}(S)=\arg\min_{h\in \mathcal{H}}L_S(h). $$

We can decompose the generalization error of a hypothesis $h$:

\[L_\mathcal{D}(h)=\epsilon_{\text{app}}+\epsilon_{\text{est}} \]

where $\epsilon_{\text{app}}:=\min\limits_{h'\in\mathcal{H}}L_\mathcal{D}(h')$ is the approximation error and $\epsilon_{\text{est}}:=L_\mathcal{D}(h)-\min\limits_{h'\in\mathcal{H}}L_\mathcal{D}(h')$ is the estimation error. The more we restrict the hypothesis class $\mathcal{H}$, the larger the approximation error $\epsilon_{\text{app}}$ is, but (maybe) the smaller the estimation error $\epsilon_{\text{est}}$ is. The goal of PAC learning is to find a hypothesis class balancing these two errors.

Realizability assumption: there exists a hypothesis $h^*\in \mathcal{H}$ such that $L_\mathcal{D}(h^*)=0$.

i.e., $\epsilon_{\text{app}}=0$.

A hypothesis class $\mathcal{H}$ is PAC learnable if there exists a function $m_\mathcal{H}:(0,1)^2\to\mathbb{N}$ and a learning algorithm such that: For every $\epsilon,\delta\in(0,1)$, for every distribution $\mathcal{D}$ over $\mathcal{X}\times\mathcal{Y}$, if the realizability assumption holds with respect to $\mathcal{H}$ and $\mathcal{D}$, then when running the learning algorithm on a training set $S$ of size $m\ge m_\mathcal{H}(\epsilon,\delta)$ sampled from $\mathcal{D}$, the algorithm returns a hypothesis $h$ such that, with probability of at least $1-\delta$ (over the choice of $S$), $$ L_\mathcal{D}(h)\le \epsilon. $$

We will show that if $\mathcal{H}$ is finite, then it is PAC learnable by ERM, and the sample complexity $m_\mathcal{H}(\epsilon,\delta)\le\log(|\mathcal{H}|/\delta)/\epsilon$.

Let $\mathcal{H}$ be a finite hypothesis class. Let $\delta\in(0,1), \epsilon>0$ and $m\ge \log(|\mathcal{H}|/\delta)/\epsilon$. For any distribution $\mathcal{D}$ over $\mathcal{X}\times \mathcal{Y}$ such that the realizability assumption holds with respect to $\mathcal{H}$ and $\mathcal{D}$, with probability of at least $1-\delta$ over the choice of $S\sim \mathcal{D}^m$, we have that for every ERM hypothesis $h_S$, it holds that $$ L_\mathcal{D}(h_S)\le \epsilon. $$

We would like to upper bound $$\mathcal{D}^m\left(\left\{S:L_\mathcal{D}(h_S)>\epsilon\right\}\right).$$ Let $\mathcal{H}_B$ be the set of "bad" hypotheses and $M$ be the set of misleading samples, i.e., $$ \mathcal{H}_B=\left\{h\in \mathcal{H}:L_\mathcal{D}(h)>\epsilon\right\},M=\left\{S:\exists h\in\mathcal{H}_B,L_S(h)=0\right\}. $$ By realizability assumption, $$\left\{S:L_\mathcal{D}(h_S)>\epsilon\right\}\subseteq M.$$

$$\mathcal{D}\left(\bigcup_{i=1}^{+\infty}A_i\right)\le \sum_{i=1}^{+\infty}\mathcal{D}(A_i)$$

Combined with the lemma above, we have (where $f$ is the labeling function): $$ \begin{align*} \mathcal{D}^m\left(\left\{S:L_\mathcal{D}(h_S)>\epsilon\right\}\right)&\le \mathcal{D}^m(M)\\ &=\mathcal{D}^m\left(\bigcup_{h\in\mathcal{H}_B}\left\{S:L_S(h)=0\right\}\right)\\ &\le \sum_{h\in\mathcal{H}_B}\mathcal{D}^m\left(\left\{S:L_S(h)=0\right\}\right)\\ &=\sum_{h\in\mathcal{H}_B}\prod_{i=1}^m\mathcal{D}\left(\left\{(x_i,h(x_i))\right\}\right)\\ &=\sum_{h\in\mathcal{H}_B}(1-L_\mathcal{D}(h))^m\\ &\le \sum_{h\in\mathcal{H}_B}(1-\epsilon)^m\\ &\le |\mathcal{H}_B|e^{-\epsilon m}\\ &\le |\mathcal{H}|e^{-\epsilon m}\\ &\le \delta. \end{align*} $$

What about infinite hypothesis class? We will show the class of threshold functions is PAC learnable, also by ERM. (A function $f:\mathbb{R}\to\{0,1\}$ is a threshold function if there exists a threshold $t$ such that $f(x)=\mathbb{I}(x< t)$ for all $x\in\mathbb{R}$.)

Let $\mathcal{H}$ be the class of threshold functions. Then $\mathcal{H}$ is PAC learnable, using the ERM learning rule, with sample complexity $m_\mathcal{H}(\epsilon,\delta)\le \lceil\log(2/\delta)\rceil/\epsilon$.

Let $a^*$ be a threshold such that the hypothesis $h^*=\mathbb{I}(x < a^*)$ achieves $L_\mathcal{D}(h^*)=0$. Let $a_0 < a^* < a_1$ be such that $$\text{Pr}_{(x,y)\sim \mathcal{D}}\left(x\in(a_0,a^*)\right)=\text{Pr}_{(x,y)\sim \mathcal{D}}\left(x\in(a^*,a_1)\right)=\epsilon.$$

Given a training set $S$, let $b_0=\max\{x:(x,1)\in S\}$ and $b_1=\min\{x:(x,0)\in S\}$. Let $b_S$ be a threshold corresponding to an ERM hypothesis $h_S$, which implies that $b_S\in (b_0,b_1]$. Therefore, a sufficient condition for $L_\mathcal{D}(h_S)\le \epsilon$ is that both $b_0\ge a_0$ and $b_1\le a_1$.

Thus

\[\begin{align*} \text{Pr}_{S\sim \mathcal{D}^m}\left(L_\mathcal{D}(h_S)>\epsilon\right) &= \text{Pr}_{S\sim \mathcal{D}^m}\left(b_0< a_0\text{ or }b_1> a_1\right)\\ &\le \text{Pr}_{S\sim \mathcal{D}^m}\left(b_0< a_0\right)+\text{Pr}_{S\sim \mathcal{D}^m}\left(b_1> a_1\right)\\ &= \text{Pr}_{S\sim \mathcal{D}^m}\left(\forall (x,y)\in S, x\notin(a_0,a^*)\right)+\text{Pr}_{S\sim \mathcal{D}^m}\left(\forall (x,y)\in S, x\notin(a^*,a_1)\right)\\ &= (1-\epsilon)^m+(1-\epsilon)^m\\ &= 2e^{-\epsilon m}\\ &\le \delta. \end{align*} \]

In pratice, realizable assumption may not hold. In this case, we can replace PAC learning by agnostic PAC learning.

A hypothesis class $\mathcal{H}$ is agnostic PAC learnable if there exists a function $m_\mathcal{H}:(0,1)^2\to\mathbb{N}$ and a learning algorithm such that: For every $\epsilon,\delta\in(0,1)$, for every distribution $\mathcal{D}$ over $\mathcal{X}\times\mathcal{Y}$, when running the learning algorithm on a training set $S$ of size $m\ge m_\mathcal{H}(\epsilon,\delta)$ sampled from $\mathcal{D}$, the algorithm returns a hypothesis $h$ such that, with probability of at least $1-\delta$ (over the choice of $S$), $$ L_\mathcal{D}(h)\le \min_{h'\in\mathcal{H}}L_\mathcal{D}(h')+\epsilon. $$

VC Dimension

We have shown that threshold functions are PAC learnable. What about other infinite hypothesis classes? Generally, we have VC dimension to do the analysis.

The intuition is, recall No-Free-Lunch theorem, the adversary used a finite set $\mathcal{C}\subseteq \mathcal{X}$ and concentrated on examples in $\mathcal{C}$. Thus we should study the behavior of $\mathcal{H}$ on $\mathcal{C}$.

Let $\mathcal{H}$ be a hypothesis class and $\mathcal{C}=\{c_1,\ldots,c_m\}\subseteq \mathcal{X}$ be a finite subset. The restriction of $\mathcal{H}$ to $\mathcal{C}$ is the set of functions from $\mathcal{C}$ to $\mathcal{Y}$ that derived from $\mathcal{H}$, that is, $$ \mathcal{H}_\mathcal{C}=\{(h(c_1),\ldots,h(c_m)):h\in \mathcal{H}\}. $$ where we represent a function from $\mathcal{C}$ to $\mathcal{Y}$ by a vector in $\mathcal{Y}^{|\mathcal{C}|}$.

A hypothesis class $\mathcal{H}$ shatters a finite subset $\mathcal{C}\subseteq \mathcal{X}$ if the restriction of $\mathcal{H}$ to $\mathcal{C}$ contains all the possible functions from $\mathcal{C}$ to $\mathcal{Y}$, i.e., $|\mathcal{H}_\mathcal{C}|=|\mathcal{Y}|^{|\mathcal{C}|}$.

Recall the proof of No-Free-Lunch theorem, we have

Let $\mathcal{H}$ be a hypothesis class of functions from $\mathcal{X}$ to $\{0,1\}$ and $m$ be the training set size. Assume that there exists a set $\mathcal{C}\subseteq \mathcal{X}$ of size $2m$ that is shattered by $\mathcal{H}$. Then, for any learning algorithm $A$, there exists a distribution $\mathcal{D}$ over $\mathcal{X}\times \{0,1\}$ and a hypothesis $h\in \mathcal{H}$ such that $L_\mathcal{D}(h)=0$ but with probability of at least $\frac{1}{7}$ over the choice of $S\sim \mathcal{D}^m$, we have that $L_\mathcal{D}(A(S))\ge \frac{1}{8}$.

This leads us directly to the definition of the VC dimension.

The VC dimension of a hypothesis class $\mathcal{H}$, denoted by $\mathrm{VCdim}(\mathcal{H})$, is the size of the largest finite subset of $\mathcal{X}$ that is shattered by $\mathcal{H}$. If $\mathcal{H}$ can shatter arbitrarily large finite subsets of $\mathcal{X}$, then we say $\mathcal{H}$ has infinite VC dimension, i.e., $\mathrm{VCdim}(\mathcal{H})=+\infty$.

A direct consequence of the corollary above is therefore

Let $\mathcal{H}$ be a hypothesis class of infinite VC dimension. Then $\mathcal{H}$ is not PAC learnable.

Suppose $\mathcal{H}$ is PAC learnable. Let $M=m_\mathcal{H}(\frac{1}{8},\frac{1}{7})$. Since $\mathcal{H}$ has infinite VC dimension, there exists a set $\mathcal{C}\subseteq \mathcal{X}$ of size $2M$ that is shattered by $\mathcal{H}$. By the corollary above, there exists a distribution $\mathcal{D}$ over $\mathcal{X}\times \{0,1\}$ and a hypothesis $h\in \mathcal{H}$ such that $L_\mathcal{D}(h)=0$ but with probability of at least $\frac{1}{7}$ over the choice of $S\sim \mathcal{D}^M$, we have that $L_\mathcal{D}(A(S))\ge \frac{1}{8}$. This contradicts the definition of PAC learning.

Let $\mathcal{H}$ be a hypothesis class of functions from $\mathcal{X}$ to $\{0,1\}$ and let the loss function be the $0-1$ loss function. Assume that $\text{VCdim}(\mathcal{H})=d<+\infty$. Then there are absolute constants $c_1,c_2$ such that

$\mathcal{H}$ is agnostic PAC learnable with sample complexity

\[c_1\frac{d+\log(1/\delta)}{\epsilon^2}\le m_\mathcal{H}(\epsilon,\delta)\le c_2\frac{d+\log(1/\delta)}{\epsilon^2}; \]

$\mathcal{H}$ is PAC learnable with sample complexity

\[c_1\frac{d+\log(1/\delta)}{\epsilon}\le m_\mathcal{H}(\epsilon,\delta)\le c_2\frac{d\log(1/\epsilon)+\log(1/\delta)}{\epsilon}; \]

Supervised Learning

Linear Methods

Linear Regression

Let $\mathcal{X}\subset\mathbb{R}^{ d}$ and $\mathcal{Y}=\mathbb{R}$. Given $S=((x_1,y_1),...,(x_n,y_n))$, we would like to learn a linear function $f:\mathbb{R}^d\to\mathbb{R}$ that best approximates the data in $S$.

By definition, the hypothesis class of linear regression predictors is

\[L_d=\{x\mapsto w^\topx+b\mid w\in\mathbb{R}^d, b\in\mathbb{R}\}. \]

Clearly, $b$ can be ignored, since we can always add one more dimension with a constant value of $1$ to $x$, so that $w^\topx+b=[w,b]^\top[x,1]$.

With the squared loss function $$L(y,x_i,y_i)=(f(x_i)-y_i)^2,$$ the exact minimizer $w^*$ is given by

\[w^*=(X^\topX)^{-1}X^\topY \]

where $X$ is the matrix with rows $x_i^\top$ and $Y$ is the vector with entries $y_i$.

Since the loss function is convex, we can also use gradient descent to find $w^*$.

Perceptron

What if the task is not regression, but classification? The first attempt is let $f(x)=\text{sign}(w^\topx)$, where $\text{sign}(z)=1$ if $z\geq 0$ and $-1$ otherwise.

This is called a perceptron, which can be seen as a simple form of neural network. However, the sign function is not differentiable, so we cannot use gradient descent but instead use the perceptron algorithm.

Initialize $w$ randomly.

Repeat until convergence (all examples are classified correctly):

Pick a random example $(x,y)$ from $S$.
If $y=1$ and $w^\topx<0$, then $w\leftarrow w+x$.
Else if $y=-1$ and $w^\topx\ge 0$, then $w\leftarrow w-x$.

Assume that there exists $w^*$ such that $\|w^*\|=1$, and $\exists \gamma>0$ such that $\forall i, y_i{w^{*}}^\topx_i\geq \gamma$. Moreover, assume $\|x_i\|\le R$ for all $i$. Then the perceptron algorithm converges in at most $\frac{R^2}{\gamma^2}$ iterations.

Consider an iteration that we find a mistake $(x_i,y_i)$, then we have $w_{t+1}=w_t+y_ix_i$. Therefore, $$w_{t+1}^\topw^*=w_t^\topw^*+y_ix_i^\topw^*\ge w_t^\topw^*+\gamma.$$

Immediately, $w_{t+1}^\topw^*\ge t\gamma$.

On the other hand, since $y_ix_i^\topw_t\le 0$, we have $$|w_{t+1}|^{2=|w_t+y_ix_i|}2=|w_t|^2+|y_ix_i|2+2y_ix_i^\topw_t\le |w_t|^2+R2.$$

Combining these two inequalities gives

\[t^2\gamma^2\le \|w_{t+1}\|^2\le tR^2. \]

So $t\le \frac{R^2}{\gamma^2}$, which proves the theorem.

Logistic Regression

Rather than using sign function, we can output a probability $f(x)\in[0,1]$, which means the probability that $y=-1$ given $x$. Equivalently, we predict a distribution $(f(x), 1-f(x))$ over the two classes. And the target distribution is $(1,0)$ if $y=-1$ and $(0,1)$ if $y=1$. Finally, we can use L1, cross-entropy or other loss functions. The above approach is called logistic regression.

The hypothesis class associated with logistic regression is the composition of a logistic function over the class of linear functions, where the logistic function is generally the sigmoid function $\sigma_{\text{sig}}:\mathbb{R}\to[0,1]$ defined as

\[\sigma_{\text{sig}}(z)=\frac{1}{1+e^{-z}}. \]

Cross-entropy is a widely used way to compute loss between two distributions. Given a distribution $p$ and a target distribution $y$, the cross-entropy is defined as

\[XE(y,p)=-\sum_i y_i\log(p_i). \]

In particular, for binary classification, we have

\[XE(y,f(x))=-y\log(f(x))-(1-y)\log(1-f(x)). \]

Ridge Regression

To avoid overfitting, we can add a regularization term to the loss function. For example, in linear regression, we can use the following loss function:

\[L(w,S)=\frac{1}{2n}\sum_{i=1}^n (w^\topx_i-y_i)^2+\frac{\lambda}{2} \|w\|_2^2, \]

where $\lambda>0$ is a hyperparameter that controls the strength of regularization. The second term is called L2 regularization or ridge regularization. From PAC learning perspective, it restricts the hypothesis class to $\|w\|_2^2\le c$ for some $c>0$.

Its gradient and Hessian are given by

\[\begin{align*} \nabla L&= \frac{1}{n}\sum_{i=1}^n (w^\topx_i-y_i)x_i+\lambda w,\\ H&=\frac{1}{n}\sum_{i=1}^n x_ix_i^\top+\lambda I. \end{align*} \]

Note that it is $\lambda$-strongly convex, so GD converges efficiently. In pratice, the update step is divided into two parts:

$\widehat{w_{t+1}}=w_t-\frac{\eta}{n}\sum_{i=1}^n (w_t^\topx_i-y_i)x_i$ (GD)
$w_{t+1}=(1-\eta\lambda)\widehat{w_{t+1}}$ (weight decay)

and $\lambda$ is chosen by validation.

Lasso Regression

In some scenarios, we want to restrict the number of non-zero entries in $w$, which is called feature selection. Unfortunately, $\|w\|_0$ is not convex, so we use L1 regularization instead. Similar to ridge regression, the loss function is given by

\[L(w,S)=\frac{1}{2n}\sum_{i=1}^n (w^\topx_i-y_i)^2+\lambda \|w\|_1 \]

where $\|w\|_1=\sum_{j=1}^d |w_j|$. This is called Lasso.

Now it becomes convex. However, it is not differentiable, since

\[\nabla L=\frac{1}{n}\sum_{i=1}^n (w^\topx_i-y_i)x_i+\lambda \text{sign}(w). \]

In practice, we do the following update:

$\widehat{w_{t+1}}=w_t-\frac{\eta}{n}\sum_{i=1}^n (w_t^\topx_i-y_i)x_i$
For every coordinate $i$:
- If $\widehat{w_{t+1}}_i> \eta\lambda$, then $w_{t+1,i}=\widehat{w_{t+1}}_i-\eta\lambda$.
- Else if $\widehat{w_{t+1}}_i< -\eta\lambda$, then $w_{t+1,i}=\widehat{w_{t+1}}_i+\eta\lambda$.
- Else $w_{t+1,i}=0$.

Compressed Sensing

Let $\boldsymbol{x}\in\mathbb{R}^d$ be a "compressible" vector. We may pick a non-adaptive matrix $A\in\mathbb{R}^{n\times d}$, where $n\ll d$, as the measurement matrix, and obtain the measurement vector $\boldsymbol{y}=A\boldsymbol{x}$.

The question is how to choose $A$ such that we can recover $\boldsymbol{x}$ from $\boldsymbol{y}$.

We say that $A\in\mathbb{R}^{n\times d}$ is $(\epsilon,s)$-RIP, if for all $\boldsymbol{x}\neq 0$ such that $\|\boldsymbol{x}\|_0\le s$, we have $$(1-\epsilon)\|\boldsymbol{x}\|_2^2\le \|A\boldsymbol{x}\|_2^2\le (1+\epsilon)\|\boldsymbol{x}\|_2^2.$$

Let $\epsilon< 1$ and let $A$ be a $(\epsilon,2s)$-RIP matrix. Let $\boldsymbol{x}$ be a vector such that $\|\boldsymbol{x}\|_0\le s$, $\boldsymbol{y}=A\boldsymbol{x}$ be the compression of $\boldsymbol{x}$, and $$\boldsymbol{\tilde{x}}\in \arg\min_{\boldsymbol{z}:A\boldsymbol{z}=\boldsymbol{y}}\|\boldsymbol{z}\|_0$$ be a reconstruction of $\boldsymbol{x}$ from $\boldsymbol{y}$. Then $\boldsymbol{\tilde{x}}=\boldsymbol{x}$.

Clearly we have $\|\boldsymbol{\tilde{x}}\|_0\le \|\boldsymbol{x}\|_0\le s$, so $\|\boldsymbol{\tilde{x}}-\boldsymbol{x}\|_0\le 2s$. Applying the RIP property on $\boldsymbol{\tilde{x}}-\boldsymbol{x}$ gives $$(1-\epsilon)\|\boldsymbol{\tilde{x}}-\boldsymbol{x}\|_2^2\le \|A(\boldsymbol{\tilde{x}}-\boldsymbol{x})\|_2^2\le (1+\epsilon)\|\boldsymbol{\tilde{x}}-\boldsymbol{x}\|_2^2.$$

Since $A(\boldsymbol{\tilde{x}}-\boldsymbol{x})=0$, we have $\|\boldsymbol{\tilde{x}}-\boldsymbol{x}\|_2^2=0$, which implies $\boldsymbol{\tilde{x}}=\boldsymbol{x}$.

However, $\arg\min_{\boldsymbol{z}:A\boldsymbol{z}=\boldsymbol{y}}\|\boldsymbol{z}\|_0$ is hard to optimize. Instead, we can use L1 minimization.

For simplicity, we will use the following notation below: Given a vector $\boldsymbol{v}$ and a set of indices $I$, we denote $\boldsymbol{v_I}$ as the vector obtained by keeping only the entries of $\boldsymbol{v}$ indexed by $I$ and setting the rest to $0$.

Let $A$ be a $(\epsilon,2s)$-RIP matrix. For any disjoint sets $I,J$ such that $|I|,|J|\le s$, for any vector $\boldsymbol{u}$, we have $$(A\boldsymbol{u_I})^\top(A\boldsymbol{u_J})\le \epsilon\|\boldsymbol{u_I}\|_2\|\boldsymbol{u_J}\|_2.$$

$$ \begin{align*} (A\boldsymbol{u_I})^\top(A\boldsymbol{u_J})&=\frac{1}{4}\left(\|A\boldsymbol{u_I}+A\boldsymbol{u_J}\|^2-\|A\boldsymbol{u_I}-A\boldsymbol{u_J}\|^2\right)\\ &\le\frac{1}{4}\left((1+\epsilon)\|\boldsymbol{u_I}+\boldsymbol{u_J}\|^2-(1-\epsilon)\|\boldsymbol{u_I}-\boldsymbol{u_J}\|^2\right) \tag{RIP}\\ &=\frac{\epsilon}{2}(\|\boldsymbol{u_I}\|^2+\|\boldsymbol{u_J}\|^2) \tag{$\boldsymbol{u_I}^\top\boldsymbol{u_J}=0$}\\ &\le\epsilon\|\boldsymbol{u_I}\|_2\|\boldsymbol{u_J}\|_2. \end{align*} $$

Let $\epsilon< \frac{1}{1+\sqrt{2}}$ and let $A$ be a $(\epsilon,2s)$-RIP matrix. Let $\boldsymbol{x}$ be an arbitrary vector and denote $\boldsymbol{x_s}\in \arg\min_{\boldsymbol{z}:\|\boldsymbol{z}\|_0\le s}\|\boldsymbol{x}-\boldsymbol{z}\|_1$. Let $\boldsymbol{y}=A\boldsymbol{x}$ be the compression of $\boldsymbol{x}$, and $$\boldsymbol{\tilde{x}}\in \arg\min_{\boldsymbol{z}:A\boldsymbol{z}=\boldsymbol{y}}\|\boldsymbol{z}\|_1$$ be a reconstruction of $\boldsymbol{x}$ from $\boldsymbol{y}$. Then $$\|\boldsymbol{\tilde{x}}-\boldsymbol{x}\|_2\le \frac{2(1+\rho)s^{-\frac{1}{2}}}{1-\rho}\|\boldsymbol{x}-\boldsymbol{x}_s\|_1,$$ where $\rho=\frac{\sqrt{2}\epsilon}{1-\epsilon}$.

Let $T_0$ be the set of indices of the $s$ largest entries (absolute value) of $\boldsymbol{x}$, and $T_0^c=[d]-T_0$. Let $T_1$ be the set of indices of the $s$ largest entries (absolute value) of $\boldsymbol{x_{T_0^c}}$, and $T_{0,1}=T_0\cup T_1,T_{0,1}^c=[d]-T_{0,1}$, and so on. Finally, we partition indices as $[d]=T_0\cup T_1\cup \ldots$.

Let $\boldsymbol{h}=\boldsymbol{\tilde{x}}-\boldsymbol{x}$. We want to bound

\[\|\boldsymbol{h}\|_2=\left\|\boldsymbol{h_{T_{0,1}}}+\boldsymbol{h_{T_{0,1}^c}}\right\|_2\le\left\|\boldsymbol{h_{T_{0,1}}}\right\|_2+\left\|\boldsymbol{h_{T_{0,1}^c}}\right\|_2. \]

First, let's consider $\left\|\boldsymbol{h_{T_{0,1}^c}}\right\|_2$.

$$\left\|\boldsymbol{h_{T_{0,1}^c}}\right\|_2\le \left\|\boldsymbol{h_{T_{0}}}\right\|_2+2s^{-\frac{1}{2}}\|\boldsymbol{x}-\boldsymbol{x_s}\|_1.$$

For any $j>1$, $\forall i\in T_j,i'\in T_{j-1}$, we have $|h_i|\le |h_{i'}|$. So $$\left\|\boldsymbol{h_{T_j}}\right\|_{\infty}\le \frac{1}{s}\left\|\boldsymbol{h_{T_{j-1}}}\right\|_1.$$

By triangle inequality, we have

\[\left\|\boldsymbol{h_{T_{0,1}^c}}\right\|_2\le \sum_{j\ge 2}\left\|\boldsymbol{h_{T_j}}\right\|_2\le \sum_{j\ge 2}s^{\frac{1}{2}}\left\|\boldsymbol{h_{T_j}}\right\|_{\infty}\le s^{-\frac{1}{2}}\sum_{j\ge 2}\left\|\boldsymbol{h_{T_{j-1}}}\right\|_1=s^{-\frac{1}{2}}\left\|\boldsymbol{h_{T_0^c}}\right\|_1. \]

Since $\boldsymbol{\tilde{x}}=\boldsymbol{x}+\boldsymbol{h}$ has minimal L1 norm,

\[\begin{align*} \|\boldsymbol{x_{T_0}}\|_1+\|\boldsymbol{x_{T_0^c}}\|_1&=\|\boldsymbol{x}\|_1\\ &\ge\|\boldsymbol{\tilde{x}}\|_1\\ &=\|\boldsymbol{\tilde{x}_{T_0}}\|_1+\|\boldsymbol{\tilde{x}_{T_0^c}}\|_1\\ &\ge \|\boldsymbol{x_{T_0}}\|_1-\left\|\boldsymbol{h_{T_0}}\right\|_1+\left\|\boldsymbol{h_{T_0^c}}\right\|_1-\|\boldsymbol{x_{T_0^c}}\|_1. \end{align*} \]

Immediately gives

\[\left\|\boldsymbol{h_{T_0^c}}\right\|_1\le\left\|\boldsymbol{h_{T_0}}\right\|_1+2\|\boldsymbol{x_{T_0^c}}\|_1. \]

Combining these two inequalities gives the lemma:

\[\left\|\boldsymbol{h_{T_{0,1}^c}}\right\|_2\le s^{-\frac{1}{2}}\left\|\boldsymbol{h_{T_0^c}}\right\|_1\le s^{-\frac{1}{2}}\left(\left\|\boldsymbol{h_{T_0}}\right\|_1+2\|\boldsymbol{x_{T_0^c}}\|_1\right)\le \left\|\boldsymbol{h_{T_{0}}}\right\|_2+2s^{-\frac{1}{2}}\|\boldsymbol{x}-\boldsymbol{x_s}\|_1. \]

Next, let's consider $\left\|\boldsymbol{h_{T_{0,1}}}\right\|_2$.

$$\left\|\boldsymbol{h_{T_{0,1}}}\right\|_2\le \frac{2\rho}{1-\rho}s^{-\frac{1}{2}}\|\boldsymbol{x}-\boldsymbol{x_s}\|_1,$$ where $\rho=\frac{\sqrt{2}\epsilon}{1-\epsilon}$.

RIP condition gives $(1-\epsilon)\left\|\boldsymbol{h_{T_{0,1}}}\right\|_2^2\le \left\|A\boldsymbol{h_{T_{0,1}}}\right\|_2^2.$

Since $A\boldsymbol{h_{T_{0,1}}}=A\boldsymbol{h}-A\boldsymbol{h_{T_{0,1}^c}}=-\sum_{j\ge 2}A\boldsymbol{h_{T_j}}$, we have

\[\begin{align*} \left\|A\boldsymbol{h_{T_{0,1}}}\right\|_2^2&=\sum_{j\ge 2}(A\boldsymbol{h_{T_{0,1}}})^\topA\boldsymbol{h_{T_j}}\\ &=\sum_{j\ge 2}(A\boldsymbol{h_{T_{0}}}+A\boldsymbol{h_{T_{1}}})^\topA\boldsymbol{h_{T_j}}\\ &\le \sum_{j\ge 2}\epsilon\left(\left\|\boldsymbol{h_{T_{0}}}\right\|_2+\left\|\boldsymbol{h_{T_{1}}}\right\|_2\right)\left\|\boldsymbol{h_{T_j}}\right\|_2 \tag{Lemma}\\ &\le \sqrt{2}\epsilon\left\|\boldsymbol{h_{T_{0,1}}}\right\|_2 \sum_{j\ge 2}\left\|\boldsymbol{h_{T_j}}\right\|_2. \end{align*} \]

\[\left\|\boldsymbol{h_{T_{0,1}}}\right\|_2\le\frac{\sqrt{2}\epsilon}{1-\epsilon}\sum_{j\ge 2}\left\|\boldsymbol{h_{T_j}}\right\|_2=\rho \sum_{j\ge 2}\left\|\boldsymbol{h_{T_j}}\right\|_2. \]

In Claim 1, we know

\[\sum_{j\ge 2}\left\|\boldsymbol{h_{T_j}}\right\|_2\le s^{-\frac{1}{2}}\left\|\boldsymbol{h_{T_0^c }}\right\|_1\le s^{-\frac{1}{2}}\left(\left\|\boldsymbol{h_{T_0}}\right\|_1+2\left\|\boldsymbol{x_{T_0^c}}\right\|_1\right). \]

Thus

\[\left\|\boldsymbol{h_{T_{0,1}}}\right\|_2\le\rho s^{-\frac{1}{2}}\left(\left\|\boldsymbol{h_{T_0}}\right\|_1+2\left\|\boldsymbol{x_{T_0^c}}\right\|_1\right)\le \rho \left\|\boldsymbol{h_{T_0}}\right\|_2+2\rho s^{-\frac{1}{2}}\left\|\boldsymbol{x_{T_0^c}}\right\|_1. \]

Since $\left\|\boldsymbol{h_{T_0}}\right\|_2\le \left\|\boldsymbol{h_{T_{0,1}}}\right\|_2$, finally we have

\[\left\|\boldsymbol{h_{T_{0,1}}}\right\|_2\le \frac{2\rho}{1-\rho}s^{-\frac{1}{2}}\|\boldsymbol{x_{T_0^c}}\|_1. \]

Combining the two claims immediately proves the theorem:

\[\begin{align*} \|\boldsymbol{h}\|_2&\le\left\|\boldsymbol{h_{T_{0,1}}}\right\|_2+\left\|\boldsymbol{h_{T_{0,1}^c}}\right\|_2\\ &\le \frac{2\rho}{1-\rho}s^{-\frac{1}{2}}\|\boldsymbol{x}-\boldsymbol{x_s}\|_1+2s^{-\frac{1}{2}}\|\boldsymbol{x}-\boldsymbol{x_s}\|_1\\ &\le \frac{2(1+\rho)s^{-\frac{1}{2}}}{1-\rho}\|\boldsymbol{x}-\boldsymbol{x_s}\|_1. \end{align*} \]

If $\boldsymbol{x}$ is sparse, then $\boldsymbol{\tilde{x}}\approx\boldsymbol{x}$:

...If $\|\boldsymbol{x}\|_0\le s$, then $\boldsymbol{x}=\boldsymbol{x_s}$, so $\boldsymbol{\tilde{x}}=\boldsymbol{x}$.

How can we find a RIP matrix? One way is to use random matrices.

Let $\epsilon,\delta$ be scalars in $(0,1)$. Let $s$ be an integer in $[d]$ and $n$ be an integer that satisfies $$n\ge 216\frac{s\log\left(\frac{120d}{\delta\epsilon}\right)}{\epsilon^2}.$$ Let $A\in\mathbb{R}^{n\times d}$ be a matrix such that each entry is independently sampled from a Gaussian distribution $\mathcal{N}\left(0,\frac{1}{n}\right)$. Then with probability at least $1-\delta$, $A$ is $(\epsilon,s)$-RIP.

Let $\epsilon\in(0,1)$. There exists a finite set $Q\subset\mathbb{R}^d$ of size $|Q|\le \left(\frac{5}{\epsilon}\right)^d$ such that $$\sup_{\boldsymbol{x}:\|\boldsymbol{x}\|\le 1}\min_{\boldsymbol{v}\in Q}\|\boldsymbol{x}-\boldsymbol{v}\|\le \epsilon.$$

Construct $Q$ as follows: Pick a vector $\boldsymbol{x}$ such that $\|\boldsymbol{x}\|\le 1$ and $\forall \boldsymbol{y}\in Q,\|\boldsymbol{x}-\boldsymbol{y}\|>\epsilon$, then add it to $Q$. Repeat this operation until there is no such vector. Clearly, this will terminate after finite steps and for all $\boldsymbol{x}$ such that $\|\boldsymbol{x}\|\le 1$, $$\min_{\boldsymbol{v}\in Q}\|\boldsymbol{x}-\boldsymbol{v}\|\le \epsilon.$$

Denote $B(\boldsymbol{x},\epsilon)=\{\boldsymbol{y}\in\mathbb{R}^d:\|\boldsymbol{x}-\boldsymbol{y}\|\le \epsilon\}$. Note that $B\left(\boldsymbol{x},\frac{\epsilon}{2}\right)$ are disjoint for different $\boldsymbol{x}\in Q$, and

\[\bigcup_{\boldsymbol{x}\in Q}B\left(\boldsymbol{x},\frac{\epsilon}{2}\right)\subseteq B\left(\boldsymbol{0},1+\frac{\epsilon}{2}\right). \]

Thus, we have

\[|Q|\le \frac{\text{vol}\left(B\left(1+\frac{\epsilon}{2}\right)\right)}{\text{vol}\left(B\left(\frac{\epsilon}{2}\right)\right)}\le \frac{\left(1+\frac{\epsilon}{2}\right)^d}{\left(\frac{\epsilon}{2}\right)^d}\le \left(\frac{3}{\epsilon}\right)^d \]

where $\text{vol}(B)$ is the volume of $B$.

Let $Q$ be a finite set of vectors in $\mathbb{R}^d$. Let $\delta\in(0,1)$ and $n$ be an integer such that $$\epsilon=\sqrt{\frac{6\ln\left(\frac{2|Q|}{\delta}\right)}{n}}\le 3.$$ Then, with probability at least $1-\delta$ over the choice of a random matrix $A\in\mathbb{R}^{n\times d}$ such that each entry is independently sampled from $\mathcal{N}(0,\frac{1}{n})$, we have $$\sup_{\boldsymbol{x}\in Q}\left|\frac{\|A\boldsymbol{x}\|^2}{\|\boldsymbol{x}\|^2}-1\right|<\epsilon.$$

See CMSC 35900 (Spring 2009) L2 Theorem 1.2.

Consider any $I\subseteq[d]$ of size $s$. Let $S$ be the span of $\{\boldsymbol{e_i}:i\in I\}$, which is a subspace of dimension $s$. For $\boldsymbol{x}\in S$, we can write $\boldsymbol{x}=U_I\boldsymbol{a}$, where $U_I$ is the matrix with columns $e_i$ for $i\in I$, and $\boldsymbol{a}\in\mathbb{R}^s$. W.l.o.g., assume $\|\boldsymbol{a}\|=1$.

By Lemma, there exists a set $Q$ of size $|Q|\le \left(\frac{20}{\epsilon}\right)^s$ such that

\[\sup_{\boldsymbol{a}:\|\boldsymbol{a}\|=1}\min_{\boldsymbol{v}\in Q}\|\boldsymbol{a}-\boldsymbol{v}\|\le \frac{\epsilon}{4}. \]

By definition of $U_I$, immediately gives

\[\sup_{\boldsymbol{x}\in S}\min_{\boldsymbol{v}\in Q}\|\boldsymbol{x}-U_I\boldsymbol{v}\|\le \frac{\epsilon}{4}. \]

Apply JL Lemma on $\{U_I\boldsymbol{v}:\boldsymbol{v}\in Q\}$, we know that for $n\ge \frac{4\cdot 6\ln\left(\frac{2|Q|}{\epsilon}\right)}{\epsilon^2}=24\frac{\ln\left(\frac{2}{\delta}\right)+s\ln\left(\frac{20}{\epsilon}\right)}{\epsilon^2}$, with probability at least $1-\delta$, we have

\[\sup_{\boldsymbol{v}\in Q}\left|\frac{\|AU_I\boldsymbol{v}\|^2}{\|U_I\boldsymbol{v}\|^2}-1\right|<\frac{\epsilon}{2}. \]

It implies that

\[\sup_{\boldsymbol{v}\in Q}\frac{\|AU_I\boldsymbol{v}\|}{\|U_I\boldsymbol{v}\|}<1+\frac{\epsilon}{2} \]

and

\[\inf_{\boldsymbol{v}\in Q}\frac{\|AU_I\boldsymbol{v}\|}{\|U_I\boldsymbol{v}\|}>1-0.3\epsilon. \]

Let $a=\sup_{\boldsymbol{x}\in S}\frac{\|A\boldsymbol{x}\|}{\|\boldsymbol{x}\|}-1$. For any $\boldsymbol{x}\in S$ of size $\|\boldsymbol{x}\|=1$, there exists $\boldsymbol{v}\in Q$ such that

\[\begin{align*} a+1&\le \|A\boldsymbol{x}\|\\ &\le \|AU_I\boldsymbol{v}\|+\|A(\boldsymbol{x}-U_I\boldsymbol{v})\|\\ &\le \left(1+\frac{\epsilon}{2}\right)\|U_I\boldsymbol{v}\|+(1+a)\|\boldsymbol{x}-U_I\boldsymbol{v}\|\\ &\le \left(1+\frac{\epsilon}{2}\right)+\frac{(1+a)\epsilon}{4}. \end{align*} \]

Solving it for $a$ gives

\[a\le \epsilon. \]

On the other hand,

\[\begin{align*} \|A\boldsymbol{x}\| &\ge \|AU_I\boldsymbol{v}\|-\|A(\boldsymbol{x}-U_I\boldsymbol{v})\|\\ &\ge \left(1-0.3\epsilon\right)\|U_I\boldsymbol{v}\|-\frac{(1+\epsilon)\epsilon}{4}\\ &\ge \left(1-0.3\epsilon\right)\left(1-\frac{\epsilon}{4}\right)-\frac{(1+\epsilon)\epsilon}{4}\\ &= 1-0.8\epsilon+0.175\epsilon^2\\ &\ge 1-\epsilon. \end{align*} \]

So far, we have

\[\sup_{\boldsymbol{x}\in S}\left|\frac{\|A\boldsymbol{x}\|}{\|\boldsymbol{x}\|}-1\right|<\epsilon. \]

Also implies

\[\sup_{\boldsymbol{x}\in S}\left|\frac{\|A\boldsymbol{x}\|^2}{\|\boldsymbol{x}\|^2}-1\right|<3\epsilon. \]

Finally, let $\delta'=d^s\delta,\epsilon'=3\epsilon$ and apply union bound over all possible $I\subseteq[d]$ of size $s$, we know that with probability at least $1-\delta'$, and

\[n\ge 24\frac{\ln\left(\frac{2d^s}{\delta}\right)+s\ln\left(\frac{60}{\epsilon'}\right)}{(3\epsilon')^2}\ge 216\frac{s\ln\left(\frac{120d}{\delta'\epsilon'}\right)}{\epsilon'^2} \]

What if the sparsity is not on a linear basis? That is, the range of nonlinear function $G$ that maps a low-dimensional space to a high-dimensional space. For example, in rectangle face generation, $z\in\mathbb{R}^{12}$ draws a rectangle face and $G(z)$ draws a realistic face based on the rectangle face. How can we compress $G(z)$ so that we can still recover it from the compressed vector?

Let $G:\mathbb{R}^{k}\to\mathbb{R}^{n}$ be a generative model from a $d$-layer neural network using ReLU activations. Let $A\in\mathbb{R}^{m\times n}$ be a random Gaussian matrix for $m=O(kd\log n)$, scaled so $A_{i,j}\sim \mathcal{N}\left(0,\frac{1}{m}\right)$. For any $\boldsymbol{x^*}\in\mathbb{R}^{n}$ and any observation $\boldsymbol{y}=A\boldsymbol{x^*}+\boldsymbol{\eta}$, let $\boldsymbol{\widehat{z}}$ minimize $\|\boldsymbol{y}-AG(\boldsymbol{z})\|_{2}$ to within additive $\epsilon$ of the optimum. Then with $1-e^{-\Omega(m)}$ probability,

\[\|G(\boldsymbol{\widehat{z}})-\boldsymbol{x^{*}}\|_{2} \leq 6\min\limits_{\boldsymbol{z^{*}}\in\mathbb{R}^{k}}\|G(\boldsymbol{z^{*}})-\boldsymbol{x^{*}}\|_{2} + 3\|\boldsymbol{\eta}\|_{2} + 2\epsilon. \]

Support Vector Machine

Recall that perceptron is a linear classifier that tries to find a hyperplane that separates the data points of different classes. However, there are often many such hyperplanes that can separate the data points. The Support Vector Machine (SVM) aims to find the optimal hyperplane that maximizes the margin between the two classes, which helps in better generalization.

Lagrange Duality

The first question is how to define the margin. A straightforward way is to define the margin as $\max_i\frac{|\boldsymbol{w}^\top\boldsymbol{x_i}|}{\|\boldsymbol{w}\|_2}$.

Hard SVM: Find $\arg\min_{\boldsymbol{w}}\frac{\|\boldsymbol{w}\|^2_2}{2}$ subject to $y_i\boldsymbol{w}^\top\boldsymbol{x}_i \geq 1$ for all $i$.

Now, we can use Lagrange multipliers to solve this constrained optimization problem.

Its Lagrangian is given by

\[\mathcal{L}(\boldsymbol{w},\alpha)=\frac{\|\boldsymbol{w}\|^2_2}{2}-\sum_i\alpha_i\left(y_i\boldsymbol{w}^\top\boldsymbol{x_i}-1\right) \]

To solve the dual problem $\max_{\alpha\ge 0}\min_{\boldsymbol{w}}\mathcal{L}(\boldsymbol{w},\alpha)$, we take the derivative:

\[\frac{\partial \mathcal{L}}{\partial \boldsymbol{w}} = 0\Rightarrow \boldsymbol{w}=\sum_i \alpha_i y_i \boldsymbol{x}_i. \]

Substituting this back into the Lagrangian gives

\[\max_{\alpha\ge 0}\left(\sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j \boldsymbol{x}_i^\top \boldsymbol{x}_j\right). \]

However, this method assumes that the data is linearly separable. If it is not, we can relax the hard constraint by using hinge loss $\ell(\boldsymbol{x},y)=\max\left(0,1-y\boldsymbol{w}^\top\boldsymbol{x}\right).$ Equivalently, we can introduce slack variables $\xi_i$ that allow some misclassification.

Soft SVM: Find $\arg\min_{\boldsymbol{w}}\left(\frac{\|\boldsymbol{w}\|^2_2}{2} + \lambda\sum_i\xi_i\right)$ subject to $y_i\boldsymbol{w}^\top\boldsymbol{x}_i \geq 1 - \xi_i$ and $\xi_i \geq 0$ for all $i$.

Its Lagrangian is given by

\[\mathcal{L}(\boldsymbol{w},\xi,\alpha,\kappa)=\frac{\|\boldsymbol{w}\|^2_2}{2} + \lambda\sum_i\xi_i - \sum_i\alpha_i\left(y_i\boldsymbol{w}^\top\boldsymbol{x}_i - 1 + \xi_i\right) - \sum_i \kappa_i \xi_i. \]

To solve the dual problem $\max_{\alpha\ge 0,\kappa\ge 0}\min_{\boldsymbol{w},\xi}\mathcal{L}(\boldsymbol{w},\xi,\alpha,\kappa)$, we take the derivative:

\[\begin{align*} \frac{\partial \mathcal{L}}{\partial \boldsymbol{w}} = 0 \Rightarrow \boldsymbol{w} = \sum_i \alpha_i y_i \boldsymbol{x}_i,\\ \frac{\partial \mathcal{L}}{\partial \xi_i} = 0 \Rightarrow \lambda = \alpha_i+\kappa_i. \end{align*} \]

Substituting this back into the Lagrangian gives

\[\max_{\forall i,0\le \alpha_i\le \lambda}\left(\sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j \boldsymbol{x}_i^\top \boldsymbol{x}_j\right). \]

We can see that these two dual problems are consistent.

Kernel Method

In fact, there are some quadratic programming algorithms that can solve the dual faster than the primal. Then why solve the dual problem instead of the primal one? The main reason is that we can use the so-called "kernel method".

Sometimes the data is not linearly separable but can be separated using higher-order features. So we can map the data to a higher-dimensional space using a feature map $\phi(\cdot)$, and then apply SVM in that space.

Notice that in the dual problem, the data points only appear in the form of inner products $\boldsymbol{x_i}^\top\boldsymbol{x_j}$. Therefore, we can replace these inner products with a kernel function $K(\boldsymbol{x_i},\boldsymbol{x_j})=\phi(\boldsymbol{x_i})^\top\phi(\boldsymbol{x_j})$. So we do not need to explicitly compute the mapping $\phi(\cdot)$, which may be computationally expensive.

Pick a kernel function $K(\cdot,\cdot)$, which corresponds to some feature map $\phi(\cdot)$.
Solve the following quadratic program:
\[\max_{\forall i,0\le \alpha_i\le \lambda}\left(\sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j K(\boldsymbol{x_i},\boldsymbol{x_j})\right). \]
The classifier is given by $f(\boldsymbol{x})=\text{sign}\left(\sum_i \alpha_i y_i K(\boldsymbol{x_i},\boldsymbol{x})\right)$.

The next question is how to choose the kernel function so that it corresponds to a valid feature map.

If the kernel matrix $$\begin{bmatrix} K(\boldsymbol{x_1},\boldsymbol{x_1}) & K(\boldsymbol{x_1},\boldsymbol{x_2}) & \cdots\\ K(\boldsymbol{x_2},\boldsymbol{x_1}) & K(\boldsymbol{x_2},\boldsymbol{x_2}) & \cdots\\ \vdots & \vdots & \ddots\\ \end{bmatrix}$$

is positive semi-definite for any $\{\boldsymbol{x_i}\}$, then there exists a feature map $\phi(\cdot)$ such that $K(\boldsymbol{x_i},\boldsymbol{x_j})=\phi(\boldsymbol{x_i})^\top\phi(\boldsymbol{x_j})$.

For example, the polynomial kernel $K(\boldsymbol{x_i},\boldsymbol{x_j})=(\boldsymbol{x_i}^\top\boldsymbol{x_j}+c)^d$ and the Gaussian kernel $K(\boldsymbol{x_i},\boldsymbol{x_j})=\exp\left(-\frac{\|\boldsymbol{x_i}-\boldsymbol{x_j}\|^2_2}{2\sigma^2}\right)$ are both valid kernels.

Decision Tree

SVM makes predictions by single decision boundary, while decision tree makes predictions by recursively splitting on different features according to a tree structure.

Boolean Functional Analysis

First, we need to define some concepts with respect to boolean functions, i.e., $f:\{-1,1\}^n\to [0,1]$. Let $\mathcal{D}$ be the uniform distribution over $\{-1,1\}^n$ and $\chi_S(\boldsymbol{x})=\prod_{i\in S} x_i$ for any $S\subseteq [n]$.

We say that a family of mappings from $X=(x_1,...,x_d)$ to $\mathbb{R}$, say $\{\psi_1,\ldots,\psi_N\}$, is a random orthonormal family if for any $i,j$, $$ \mathbb{E}_{\boldsymbol{x}\sim \mathcal{D}}[\psi_i(\boldsymbol{x})\psi_j(\boldsymbol{x})]=\delta_{ij}=\begin{cases} 1 & i=j\\ 0 & i\ne j \end{cases}. $$

Clearly, $\{\chi_S\}_{S\subseteq [n]}$ is a random orthonormal family, called the Fourier basis.

For any boolean function $f:\{-1,1\}^n\to [0,1]$, its Fourier expansion is given by the unique representation under the Fourier basis: $$ f(\boldsymbol{x})=\sum_{S\subseteq [n]} \widehat{f}_S\chi_S(\boldsymbol{x}), $$ where $\widehat{f}_S=\mathbb{E}_{\boldsymbol{x}\sim \mathcal{D}}[f(\boldsymbol{x})\chi_S(\boldsymbol{x})]$ is called the Fourier coefficient.

For any boolean function $f:\{-1,1\}^n\to [0,1]$, we have $\mathbb{E}_{\boldsymbol{x}\sim \mathcal{D}}[f(\boldsymbol{x})^2]=\sum_{S\subseteq [n]} \widehat{f}_S^2.$

Intuitively, the easier the function is to represent, the simpler the Fourier coefficients are, i.e., the lower the degree and higher the sparsity.

We say that two boolean functions $f$ and $g$ are $\epsilon$-close if $\mathbb{E}_{\boldsymbol{x}\sim \mathcal{D}}[f(\boldsymbol{x})- g(\boldsymbol{x})]^2\le \epsilon$.

We would like to approximate a decision tree by a low-degree and sparse function.

For any decision tree $T$ with $s$ leaves, there exists a degree-$\log\left(\frac{s}{\epsilon}\right)$ and sparsity-$\frac{s^2}{\epsilon}$ function $h$ that $4\epsilon$-close to $T$.

First, let's truncating $T$ at depth $\log\left(\frac{s}{\epsilon}\right)$. Each leaf at depth more than $\log\left(\frac{s}{\epsilon}\right)$ provides error with probability at most $2^{-\log(s/\epsilon)}=\frac{\epsilon}{s}$. So the total error is at most $\epsilon$. We assume that $T$ has depth at most $\log\left(\frac{s}{\epsilon}\right)$ in the following.

Clearly, a decision tree with $s$ leaves can be represented by union of $s$ "AND" terms, denoted by $f$. Since every "AND" term has $L_1\le 1$ and at most $\log\left(\frac{s}{\epsilon}\right)$ variables, so $L_1(f)\le s$ and the degree of $f$ is at most $\log\left(\frac{s}{\epsilon}\right)$.

Finally, let $h$ be the truncation of $f$ by keeping only the Fourier coefficients with $|\widehat{f}_S|\ge\frac{\epsilon}{L_1(f)}$. Then $L_0(h)\le \frac{L_1(f)}{\frac{\epsilon}{L_1(f)}}\le\frac{s^2}{\epsilon}$. By Parseval's identity, the missing terms have contribution at most

\[\sum_{|\widehat{f}_S|<\frac{\epsilon}{L_1(f)}} \left(\widehat{f}_S\right)^2 \le \max_{|\widehat{f}_S|<\frac{\epsilon}{L_1(f)}} |\widehat{f}_S|\cdot\sum_{|\widehat{f}_S|<\frac{\epsilon}{L_1(f)}} |\widehat{f}_S|\le L_1(f)\cdot \frac{\epsilon}{L_1(f)}=\epsilon. \]

Thus, $h$ are $\epsilon$-close to $f$. By triangle inequality, we have $$\mathbb{E}{\boldsymbol{x}\sim \mathcal{D}}[T(\boldsymbol{x})- h(\boldsymbol{x})]^2\le 2\mathbb{E}\sim \mathcal{D}}\left[(T(\boldsymbol{x})- f(\boldsymbol{x}))^2+(f(\boldsymbol{x})- h(\boldsymbol{x}))^2\right]\le 4\epsilon.$$

However, in pratice, we do not know the decision tree $T$ but only have access to random samples. Under this setting, we can use the following methods.

Take $m$ uniformly random samples $\{\boldsymbol{x_i},y_i\}_{i=1}^m$ for $f$.
For every $S\subseteq [n]$ with degree at most $\log\left(\frac{s}{\epsilon}\right)$, estimate $\widehat{f}_S$ by $\frac{1}{m}\sum_{i=1}^m y_i \chi_S(\boldsymbol{x_i})$.

Denote the Fourier coefficients of $f$ by $\boldsymbol{x}$, and the generation of random samples can be represented by $\boldsymbol{y}=A\boldsymbol{x}$. Since $A$ is a random matrix, we can use compressed sensing to recover a sparse approximation of $\boldsymbol{x}$.

Given measurement matrix $A\in \mathbb{R}^{m\times N}$ such that its columns are in a random orthonormal family, and vector $\boldsymbol{y}=A\boldsymbol{x}+\boldsymbol{e}$ where $\boldsymbol{e}$ is some noise. If $\boldsymbol{x}$ is $s$-sparse, then LASSO finds an $\boldsymbol{x^*}$ such that $$\|\boldsymbol{x^*}-\boldsymbol{x}\|_2 \le c\frac{\|\boldsymbol{e}\|_2}{\sqrt{m}}$$ for some constant $c$, with probability $1-\delta$, as long as $m\ge \tilde{O}(s\log N)$ ($\tilde{O}$ hides dependencies on $\delta,s$, and other constants).

Construct Decision Tree

Now, we consider how to construct a decision tree from a boolean function. Unfortunately, finding the optimal decision tree is $\mathsf{NP}$-complete. Therefore, greedy algorithms are often used in practice.

The key step is to decide which feature to split on at each node. To do this, we need to define a metric to measure the "impurity" (that is, how mixed the classes are) of a node. A common choice is the Gini impurity. Gini impurity measures the probability of two randomly sampled elements having the same label. Formally, for a node with $K$ classes, let $p_k$ be the proportion of class $k$ in the node. Then the Gini impurity is defined as

\[\text{Gini} = 1 - \sum_{k=1}^K p_k^2. \]

Recursively construct the decision tree as follows:

At each node, compute the Gini impurity of the current node. If it is zero, stop.
For each feature, compute the Gini impurity of the child nodes after splitting on that feature.
Choose the feature that minimizes the weighted average Gini impurity of the child nodes.
Repeat the process for each child node.

Theoretically, if decision tree is super deep, it can fit anything. However, it may overfit. To avoid this, one may force the decision tree to only consider a subset of data. We can use bagging (or bootstrap aggregating) to achieve this. Formally, denote the dataset by $(x_1,y_1),...,(x_n,y_n)$. We sample $n$ times with replacement from the original dataset to get a new dataset $(x_1',y_1'),...,(x_n',y_n')$. Then we train the decision tree on this new dataset. This procedure will be repeated multiple times to get an ensemble of decision trees, which is called a random forest.

Bagging the dataset (and possibly features) for $B$ times to get $B$ datasets. Train a decision tree on each dataset. The final prediction is given by averaging the predictions of all decision trees (in regression) or taking the majority vote (in classification).

Boosting

If we have lots of weak leaners, can we combine them to get a strong learner? A straightforward approach is stacking weak learners. That is, let $h_1, h_2, \ldots, h_n$ be a sequence of weak learners, and we train the final learner on $\{((h_1(x),h_2(x),...,h_n(x)),y)|(x,y)\in S\}$, where $S$ is the original training set.

In fact, in practice, we have a transparent learning algorithm, which means that the learner on any training set can be adaptively acquired. Therefore, we can iteratively add weak learners to the current model to reduce the loss. Boosting does exactly this.

Gradient Boosting

We can view the problem as an optimization problem in function space. So we can use gradient descent to solve it. That is, at each iteration, we find a weak learner that fits the negative gradient of the loss function with respect to the current model.

Given training set $S=\{(x_1,y_1),...,(x_n,y_n)\}$ and a loss function $L(h)=\sum\limits_{i=1}^n\ell(y_i,h(x_i))$. Start with any initial learner $H$.

For $t=1$ to $T$:

For all $i$, calculate gradients $$g(x_i)=\frac{\partial L(H)}{\partial H(x_i)}.$$
Find the best weak learner $h_t$ that fits the negative gradients, i.e., $$h_t=\arg\min \sum\limits_{i=1}^n\ell(-g(x_i),h(x_i)).$$
Update the learner $H\leftarrow H+\eta h_t$ for some step size $\eta$.

Output the final learner $H$.

For example, if we use squared loss $\ell(y,h(x))=\frac{1}{2}(y-h(x))^2$, then $-g(x_i)=y_i-H(x_i)$, which is exactly the residual.

AdaBoost

Another popular boosting algorithms is AdaBoost. Instead of fitting the negative gradient, AdaBoost reweights the training examples at each iteration, so that harder examples are given more weight.

Given training set $S=\{(x_1,y_1),...,(x_n,y_n)\}$. Let $\mathcal{D}_1$ be the uniform distribution over $\{1,...,n\}$.

For $t=1$ to $T$:

Find the weak learner $h_t$ that minimizes the weighted error $$\epsilon_t=\text{Pr}_{i\sim \mathcal{D}_t}[h_t(x_i)\ne y_i];$$
Construct distribution $\mathcal{D}_t$:

\[\mathcal{D}_{t+1}(i)=\frac{1}{Z_t}\mathcal{D}_t(i)\exp(-\alpha_t y_i h_t(x_i)), \]

where $\alpha_t=\frac{1}{2}\ln\left(\frac{1-\epsilon_t}{\epsilon_t}\right)$ and $Z_t$ is the normalization factor.

Output the final learner $$H_{\text{final}}(x)=\text{sign}\left(\sum_{t=1}^\top \alpha_t h_t(x)\right).$$

By the way, one can verify that AdaBoost is a special case of gradient boosting with exponential loss $\ell(y,h(x))=\exp(-y h(x))$.

In AdaBoost, the training error of $H_{\text{final}}$ is at most $$\prod_t\left(2\sqrt{\epsilon_t(1-\epsilon_t)}\right)\le\exp\left(-2\sum_t\left(\frac{1}{2}-\epsilon_t\right)^2\right).$$

Clearly, $$ \mathcal{D}_T(i)=\frac{1}{n}\frac{\exp(-y_i\sum_{t} \alpha_t h_t(x_i))}{\prod_t Z_t}. $$

The training error of $H_{\text{final}}$ is

\[\begin{align*} &\ \text{Pr}_{(x,y)\sim S} [H_{\text{final}}(x)\ne y]\\ =&\ \text{Pr}_{S} \left[y \sum_t \alpha_t h_t(x)\le 0\right]\\ \le&\ \mathbb{E}_{S}\left[\exp\left(-y \sum_{t} \alpha_t h_t(x)\right)\right]\\ =&\ \sum_i \mathcal{D}_T(i)\prod_t Z_t\\ =&\ \prod_t Z_t. \end{align*} \]

On the other hand,

\[\begin{align*} Z_t=&\ \sum_i \mathcal{D}_t(i)\exp(-\alpha_t y_i h_t(x_i))\\ =&\ \sum_{i:h_t(x_i)=y_i} \mathcal{D} x_t(i)\exp(-\alpha_t) + \sum_{i:h_t(x_i)\ne y_i} \mathcal{D}_t(i)\exp(\alpha_t)\\ =&\ (1-\epsilon_t)\exp(-\alpha_t)+\epsilon_t\exp(\alpha_t)\\ =&\ 2\sqrt{\epsilon_t(1-\epsilon_t)}. \end{align*} \]

Let $\gamma_t=\frac{1}{2}-\epsilon_t$. Then $$Z_t=\sqrt{1-4\gamma_t^2}\le \exp(-2\gamma_t^2).$$

Immediately we have

\[\prod_t Z_t \le \exp\left(-2\sum_t \gamma_t^2\right). \]

Margin Theory

Boosting has a very critical and counterintuitive property: even after the training error has dropped to zero, continuing training can still improve the performance of the model on test set.

The explanation is that boosting increases the margin of the training examples. The margin of an example $(x,y)$ is defined as $yf(x)$. Intuitively, the larger the margin, the more confident the prediction is. So even if the training error is zero, increasing the margin can still improve the generalization.

In AdaBoost, let $f(x)=\frac{\sum_t \alpha_th_t(x)}{\sum_t \alpha_t}$, then for any $\theta$, $$ \text{Pr}_{S}[yf(x)\le \theta]\le \prod_t \left(2\sqrt{\epsilon_t^{1-\theta}(1-\epsilon_t)^{1+\theta}}\right). $$

Similarly, $$ \begin{align*} &\ \text{Pr}_{S}[yf(x)\le \theta]\\ \le &\ \text{Pr}_S[y\sum_t \alpha_t h_t(x)\le \theta\sum_t \alpha_t] &\\ \le &\ \mathbb{E}_S\left[\exp\left(-y\sum_t\alpha_t h_t(x)+\theta\sum_t\alpha_t\right)\right]\\ =&\ \exp\left(\theta\sum_t\alpha_t\right)\mathbb{E}_S\left[\exp\left(-y \sum_{t} \alpha_t h_t(x)\right)\right]\\ =&\ \exp\left(\theta\sum_t\alpha_t\right)\prod_t Z_t. \end{align*} $$

Substituting $Z_t$ and $\alpha_t$ gives

\[\prod_t\left(2\sqrt{\epsilon_t^{1-\theta}(1-\epsilon_t)^{1+\theta}}\right). \]

Assume that $\theta<\frac{1}{2}-\epsilon_t$ for all $t$. Then the expression inside the parentheses is less than 1. So the probability decreases exponentially with $T$.

The significance of margin theory is that we can bound the generalization error by the margin distribution on the training set, instead of the training error.

Let $S$ be a set of $m$ samples chosen independently at random according to $\mathcal{D}$. Assume that the base hypothesis class $\mathcal{H}$ is finite, and let $\delta>0$. Then with probability at least $1-\delta$ over the random choice of the training set $S$, every weighted average function $f$ satisfies the following bound for all $\theta>0$: $$ \text{Pr}_{\mathcal{D}}[y f(x)\le 0]\le \text{Pr}_S[y f(x)\le \theta] + O\left(\frac{1}{\sqrt{m}}\left(\frac{\log m \log |\mathcal{H}|}{\theta^2}+\log\frac{1}{\delta}\right)^{1/2}\right). $$

Nearest Neighbor

KNN

Consider a classification task. When a new query $x$ is given, we predict its label by looking at the labels of its nearest neighbors in the training set. Such algorithms is called the nearest neighbor algorithms. The most common version is the $k$-nearest neighbor algorithm.

Let $S=\{(x_1,y_1),...,(x_n,y_n)\}$ be the training set and $x^*$ be the query point.

Find $k$ nearest neighbors of $x^*$ in $S$, denoted by $\{(x_{i_1},y_{i_1}),...,(x_{i_k},y_{i_k})\}$.

Predict the label of $x^*$ by majority vote:

\[y=\arg\max_{y} \sum_{i=1}^k \delta(y_i,y). \]

Choosing $k$ is critical. If $k$ is too small, the model may be too sensitive to noise. If $k$ is too large, the model may be too smooth and miss important patterns.

$k$NN is non-parametric, which means that there are no parameters we can tune to make the performance better. In other words, the model structure determined from the dataset. Thus, some people think kNN belongs to unsupervised learning.

However, exact nearest neighbor search is hard, so we consider approximation. Since "nearest search" is equivalent to "$R$-near search" due to binary search, we can relax "$R$-near search" to "$c$-approximate $R$-near search", which means that we only need to find a point within distance $cR$ if there exists a point within distance $R$.

LSH and Approximate Nearest Neighbor Search

A family $\mathcal{H}$ is called $(R,cR,P_1,P_2)$-sensitive if for any $x,y\in\mathbb{R}^d$:

If $\|x-y\|\leq R$, then $\Pr_{h\in\mathcal{H}}[h(x)=h(y)]\geq P_1$;
If $\|x-y\|\geq cR$, then $\Pr_{h\in\mathcal{H}}[h(x)=h(y)]\leq P_2$.

An LSH family can be used to design an efficient algorithm for approximate near neighbor search.

Choose $L$ function tables $g_1,g_2,\cdots,g_L$, where each $g_i=(h_{i,1},h_{i,2},\cdots,h_{i,k})$ and each $h_{i,j}$ is chosen independently from $\mathcal{H}$.

Construct $L$ hash tables, where each table $i$ contains the dataset points hashed using the function $g_i$.

When a query $x$ is given, for each $j=1,2,\cdots,L$:

Find the bucket $g_j(x)$ in the $j$-th hash table and retrieve all points in that bucket.
For each of the retrieved point, check if it is within distance $cR$ from $x$. If yes, return it.
Stop as soon as we have checked $L'$ points.

Let $\rho=\frac{\log 1/P_1}{\log 1/P_2}$. Set $k=\log_{1/P_2}(n),L=n^\rho$ and $L'=2L+1$. If there exists a point that is $R$-near to $x$, the algorithm will return a point that is $cR$-near to $x$ with probability at least $\frac{1}{2}-\frac{1}{e}$.

Denote $S'$ be the set of points that are not $cR$-near to $x$, i.e., $S'=S-B(x,cR)$. If there exists $x^*\in S'$ that is $R$-near to $x$, for every $i$ and $x'\in S'$, $$ \text{Pr}[g_i(x')=g_i(x)]\leq P_2^k=\frac{1}{n}. $$ Thus, $$ \mathbb{E}[\#(x'\in S':g_i(x')=g_i(x))]\leq n\cdot \frac{1}{n}=1. $$ Furthermore, since there are $L$ hash tables, $$ \mathbb{E}[\#(\text{total wrong points})]\le L. $$ By Markov's inequality, $$ \Pr[\#(\text{total wrong points})\le 2L]\ge \frac{1}{2}. $$

On the other hand,

\[\text{Pr}[g_i(x^*)=g_i(x)]\ge P_1^k=n^{-\rho}. \]

Thus

\[\text{Pr}[g_i(x^*)\neq g_i(x),\forall i\in[L]]\le (1-n^{-\rho})^L\le \frac{1}{e}. \]

Thus, the probability that $x^*$ is found in the first $2L+1$ points is at least $\frac{1}{2}-\frac{1}{e}$.

We can repeat the algorithm (with different hash tables) by $O\left(\frac{1}{\delta}\right)$ times, and amplify the success probability to $1-\delta$.

Finally, the problem is how to construct LSH families. For $\ell_2$ space, we can use the following LSH Library:

\[h_{r,b}=\left\lfloor\frac{\langle r,x\rangle+b}{w}\right\rfloor, \]

where $r\sim\mathcal{N}(0,I_d)$, $b\sim\text{Unif}[0,w)$ and $w>0$ is a parameter. Let $c=\|x-y\|_2$, we have

\[P(c)=\text{Pr}[h_{r,b}(x)=h_{r,b}(y)]=\int_0^w \frac{1}{c}\cdot f_p\left(\frac{t}{c}\right)\left(1-\frac{t}{w}\right)\mathrm{d}t, \]

where $f_p$ is the PDF of $\mathcal{N}(0,1)$. It can be shown that $P(c)$ is a monotonically decreasing function of $c$.

Metric Learning

Searching nearest neighbor in the original space $\mathbb{R}^d$ may not be the best choice. We can use neural networks to learn a good feature space so that inside the space we have better nearest neighbor structure. This is called metric learning.

The key point is how to define the loss function.

NCA: $$L=\sum_i\sum\limits_{j:c_j=c_i}p_{i,j}$$

where $p_{ii}=0$ and for $i\neq j$,

\[p_{i,j}=\frac{\exp(-\|f(x_i)-f(x_j)\|^2)}{\sum_{k\neq i}\exp(-\|f(x_i)-f(x_k)\|^2)}. \]

LMNN: $$L=\sum_{i,j\in N_i,y_k\neq y_i}\max(0,|f(x_i)-f(x_j)|_2-|f(x_i)-f(x_k)|_2+1)$$

where $N_i$ is the set of target neighbors of $x_i$.

Unsupervised Learning

PCA

PCA is a method for finding the directions in high-dimensional data that are most informative, i.e., that capture the most variance. Formally, denote the data matrix by $X=[x_1,x_2,\cdots,x_n]\in\mathbb{R}^{d\times n}$, we want to find

\[\arg\max_{v:\|v\|^2=1}\mathbb{E}_{x_i}\left[\langle v,x_i\rangle^2\right]. \]

Equivalently, we can write it as

\[\arg\max_{v:\|v\|^2=1}v^\top XX^\top v. \]

Clearly, the solution is the top eigenvector of $XX^\top$. The $k$-PCA is to find the top $k$ eigenvectors of $XX^\top$. To compute PCA, we can use the power method:

Let $b_0$ be a random vector in $\mathbb{R}^d$.

For $t=1,2,\cdots,T$:

$b_t = \frac{(XX^\top) b_{t-1}}{\|(XX^\top) b_{t-1}\|}.$

Return $b_T$.

To analyze the convergence, let $v_1,v_2,\cdots,v_d$ be the eigenvectors of $XX^\top$ with eigenvalues $\lambda_1\geq \lambda_2\geq \cdots\geq \lambda_d\geq 0$. We can write $b_0=\sum_{i=1}^d \alpha_i v_i$. Then

\[b_t = Z_t\sum_{i=1}^d \lambda_i^t\alpha_tv_i. \]

where $Z_t$ is a normalization factor. Thus, $b_t$ converges to $v_1$ exponentially fast with rate $\lambda_2/\lambda_1$. Once we know $v_1$, we can remove $v_1$ component and repeat the process to find $v_2$ and so on.

Clustering

Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other clusters.

K-means & Lloyd's Algorithm

One of the most popular clustering algorithms is K-means.

Given $(x_1,x_2,\cdots,x_n)$, $x_i\in\mathbb{R}^d$, and $k$.

Find $S=\{S_1,S_2,\cdots,S_k\}$ to minimize

\[\sum_{i=1}^k \sum_{x\in S_i} \|x - \mu_i\|^2, \]

where $\mu_i = \frac{1}{|S_i|}\sum_{x\in S_i} x$ is the center of cluster $S_i$.

However, it is $\mathsf{NP}$-hard to find the optimal solution. Lloyd's algorithm is an heuristic algorithm that converges quickly in practice.

Randomly initialize $k$ cluster centers $\mu_1,\mu_2,\cdots,\mu_k$.

Repeat until convergence:

Assign each point to the nearest cluster center:
\[S_i = \{x_j: i = \arg\min_{l} \|x_j - \mu_l\|^2\}, \quad \forall i=1,2,\cdots,k. \]
Update each cluster center:
\[\mu_i = \frac{1}{|S_i|}\sum_{x\in S_i} x, \quad \forall i=1,2,\cdots,k. \]

The above two algorithms are clustering under $\ell_p$, but sometimes this is not the best metric.

Spectral Clustering

Assume that we can define "similarity" between any two points $x_i$ and $x_j$, denoted by $A_{ij}\ge 0$. i.e., $A$ is the adjacency matrix of a weighted graph. Then we can define the degree matrix $D$ as a diagonal matrix with $D_{ii}=\sum_{j=1}^n A_{ij}$, and the Laplacian matrix $L=D-A$.

Let $G$ be undirected graph with non-negative weights. Then

The multiplicity of eigenvalue $0$ of $L$ equals the number of connected components $S_1,S_2,...,S_k$ in $G$;
The eigenspace of eigenvalue $0$ is spanned by the indicator vectors $\mathbf{1}_{S_1},\mathbf{1}_{S_2},...,\mathbf{1}_{S_k}$.

Clearly, $L$ is symmetric. Also, for any $v\in\mathbb{R}^n$, $$ \begin{align*} v^\top L v &= v^\top D v - v^\top A v \\ &= \sum_{i=1}^n D_{ii} v_i^2 - \sum_{1\le i,j\le n} A_{ij} v_i v_j \\ &= \frac{1}{2}\left(\sum_{i=1}^n D_{ii} v_i^2 + \sum_{j=1}^n D_{jj} v_j^2 - 2\sum_{1\le i,j\le n} A_{ij} v_i v_j\right) \\ &= \frac{1}{2} \sum_{1\le i,j\le n} A_{ij} (v_i - v_j)^2 \\ & \ge 0. \end{align*} $$ Thus, $L$ is positive semi-definite.

Assume $S_1,S_2,...,S_k$ are the connected components of $G$. One can easily verify that $L\mathbf{1}_{S_i}=0$ for all $i=1,2,...,k$. Thus, the multiplicity of eigenvalue $0$ is at least $k$. On the other hand, if $Lv=0$, then $v^\top L v=0$. Since $A_{ij}\ge 0$, we must have $v_i=v_j$ for any $(i,j)$ such that $A_{ij}>0$. Thus, $v$ must be a linear combination of $\mathbf{1}_{S_1},\mathbf{1}_{S_2},...,\mathbf{1}_{S_k}$. Therefore, the multiplicity of eigenvalue $0$ is exactly $k$.

However, in general, the graph is connected and we cannot directly use the above lemma. Instead, we compute the first $k$ eigenvectors of $L$ and use them as features for clustering.

Given $(x_1,x_2,\cdots,x_n)$, $x_i\in\mathbb{R}^d$, and $k$.

Construct the similarity graph and compute the adjacency matrix $A$, the degree matrix $D$ and the Laplacian matrix $L=D-A$.

Compute the first $k$ eigenvectors (corresponding to smallest $k$ eigenvalues) $v_1,v_2,\cdots,v_k$ of $L$.

Let $U=[v_1,v_2,\cdots,v_k]\in\mathbb{R}^{n\times k}$, and let $y_i\in\mathbb{R}^k$ be the $i$-th row of $V$ for $i=1,2,\cdots,n$.

Cluster $(y_1,y_2,\cdots,y_n)$ using K-means into $k$ clusters $C_1,C_2,\cdots,C_k$.

Output the clusters $S_i=\{x_j:y_j\in C_i\}$ for $i=1,2,\cdots,k$.

How to find the smallest eigenvalues? We already know that the power method can find the largest eigenvalue $\lambda_{max}$ of $L$. Since $A$ is positive semi-definite, $B=A-\lambda_{max}I$ is negative semi-definite, and the largest eigenvalue of $B$ is $\lambda_{min}-\lambda_{max}$. Thus, we can use the power method to find the largest eigenvalue of $B$ and get the smallest eigenvalue of $A$.

To understand why spectral clustering works, we need to introduce the RatioCut objective function. Given a clustering $C_1,C_2,...,C_k$, define

\[\text{RatioCut}(C_1,C_2,...,C_k) = \sum_{i=1}^k \frac{\text{cut}(C_i,\bar{C_i})}{|C_i|}, \]

where $\text{cut}(C_i,\bar{C_i}) = \sum_{x\in C_i,y\in \bar{C_i}} A_{xy}$ is the total weight of edges between $C_i$ and its complement $\bar{C_i}$. However, minimizing RatioCut is $\mathsf{NP}$-hard.

Let $C_1,C_2,...,C_k$ be a clustering. Define $H\in\mathbb{R}^{n\times k}$ as $$ H_{ij} = \begin{cases} \frac{1}{\sqrt{|C_j|}}, & \text{if } x_i\in C_j \\ 0, & \text{otherwise} \end{cases}. $$ Then $H^\top H=I$ and $\text{RatioCut}(C_1,C_2,...,C_k)=\text{tr}(H^\top L H)$.

Recall that $$v^\top L v = \frac{1}{2} \sum_{1\le i,j\le n} A_{ij} (v_i - v_j)^2.$$

Thus, let $H_i=(H_{1i},H_{2i},...,H_{ni})$, we have

\[H_i^\top L H_i = \frac{1}{2} \sum_{1\le p,q\le n} A_{pq} (H_{pi} - H_{qi})^2 = \sum_{x\in C_i, y\in \bar{C_i}} A_{xy} \left(\frac{1}{\sqrt{|C_i|}} - 0\right)^2 = \frac{\text{cut}(C_i,\bar{C_i})}{|C_i|}. \]

Immediately,

\[\text{RatioCut}(C_1,C_2,...,C_k) = \sum_{i=1}^k H_i^\top L H_i = \sum_{i=1}^k (H^\top L H)_{ii}=\text{tr}(H^\top L H). \]

Now, we can relax $H$ to be any orthogonal matrix and solve the following optimization problem:

\[\min_{H\in\mathbb{R}^{n\times k},H^\top H=I}\text{tr}(H^\top L H). \]

Standard trace minimization tells us we should pick smallest $k$ eigenvectors of $L$ as columns of $H$. This is exactly what spectral clustering does.

Let $L\in\mathbb{R}^{n\times n}$ be a symmetric and positive semi-definite matrix with eigenvalues $\lambda_1\leq \lambda_2\leq \cdots\leq \lambda_n$. Then $$ \min_{H\in\mathbb{R}^{n\times k},H^\top H=I}\text{tr}(H^\top L H)=\sum\limits_{i=1}^k \lambda_i. $$

Apply eigendecomposition $L=Q^\top\Lambda Q$, where $Q=[v_1,v_2,...,v_n]$ is orthogonal and $\Lambda=\text{diag}(\lambda_1,\lambda_2,...,\lambda_n)$. For each $i$, expand $h_i$ in the basis of $Q$, we have $h_i=\sum_{j=1}^n c_{ij} v_j$. So $$ \text{tr}(H^\top L H) = \text{tr}(C^\top\Lambda C) = \sum_{i=1}^k \sum_{j=1}^n \lambda_j c_{ij}^2. $$

Since $H^\top H=I$, we have $\sum_{j=1}^n c_{ij}^2=1$ for all $i$ and $\sum_{i=1}^k c_{ij}^2\leq 1$ for all $j$. That is, $(c_{ij}^2)$ forms the first $k$ rows of a doubly stochastic matrix $B$. By Birkhoff's theorem, $B$ is a convex combination of permutation matrices.

Notice that all the feasible $(c_{ij}^2)$ form a compact and convex set, so the minimizer must occur at an extreme point, which is a permutation matrix. Thus, we should pick $c_{ij}^2=1$ for $j=i$ and $0$ otherwise, which corresponds to the smallest $k$ eigenvalues of $L$.