Back to Math Tools
Back to notes

# Concentration Inequalities

## Basic inequalities

### Markov’s inequality

Let $\bfX$ be a non-negative random variable, for any $t>0$, we have $$\bbP[\bfX\geq t]\leq\frac{\bbE[\bfX]}{t}.$$

Let $f_{\bfX}(\cdot)$ be the pdf of $\bfX$. \begin{align} \bbE[\bfX] &= \int_{0}^{\infty}X\cdot f_{\bfX}(X)dX\\
&= \int_{0}^{t}X\cdot f_{\bfX}(X)dX + \int_{t}^{\infty}X\cdot f_{\bfX}(X)dX\\
&\geq t\cdot\bbP[\bfX\geq t]. \end{align}

Note that Markov’s inequality only works for non-negative random variable. As a result, we may apply Markov’s inequality on $\bfY=\card{\bfX-\bbE[X]}$ for random variable $\bfX$ which might have negative value.

It’s also natural to boost Markov’s inequality via applying a non-negative and nondecreasing function in advance. Concretely, let $\phi:\bbR\rightarrow\bbR^{\geq0}$ be a non-negative and nondecreasing function on $I\subseteq\bbR$, we have for any $t>0$ $$\bbP[\phi(\bfX)\geq t]\leq\frac{\phi(\bbE[\bfX])}{t}.$$

For instance, take $\phi(x)=x^2$ on $(0,\infty]$, we yield the Chebyshev’s inequality.

### Chernoff’s inequality

Let $\bfX$ be a random variable over $\bbR$. For any $t\in\bbR$ and $\lambda\geq0$, we have $$\mathbb{P}[\bfX\geq t]\leq e^{-(\lambda t-\log\mathbb{E}e^{\lambda\bfX})}.$$

Follow the idea in the previous subsection, here we take $\phi(x)=e^{\lambda x}$ for any $\lambda\geq0$. Let $t’=e^{\lambda t}>0$ for any $t\in\bbR$.

\begin{align} \bbP[\bfX\geq t] &= \bbP[e^{\lambda\bfX}\geq t’]\\
&\leq\frac{\bbE[e^{\lambda\bfX}]}{t’}\\
&=\frac{\bbE[e^{\lambda\bfX}]}{e^{\lambda t}}. \end{align}

We denote the logarithm of the moment generating function of $\bfX$ as $\psi_{\bfX}(\lambda):=\log\mathbb{E}e^{\lambda\bfX}$ and define the Cramér’s transformation $\psi^{*}_{\bfX}(t):=\max_{\lambda\geq0}\lambda t-\psi_{\bfX}(\lambda)$. Now we can rewrite the Chernoff’s inequality as

$$\mathbb{P}[\bfX\geq t]\leq e^{-\psi^{*}_{\bfX}(t)}.$$

As a result, to optimize the performance of Chernoff’s inequality, it is equivalent to explicitly compute $\Psi_{\bfX}^{*}(t)$.

• Properties of $\psi_{\bfX}(\lambda)$:
• By Jenson’s inequality, $\psi_{\bfX}(\lambda)\geq\mathbb{E}[\bfX]$.
• $S=\{\lambda>0:\ \mathbb{E}e^{\lambda\bfX}<\infty\}$ is either empty or an interval with left endpoint 0. Let $b=\sup S$.
• $\psi_{\bfX}(\lambda)$ is convex and infinitely differentiable on $I=(0,b)$
• If $\mathbb{E}[\bfX]=0$, then $\psi_{\bfX}(\lambda)$ is continuously differentiable on $I=[0,b)$ where $\psi_{\bfX}’(0)=0$.
• Thus, $\psi^{*}_{\bfX}(t)=\max_{\lambda\in I}\lambda t-\psi_{\bfX}(\lambda)$.
• Thus, one can compute $\psi_{\bfX}(t)$ by differentiating $\lambda t-\psi_{\bfX}(t)$ and set the derivative to zero.
• Triviality: $\psi^{*}_{\bfX}(t)=0$.
• $\psi_{\bfX}(\lambda)=\infty$, or
• $t\leq\mathbb{E}[\bfX]$ (since $\lambda t-\psi_{\bfX}(\lambda)\leq\lambda(t-\mathbb{E}[\bfX])$).

Here, we list several useful forms of Chernoff’s inequalities.

Let $X_1,X_2,\dots,X_n$ be mutually independent random variables taking value over $\{0,1\}$, and let $\mu=\mathbb{E}[\sum_{i\in[n]}X_i]$. Then, for any $\epsilon>0$, \begin{align} \mathbb{P}\left[\sum_{i\in[n]}X_i\geq(1+\epsilon)\mu\right]\leq\left(\frac{e^{\epsilon}}{(1+\epsilon)^{1+\epsilon}}\right)^{\mu},\\
\mathbb{P}\left[\sum_{i\in[n]}X_i\leq(1-\epsilon)\mu\right]\leq\left(\frac{e^{-\epsilon}}{(1-\epsilon)^{1-\epsilon}}\right)^{\mu},\\
\mathbb{P}\left[\card{\sum_{i\in[n]}X_i-\mu}\geq\epsilon\cdot\mu\right]\leq2\cdot e^{-\epsilon^2\mu/3}. \end{align}

Let $X_1,X_2,\dots,X_n$ be mutually independent random variables taking value over $[a,b]$, and let $\mu=\mathbb{E}[\sum_{i\in[n]}X_i]$. Then, for any $\epsilon>0$, \mathbb{P}\left[\sum_{i\in[n]}\geq(1+\epsilon)\mu\right]\leq e^{-\frac{2\epsilon^2\mu^2}{n(b-a)^2}},\\
\mathbb{P}\left[\sum_{i\in[n]}\leq(1-\epsilon)\mu\right]\leq e^{-\frac{\epsilon^2\mu^2}{n(b-a)^2}}.

### Khintchine’s inequality

Khintchine’s inequality deals with the concentration of the inner product with i.i.d. Radmacher random variables.

Let $\sigma_1,\sigma_2,\dots,\sigma_n$ be Radmacher random variables (uniform on ${\pm1}$) and $x=(x_1,x_2,\dots,x_n)\in\R^n$. We have $$\mathbb{P}\left[\sum_{i\in[n]}\sigma_ix_i>\lambda\right]\leq e^{-\frac{\lambda^2}{2\|x\|_2^2}}\ \ \ .$$

The proof is based on the standard moment method. Let $s\geq$ be chosen later, by Markov’s inequality we have $$\mathbb{P}\left[e^{s\sum_{i\in[n]}\sigma_ix_i}>e^{\lambda}\right]\leq\frac{\mathbb{E}[e^{s\sum_{i\in[n]}\sigma_ix_i}]}{e^{s\lambda}}.$$ By the independence of Radmacher random variables, we have $$\mathbb{E}\left[e^{s\sum_{i\in[n]}\sigma_ix_i}\right] = \prod_{i\in[n]}\mathbb{E}\left[e^{s\sigma_ix_i}\right] = \prod_{i\in[n]}\frac{1}{2}(e^{sx_i}+e^{-sx_i}).$$ By Taylor’s expansion, we can upper bound $e^{sx_i}+e^{-sx_i}$ by $2e^{s^2x_i^2}$ and yield $$\mathbb{P}\left[e^{s\sum_{i\in[n]}\sigma_ix_i}>e^{\lambda}\right]\leq\frac{e^{s^2\|x\|_2^2}}{e^{s\lambda}}.$$ Finally, pick $s=\frac{\lambda}{2\|x\|_2^2}$ we have $$\mathbb{P}\left[\sum_{i\in[n]}\sigma_ix_i>\lambda\right]\leq e^{-\frac{\lambda^2}{2\|x\|_2^2}}\ \ \ .$$

### Hanson-Wright inequality

Another useful inequality about Rademacher random variables is the Hanson-Wright inequality. While Khintchine’s inequality deals with a linear form, Hanson Wright inequality deals with a quadratic form.

Let $B\in\bbR^{n\times n}$ be a symmetric matrix and let $\sigma_1,\sigma_2,\dots,\sigma_n$ be Radmacher random variables (uniform on ${\pm1}$) and $x=(x_1,x_2,\dots,x_n)\in\R^n$. For every integer $p\geq1$, we have $$\left\|\sigma^\top B\sigma-\bbE_\sigma[\sigma^\top B\sigma]\right\|_p\leq O(\sqrt{p}\|B\|_F+p\|B\|)$$ where $\|X\|_p$ is defined as $\bbE[|X|^p]^{1/p}$, $\|\cdot\|_F$ is the Frobenius norm, and $\|\cdot\|$ is the operator norm.

Hanson-Wright inequality has some applications such as the chaining technique. See a paper of mine for example.

## Martingale concentration inequalities

Martingale is a generalization of i.i.d. samples. We say a sequence of random variables $M_1,M_2,\dots$ form a martingale if $\bbE[M_{t}|X_1,\dots,M_{t-1}]=M_{t-1}$. There are also two variants called supermartingale and submartingale. The intuition is that given the past information (i.e., $M_1,\dots,M_{t-1}$, technically denoted as filtration $\{\cF_{t-1}\}$), the expectation of where you are at the next time step (i.e., $\bbE[M_{t}|M_1,\dots,M_{t-1}]$) is known.

The reason why martingale is important and useful is that (i) many things can be modeled as a martingale and (ii) there are concentration inequalities for martingales! There are many martingale concentration inequalities on the market but what you really need to remember is the following Freedman’s inequality which basically subsumes all the others. For more details and properties on martingale, I might create a separate note soon.

Let $\{M_t\}_{t\geq0}$ be an adapted stochastic process (meaning that it does not depend on the future). Let $T\in\bbN$ and $a,c,\sigma_t,\mu_t\geq0$ be some constants for every $t\in[T]$. Suppose the following conditions hold.

• (Bounded difference) $|M_t-M_{t-1}|\leq c$ for all $t\in[T]$.
• (Bounded condition expectation) $|\bbE[M_{t}-M_{t-1}|\cF_{t-1}]|\leq\mu_t$ for all $t\in[T]$.
• (Bounded conditional variance) $\Var[M_{t}|\cF_{t-1}]\leq\sigma_t^2$ for all $t\in[T]$.

Then, $$\bbP\left[\sup_{t\in[T]} \card{M_t-M_0}\geq a\right]<\exp\left(-\Omega\left(\frac{a^2}{\sum_{t\in[T]}\sigma_t^2+ca}\right)\right) \, .$$