Statistical Learning Theory Part 1: Hoeffding’s Inequality Derivation & Simulation
Derivation of Hoeffding’s Inequality from first principles
1: Background & Motivation
Hoeffding’s Inequality is an important concentration inequality in Mathematical Statistics and Machine Learning (ML), leveraged extensively in theoretically areas such as Statistical Learning Theory as well as applied areas such as Reinforcement Learning.
I have noticed in pockets of the ML community it common to present Hoeffding’s Inequality as a “given” with only slight intuition (if any) provided as to where said inequality is derived from. I’m generally not a fan of this type of “magical thinking” with regards to understanding mathematical material. Given I will be writing future pieces that will leverage Hoeffding’s Inequality extensively, I’ve written this piece as a primer deriving the inequality step-by-step from first principles.
We begin by stating Hoeffding’s Inequality, and in the following sections derive the inequality clearly step-by-step. We end the piece with a computational simulation showing the conservative (but probabilistically correct) bounds this inequality provides around Random Variables and empirical estimates of sampling estimators.
Suppose we have:
Given the above conditions, Hoeffding’s Inequality provides the following two-sided statistical inequality:
The sections that follow derive the above inequality from first principles. The Table of Contents for this piece are as follows:
With that said, let’s jump in.
2: Markov’s Inequality
Beginning from first principles, we start with Markov’s Inequality.
Suppose we have:
Given the above conditions, Markov’s Inequality provides the following statistical inequality:
Proof:
Markov’s Inequality provides a fairly loose bound. If our Random Variable of interest has a defined and finite variance, we can tighten Markov’s Inequality with Chebyshev’s Inequality, as shown in the next section.
3: Chebyshev’s Inequality
We next move to Chebyshev’s Inequality, which is an immediate consequence of Markov’s Inequality.
Suppose we have:
Given the above conditions, Chebyshev’s Inequality provides the following statistical inequality:
Proof:
Note above we leveraged X had a defined and finite variance, i.e. it’s second moment is defined and finite. If X had defined moments up to degree r, we can extend the above to the following inequality:
For many Random Variables, the Moment Generating Function (MGF) will exist in a neighborhood around zero, i.e. the MGF is finite for all |t|≤b where b>0 is some constant. In these cases we can use the MGF to produce a tail bound, as is the case with Chernoff Bounds in the next section.
4: Chernoff Bounds
By extending Chebyshev’s Inequality to higher-level moments, we derive Chernoff Bounds.
Suppose we have:
Given the above conditions, Chernoff Bounds provide the following statistical inequality:
Proof:
In Section 6 we will derive Chernoff Bounds specifically for Gaussian Random Variables. In preparation for doing so however, we will first derive the MGF of Gaussian Random Variables in the next section.
5: Moment Generating Function (MGF) of Gaussian Random Variables
We will first derive the Moment Generating Function (MGF) for a single Gaussian Random Variable, and then derive the MGF of centered mean of i.i.d. Gaussian Random Variables.
5.1: MGF of a Single Gaussian Random Variable
Proof:
We next extend the above to centered mean of i.i.d. Gaussian Random Variables.
5.2: MGF of Centered Mean of i.i.d. Gaussian Random Variables
Consider n identically and independently distributed (iid) Gaussian Random Variables:
Proof:
6: Gaussian Tail Bounds via Chernoff Bounds
Leveraging information from sections 4 and 5, we now derive Chernoff Bounds for centered mean of i.i.d. Gaussian Random Variables.
Proof:
In the following section, we will explore Sub-Gaussian Random Variables, a family of Random Variables for which we can leverage the Gaussian Tail bounds above from a statistical inequality perspective.
7: sub-Gaussian Random Variables
In the previous section we derived Chernoff Bounds for the centered mean of i.i.d. Gaussian Random Variables. It turns out that these Gaussian tail inequalities are much more broadly applicable to a class of Random Variables called sub-Gaussian Random Variables. Roughly speaking, these are random variables whose tails decay faster than a Gaussian.
In the following subsections, we formally define the class of sub-Gaussian Random Variables, prove Rademacher Random Variables are within the sub-Gaussian class, and prove all bounded Random Variables are within the sub-Gaussian class.
7.1: Definition of the sub-Gaussian Random Variable class
7.2: Rademacher Random Variables are sub-Gaussian
We next prove Rademacher Random Variables are sub-Gaussian.
7.3: All bounded Random Variables are sub-Gaussian
Lastly, we will prove all bounded Random Variables (i.e. those with a bounded support) are sub-Gaussian.
Suppose we have:
In the next section, through Popoviciu’s Inequality on Variances we show the inequality on the variance can be tightened further from (b-a)² to (b-a)²/4.
8: Popoviciu’s Inequality on Variances
For Random Variable X with bounded support [a,b], Popoviciu’s Inequality provides the following inequality bound on the variance Var(X):
Proof:
Therefore:
9: Hoeffding’s Inequality
Suppose we have:
Taking the sub-Gaussian tail bounds from section 7 and Popoviciu’s Inequality from section 8, we have:
… and we finally arrive at the two-sided Hoeffding’s Inequality:
10: Computational Simulation
We now perform a computational simulation in Python showing the conservative (but probabilistically correct) bounds Hoeffding’s Inequality can provide around Random Variables and empirical estimates of sampling estimators.
Let us start by loading our libraries:
, next create a function to simulate data and recover Hoeffding Bounds:
, and lastly recover the simulation results:
From the analysis above, we simulated data for Rademacher, Beta, Binomial, Uniform, and Gaussian sampling estimators (all parametric distributions within the sub-Gaussian class), and recovered Hoeffding Bounds (Chernoff Bounds in the case of the Gaussian Samling estimator) for a delta of 20%. As we can see from the results above, the Hoeffding Bounds are conservative, but all provide coverage of over 20%.
11: Wrap-Up
In will be writing future pieces on both Reinforcement Learning as well as Statistical Learning theory for both finite and infinite parameter function classes, where leveraging Hoeffding’s Inequality will be essential.
For reference of solid Statistical Learning Theory content, I would recommend textbooks “All of Statistics” and “All of Nonparametric Statistics” from Larry Wasserman (Statistics and Machine Learning Professor at Carnegie Mellon), and Elements of Statistical Learning by faculty at Stanford.
I look forward to writing future pieces, and please subscribe and follow me here on Medium!