Statistical Learning Theory Part 5: Shattering Coefficient
Proof of Consistency, Rates, and Generalization Bounds for ML Estimators over Infinite Function Classes leveraging the Shattering Coefficient
1: Background & Motivation
As stated in previous pieces in this series, Statistical Learning Theory provides a probabilistic framework for understanding the problem of Machine Learning Inference. In mathematical terms, the basic goals of Statistical Learning Theory can be formulated as:
This piece is Part 5 in a Series on Statistical Learning Theory. The first four pieces are:
- Part 1: Hoeffding’s Inequality Derivation & Simulation
- Part 2: Optimality of the Bayes Classifier
- Part 3: Convergence and Consistency of Learned ML Estimators
- Part 4: Consistency over Finite Function Classes
In Part 1 of this series we derived Hoeffding’s Inequality from first principles, in Part 2 we proved optimality of the Bayes Classifier, in Part 3 we developed theory to assess consistency of data-adaptive machine learning sampling estimators, and in Part 4 we derived consistency rates and generalization bounds of ML estimators over finite sized function classes. In this piece we extend our theory to learned ML estimators over infinite sized function classes and derive consistency rates and generalization bounds leveraging the Shattering Coefficient.
To motivate the current problem of interest, consider:
we define:
and recall:
However, what if we consider an infinite sized function class rather than finite? For example, the function class of all linear models? In such a case, do we still have consistency?
In the notes that follow, we derive inequalities, rates, and generalization bounds for ML estimators over infinite function classes leveraging the Shattering Coefficient.
The Table of Contents for this piece are as follows:
With that said, let’s jump in.
2: Shattering Coefficient
2.1: Definition of the Shattering Coefficient
We would like to measure the capacity of an infinite function class. The Shattering Coefficient is the most simple such capacity measure.
Let’s jump in with some examples of the Shattering Coefficient for simple toy examples.
2.2: Toy Example #1
2.3: Toy Example #2
2.4: Toy Example #3
3: Statistical Inequality via Ghost Sample
Starting with our statistical inequality from Part 4 of this series:
We will in this section prove a further bound to the right side of the inequality above as follows:
To prove the above statistical inequality, we start by proving the following intermediate result we will later leverage:
Proof of the above intermediate result is as follows:
We are now ready to prove that:
Proof of the above inequality is as follows:
Using the statistical inequality above, in the next section we examine generalization bounds and consistency rates leveraging the Shattering Coefficient.
4: Generalization Bounds and Consistency Rates
From the results in the previous section, we are now ready to show:
Proof of the above generalization bounds are as follows:
5: Wrap-up & Conclusions
Note, although we were able to derive generalization bounds and consistency rates for ML estimators over infinite sized function classes, the approaches in this piece suffer from some drawbacks. Mainly:
- Beyond simple toy examples, the Shattering Coefficient is generally very difficult to compute or calculate.
- The Shattering Coefficient is also a function of the specific sample size “n” meaning we need to know how fast said coefficient grows asymptotically in order to leverage it for the purposes used in this piece.
In the following Part 6 of this series, we will leverage another tool to derive the capacity of infinite sized function classes, that being the Vapnik-Chervonenkis (VC) dimension. As we will see in the next piece, the VC dimension is, for certain use-cases, more tenable to calculate than the Shattering Coefficient. And unlike the Shattering Coefficient, the VC dimension is not dependent on the sample size “n”.
For reference of solid Statistical Learning Theory content, I would recommend textbooks “All of Statistics” and “All of Nonparametric Statistics” from Larry Wasserman (Statistics and Machine Learning Professor at Carnegie Mellon), Elements of Statistical Learning by faculty at Stanford, and Statistical Learning Theory by Vladimir Vapnik.
I look forward to writing future pieces, and please subscribe and follow me here on Medium!