Statistical Learning Theory Part 5: Shattering Coefficient

Proof of Consistency, Rates, and Generalization Bounds for ML Estimators over Infinite Function Classes leveraging the Shattering Coefficient

Andrew Rothman
5 min readNov 3, 2023
Photo by Resource Database on Unsplash

1: Background & Motivation

As stated in previous pieces in this series, Statistical Learning Theory provides a probabilistic framework for understanding the problem of Machine Learning Inference. In mathematical terms, the basic goals of Statistical Learning Theory can be formulated as:

image by author

This piece is Part 5 in a Series on Statistical Learning Theory. The first four pieces are:

In Part 1 of this series we derived Hoeffding’s Inequality from first principles, in Part 2 we proved optimality of the Bayes Classifier, in Part 3 we developed theory to assess consistency of data-adaptive machine learning sampling estimators, and in Part 4 we derived consistency rates and generalization bounds of ML estimators over finite sized function classes. In this piece we extend our theory to learned ML estimators over infinite sized function classes and derive consistency rates and generalization bounds leveraging the Shattering Coefficient.

To motivate the current problem of interest, consider:

image by author

we define:

image by author

and recall:

image by author
image by author

However, what if we consider an infinite sized function class rather than finite? For example, the function class of all linear models? In such a case, do we still have consistency?

In the notes that follow, we derive inequalities, rates, and generalization bounds for ML estimators over infinite function classes leveraging the Shattering Coefficient.

image by author

The Table of Contents for this piece are as follows:

image by author

With that said, let’s jump in.

2: Shattering Coefficient

2.1: Definition of the Shattering Coefficient

We would like to measure the capacity of an infinite function class. The Shattering Coefficient is the most simple such capacity measure.

image by author

Let’s jump in with some examples of the Shattering Coefficient for simple toy examples.

2.2: Toy Example #1

image by author

2.3: Toy Example #2

image by author

2.4: Toy Example #3

image by author

3: Statistical Inequality via Ghost Sample

Starting with our statistical inequality from Part 4 of this series:

image by author

We will in this section prove a further bound to the right side of the inequality above as follows:

image by author

To prove the above statistical inequality, we start by proving the following intermediate result we will later leverage:

image by author

Proof of the above intermediate result is as follows:

image by author
image by author

We are now ready to prove that:

image by author

Proof of the above inequality is as follows:

image by author

Using the statistical inequality above, in the next section we examine generalization bounds and consistency rates leveraging the Shattering Coefficient.

4: Generalization Bounds and Consistency Rates

From the results in the previous section, we are now ready to show:

image by author

Proof of the above generalization bounds are as follows:

image by author
image by author
image by author

5: Wrap-up & Conclusions

image by author

Note, although we were able to derive generalization bounds and consistency rates for ML estimators over infinite sized function classes, the approaches in this piece suffer from some drawbacks. Mainly:

  • Beyond simple toy examples, the Shattering Coefficient is generally very difficult to compute or calculate.
  • The Shattering Coefficient is also a function of the specific sample size “n” meaning we need to know how fast said coefficient grows asymptotically in order to leverage it for the purposes used in this piece.

In the following Part 6 of this series, we will leverage another tool to derive the capacity of infinite sized function classes, that being the Vapnik-Chervonenkis (VC) dimension. As we will see in the next piece, the VC dimension is, for certain use-cases, more tenable to calculate than the Shattering Coefficient. And unlike the Shattering Coefficient, the VC dimension is not dependent on the sample size “n”.

For reference of solid Statistical Learning Theory content, I would recommend textbooks “All of Statistics” and “All of Nonparametric Statistics” from Larry Wasserman (Statistics and Machine Learning Professor at Carnegie Mellon), Elements of Statistical Learning by faculty at Stanford, and Statistical Learning Theory by Vladimir Vapnik.

I look forward to writing future pieces, and please subscribe and follow me here on Medium!

--

--

Andrew Rothman

Principal Data/ML Scientist @ The Cambridge Group | Harvard trained Statistician and Machine Learning Scientist | Expert in Statistical ML & Causal Inference