Statistical Learning Theory Part 4: Finite Function Classes — Consistency, Rates, & Bounds
Proof of Consistency, Rates, and Generalization Bounds for ML Estimators over Finite Function Classes
1: Background & Motivation
As stated in previous pieces in this series, Statistical Learning Theory provides a probabilistic framework for understanding the problem of Machine Learning Inference. In mathematical terms, the basic goals of Statistical Learning Theory can be formulated as:
This piece is Part 4 in a Series on Statistical Learning Theory. The first three pieces are:
- Part 1: Hoeffding’s Inequality Derivation & Simulation
- Part 2: Optimality of the Bayes Classifier
- Part 3: Convergence and Consistency of Learned ML Estimators
In Part 1 of this series we derived Hoeffding’s Inequality from first principles, in Part 2 we proved optimality of the Bayes Classifier, and in Part 3 we started to develop theory to assess consistency of data-adaptive machine learning sampling estimators. In this piece we assess consistency and generalization bounds for sampling estimators chosen from a finite-sized function class.
The Table of Contents for this piece are as follows:
With that said, let’s jump in.
2: Proof of Consistency for Finite Function Classes
Let’s begin with our problem set-up and assumptions. Consider we have:
we define:
and recall:
3: Toy Examples
We will now review two simple toy examples with finite function classes, one with the function class size m fixed and specified a priori, and another example with class size m specified as a function of sample size n.
3.1: Fixed Finite Function Class Size “m” specified a priori
Consider:
As an example of the finite function class of all piecewise constant functions given the support of Y and X above, below are the 8 functions possible for partition size k=3:
3.2: Finite Function Class Size “m” specified as a function of sample size “n”
Consider
4: Wrap-up & Conclusions
Whereas this piece dealt with consistency of machine learning sampling estimators with respect to finite function classes, what happens when our available function class is infinite? For instance, the function class of all linear models? Is it possible to construct generalization bounds and have consistency under these conditions? The answer is yes, of which we will begin exploring in part 5 with the Shattering Coefficient.
For reference of solid Statistical Learning Theory content, I would recommend textbooks “All of Statistics” and “All of Nonparametric Statistics” from Larry Wasserman (Statistics and Machine Learning Professor at Carnegie Mellon), Elements of Statistical Learning by faculty at Stanford, and Statistical Learning Theory by Vladimir Vapnik.
I look forward to writing future pieces, and please subscribe and follow me here on Medium!