A Formal Framework for ML

Machine Learning

Jesus A. Gonzalez

August 10, 2019

Our goal in ML consists of finding a hypothesis with low error given a sample \(S\) of the data \(D\)
Examples composing the training set \(S\) are idependently and identically distributed according to the distribution \(D\)
We call this Empirical Risk Minimization (ERM)
- Find a hypothesis \(h\) with the lowest error such that
- \(L_S(h) = \frac{|\{i \epsilon [m] : h(x_i) \neq y_i \}|}{m}\)

We might be able to find a hypothesis \(h\) with \(L_S(h) = 0\)
- \(h\) is really good with the training set
- \(h\) performs poorly with new data

Possible way to avoid overfitting
- Restrict the search space
- Choose a set of predictors in advance, the hypothesis class, \(H\)
Restricting the learner to choose a predictor from \(H\), we have a bias towards certain predictors
This bias is called inductive bias

Ideally, the inductive bias would be chosen according to prior knowledge about the problem we need to solve
Over which hypotheses classes \(ERM_H\) learning will not overfit?
- Restricting the hypothesis class protects us against overfitting but might cause a larger inductive bias

The simplest restriction consists of imposing an upper bound on its size
- The number of predictors \(h\) in \(H\)
Now we need to find
- \(h_S\, \epsilon \, \underset{h \epsilon H}{\operatorname{argmin}} L_s(h)\)
We are interested in the true risk \(L_{(D,f)} (h_S)\) of \(h_S\), not its empirical risk
This depends on the relation between \(S\) and \(D\)

ML Assumption: \(S\) is created by sampling points from the distribution \(D\) independently of each other
- The i.i.d. assumption, independently and identically distributed according to the distribution \(D\)
- Denoted by \(S \sim D^m\)
- \(m\) is the size of \(S\)
- \(D^m\) denotes the probability over \(m\)-tuples induced by applying \(D\) to pick each element of the tuple independently of the other members of the tuple

Because \(L_{(D,f)}(h_S)\) depends on the training set \(S\) and \(S\) is obtained through a random process, then there is randomness in the choice of the predictor \(h_S\) and consequently in the risk \(L_{(D,f)}(h_S)\)
We formally say that it is a random variable
Then, because of this randomness, \(S\) will usually be representative of \(D\)
- With a low probability that it won’t be representative, denoted by \(\delta\)
- \((1 - \delta)\) is the confidence parameter of our prediction

The learned model \(h_S\) won’t be perfect, it will make errors in the predictions
We use another parameter for the quality of the predictions: the accuracy parameter, denoted by \(\epsilon\)
- With \(L_{(D,f)}(h_S) > \epsilon\), as a failure of the learner
- With \(L_{(D,f)}(h_S) \leq \epsilon\), as an approximately correct predictor
What’s the upper bound of the probability that we sample \(m\)-tuple of instances that will lead to failure of the learner?

Corolary 2.3. Let \(H\) be a finite hypothesis class. Let \(\delta \epsilon (0,1)\) and \(\epsilon > 0\) and let \(m\) be an integer that satisfies \(m \geq \frac{log(|H|/\delta)}{\epsilon}\). Then, for any labeling function, \(f\), and for any distribution, \(D\) for which the realizability assumption holds (that is, for some \(h \epsilon H\), \(L_{(D,f)}(h) = 0\)), with probability of at least \(1 - \delta\) over the choice of an i.i.d. sample \(S\) of size \(m\), we have that for every ERM hypothesis, \(h_S\), it holds that \(L_{(D,f)}(h_S) \leq \epsilon\).
This means that for a sufficiently large \(m\), the \(ERM_H\) rule over a finite hypothesis class will be probably (with confidence \(1 - \delta\)) approximately (up to an error of \(\epsilon\)) correct.

Definition 3.1 (PAC Learnability). A hypothesis class \(H\) is PAC learnable if there exist a function \(m_H : (0,1)^2 \rightarrow N\) and a learning algorithm with the following property: For every \(\epsilon, \delta \;\epsilon (0, 1)\), for every distribution \(D\) over \(X\), and for every labeling function \(f: X \rightarrow (0, 1)\), if the realizable assumption holds w.r.t. \(H, D, f\) then when running the learning algorithm on \(m \geq m_H(\epsilon, \delta)\) i.i.d. examples generated by \(D\) and labeled by \(f\), the algorithm returns a hypothesis \(h\) such that, with probability of at least \(1 - \delta\) (over the choice of the examples), \(L_{(D, f)}(h_S) \leq \epsilon\).

\(m_H : (0, 1)^2 \rightarrow N\) determines the sample complexity of learning \(H\)
- How many examples are required to guaranty a probably approximately correct solution
- Sample complexity if a function of the accuracy (\(\epsilon\)) and confidence (\(\delta\))
- Also depends on the hypothesis class \(H\), for a finite class the sample complexity depends on the \(log\) of the size of \(H\)
\(m_H(\epsilon, \delta) \leq \Big \lceil \frac{log(|H|/\delta)}{\epsilon} \Big \rceil\)