A Formal Framework for ML
Machine Learning
Jesus A. Gonzalez
August 10, 2019
Overfitting
- We might be able to find a hypothesis \(h\) with \(L_S(h) = 0\)
- \(h\) is really good with the training set
- \(h\) performs poorly with new data
ERM and Inductive Bias
- Possible way to avoid overfitting
- Restrict the search space
- Choose a set of predictors in advance, the hypothesis class, \(H\)
- Restricting the learner to choose a predictor from \(H\), we have a bias towards certain predictors
- This bias is called inductive bias
ERM and Inductive Bias
- Ideally, the inductive bias would be chosen according to prior knowledge about the problem we need to solve
- Over which hypotheses classes \(ERM_H\) learning will not overfit?
- Restricting the hypothesis class protects us against overfitting but might cause a larger inductive bias
Finite Hypothesis Classes
- The simplest restriction consists of imposing an upper bound on its size
- The number of predictors \(h\) in \(H\)
- Now we need to find
- \(h_S\, \epsilon \, \underset{h \epsilon H}{\operatorname{argmin}} L_s(h)\)
- We are interested in the true risk \(L_{(D,f)} (h_S)\) of \(h_S\), not its empirical risk
- This depends on the relation between \(S\) and \(D\)
The i.i.d. assumption
- ML Assumption: \(S\) is created by sampling points from the distribution \(D\) independently of each other
- The i.i.d. assumption, independently and identically distributed according to the distribution \(D\)
- Denoted by \(S \sim D^m\)
- \(m\) is the size of \(S\)
- \(D^m\) denotes the probability over \(m\)-tuples induced by applying \(D\) to pick each element of the tuple independently of the other members of the tuple
Randomness in the model creation
- Because \(L_{(D,f)}(h_S)\) depends on the training set \(S\) and \(S\) is obtained through a random process, then there is randomness in the choice of the predictor \(h_S\) and consequently in the risk \(L_{(D,f)}(h_S)\)
- We formally say that it is a random variable
- Then, because of this randomness, \(S\) will usually be representative of \(D\)
- With a low probability that it won’t be representative, denoted by \(\delta\)
- \((1 - \delta)\) is the confidence parameter of our prediction
Quality of the predictions
- The learned model \(h_S\) won’t be perfect, it will make errors in the predictions
- We use another parameter for the quality of the predictions: the accuracy parameter, denoted by \(\epsilon\)
- With \(L_{(D,f)}(h_S) > \epsilon\), as a failure of the learner
- With \(L_{(D,f)}(h_S) \leq \epsilon\), as an approximately correct predictor
- What’s the upper bound of the probability that we sample \(m\)-tuple of instances that will lead to failure of the learner?
Probably Approximately Correct
- Corolary 2.3. Let \(H\) be a finite hypothesis class. Let \(\delta \epsilon (0,1)\) and \(\epsilon > 0\) and let \(m\) be an integer that satisfies \(m \geq \frac{log(|H|/\delta)}{\epsilon}\). Then, for any labeling function, \(f\), and for any distribution, \(D\) for which the realizability assumption holds (that is, for some \(h \epsilon H\), \(L_{(D,f)}(h) = 0\)), with probability of at least \(1 - \delta\) over the choice of an i.i.d. sample \(S\) of size \(m\), we have that for every ERM hypothesis, \(h_S\), it holds that \(L_{(D,f)}(h_S) \leq \epsilon\).
- This means that for a sufficiently large \(m\), the \(ERM_H\) rule over a finite hypothesis class will be probably (with confidence \(1 - \delta\)) approximately (up to an error of \(\epsilon\)) correct.
PAC Learnability
- Definition 3.1 (PAC Learnability). A hypothesis class \(H\) is PAC learnable if there exist a function \(m_H : (0,1)^2 \rightarrow N\) and a learning algorithm with the following property: For every \(\epsilon, \delta \;\epsilon (0, 1)\), for every distribution \(D\) over \(X\), and for every labeling function \(f: X \rightarrow (0, 1)\), if the realizable assumption holds w.r.t. \(H, D, f\) then when running the learning algorithm on \(m \geq m_H(\epsilon, \delta)\) i.i.d. examples generated by \(D\) and labeled by \(f\), the algorithm returns a hypothesis \(h\) such that, with probability of at least \(1 - \delta\) (over the choice of the examples), \(L_{(D, f)}(h_S) \leq \epsilon\).
PAC Learnability
- \(m_H : (0, 1)^2 \rightarrow N\) determines the sample complexity of learning \(H\)
- How many examples are required to guaranty a probably approximately correct solution
- Sample complexity if a function of the accuracy (\(\epsilon\)) and confidence (\(\delta\))
- Also depends on the hypothesis class \(H\), for a finite class the sample complexity depends on the \(log\) of the size of \(H\)
- \(m_H(\epsilon, \delta) \leq \Big \lceil \frac{log(|H|/\delta)}{\epsilon} \Big \rceil\)