Bayesian Learning

Machine Learning

Jesus A. Gonzalez

August 10, 2019

Bayesian Methods

Provide a probabilistic inference method
Quantities of interest defined by probability distributions
Take optimal decisions by reasoning about the probabilities of the observed data
Quantitative method
- Weights the evidence that supports hypotheses

Introduction

Events: a space of possible outputs or results
Output space: possible outputs of an event
- i.e. rolling a dice \(\Omega = \{1, 2, 3, 4, 5, 6\}\)
Measurable event: An event \(\alpha \in S\) to which we assign probabilities
The space of an event satisfies three properties
- Contains the empty event: \(\emptyset\) and the trivial event: \(\Omega\)
  - \(P(\emptyset)=0\)
  - \(P(\Omega)=1\)
- It is closed under union. If \(\alpha\), \(\beta \in S\), then also \(\alpha \cup \beta\)
- It is closed under complement. If \(\alpha \in S\), then also \(\Omega - \alpha\)

Introduction

Probability Distribution
- A probability distribution P over \((\Omega, S)\) is a mapping of events in \(S\) to real values that satisfy the following conditions:
  - \(P(\alpha) \geq 0\) for all \(\alpha \in S\)
  - \(P(\Omega) = 1\)
  - If \(\alpha\), \(\beta \in S\) and \(\alpha \cap \beta = \emptyset\), then \(P(\alpha \cup \beta) = P(\alpha) + P(\beta)\)
- \(\Omega\) is the sample space, the set of outcomes of an experiment, i.e. tossing a coin twice: \(\Omega={HH, HT, TT, TH}\)
- \(S\) is the event space, it is a subset of \(\Omega\)

Introduction - Interpretations of Probability

Frequentist
- See probabilities as frequencies of events
- This is the fraction of times that the event occurs if we repeat the event indefinitely
- Gives tangible semantic to probabilities
- Fails with events such as: will it rain tomorrow afternoon?
  - How can we define frequencies for this type of events?
  - Such an event will occur only once, not several times

Introduction - Interpretations of Probability

Subjective
- See probabilities as subjective degrees of belief
- \(P(\alpha) = 0.3\) represents a proper degree of belief that an event \(\alpha\) will occur with that probability
- We can express our uncertainty about an event (i.e. rain)
- We still need to explain how subjective are those degrees of belief
  - We can do it with our actions, i.e. making a bet

Introduction - Conditional Probability

\(P(\beta | \alpha) = \frac{P(\alpha \cap \beta)}{P(\alpha)}\)
The probability that \(\beta\) is true given that we know that event \(\alpha\) has occurred
This is a probability distribution (it satisfies the properties)

Introduction - The Chain Rule

The chain rule
If \(\alpha_n, \dots, \alpha_1\) are events, then:
- \(P(\alpha_n, \dots, \alpha_1) = P(\alpha_n|\alpha_{n-1}, \dots, \alpha_1) \cdot P(\alpha_{n-1}, \dots, \alpha_1)\)
We repeat this process with each final term and create the product:
- \(P(\bigcap\limits_{k=1}^{n}\alpha_k)=\prod\limits_{k=1}^{n}P(\alpha_k|\bigcap\limits_{j=1}^{k-1}\alpha_j)\)
For 4 variables we have:
- \(P(\alpha_4, \alpha_3, \alpha_2, \alpha_1) = P(\alpha_4 | \alpha_3, \alpha_2, \alpha_1) \cdot P(\alpha_3 | \alpha_2, \alpha_1) \cdot P(\alpha_2 | \alpha_1) \cdot P(\alpha_1)\)

Introduction - Bayes Rule

Bayes theorem is given by the equation: \[P(A|B)=\frac{P(B|A)P(A)}{P(B)}\]
\(A\) and \(B\) are events, and \(P(B) \neq 0\).
\(P(A)\) and \(P(B)\) are the probabilities of observing \(A\) and \(B\)
\(P(A|B)\) is a conditional probability, the probability of observing event \(A\) given that event \(B\) is true
\(P(B|A)\) is the probability of observing event \(B\) given that event \(A\) is true

Introduction - Bayes Rule - Example

Students Population
- Smart \(\rightarrow\) Intelligent students
- GradeA \(\rightarrow\) Students that got an A
- We believe that \(P(GradeA | Smart) = 0.6\)
  - Because previous students
- Now we know that a particular student got an A
- Can we compute the probability that that student is Smart?
- We need the probabilities of:
  - \(P(Smart) = 0.3\), \(P(GradeA) = 0.2\)
- \(P(Smart | GradeA)= 0.6*(0.3/0.2) = 0.9\)

Characteristics of Bayesian Methods

Each training example contributes to increase or decrease the estimated probability of a hypothesis
We can combine previous knowledge with observed data to determine the final probability of a hypothesis
- Apriori probability
- Probability distribution over the observed data for each possible hypothesis
Can classify new instances by combining the prediction of multiple hypotheses weighted by their probabilities

Characteristics of Bayesian Methods

Practical difficulty: initial knowledge about many probabilities is required
- If we don’t know them, we estimate them from previous knowledge, from data
Computational cost: to determine the optimal hypothesis in the general case

Bayesian Learning

We want to find the best hypothesis \(h\) given the data \(D\)
We can interpret the best hypothesis as the most probable given the initial data and initial knowledge about the apriori probability of the hypotheses in \(H\)
- Bayes theorem
- Allows computing conditional probabilities
- \(P(h|D) = \frac{P(D|h)P(h)}{P(D)}\)

Naive Bayes

Precision compared to that of neural networks and decision trees (in some cases)
Learning tasks:
- Each instance \(x\) is described by a conjunction of attribute values and
- The objective function \(f(x)\) may take any value from a finite set \(V\)
Classify a new instance
- Assign the most probable objective value \(v_{MAP}\) given the values of the attributes that describe the instance
- \(v_{MAP}= \arg\max\limits_{v \in V}P(v_j|a_1, a_2, \dots, a_n)\)
- We rewrite the bayes rule
- \(v_{MAP}=\arg\max\limits_{v \in V}\frac{P(a_1,a_2,\dots,a_n | v_j)P(v_j)}{P(a_1,a_2, \dots, a_n)}\)
- \(=\arg\max\limits_{v \in V}P(a_1,a_2,\dots,a_n | v_j)P(v_j)\)
- MAP stands for: maximum a posteriori probability)

Naive Bayes

Now we can estimate the terms of the equation from the training data
- \(=\arg\max\limits_{v \in V}P(a_1,a_2,\dots,a_n | v_j)P(v_j)\)
\(P(v_j)\) is estimated with the frequency of each objective value \(v_j\) (class) from the training data
\(P(a_1,a_2, \dots,a_n|v_j)\) cannot be estimated in the same way
- We would need a huge dataset

Naive Bayes

Naive Bayes assumes that the values of the attributes are conditionally independent among them given the value of the class
- This is a simplification
- Given the objective value of the instance, the probability of observing \(a_1,a_2,\dots,a_n\) is the product of the probabilities for the individual attributes
- \(P(a_1,a_2,\dots,a_n|v_j)=\prod\limits_{i}P(a_i|v_j)\), we substitute in our previous equation
- \(v_{NB}=\arg\max\limits_{v_j \in V}P(v_j)\prod\limits_{i}P(a_i|v_j)\)
- \(v_{NB}\) is the class assigned by Naive Bayes

Naive Bayes

The number of distinct \(P(a_i|v_j)\) to compute from training data is:
- The number of distinct attribute values \(X\) and the number of distinct objective values
- Much smaller than computing all distinct \(P(a_1,a_2, \dots,a_n|v_j)\)
Notes
- There is not search in a possible hypotheses space
- The space of possible hypotheses is the space of possible values that can be assigned to \(P(v_j)\) and \(P(a_i|v_j)\)
- We create the hypothesis by counting the frequency of combinations of data from the training examples

Naive Bayes

Now we can estimate the terms of the equation from the training data

Naive Bayes

The enjoy sport dataset

Naive Bayes

We want to classify the following instance (a new instance):
- Sky = sunny
- Temperature = low
- Humidity = high
- Wind = Strong
Predict the class of the objective concept “Enjoy Favorite Sport”
- Yes or No

Naive Bayes

Using the formula
- \(v_{NB}=\arg\max\limits_{v_j \in \{yes,no\}}P(v_j)\prod\limits_{i}P(a_i|v_j)\)
We replace \(a_i\) by the values of the attributes of the new instance \[=\arg\max\limits_{v_j\in\{yes,no\}}P(v_j) \times P(Sky=sunny|v_j) \times\] \[P(Temperature=low|v_j) \times\] \[P(Humidity=high|v_j) \times\] \[P(Wind=strong|v_j)\]

Naive Bayes

We need \(10\) probabilities that we will estimate from our data
The probabilities of the objective values
- From the frequencies of the \(14\) examples
- \(P(Play=yes) = 9/14 = 0.64\)
- \(P(Play=no) = 5/14 = 0.36\)

Naive Bayes

Now we compute the conditional probabilities
- \(P(Wind=Strong | Play=yes) = 3/9 = 0.33\)
- \(P(Wind=Strong | Play=no) = 3/5 = 0.60\)
- \(P(Sky=Sunny|Play=yes) = 2/9 = 0.22\)
- \(P(Sky=Sunny|Play=no) =3/5 = 0.60\)
- \(P(Temperature=Low|Play=yes) = 3/9 = 0.33\)
- \(P(Temperature=Low|Play=no) = 1/5 = 0.20\)
- \(P(Humidity=High|Play=yes) = 3/9 = 0.33\)
- \(P(Humidity=High|Play=no) = 4/5 = 0.80\)

Naive Bayes

After computing the probabilities \[P(yes)P(Sunny|yes)P(Low|yes)P(High|yes)P(Strong|yes)=0.0053\] \[P(no)P(Sunny|no)P(Low|no)P(High|no)P(Strong|no)=0.0206\]
Then, the class of the new instance is: Play=No
- Obtained from the estimates learned from data
If we normalize we obtain the conditional probability that the objective value is no
- \(\frac{0.0206}{0.0206+0.0053}= 0.795\)

Naive Bayes - m-estimate of Probability

We are estimating conditional probabilities \(P(A | B)\) or \(P(Wind=Strong|Play=no)\) by using \(\frac{n_c}{n}\)
- \(n_c\) is the number of times \(A \land B\) happened in the training data
  - i.e. \(Wind=Strong \land Play=no\), \(n_c=3\)
- \(n\) is the number of times \(B\) happened in the training data
  - i.e. \(Play=no\), \(n=5\)
This can cause a problem when \(n_c=0\)

Naive Bayes - m-estimate of Probability

We can avoid this by fixing two numbers beforehand
- A nonzero prior estimate \(p\) for \(P(A|B)\)
  - With no additional information, \(p\) takes normal apriori values
  - If an attribute has \(k\) possible values, then \(p=\frac{1}{k}\)
  - For \(P(Wind=Strong| Play=no)\)
    - Wind has 2 possible values
    - Uniform apriori probabilities give us \(p=0.5\)
- A number \(m\), a constant called equivalent size of the sample, it determines the weight of \(p\) relative to the observed data
Instead of using \(\frac{n_c}{n}\) we use \(\frac{n_c + mp}{n + m}\)
We think of this as having a bunch of examples to start the whole process

Bayesian Networks

Bayesian Networks describe the probability distribution for a specific set of variables
- The set of variables specify a set of conditional independence assumptions
- A set of conditional probabilities
Allow describing conditional independence assumptions that apply to subsets of the variables
Less restrictive than Naive-bayes
This approach is more tractable than removing all the conditional independence assumptions
Used for inference

Bayesian Networks

Bayesian Networks are used to reason under uncertainty
Based on the representation of dependence between variables
Provide a concise specification of the probability distribution table