Bayes theorem is given by the equation: \[P(A|B)=\frac{P(B|A)P(A)}{P(B)}\]
\(A\) and \(B\) are events, and \(P(B) \neq 0\).
\(P(A)\) and \(P(B)\) are the probabilities of observing \(A\) and \(B\)
\(P(A|B)\) is a conditional probability, the probability of observing event \(A\) given that event \(B\) is true
\(P(B|A)\) is the probability of observing event \(B\) given that event \(A\) is true
Introduction - Bayes Rule - Example
Students Population
Smart \(\rightarrow\) Intelligent students
GradeA \(\rightarrow\) Students that got an A
We believe that \(P(GradeA | Smart) = 0.6\)
Because previous students
Now we know that a particular student got an A
Can we compute the probability that that student is Smart?
We need the probabilities of:
\(P(Smart) = 0.3\), \(P(GradeA) = 0.2\)
\(P(Smart | GradeA)= 0.6*(0.3/0.2) = 0.9\)
Characteristics of Bayesian Methods
Each training example contributes to increase or decrease the estimated probability of a hypothesis
We can combine previous knowledge with observed data to determine the final probability of a hypothesis
Apriori probability
Probability distribution over the observed data for each possible hypothesis
Can classify new instances by combining the prediction of multiple hypotheses weighted by their probabilities
Characteristics of Bayesian Methods
Practical difficulty: initial knowledge about many probabilities is required
If we don’t know them, we estimate them from previous knowledge, from data
Computational cost: to determine the optimal hypothesis in the general case
Bayesian Learning
We want to find the best hypothesis \(h\) given the data \(D\)
We can interpret the best hypothesis as the most probable given the initial data and initial knowledge about the apriori probability of the hypotheses in \(H\)
Bayes theorem
Allows computing conditional probabilities
\(P(h|D) = \frac{P(D|h)P(h)}{P(D)}\)
Naive Bayes
Precision compared to that of neural networks and decision trees (in some cases)
Learning tasks:
Each instance \(x\) is described by a conjunction of attribute values and
The objective function \(f(x)\) may take any value from a finite set \(V\)
Classify a new instance
Assign the most probable objective value \(v_{MAP}\) given the values of the attributes that describe the instance
\(P(v_j)\) is estimated with the frequency of each objective value \(v_j\) (class) from the training data
\(P(a_1,a_2, \dots,a_n|v_j)\) cannot be estimated in the same way
We would need a huge dataset
Naive Bayes
Naive Bayes assumes that the values of the attributes are conditionally independent among them given the value of the class
This is a simplification
Given the objective value of the instance, the probability of observing \(a_1,a_2,\dots,a_n\) is the product of the probabilities for the individual attributes
\(P(a_1,a_2,\dots,a_n|v_j)=\prod\limits_{i}P(a_i|v_j)\), we substitute in our previous equation
We replace \(a_i\) by the values of the attributes of the new instance \[=\arg\max\limits_{v_j\in\{yes,no\}}P(v_j) \times P(Sky=sunny|v_j) \times\]\[P(Temperature=low|v_j) \times\]\[P(Humidity=high|v_j) \times\]\[P(Wind=strong|v_j)\]
Naive Bayes
We need \(10\) probabilities that we will estimate from our data
The probabilities of the objective values
From the frequencies of the \(14\) examples
\(P(Play=yes) = 9/14 = 0.64\)
\(P(Play=no) = 5/14 = 0.36\)
Naive Bayes
Now we compute the conditional probabilities
\(P(Wind=Strong | Play=yes) = 3/9 = 0.33\)
\(P(Wind=Strong | Play=no) = 3/5 = 0.60\)
\(P(Sky=Sunny|Play=yes) = 2/9 = 0.22\)
\(P(Sky=Sunny|Play=no) =3/5 = 0.60\)
\(P(Temperature=Low|Play=yes) = 3/9 = 0.33\)
\(P(Temperature=Low|Play=no) = 1/5 = 0.20\)
\(P(Humidity=High|Play=yes) = 3/9 = 0.33\)
\(P(Humidity=High|Play=no) = 4/5 = 0.80\)
Naive Bayes
After computing the probabilities \[P(yes)P(Sunny|yes)P(Low|yes)P(High|yes)P(Strong|yes)=0.0053\]\[P(no)P(Sunny|no)P(Low|no)P(High|no)P(Strong|no)=0.0206\]
Then, the class of the new instance is: Play=No
Obtained from the estimates learned from data
If we normalize we obtain the conditional probability that the objective value is no
\(\frac{0.0206}{0.0206+0.0053}= 0.795\)
Naive Bayes - m-estimate of Probability
We are estimating conditional probabilities \(P(A | B)\) or \(P(Wind=Strong|Play=no)\) by using \(\frac{n_c}{n}\)
\(n_c\) is the number of times \(A \land B\) happened in the training data
i.e. \(Wind=Strong \land Play=no\), \(n_c=3\)
\(n\) is the number of times \(B\) happened in the training data
i.e. \(Play=no\), \(n=5\)
This can cause a problem when \(n_c=0\)
Naive Bayes - m-estimate of Probability
We can avoid this by fixing two numbers beforehand
A nonzero prior estimate \(p\) for \(P(A|B)\)
With no additional information, \(p\) takes normal apriori values
If an attribute has \(k\) possible values, then \(p=\frac{1}{k}\)
For \(P(Wind=Strong| Play=no)\)
Wind has 2 possible values
Uniform apriori probabilities give us \(p=0.5\)
A number \(m\), a constant called equivalent size of the sample, it determines the weight of \(p\) relative to the observed data
Instead of using \(\frac{n_c}{n}\) we use \(\frac{n_c + mp}{n + m}\)
We think of this as having a bunch of examples to start the whole process
Bayesian Networks
Bayesian Networks describe the probability distribution for a specific set of variables
The set of variables specify a set of conditional independence assumptions
A set of conditional probabilities
Allow describing conditional independence assumptions that apply to subsets of the variables
Less restrictive than Naive-bayes
This approach is more tractable than removing all the conditional independence assumptions
Used for inference
Bayesian Networks
Bayesian Networks are used to reason under uncertainty
Based on the representation of dependence between variables
Provide a concise specification of the probability distribution table