Probabilistic Reasoning
Machine Learning
Jesus A. Gonzalez
August 10, 2019
Introduction
- This class presents a systematic way to represent relationships in the form of Bayesian Networks
- Syntax
- Semantics
- How they are used to capture uncertain knowledge in natural and efficient way
Bayesian Networks
- Full joint probability distribution
- Can be used to answer any question about the domain
- But can be intractable as we add variables
- Independence and conditional independence relationships among variables
- Reduce number of probabilities required to define the full joint distribution
- Bayesian Networks
- Used to represent dependencies among variables
- Can represent any full joint probability distribution
Bayesian Networks
- Bayesian Network
- Directed graph
- Each node contains quantitative probability information
- BN specification
- Each node corresponds to a random variable (discrete or continuous)
- Directed edges (arrows, links) connect pairs of nodes
- In an edge from \(X\) to \(Y\), \(X\) is the parent of \(Y\)
- The graph has no cycles (it is a DAG)
- Each node \(X_i\) has a conditional probability distribution
- \(P(X_i | Parents(X_i))\)
- Quantifies the effect of the parents on the node
Bayesian Networks
- Topology of the network
- The set of nodes and links
- Specifies the conditional independence relationships
- Those that hold in the domain
- Intuitive meaning of an arrow
- X has direct influence on Y
- Means that causes should be parents of effects
- Domain experts are able to identify these relationships
- Easier than specifying the actual probabilities
Bayesian Networks
- Once we specify the network topology
- We need to specify the conditional probability distribution
- For each variable, given its parents
- The combination of topology and conditional distributions suffices to specify the full joint distribution of all variables
Bayesian Networks: Example
- Variables: \(Toothache\), \(Cavity\), \(Catch\), and \(Weather\)
- \(Weather\) is independent from the other variables
- \(Toothache\) and \(Catch\) are conditionally independent given \(Cavity\)
- These relationships are represented in a Bayesian Network as:
Bayesian Networks: Example
Bayesian Networks
- The conditional independence of \(Toothache\) and \(Catch\) given \(Cavity\)
- Indicated as the absence of a link between \(Toothache\) and \(Catch\)
- \(Cavity\) is a direct cause of \(Toothache\) and \(Catch\)
- No direct cause between \(Toothache\) and \(Catch\)
Bayesian Networks: Example
- New burglar alarm installed at home
- Fairly reliable at detecting a burglary
- Sometimes it also responds to minor earthquakes
- There are also 2 neighbors: John and Mary
- They promissed to call you at work when they hear the alarm
- John almost always calls when he hears the alarm
- Sometimes he confuses the phone ringing with the alarm and also calls
- Mary hears loud music and often misses the alarm
- Given the evidence (of who has or has not called)
- We want to estimate the probability of a burglary
Bayesian Networks: Example
Bayesian Networks: Example
- The network structure shows that
- Burglary and earthquakes directly affect the probability of the alarm’s going off
- John and Mary calling depends only on the alarm
- Network represents the assumptions that they don’t
- Directly perceive burglaries
- Notice minor earthquakes
Bayesian Networks: Example
- Conditional Probability Tables or CPTs
- The conditional distributions
- Rows contain the conditional probability of each node value for a conditional case
- A possible combination of values of the parent nodes
- i.e. a possible world
- Each row sums to 1 (entries represent an exhaustive set of cases for the variable)
- A row with no parents has 1 row: the prior probabilities of each possible value of the variable
Bayesian Networks: Example
- Conditional Probability Tables or CPTs
- The network doesn’t show information not relevant or captured by other variables
- Laziness and ignorance in operation
- Probabilities summarize an infinite set of circunstances for the alarm to fail to go off
- High humidity, power failure, dead battery, cut wires, and so on…
- This is how an agent can deal with a very large world
- Mary listening music (already captured in the CPT, link from Alarm to Mary calls)
- Phone confusing John (already captured in the CPT, link from Alarm to John calls)
- Other, i.e. irrelevant information
The Semantics of Bayesian Networks
- The meaning of a bayesian network
- See the network as a representation of the joint probability distribution
- Useful for understanding how to construct networks
- See the network as an encoding of a collection of conditional independence statements
- Useful in designing inference procedures
Representing the Full Joint Distribution
- We define the way in which the network represents a specific joint distribution over all the variables
- \(P(x_1, \dots, x_n) = \prod\limits_{i=1}^n P(x_i | parents(X_i))\).
- Probability that the alarm has sounded but neither a burglary nor an earthquake has occurred, and both John and Mary call
- Multiply entries from the joint distribution
- \(P(j,m,a,\neg b, \neg e) = P(j|a)P(m|a)P(a| \neg b \land \neg e)P(\neg b)P(\neg e)\)
- \(= 0.90 \times 0.70 \times 0.001 \times 0.999 \times 0.998 = 0.000628\)
- The global semantics of the network defines the full joint distribution as the product of the local conditional distributions
Representing the Full Joint Distribution
- We use the full joint distribution to answer any query about the domain
- If the BN represents the joint distribution, it can also be used to answer any query
- Summing all relevant joint entries
- There are several methods to do this efficiently
A Method for Constructing Bayesian Networks
- We need to construct the BN so that the resulting joint distribution is a good representation of a given domain
- The equation \(P(x_1, \dots, x_n) = \prod\limits_{i=1}^n P(x_i | parents(X_i))\) implies some conditional independence relationships
- We can use this information in constructing the topology of the network
A Method for Constructing Bayesian Networks
- We rewrite the entries in the joint distribution in terms of conditional probability using the product rule
- \(P(x_1, \dots, x_n) = P(x_n | x_{n-1}, \dots, x_1)P(x_{n-1}, \dots, x_1)\)
- Repeat the process, we reduce each conjunctive probability to a conditional probability and a smaller conjuction
- We end with this product
- \(P(x_1, \dots, x_n) = P(x_n|x_{n-1}, \dots, x_1)P(x_{n-1}|x_{n-2}, \dots, x_1) \dots P(x_2|x_1)P(x_1)\)
- = \(\prod\limits_{i=1}^n P(x_i|x_{i-1}, \dots, x_1)\).
- This identity is called the chain rule
A Method for Constructing Bayesian Networks
- continues…
- This identity is called the chain rule
- Holds for any set of random variables
- This is equivalent to say
- \(P(X_i|X_{i-1}, \dots, X_1) = P(X_i|Parents(X_i))\) (3)
- given that \(Parents(X_i) \subseteq {X_{i-1}, \dots, X_1}\)
- We can satisfy this condition by numbering the nodes in a way consistent with the partial order implicit in the graph structure
A Method for Constructing Bayesian Networks
- Equation (3) says that
- The Bayesian Network is a correct representation of the domain
- Only if each node is conditionally independent of its other predecessors in the node ordering, given its parents
- We can satisfy this condition with the following methodology
A Method for Constructing Bayesian Networks
- Nodes:
- Determine the set of variables required to model the domain
- Order the variables: \(\{X_1, \dots, X_n\}\)
- Any order works but the resulting network will be more compact if variables are ordered such that
- Links: For \(i = 1\) to \(n\) do:
- Choose from \(X_1, \dots, X_{i-1}\), a minimal set of parents for \(X_i\), satisfying equantion (3)
- For each parent insert a link from the parent to \(X_i\)
- CPTs: Write the conditional probability table, \(P(X_i | Parents(X_i))\).
A Method for Constructing Bayesian Networks
- Intuitively
- Parents of \(X_i\) should contain all those nodes in \(X_1, \dots, X_{i-1}\) that directly influence \(X_i\)
- In node MaryCalls
- It is influenced by Burglary or Earthquake, but not directly influenced
- It is influenced by Burglary or Earthquake, but only through the effect on the Alarm
- Given the state of the alarm, JohnCalls doesn’t influence MaryCalls
A Method for Constructing Bayesian Networks
- Formally
- We think the following conditional independence statement holds
- \(P(MaryCalls | JohnCalls, Alarm, Earthquake, Burglary) = P(MaryCalls | Alarm)\).
- Then, Alarm is the only parent node for MaryCalls
- Each node is connected only to earlier nodes
- Guarantees that the network is acyclic
Compactness and Node Ordering
- The compactness of a Bayesian Network is due to the locally structured or sparse systems property
- In a locally structured system
- Each subcomponent interacts directly with a bounded number of other components
- No matter the total number of components
- Local structure is usually associated to linear (rather than exponential) growth complexity
Compactness and Node Ordering
- In a BN it is reasonable to suppose that each node is influenced by at most \(k\) others
- Assuming boolean variables (for simplicity)
- Ammount of information to specify each CPT will be at most \(2^k\) numbers
- The complete network could be specified with \(n2^k\) numbers
- In contrast with the joint distribution that contains \(2^n\) numbers
- Suppose we have \(n = 30\) nodes, each with 5 parents (\(k = 5\))
- The BN requires 960 numbers
- The full joint distribution requires over a billion numbers
Compactness and Node Ordering
- If we choose the ordering: MaryCalls, JohnCalls, Alarm, Burglary, Earthquake
- Add MaryCalls:
- Add JohnCalls:
- If Mary calls the alarm has probably gone off, which makes more likely that John calls
- JohnCalls needs MaryCalls as parent
- Add Alarm:
- If both call it is more likely that the alarm has gone off than if only one calls
- We need both MaryCalls and JohnCalls as parents
Compactness and Node Ordering
- If we choose the ordering: MaryCalls, JohnCalls, Alarm, Burglary, Earthquake
- continues…
- Add Burglary:
- Knowing the alarm state the call from John or Mary can give us information about our phone ringing or Mary’s music but not about the burglary
- \(P(Burglary | Alarm, JohnCalls, MaryCalls) = P(Burglary | Alarm)\)
- We need Alarm as parent
- Add Earthquake:
- If alarm is on is more likely that there has been an earthquake
- But if there has been a burglary, that explains the alarm, and the probability of an earthquake would only be slightly above normal
- We need both Alarm and Burglary as parents
Compactness and Node Ordering
- The network structure depends on the order in which we consider the nodes