Probabilistic Reasoning

Machine Learning

Jesus A. Gonzalez

August 10, 2019

Introduction

This class presents a systematic way to represent relationships in the form of Bayesian Networks
- Syntax
- Semantics
- How they are used to capture uncertain knowledge in natural and efficient way

Bayesian Networks

Full joint probability distribution
- Can be used to answer any question about the domain
- But can be intractable as we add variables
Independence and conditional independence relationships among variables
- Reduce number of probabilities required to define the full joint distribution
Bayesian Networks
- Used to represent dependencies among variables
- Can represent any full joint probability distribution
  - In a compact way

Bayesian Networks

Bayesian Network
- Directed graph
- Each node contains quantitative probability information
BN specification
1. Each node corresponds to a random variable (discrete or continuous)
2. Directed edges (arrows, links) connect pairs of nodes
  - In an edge from \(X\) to \(Y\), \(X\) is the parent of \(Y\)
  - The graph has no cycles (it is a DAG)
3. Each node \(X_i\) has a conditional probability distribution
  - \(P(X_i | Parents(X_i))\)
  - Quantifies the effect of the parents on the node

Bayesian Networks

Topology of the network
- The set of nodes and links
- Specifies the conditional independence relationships
  - Those that hold in the domain
- Intuitive meaning of an arrow
  - X has direct influence on Y
  - Means that causes should be parents of effects
  - Domain experts are able to identify these relationships
    - Easier than specifying the actual probabilities

Bayesian Networks

Once we specify the network topology
- We need to specify the conditional probability distribution
  - For each variable, given its parents
- The combination of topology and conditional distributions suffices to specify the full joint distribution of all variables

Bayesian Networks: Example

Variables: \(Toothache\), \(Cavity\), \(Catch\), and \(Weather\)
- \(Weather\) is independent from the other variables
- \(Toothache\) and \(Catch\) are conditionally independent given \(Cavity\)
These relationships are represented in a Bayesian Network as:

Bayesian Networks: Example

Bayesian Networks

The conditional independence of \(Toothache\) and \(Catch\) given \(Cavity\)
- Indicated as the absence of a link between \(Toothache\) and \(Catch\)
- \(Cavity\) is a direct cause of \(Toothache\) and \(Catch\)
- No direct cause between \(Toothache\) and \(Catch\)

Bayesian Networks: Example

New burglar alarm installed at home
Fairly reliable at detecting a burglary
Sometimes it also responds to minor earthquakes
There are also 2 neighbors: John and Mary
- They promissed to call you at work when they hear the alarm
- John almost always calls when he hears the alarm
  - Sometimes he confuses the phone ringing with the alarm and also calls
- Mary hears loud music and often misses the alarm
Given the evidence (of who has or has not called)
- We want to estimate the probability of a burglary

Bayesian Networks: Example

Bayesian Networks: Example

The network structure shows that
- Burglary and earthquakes directly affect the probability of the alarm’s going off
- John and Mary calling depends only on the alarm
- Network represents the assumptions that they don’t
  - Directly perceive burglaries
  - Notice minor earthquakes

Bayesian Networks: Example

Conditional Probability Tables or CPTs
- The conditional distributions
- Rows contain the conditional probability of each node value for a conditional case
  - A possible combination of values of the parent nodes
  - i.e. a possible world
  - Each row sums to 1 (entries represent an exhaustive set of cases for the variable)
- A row with no parents has 1 row: the prior probabilities of each possible value of the variable

Bayesian Networks: Example

Conditional Probability Tables or CPTs
- The network doesn’t show information not relevant or captured by other variables
  - Laziness and ignorance in operation
    - Probabilities summarize an infinite set of circunstances for the alarm to fail to go off
    - High humidity, power failure, dead battery, cut wires, and so on…
    - This is how an agent can deal with a very large world
  - Mary listening music (already captured in the CPT, link from Alarm to Mary calls)
  - Phone confusing John (already captured in the CPT, link from Alarm to John calls)
  - Other, i.e. irrelevant information

The Semantics of Bayesian Networks

The meaning of a bayesian network
- See the network as a representation of the joint probability distribution
  - Useful for understanding how to construct networks
- See the network as an encoding of a collection of conditional independence statements
  - Useful in designing inference procedures

Representing the Full Joint Distribution

We define the way in which the network represents a specific joint distribution over all the variables
- \(P(x_1, \dots, x_n) = \prod\limits_{i=1}^n P(x_i | parents(X_i))\).
Probability that the alarm has sounded but neither a burglary nor an earthquake has occurred, and both John and Mary call
- Multiply entries from the joint distribution
- \(P(j,m,a,\neg b, \neg e) = P(j|a)P(m|a)P(a| \neg b \land \neg e)P(\neg b)P(\neg e)\)
- \(= 0.90 \times 0.70 \times 0.001 \times 0.999 \times 0.998 = 0.000628\)
The global semantics of the network defines the full joint distribution as the product of the local conditional distributions

Representing the Full Joint Distribution

We use the full joint distribution to answer any query about the domain
- If the BN represents the joint distribution, it can also be used to answer any query
  - Summing all relevant joint entries
  - There are several methods to do this efficiently

A Method for Constructing Bayesian Networks

We need to construct the BN so that the resulting joint distribution is a good representation of a given domain
The equation \(P(x_1, \dots, x_n) = \prod\limits_{i=1}^n P(x_i | parents(X_i))\) implies some conditional independence relationships
- We can use this information in constructing the topology of the network

A Method for Constructing Bayesian Networks

We rewrite the entries in the joint distribution in terms of conditional probability using the product rule
- \(P(x_1, \dots, x_n) = P(x_n | x_{n-1}, \dots, x_1)P(x_{n-1}, \dots, x_1)\)
Repeat the process, we reduce each conjunctive probability to a conditional probability and a smaller conjuction
- We end with this product
- \(P(x_1, \dots, x_n) = P(x_n|x_{n-1}, \dots, x_1)P(x_{n-1}|x_{n-2}, \dots, x_1) \dots P(x_2|x_1)P(x_1)\)
- = \(\prod\limits_{i=1}^n P(x_i|x_{i-1}, \dots, x_1)\).
- This identity is called the chain rule

A Method for Constructing Bayesian Networks

continues…
- This identity is called the chain rule
  - Holds for any set of random variables
- This is equivalent to say
  - \(P(X_i|X_{i-1}, \dots, X_1) = P(X_i|Parents(X_i))\) (3)
    - given that \(Parents(X_i) \subseteq {X_{i-1}, \dots, X_1}\)
  - We can satisfy this condition by numbering the nodes in a way consistent with the partial order implicit in the graph structure

A Method for Constructing Bayesian Networks

Equation (3) says that
- The Bayesian Network is a correct representation of the domain
  - Only if each node is conditionally independent of its other predecessors in the node ordering, given its parents
We can satisfy this condition with the following methodology

A Method for Constructing Bayesian Networks

Nodes:
- Determine the set of variables required to model the domain
- Order the variables: \(\{X_1, \dots, X_n\}\)
  - Any order works but the resulting network will be more compact if variables are ordered such that
    - causes precede effects
Links: For \(i = 1\) to \(n\) do:
- Choose from \(X_1, \dots, X_{i-1}\), a minimal set of parents for \(X_i\), satisfying equantion (3)
- For each parent insert a link from the parent to \(X_i\)
- CPTs: Write the conditional probability table, \(P(X_i | Parents(X_i))\).

A Method for Constructing Bayesian Networks

Intuitively
- Parents of \(X_i\) should contain all those nodes in \(X_1, \dots, X_{i-1}\) that directly influence \(X_i\)
- In node MaryCalls
  - It is influenced by Burglary or Earthquake, but not directly influenced
  - It is influenced by Burglary or Earthquake, but only through the effect on the Alarm
  - Given the state of the alarm, JohnCalls doesn’t influence MaryCalls

A Method for Constructing Bayesian Networks

Formally
- We think the following conditional independence statement holds
  - \(P(MaryCalls | JohnCalls, Alarm, Earthquake, Burglary) = P(MaryCalls | Alarm)\).
- Then, Alarm is the only parent node for MaryCalls
Each node is connected only to earlier nodes
- Guarantees that the network is acyclic

Compactness and Node Ordering

The compactness of a Bayesian Network is due to the locally structured or sparse systems property
- In a locally structured system
  - Each subcomponent interacts directly with a bounded number of other components
  - No matter the total number of components
- Local structure is usually associated to linear (rather than exponential) growth complexity

Compactness and Node Ordering

In a BN it is reasonable to suppose that each node is influenced by at most \(k\) others
- Assuming boolean variables (for simplicity)
- Ammount of information to specify each CPT will be at most \(2^k\) numbers
- The complete network could be specified with \(n2^k\) numbers
  - In contrast with the joint distribution that contains \(2^n\) numbers
- Suppose we have \(n = 30\) nodes, each with 5 parents (\(k = 5\))
  - The BN requires 960 numbers
  - The full joint distribution requires over a billion numbers

Compactness and Node Ordering

If we choose the ordering: MaryCalls, JohnCalls, Alarm, Burglary, Earthquake
- Add MaryCalls:
  - No parents
- Add JohnCalls:
  - If Mary calls the alarm has probably gone off, which makes more likely that John calls
  - JohnCalls needs MaryCalls as parent
- Add Alarm:
  - If both call it is more likely that the alarm has gone off than if only one calls
  - We need both MaryCalls and JohnCalls as parents

Compactness and Node Ordering

If we choose the ordering: MaryCalls, JohnCalls, Alarm, Burglary, Earthquake
- continues…
- Add Burglary:
  - Knowing the alarm state the call from John or Mary can give us information about our phone ringing or Mary’s music but not about the burglary
    - \(P(Burglary | Alarm, JohnCalls, MaryCalls) = P(Burglary | Alarm)\)
  - We need Alarm as parent
- Add Earthquake:
  - If alarm is on is more likely that there has been an earthquake
  - But if there has been a burglary, that explains the alarm, and the probability of an earthquake would only be slightly above normal
  - We need both Alarm and Burglary as parents