## Bayes Theorem in practice: Kolmogorov Complexity

**This is the second post in a sequence titled “Basic concepts in the formalisation of thought.”**

In the previous post we looked at Bayes Theorem, an equation as below:

Basically, what this does is allow us to take a prior probability and produce a posterior probability given some evidence. So, for example, if we were trying to determine the probability the sun would rise tomorrow, we would start with our prior probability (very high) but then, if we learnt that an alien fleet was determine to destroy it tonight, we could use Bayes Theorem to recalculate this probability based on that new evidence.

In the equation, P(A) represents the prior probability and if we don’t know this figure then it becomes impossible to apply the equation. This post is going to explore Kolmogorov Complexity, one suggestion for how we can reason under these conditions.

**What is Kolmogorov Complexity?**

Kolmogorov Complexity is an attempt to formalise the lose concept of how complex a string is. It is defined as the length of the shortest program that can produce the string. Of course, the length on this program will differ depending on which programming language you use to implement it. However, it can be shown that the difference in this length can be no more than a constant.

To explain, let’s imagine that we originally used the programming language c++ to determine the complexity of a string, S. Now let’s say that instead we use java to do the same job. We could choose to write a c++ interpreter in java and then run the original program. That means that, at most, the complexity of the string in java can be equal to its complexity in c++ plus the length of the interpreter. So, while the choice of language can lead to different Kolmogorov Complexities, this difference is bounded.

**Kolmogorov Complexity and Bayes Theorem**

Ockham’s Razor famously says that, “All else being equal, the simplest theory is the best”, or at least something along those lines and this is the basic idea behind use Kolmogorov Complexity in Bayes Theorem. Our problem with Bayes Theorem was circumstances where we did not know the prior probability of a statement. Ockham’s Razor provides a response – the likelihood of the prior should be dependent on the simplicity of the theory. And Kolmogorov Complexity gives us a way to formalise this.

So the idea, under circumstances where no more information is available, is to define the prior, P(m), as:

2^{-K(m)}

Which has the desired format. Actually, this equation will not ensure probabilities sum to 1 and so, instead, we use what is called prefix-Kolmogorov complexity instead of standard Kolmogorov complexity. This prefix-Kolmogorov Complexity, L(m), is calculated as follows:

L(m) = k(m) + Log_{2}K(,)

And the prior probability is then calculated as:

2^{-L(m)}

In this case, the probabilities do add to one.

**Next post**

The next sequence of posts will explore causal discovery – that is, finding causal information from data.

## Bayes Theorem and the Monty Hall Problem

**This is the first post in a sequence titled “Basic concepts in the formalisation of thought.”**

Having written a few posts on Causality I’m now going to write a short sequence exploring basic concepts in the formalisation of thought. This first post will deal with Bayes Theorem. However, because there are already good explanations of Bayes Theorem (for example, this one by Eliezer Yudkowsky and this one on community blog Less Wrong) I’m not going to do a standard basic introduction but will instead apply Bayes Theorem to a well known puzzle, the Monty Hall problem, to show how it can aid with probabilistic reasoning.

**What is Bayes Theorem**

Bayes Theorem is simply an equation that allows you to calculate conditional probabilities. That is, the probability of A, given B. The equation is:

Which just means that, for example, the probability that I’m hungry given that my belly is rumbling is equal to the number of times that I’m hungry and my belly rumbles out of the number of times that my belly rumbles which is intuitively what we would expect it to say.

There are a number of variants of this formula, but the one that matters for us is Bayes Theorem under the condition that you are given not just one piece of information but two. That formula is as follows:

**What is the Monty Hall Problem?**

You’re on a quiz program and are faced with three doors. A door has been selected at random before the show and a car placed behind it. The other doors have nothing behind them. You are asked to choose a door.

Once you have done so, the host chooses, at random, one of the other doors. He never chooses the one with the car behind it. He opens the door and then asks whether you would like to change your answer?

Should you do so?

**An intuitive response to the Monty Hall Problem?**

The initial response is normally “No” or, more to the point, “Who cares. Whatever door I choose there’s a 1/3 chance of winning so why both changing.”

The answer, however, is actually “Yes.” Think about it like this: You had a 1/3 chance of choosing the winning door in the first place. Now, if you change, you will lose. You had a 2/3 chance of choosing a losing door in the first place. Now, if you change, you will win (because the other door has been opened to reveal no prize so you will inevitably change to the winning door). So 2/3 of the time you win by changing, 1/3 of the time you don’t.

**What information do you gain?**

The reasoning is easy enough to understand. What’s harder is to figure out what new piece of information the presenter gave you by opening the door. It becomes a bit more clear when we consider the following situation: If the host had simply pointed at a door at random, regardless of whether it was the winning door, and not opened it, then changing doors would not be beneficial. So the information is given by the fact that he cannot open the winning door.

So imagine a different situation: The host says you can either guess the door straight up or, before you do, he will point at the door. He will tell you the correct door 2/3 of the time and lie and tell you the incorrect one 1/3 of the time. Do you simply guess or do you wait for him to tell you and then choose the door he points out? If you guess you have a 1/3 chance of winning, if you wait, he is telling the truth 2/3 of the time. So you wait.

The new piece of information in the first case is similar to that: By not choosing a door, the host is giving you information that that is the correct door. It just so happens that this information is only true 2/3 of the time but, it’s still additional information on those occasions.

To put it another way: Even if you were told you’d picked a wrong door, you would still not know which of the other two doors the prize was behind. But after the other wrong door was selected, you would. So his actions in choosing a door give you new information if you knew you’d picked the wrong door. And given that you know you’re 2/3 likely to have picked the wrong door and only 1/3 likely to have picked the right, this information is of use,

**Bayes Theorem and the Monty Hall Problem**

All of this is well and good in relation to the specific problem but, unless you got it right the first time you heard it, what it has revealed is that there is a flaw in the way that you process probabilistic information. And this flaw isn’t necessarily dissolved simply by understanding one circumstance under which it was revealed. This is where Bayes Theorem comes in. Bayes Theorem is an approach that increases your general skills of probabilistic reasoning.

So Bayes theorem, remember, is about conditional probability: The probability that A is true given the evidence B.

Bayes Theorem makes conditioning on the situation (right door/wrong door) explicit. It would still be possible to know Bayes Theorem but not think to use it in this case. But if you approach probabilistic questions with Bayes Theorem in mind then you’re explicitly primed to ask whether there is a conditional approach to the problem. Here’s a Bayes Theorem based approach to the Monty Hall Problem.

So what we want to discover is the probability that the unselected and unopened door contains the prize given the information of the choice and the opening of door.

We now calculate the various components:

Because the host never selects the door you choose or the one with the prize.

Because the position of the prize had an initial probability of 1/3 and this doesn’t change based on the choice.

Now plug these in:

That’s the right answer (the probability that the prize is behind the door that has not been chosen or opened is 2/3 so you should change your choice)

But the important thing about Bayes Theorem isn’t that it gets the right answer, it’s that its use improves your ability to reason with probability so you’re less likely to get other probability questions wrong in the future. Simply by plugging in what you want to know and what you do know, this equation can improve your abilities and not just give you the answer to a specific question.

**The limits of Bayes Theorem**

Bayes Theorem is more than a solution to a specific problem – it’s a general technique for reasoning with probability. That’s not to say it’s a solve all solution. Some problems have the complexity in calculating those components above and, in those cases, Bayes Theorem won’t help. However, it’s a big step toward being able to solve problems where the components are simple but it’s the coming together of these which is complicated (see Greg Egan’s website for an example of a problem that Bayes Theorem may not help you with).

**The next post is Bayes Theorem in practice: Kolmogorov Complexity**

## Basic concepts in the formalisation of thought: Sequence index

This sequence will explore basic concepts in the formalisation of thought especially as it related to logic, probabalistic reasoning and inductive reasoning. Most of these posts will be able to be read alone.

Post 1: Bayes Theorem and the Monty Hall Problem – An introduction to Bayes Theorem, one of the most important equations in probabalistic reasoning. Applied to the Monty Hall Problem and with links to other more thorough introductions to the theory.

Post 2: Bayes Theorem in practice: Kolmogorov Complexity – Bayes Theorem takes a prior probability of something being true and modifies it based on new evidence. But what do you do if you have no prior probability to use in the equation?

## Uses of Bayes Nets in causality: Modeling type and token causal relevance

**This is post 6 in a sequence exploring formalisations of causality entitled “Reasoning with causality”. This post continues to summarise the paper “Causal Reasoning with Causal Models”.**

**The previous post in the sequence is Is a causal interpretation of Bayes Nets fundamental?**

Causal Relevance, as opposed to causal role, is simply interested in whether there is a causal relationship between a cause and an effect, not whether the cause increases or decreases the chance of the effect. “Causal Reasoning with Causal Models” attempts to find a way of capturing and formalizing causal relevance.

**Type causal relevance**

Type causality looks at relationships between general categories – does smoking cause cancer. Token causality looks at a specific instance – did smoking cause cancer in this patient.

In relation to capturing type causal relevance, the first act seems to be to look at the relationship under intervention. If we change the amount someone in general smokes, does this change their chance of getting cancer? If so, smoking is causally relevant to cancer. Unfortunately, it’s not this simple. Imagine the following case: Taking the pill increases the chance of getting thrombosis but decreases the chance of getting pregnant. Getting pregnant increases the chances of getting thrombosis. Further, imagine that these two effects balance perfectly: So whatever dosage of pill you take, the increased chance of getting thrombosis due to this is directly countered by the decreased chance of getting pregnancy caused thrombosis. However, we still want to say that the pill is causally relevant thrombosis.

The problem with the above scenario is that we want to consider the component causal effects rather than the net causal effects. This is achieved by looking at each path from a cause to an effect one at a time (and blocking the other paths while doing so). If there is a probabilistic dependence between any of the paths then the factor is causally relevant. So, if we block the path from the pill via decreased chance of pregnancy then there is plainly a probabilistic dependence between the pill and thrombosis via the other path.

**Token causal relevance: First attempt**

Extending this account to token causality is a little harder. Imagine a hiker walking up a hill when a boulder is dislodged. They duck and so survive. If the boulder had not fallen, they also would have survived. In terms of type causality, this sort of event can cause death so we want to see the boulder falling as causally relevant to survival. This is achieved with the analysis above. However, from a token perspective, this boulder fall was not causally relevant.

A possible criteria: A is causally relevant to B if for all paths from A to B there is a probabilistic dependence between A and B if the other paths are set at their observed file.

So in the boulder example, if we set the value of the variable duck to true then the boulder falling is not causally relevant to survival – just as we were hoping.

However, this criteria doesn’t solve all problems of this type: Imagine Suzy throws a rock at a bottle and a second late Billy does the same. Suzy’s rock breaks the bottle but there’s no probabalistic dependence because if she didn’t throw a rock the bottle would still have been broken (by Billy). This doesn’t change if we set the other paths to their observed value (ie. we set Billy throwing to true) But we do want her throw to have token causal relevance to the bottle breaking.

**Token causal relevance: A solution**

This can be solved if we have a more complex criteria for token causal relevance:

1.) The causal model is correctly constructed – I won’t go into all the details here but this means the model must meet certain criteria (like all variables must be intervenable)

2.) The context should be fixed according to observation (so fix the value of Billy throwing to true)

3.) All paths from cause to effect must meet certain criteria including that they must be type causally relevant (there is some value of the variable which will induce a probabalistic dependence with other paths blocked). Other paths should be removed.

Now we check, without blocking other pathways from the cause to the effect, whether there is a probabalistic dependence.

So in the above example we construct the model as per 1:

We now fix the context so both Suzy throwing and Billy throwing are set to true. Now with the context of Suzy Throwing being set to true, there is no longer a probabalistic dependence between Billy throws and bottle breaks. So we remove the path.

Now the probabalistic dependence is clear. The same solution can also be used to solve a variety of scenarios which I haven’t had the space to go into here.

Thus we have a solution for how to formalise both type and token causal relevance.

**The next post is Probabilistic causality**

## Is a causal interpretation of Bayes Nets fundamental?

**This is post 5 in a sequence exploring formalisations of causality entitled “Reasoning with causality”. This post continues to summarise the paper “Causal Reasoning with Causal Models”.
**

**The previous post in the sequence is An introduction to Bayesian networks in causal modeling**

Is a causal interpretation of Bayes Nets fundamental in some way or is it simply accidental. To what extent can Bayes Net be considered to be suitable causal models?

**Bayes Nets as representations of probability distributions**

A common argument that the causal interpretation of Bayes Nets isn’t fundamental is that Bayes Nets are designed to represent probability distributions. And given any Bayes Net with a causal interpretation, another one can be generated that represents the same probability distribution but does not have the causal interpretation. On the surface then, the Bayes Net which is a causal model seems no more fundamental than the other.

Chickering’s Arc Reversal Rule can be used to support the above assertion. This rule basically just reverses the direction of all the arrows. A technical addition to how this works means that the rule can introduce arrows but not remove them. From this, there seems to be an obvious way that Bayes Nets do fundamentally represent causality: The causal model is the one with least arrows that still manages to capture the relevant probability distribution.

Unfortunately, this doesn’t work.

**Why the simplest Bayes Net isn’t the causal model**

There are circumstances under which the causal model is not the simplest Bayes Net that captures the probabalistic dependencies. Imagine the following situation: Sunscreen decreases instances of cancer but increases the time people spend in the sun (because they feel safer) which then increases the chance of cancer (modelled as below):

Now imagine that the two effects perfectly balanced so that sunscreen made no difference as to whether you got sunscreen. The simplest model that would capture the related probabality distribution is far simpler than that above. It looks like this:

The only problem is, this doesn’t represent the causal system. So the causal model cannot be the simplest one which captures the probability distribution.

**Augmented Causal Simplicity**

Which leads to a more complicated conjecture: A causal model is the one with the least arrows that still manages to capture the relevant fully augmented probability distribution. In a basic sense, the fully augmented probability distribution simply means captures the probabilities regardless of how we intervene in the model. So say we intervene in the above model to set the amount of time spent in the sun. In the first graph, this means sunscreen no longer changes the amount of time spent in the sun but it does change the chance of getting cancer. In the second model, changing the time spent in the sum will not capture this probabalistic relationship.

So by demanding that a model capture the probability distribution under all possible interventions, the causal model can still be said to be the simplest that captures the distribution. This then implies that Bayes Nets representation of causality are fundamental.

**The next post in the sequence is Modeling type and token causal relevance**

## An introduction to Bayesian networks in causal modeling

**This is post 4 in a sequence exploring formalisations of causality entitled “Reasoning with causality”**

**The previous post was Reasoning with causality: An example**

The remainder of this sequence is going to depart from the path previously indicated (continuing to read Pearl’s *Causality*) and will instead explore the use of Bayesian networks in causal modeling (in doing so, it will also discuss probabalistic vs determinstic causality) by summarising a technical report titled *Causal Reasoning with Causal Models* (Which should link to, http://www.csse.monash.edu.au/~korb/pubs/techrept.pdf). This first post will introduce the use of Bayesian networks in causal modeling.

**What are Bayesian Networks**

A Bayesian Network is a directed acyclical graph (DAG) and their conditional probabilities. What does this mean? Well take the graph (series of nodes connected by edges) below:

1.) Directed simply means it has arrows pointing out from each node.

2.) Acylical means that if you follow the arrows of any path you will never return to your starting point.

So now we know what the DAG aspect of the definition above means. How about the fact that it captures their conditional probabilities. While that simply means that the Bayesian Network (or Bayes Net) also captures the way the variables are conditionally dependent on one another. Take the central node (“grass wet”). Let’s define the conditional probabilities for this node in the form of a table:

Sprinkler | Rain | Grass wet | Grass dry | |

T | T | 1 | 0 | |

T | F | 0.7 | 0.3 | |

F | T | 0.9 | 0.1 | |

F | F | 0 | 1 |

So this says that if it rained and the sprinkler was on last night, the grass will definitely be wet. If just the sprinkler was on, there’s an 0.3 chance that the grass has dried by now and so on.

So a Bayes Net is a DAG combined with the conditional probabilities that hold between the nodes. Bayes nets are used because they make many problems more computationally cheap (if the list of relevant conditional probabilities is small enough).

**Bayesian Networks and conditional dependencies**

If there are no arrows between nodes in a Bayes Net then this means they must be probabalistically independent. Which is to say:

P(A | B) = P(A)

Conditional independence is an extension of this to situations where a third variable induces independence. So, in the above graph, rain and the paper being wet are conditionally independent because they are “screened off” by the grass being wet.

There are two properties related to this:

1.) A Bayes Net is said to have the Markov Property if all of the conditional independences in the graph are true of the system being modelled (the Markov Property simply means that the future state of a system depends only on the present state and not on the past ones. This is clearly related to conditional independence). It can also be called an independence-map or I-map of the system.

2.) If all of the dependencies in the Bayes Net are true in the system then it’s said to be faithful (or a D-map, Dependence Map) to the system.

If a Bayes Net has both of these properties then it’s said to be a perfect map. Generally in Bayes Nets, we want the I-map to be minimal (such that if we removed any nodes, it would no longer be an I-map).

**Bayesian Networks and Causality**

A Bayes Net becomes a causal model if each of the arrows in the graph can be given a causal interpretation. Each arrow then represents direct causality. Indirect causality is based on paths that always follow arrows. So in the above graph, rain is indirectly related to the paper being wet via the path (rain -> grass wet -> paper wet). Sprinkler being on is not indirectly causally related to rain because the path from sprinkler to rain does not always follow the direction of the arrows.

A path from X to Y is blocked by Z if X and Y are conditionally independent given Z (at least under some basic assumptions). If the graph has the Markov Property than the equivalent procedure to blocking is d-seperation which is best explained as follows. Given two variables, A and B there are four ways that a path can go from A to B through C in an undirected graph:

1.) They can go via a chain A->C->B. In this case A and B are d-seperated given a set of nodes Z if C is in Z (if you know C, knowing A won’t tell you anything more about B).

2.) They can go via a chain B->C->A (reasoning as above)

3.) They can both be caused by C. In this case, A and B are d-seperated given Z is C in in Z (once again, knowing C means that knowing A or B tells you nothing more about the other variable).

4.) They can both cause C. In this case if C isn’t in Z then there is no dependence between A and B by default. They may both cause C but knowing one doesn’t tell you anything about the other. On the other hand, if C is in Z then A and B are causally related. Look at the graph earlier – here if we know whether the grass is wet then there is a conditional relationship between rain and sprinkler – namely that if it didn’t rain the sprinkler must have been on and vice versa. So A and B are d-desperated given Z if C is not in Z.

All up, A and B are d-seperated given Z if all paths from A to B meet one of the above criteria.

**Conclusions**

This post has explored Bayesian Networks and how these can be used as causal models. In the following posts it will explore objections to the use of Bayes Nets to represent causality, the debate over probablistic vs deterministc interpretations of causality and the difference between type and token causality.

**The next post in the sequence is **

**Is a causal interpretation of Bayes Nets fundamental?**

## Reasoning with causality: An example

**This is post 3 in a sequence exploring formalisations of causality entitled “Reasoning with causality”**

**The previous post was “A causal calculus: processing causal information”
**

The last two posts have introduced a graphical method and a calculus for discussing causality. This post will demonstrate how these can be used for causal reasoning by following one of Pearl’s examples – an exploration of the causal relationship between smoking and lung cancer in a deliberately simplified world as follows: *Smoking causes lung cancer via the intermediary of building up tar deposits in the lungs. There is also a genetic feature that both increases the chance of developing cancer and increases the chance of one smoking.*

The first thing we’re going to need to think about this is the relevant causal graph which we can see showing both of these two causes of cancer. The question we now need to determine is the strength of these links – basically, what is the probability of getting cancer due to do(smoking)?:

So we’re trying to discover the probability of cancer giving do(smoking), or:

The rest of the proof will use the rules of do() introduced in the last post to remove the do() statement so that the problem can be resolved with the normal rules of probability (the first few steps will be explained explicitly but, if you want to follow after that you should have the tools to work out the steps for yourself).

**Step 1**

The first step is to state that the above equation is equivalent to:

The justification for this is due to the axioms of probability, as follows:

Any probability P(a) can instead be thought of as a sum of the probabilities of exhaustive, mutually exclusive events. So, for example: *A school has four classes of 25 students. All 100 students are currently gathered in a hall for a meeting. They are mixed up at random. What is the probability that a student selected at random is the tallest in their class?*

* *

You can reason as follows: There are four tallest students (because there are four classes) and 100 students so the probability is 4/100. Or you could think in terms of classes (which are mutually exclusive – no students are in two classes – and exhaustive – all students are in a class). In each class there are 25 students so you can say: What’s the probability of this student being in class 1: ¼ and what’s they’re probability of being the tallest in their class: 1/25. So what’s the probability that the student is the tallest in class 1: 1/100. You could then find the same values for the other three classes and sum these values together to once again get 4/100.

Similarly, the probability of cancer given do(smoking) is equivalent to the sum, for all possible tar levels, of all the probability of cancer given do(smoking) and given a certain level of tar in the lungs multiplied by the probability of that level of tar being in the lungs given smoking (if that’s not clear, think of it in terms of the school student example).

**Step 2**

** **

The next step is to show that this is equal to:

** **** **

** **

This is simply a use of Pearl’s second rule (you may want to have the rules post open for reference). However, Pearl’s rules can only be used if a certain precondition is met. You may have noticed that each of the preconditions in the previous post ended with something like:

This notation tells you what graph the precondition must be met in. So rather than seeing if it applies to the graph we drew above, we see if it applies to a subgraph of this defined such that if there’s a hat above the letter, all edges going into the letter are removed and, if there’s a hat below, all edges going out of the letter are removed. So in the above example all arrows going into X and out of Z are removed. Which in our example above means the preconditions must be met in the following graph:

The precondition is that the rule can only be used if c and t are conditionally independent given do(s) (which is to say that if you know do(s), knowing t as well won’t change your probability for c). In the above graph this is plainly true as there is no causal link from t to c.

** **

**Remaining steps**

** **

The rest of the proof proceeds as follows, ending with a probability equation that does not contain any do() operators. Each step of this proof takes place in a similar way to those above. One of Pearl’s rules for do() are applied after the appropriate subgraph is determined to meet the required criteria:

** **** **

** **

**Conclusion**

** **

This is where the graphical and calculus based approaches to causality come together and begin to allow us to reason with causal information that is given to us. In *Causality*, Pearl’s proof is much shorter and to the point. Here I’ve tried to provide more detail on the first few steps to make the proof clear but for those who worry that they’re drowning in detail, read Pearl’s presentation and see if you find it any more clear.

So far I have been focusing on how causal information can be manipulated once you have it using the do() calculus and Pearl’s graphical methods. In the next lot of posts, I will be exploring more about how Bayesian Networks can be used as causal models.

**The next post in the sequence is An introduction to Bayesian networks in causal modeling**