## Categories and types of intervention

**This is post8 in a sequence exploring formalisations of causality entitled “Reasoning with causality”. **

This post will summarise the paper Varieties of Causal Intervention and, by doing so, will explore what a causal intervention is. To start with, imagines a Bayes Net representing a causal process. So let’s say that cleaning your teeth and genes both effect the chances of dental decay but that the same genes also influence the chance that you’ll do your teeth:

Now image there’s a government policy being considered whereby police men would enforce tooth cleaning. To analyse the affects of this we would intervene on “Clean”, which we can think of as setting the value of the variable and ignoring the influence of any parents, and we would then observe the effects of the intervention on decay.

**Intervention vs Observation**

This concept of intervention is different to observation. Imagine, for example, that we observed someone cleaning their teeth. This may have a different effect to intervene to make them clean their teeth as it suggests they’re more likely to have certain genes and these genes also make it less likely that teeth will decay. Thus, in this case, the observation of tooth cleaning would have a stronger causal effect on decay than an intervention.

The important point is that the two concepts are different because intervention surgically removes the context of any parent nodes.

**Make interventions complex**

The simple view of interventions, expressed in most of the literature on the topic, is that they are deterministic, always achieve a desired effect and affect only one variable. However, interventions in the real world often fail to match these simplified assumptions. For example, an intervention to pressure someone to clean their teeth might fail. Or an intervention might affect multiple variables, rather than just one. And finally, an intervention may be indeterminstic and fail to set a variable to a specific value.

To model this, rather than simple changing the value of a variable in the system (setting clean to true, in the above example), the paper suggests introducing a new parent node for the variable which is introduced to change the variable to some particular target distribution and which is outside of the system. See below:

This intervention node will be binary (yes/no) but its interaction with other nodes is open so that the particular target distribution may not be achieved. Note that intervention variables can be parents of more than their target variable, allowing side effects to be modelled.

**Types of intervention**

An intervention then leads to a new probability distribution across its states in the targeted variable. The types of interventions possible are:

- An independent intervention where the results of the intervention do not depend on other parents of the variable. This changes the probability distribution of the target to that intended by the intervention. A dependent intervention would calculate the new probability distribution based on both this intended distribution and also based on the other parents. So imagine a gene therapy intervention that fixes that effects of bad genes on not cleaning teeth. To someone with bad genes this will decrease the chance of decay. For someone who already has the good genes, this will have no effect. This would be a dependent intervention.
- A deterministic intervention aims to achieve a specific effect. Say to force people to clean their teeth. That’s to say, it aims to set a variable to a specific value. It may still be complicated to model if the intervention is dependent. A stochastic intervention aims to leave the target variable with a distribution over more than one state.

**Conclusion**

Representing interventions as the introduction of a new parent node rather than as a surgical change of the value of a variable allows a wider range of types of intervention to be modelled.

## Probabilistic causality

**This is post7 in a sequence exploring formalisations of causality entitled “Reasoning with causality”. **

Many of the posts on this blog in relation to causality have a hidden assumption: Namely, that causality is inherently probabilistic (though many of the posts are still just as relevant whether this is accepted or not). However, this goes against the mainstream view. This post will summarise part of the paper Varieties of Causal Intervention which argues against the mainstream.

**Deterministic causality**

Judea Pearl, one of the principle researchers in the area of causality, has argued that causality should be interpreted deterministically for the following reasons:

- The deterministic interpretation is the more intuitive one.
- Deterministic interpretations are more general as any indeterministic interpretation can be modelled as a deterministic one.
- A deterministic causality is needed to make sense of counterfactuals and causal explanation.

This post is going to ignore the first point (intuitive doesn’t mean right) and will explore possible responses to the other two.

**Deterministic causality as the more general theory**

Imagine a system where both A and B have a causal influence on C. We can model that as the following equation:

C = dA + eB + U

This says that C is influenced to degree d by A and degree e by B. It also says that C isn’t entirely determined by A and B but there is some degree of variation. By adding U in, a case that was originally indeterministic has become deterministic. However, that doesn’t mean the causal system being modelled is in fact deterministic unless U is a part of the system. Given that, in practice, U is normally defined as that which is left over, claiming this as part of the system seems to be reasonable only if we presume from the start that the system is deterministic.

Given that indeterminstic worlds are consistent, it seems to be an a posteriori question to determine whether causality should be deterministic or indeterministic and hence such an a priori assumption seems unwarranted.

Even if you don’t buy all that, there’s a further point: This process can be applied two ways. This means that any indeterminstic system can be modelled as a deterministic one but any deterministic one can also be modelled as an indeterministic one.

Neither way is more general.

**Indeterministic causality and causal explanation**

In a previous post I discussed an approach by the same author to type and token causality that claims to present an indeterministic account of causal explanation. As such, the third of the problems listed above seems to be solved.

**The next post is Categories and types of intervention**

## Uses of Bayes Nets in causality: Modeling type and token causal relevance

**This is post 6 in a sequence exploring formalisations of causality entitled “Reasoning with causality”. This post continues to summarise the paper “Causal Reasoning with Causal Models”.**

**The previous post in the sequence is Is a causal interpretation of Bayes Nets fundamental?**

Causal Relevance, as opposed to causal role, is simply interested in whether there is a causal relationship between a cause and an effect, not whether the cause increases or decreases the chance of the effect. “Causal Reasoning with Causal Models” attempts to find a way of capturing and formalizing causal relevance.

**Type causal relevance**

Type causality looks at relationships between general categories – does smoking cause cancer. Token causality looks at a specific instance – did smoking cause cancer in this patient.

In relation to capturing type causal relevance, the first act seems to be to look at the relationship under intervention. If we change the amount someone in general smokes, does this change their chance of getting cancer? If so, smoking is causally relevant to cancer. Unfortunately, it’s not this simple. Imagine the following case: Taking the pill increases the chance of getting thrombosis but decreases the chance of getting pregnant. Getting pregnant increases the chances of getting thrombosis. Further, imagine that these two effects balance perfectly: So whatever dosage of pill you take, the increased chance of getting thrombosis due to this is directly countered by the decreased chance of getting pregnancy caused thrombosis. However, we still want to say that the pill is causally relevant thrombosis.

The problem with the above scenario is that we want to consider the component causal effects rather than the net causal effects. This is achieved by looking at each path from a cause to an effect one at a time (and blocking the other paths while doing so). If there is a probabilistic dependence between any of the paths then the factor is causally relevant. So, if we block the path from the pill via decreased chance of pregnancy then there is plainly a probabilistic dependence between the pill and thrombosis via the other path.

**Token causal relevance: First attempt**

Extending this account to token causality is a little harder. Imagine a hiker walking up a hill when a boulder is dislodged. They duck and so survive. If the boulder had not fallen, they also would have survived. In terms of type causality, this sort of event can cause death so we want to see the boulder falling as causally relevant to survival. This is achieved with the analysis above. However, from a token perspective, this boulder fall was not causally relevant.

A possible criteria: A is causally relevant to B if for all paths from A to B there is a probabilistic dependence between A and B if the other paths are set at their observed file.

So in the boulder example, if we set the value of the variable duck to true then the boulder falling is not causally relevant to survival – just as we were hoping.

However, this criteria doesn’t solve all problems of this type: Imagine Suzy throws a rock at a bottle and a second late Billy does the same. Suzy’s rock breaks the bottle but there’s no probabalistic dependence because if she didn’t throw a rock the bottle would still have been broken (by Billy). This doesn’t change if we set the other paths to their observed value (ie. we set Billy throwing to true) But we do want her throw to have token causal relevance to the bottle breaking.

**Token causal relevance: A solution**

This can be solved if we have a more complex criteria for token causal relevance:

1.) The causal model is correctly constructed – I won’t go into all the details here but this means the model must meet certain criteria (like all variables must be intervenable)

2.) The context should be fixed according to observation (so fix the value of Billy throwing to true)

3.) All paths from cause to effect must meet certain criteria including that they must be type causally relevant (there is some value of the variable which will induce a probabalistic dependence with other paths blocked). Other paths should be removed.

Now we check, without blocking other pathways from the cause to the effect, whether there is a probabalistic dependence.

So in the above example we construct the model as per 1:

We now fix the context so both Suzy throwing and Billy throwing are set to true. Now with the context of Suzy Throwing being set to true, there is no longer a probabalistic dependence between Billy throws and bottle breaks. So we remove the path.

Now the probabalistic dependence is clear. The same solution can also be used to solve a variety of scenarios which I haven’t had the space to go into here.

Thus we have a solution for how to formalise both type and token causal relevance.

**The next post is Probabilistic causality**

## Is a causal interpretation of Bayes Nets fundamental?

**This is post 5 in a sequence exploring formalisations of causality entitled “Reasoning with causality”. This post continues to summarise the paper “Causal Reasoning with Causal Models”.
**

**The previous post in the sequence is An introduction to Bayesian networks in causal modeling**

Is a causal interpretation of Bayes Nets fundamental in some way or is it simply accidental. To what extent can Bayes Net be considered to be suitable causal models?

**Bayes Nets as representations of probability distributions**

A common argument that the causal interpretation of Bayes Nets isn’t fundamental is that Bayes Nets are designed to represent probability distributions. And given any Bayes Net with a causal interpretation, another one can be generated that represents the same probability distribution but does not have the causal interpretation. On the surface then, the Bayes Net which is a causal model seems no more fundamental than the other.

Chickering’s Arc Reversal Rule can be used to support the above assertion. This rule basically just reverses the direction of all the arrows. A technical addition to how this works means that the rule can introduce arrows but not remove them. From this, there seems to be an obvious way that Bayes Nets do fundamentally represent causality: The causal model is the one with least arrows that still manages to capture the relevant probability distribution.

Unfortunately, this doesn’t work.

**Why the simplest Bayes Net isn’t the causal model**

There are circumstances under which the causal model is not the simplest Bayes Net that captures the probabalistic dependencies. Imagine the following situation: Sunscreen decreases instances of cancer but increases the time people spend in the sun (because they feel safer) which then increases the chance of cancer (modelled as below):

Now imagine that the two effects perfectly balanced so that sunscreen made no difference as to whether you got sunscreen. The simplest model that would capture the related probabality distribution is far simpler than that above. It looks like this:

The only problem is, this doesn’t represent the causal system. So the causal model cannot be the simplest one which captures the probability distribution.

**Augmented Causal Simplicity**

Which leads to a more complicated conjecture: A causal model is the one with the least arrows that still manages to capture the relevant fully augmented probability distribution. In a basic sense, the fully augmented probability distribution simply means captures the probabilities regardless of how we intervene in the model. So say we intervene in the above model to set the amount of time spent in the sun. In the first graph, this means sunscreen no longer changes the amount of time spent in the sun but it does change the chance of getting cancer. In the second model, changing the time spent in the sum will not capture this probabalistic relationship.

So by demanding that a model capture the probability distribution under all possible interventions, the causal model can still be said to be the simplest that captures the distribution. This then implies that Bayes Nets representation of causality are fundamental.

**The next post in the sequence is Modeling type and token causal relevance**

## An introduction to Bayesian networks in causal modeling

**This is post 4 in a sequence exploring formalisations of causality entitled “Reasoning with causality”**

**The previous post was Reasoning with causality: An example**

The remainder of this sequence is going to depart from the path previously indicated (continuing to read Pearl’s *Causality*) and will instead explore the use of Bayesian networks in causal modeling (in doing so, it will also discuss probabalistic vs determinstic causality) by summarising a technical report titled *Causal Reasoning with Causal Models* (Which should link to, http://www.csse.monash.edu.au/~korb/pubs/techrept.pdf). This first post will introduce the use of Bayesian networks in causal modeling.

**What are Bayesian Networks**

A Bayesian Network is a directed acyclical graph (DAG) and their conditional probabilities. What does this mean? Well take the graph (series of nodes connected by edges) below:

1.) Directed simply means it has arrows pointing out from each node.

2.) Acylical means that if you follow the arrows of any path you will never return to your starting point.

So now we know what the DAG aspect of the definition above means. How about the fact that it captures their conditional probabilities. While that simply means that the Bayesian Network (or Bayes Net) also captures the way the variables are conditionally dependent on one another. Take the central node (“grass wet”). Let’s define the conditional probabilities for this node in the form of a table:

Sprinkler | Rain | Grass wet | Grass dry | |

T | T | 1 | 0 | |

T | F | 0.7 | 0.3 | |

F | T | 0.9 | 0.1 | |

F | F | 0 | 1 |

So this says that if it rained and the sprinkler was on last night, the grass will definitely be wet. If just the sprinkler was on, there’s an 0.3 chance that the grass has dried by now and so on.

So a Bayes Net is a DAG combined with the conditional probabilities that hold between the nodes. Bayes nets are used because they make many problems more computationally cheap (if the list of relevant conditional probabilities is small enough).

**Bayesian Networks and conditional dependencies**

If there are no arrows between nodes in a Bayes Net then this means they must be probabalistically independent. Which is to say:

P(A | B) = P(A)

Conditional independence is an extension of this to situations where a third variable induces independence. So, in the above graph, rain and the paper being wet are conditionally independent because they are “screened off” by the grass being wet.

There are two properties related to this:

1.) A Bayes Net is said to have the Markov Property if all of the conditional independences in the graph are true of the system being modelled (the Markov Property simply means that the future state of a system depends only on the present state and not on the past ones. This is clearly related to conditional independence). It can also be called an independence-map or I-map of the system.

2.) If all of the dependencies in the Bayes Net are true in the system then it’s said to be faithful (or a D-map, Dependence Map) to the system.

If a Bayes Net has both of these properties then it’s said to be a perfect map. Generally in Bayes Nets, we want the I-map to be minimal (such that if we removed any nodes, it would no longer be an I-map).

**Bayesian Networks and Causality**

A Bayes Net becomes a causal model if each of the arrows in the graph can be given a causal interpretation. Each arrow then represents direct causality. Indirect causality is based on paths that always follow arrows. So in the above graph, rain is indirectly related to the paper being wet via the path (rain -> grass wet -> paper wet). Sprinkler being on is not indirectly causally related to rain because the path from sprinkler to rain does not always follow the direction of the arrows.

A path from X to Y is blocked by Z if X and Y are conditionally independent given Z (at least under some basic assumptions). If the graph has the Markov Property than the equivalent procedure to blocking is d-seperation which is best explained as follows. Given two variables, A and B there are four ways that a path can go from A to B through C in an undirected graph:

1.) They can go via a chain A->C->B. In this case A and B are d-seperated given a set of nodes Z if C is in Z (if you know C, knowing A won’t tell you anything more about B).

2.) They can go via a chain B->C->A (reasoning as above)

3.) They can both be caused by C. In this case, A and B are d-seperated given Z is C in in Z (once again, knowing C means that knowing A or B tells you nothing more about the other variable).

4.) They can both cause C. In this case if C isn’t in Z then there is no dependence between A and B by default. They may both cause C but knowing one doesn’t tell you anything about the other. On the other hand, if C is in Z then A and B are causally related. Look at the graph earlier – here if we know whether the grass is wet then there is a conditional relationship between rain and sprinkler – namely that if it didn’t rain the sprinkler must have been on and vice versa. So A and B are d-desperated given Z if C is not in Z.

All up, A and B are d-seperated given Z if all paths from A to B meet one of the above criteria.

**Conclusions**

This post has explored Bayesian Networks and how these can be used as causal models. In the following posts it will explore objections to the use of Bayes Nets to represent causality, the debate over probablistic vs deterministc interpretations of causality and the difference between type and token causality.

**The next post in the sequence is **

**Is a causal interpretation of Bayes Nets fundamental?**

## Reasoning with causality: An example

**This is post 3 in a sequence exploring formalisations of causality entitled “Reasoning with causality”**

**The previous post was “A causal calculus: processing causal information”
**

The last two posts have introduced a graphical method and a calculus for discussing causality. This post will demonstrate how these can be used for causal reasoning by following one of Pearl’s examples – an exploration of the causal relationship between smoking and lung cancer in a deliberately simplified world as follows: *Smoking causes lung cancer via the intermediary of building up tar deposits in the lungs. There is also a genetic feature that both increases the chance of developing cancer and increases the chance of one smoking.*

The first thing we’re going to need to think about this is the relevant causal graph which we can see showing both of these two causes of cancer. The question we now need to determine is the strength of these links – basically, what is the probability of getting cancer due to do(smoking)?:

So we’re trying to discover the probability of cancer giving do(smoking), or:

The rest of the proof will use the rules of do() introduced in the last post to remove the do() statement so that the problem can be resolved with the normal rules of probability (the first few steps will be explained explicitly but, if you want to follow after that you should have the tools to work out the steps for yourself).

**Step 1**

The first step is to state that the above equation is equivalent to:

The justification for this is due to the axioms of probability, as follows:

Any probability P(a) can instead be thought of as a sum of the probabilities of exhaustive, mutually exclusive events. So, for example: *A school has four classes of 25 students. All 100 students are currently gathered in a hall for a meeting. They are mixed up at random. What is the probability that a student selected at random is the tallest in their class?*

* *

You can reason as follows: There are four tallest students (because there are four classes) and 100 students so the probability is 4/100. Or you could think in terms of classes (which are mutually exclusive – no students are in two classes – and exhaustive – all students are in a class). In each class there are 25 students so you can say: What’s the probability of this student being in class 1: ¼ and what’s they’re probability of being the tallest in their class: 1/25. So what’s the probability that the student is the tallest in class 1: 1/100. You could then find the same values for the other three classes and sum these values together to once again get 4/100.

Similarly, the probability of cancer given do(smoking) is equivalent to the sum, for all possible tar levels, of all the probability of cancer given do(smoking) and given a certain level of tar in the lungs multiplied by the probability of that level of tar being in the lungs given smoking (if that’s not clear, think of it in terms of the school student example).

**Step 2**

** **

The next step is to show that this is equal to:

** **** **

** **

This is simply a use of Pearl’s second rule (you may want to have the rules post open for reference). However, Pearl’s rules can only be used if a certain precondition is met. You may have noticed that each of the preconditions in the previous post ended with something like:

This notation tells you what graph the precondition must be met in. So rather than seeing if it applies to the graph we drew above, we see if it applies to a subgraph of this defined such that if there’s a hat above the letter, all edges going into the letter are removed and, if there’s a hat below, all edges going out of the letter are removed. So in the above example all arrows going into X and out of Z are removed. Which in our example above means the preconditions must be met in the following graph:

The precondition is that the rule can only be used if c and t are conditionally independent given do(s) (which is to say that if you know do(s), knowing t as well won’t change your probability for c). In the above graph this is plainly true as there is no causal link from t to c.

** **

**Remaining steps**

** **

The rest of the proof proceeds as follows, ending with a probability equation that does not contain any do() operators. Each step of this proof takes place in a similar way to those above. One of Pearl’s rules for do() are applied after the appropriate subgraph is determined to meet the required criteria:

** **** **

** **

**Conclusion**

** **

This is where the graphical and calculus based approaches to causality come together and begin to allow us to reason with causal information that is given to us. In *Causality*, Pearl’s proof is much shorter and to the point. Here I’ve tried to provide more detail on the first few steps to make the proof clear but for those who worry that they’re drowning in detail, read Pearl’s presentation and see if you find it any more clear.

So far I have been focusing on how causal information can be manipulated once you have it using the do() calculus and Pearl’s graphical methods. In the next lot of posts, I will be exploring more about how Bayesian Networks can be used as causal models.

**The next post in the sequence is An introduction to Bayesian networks in causal modeling**

## A causal calculus: Processing causal information

**This is post 2 in a sequence exploring formalisations of causality entitled “Reasoning with causality”**

**The first post is “Causality and graphical methods”.
**

In the previous post, I introduced Pearl’s graphical method for discussing causality. In this post we will attempt to answer the question: How can we process causal information that is given to us.

**A definition of causality**

** **

The previous post introduced the idea of surgery – that is, reaching in and modifying the value of a node in a causal graph. Pearl defines causality in terms of this. He says that A can be said to cause B if performing surgery on A (ie. changing the value of A) can change the value of B. To put it another way, if we specify A surgically via a new equation, A causes B if the value of B relies on this new equation.

So take the graph from the previous post. Y can be said to cause Z in this graph because if we reached in and changed the value of Y (while leaving X the same), the value of Z would change because Z is simply equal to 2Y.

**A causality calculus**

Having just established a graph, rather than equation, based view of causality, Pearl now points out the benefits of being able to discuss causality in a formal language. So just as we have the Boolean Algebra for deductive logic and algebras to discuss probability, we now need a causal calculus.

He sets out to establish such a calculus by introducing a new operator, do(), to standard probability theory. Do is designed to capture causal ideas. So take the wet grass example (a robot might conclude from, “If the grass is wet then it rained” and “if I break this bottle, the grass is wet” that “if I break this bottle, then it rained”). The “do()” operator should allow us to differentiate the grass being wet and one undertaking (or *doing*) an action that makes the grass wet. It should then be able to determine that undertaking an action that makes the grass wet does not change the probability that it rained.

How do we specify the do() operator so that this occurs? Pearl proposes three rules that will allow do() operators to be manipulated in useful ways.

The first of those rules is called “Ignoring observations” and is expressed as:

Basically what this says is that the probability of y is the same given z, w and the undertaking of the action of x as if you were simply given w and the undertaking of the action of x. Obviously this won’t be true in all situations and so there is a precondition that must be met before rule 1 can be used:

Basically, this means that rule 1 can only be applied when Y and Z are conditionally independent given X and W. Which is to say when: *Knowledge of Z will make no difference to the probability of Y if you already have X and W.* This means that rule 1 basically allows you to ignore an irrelevant observation.

The second rule is called “Action/Observation exchange”:

This basically says that the probability of y is the same given w and the undertaking of x and z as it is if given w and z and the undertaking of x. Once again, there is a precondition:

That is, Y and Z must be conditionally independent given X and W. Which is to say that if you know X and W, then Z doesn’t change the probability of Y. So rule 2 basically allows you to use this rule of conditional probabilities to swap one action (a do() statement) for an observation of the same thing.

The third rule is called “Ignoring actions”:

Which is basically the equivalent of the first rule but for actions: So while the first rule allowed you to ignore an irrelevant observation, this rule allows you to ignore an irrelevant action. The precondition is:

Which is to say, the rule can be used when Z provides no additional information about Y given X and W.

**Conclusion**

The do() operator is Pearl’s way of developing a causal calculus. The graphical method is his way of defining causality. The next post will bring these together and hopefully explain them both more clearly by following an example that Pearl provides in Causality as to how these can be used to do causal reasoning.

**The next post is “Reasoning with causality: an example”**