The Autoverse was a ‘toy’ universe, a computer model which obeyed its own simplified ‘laws of physics’ – laws far easier to deal with mathematically than the equations of real-world quantum mechanics. Atoms could exist in this stylized universe, but they were subtly different from their real-world counterparts; the Autoverse was no more a faithful simulation of the real world than the game of chess was a faithful simulation of medieval warfare. It was far more insidious than chess, though, in the eyes of many real world chemists. The false chemistry it supported was too rich, too complex, too seductive by far.

Greg Egan, Permutation City

Imagine you’ve created a simulation rich enough that evolution can take place within it. But the environment is not a normal environment – it’s an environment designed to evolve new decision theories. The selective pressures are different decision scenarios which exist in different combinations in different parts of the world. On Earth, selective advantage comes about through the ability to reproduce. What could fill the same role in this simulation? The traditional answer in philosophy is rationality. Decision theory is intended to be a theory not of how an agent should decide so as to thrive but so as to be rational. Whether there should be a distinction between the two and how we should think about rationality will be the topic of a future section of this sequence but in this section I intend to discuss the standard debate of decision theory. The background.

So what determines rationality? Well, traditional philosophy maycome up with criteria but they’re happy to abandon them if they seem to fail to capture rationality in some circumstance. So that means that underlying the explicit criteria is an implicit appeal to intuition. We will know rationality when we see it (it may not come as a surpise then that which decision is considered to be rational is debated for many decision scenarios).

These are issues for later though. For now, imagine that our intuitions are more coherent and have been fed to the simulation. An agent that decides rationally finds that they are more likely to reproduce.

Evolution can’t survive without variation and mutation. So what is varying and mutating in this simulation? Take the original formula we discussed for expected utility:

The utility section of this formula is fairly uncontroversial. Different decision theories involves different probabilities playing a role in this formula. So that is what will vary and mutate in the simulation: the probability term used here – different probabilities will be used which capture different relationships between the decision and the world state.

The simulation will start with naive decision theory which, having at least some coherence, will survive and spread widely enough for mutations to begin to develop. Many of these will be evolutionary dead ends. Maybe one will refer not to the probability of the world state alone but instead the probability of the decision alone – the formula will now ask what the probability is that the agent will decide a certain way. But the formula is designed to determine the decision the agent makes so each time it seems to reach a decision a new probability will be determined for the decision and the formula will have to be recalculated. Even if it eventually finds an equilibrium point, there’s no reason to believe that this point will identify the rational decision. Such mutations will rapidly die out.

Others will be more useful. Eventually, the evidential decision theory that we discussed in the last post will come into being. Presuming the environment it develops in is one which includes decision scenarios where the world state depends on the decision, it will take over from naive decision theory in these areas.

Elsewhere, however, another decision theory will evolve to deal with this same problem: causal decision theory. While evidential decision theory made use of conditional probabilities, causal decision theory makes you of the probability of a particular type of conditionals. In other words, it’s probability term looks something like this (traditionally a box with an arrow coming out of it represents the relationship but WordPress doesn’t render that symbol so I’m using the standard arrow):

This represents the probability of the subjunctive conditional relating the decision to the world state. A subjunctive conditional can also be thought of as a counterfactual conditional capturing the statement “If I were to make this decision then the world would be in this state”. Causal decision theory then considers the probability of this statement being true.

Another way of thinking of this is that while evidential decision theory asks how the decision and the world state are correlated, causal decision theory is interested in only causation between the decision and the world state. This will also resolve the difficulties faced by naive decision theory. If an agent is deciding whether to enter an air raid shelter during a bombing raid, they will note that doing so will cause them to be more likely to survive and so they will take shelter. If naive decision theory, on the other hand, divides the world into the states where it survives and where it dies it will fail to realise that it’s decision can change the probabilities of the world being each of these ways. So it won’t bother with the hassle of sheltering.

So causal decision theory also begins to establish a foothold in the world. The next post will ask the question: what happens when causal and evidential decision theory meet in the same area of the simulation?

]]>

You have rebuilt the artificial creature and now you’ve placed it back in the false environment. For a while, it sits and simply gathers data about the world in the form of probabilistic relationships. It observes ten creatures going to a substantial patch of food, eight of them get picked off by predators. It observes ten creatures going to a less substantial but more sheltered patch of food. None of them get eaten. It begins to form opinions about probabilities relating the predators and the two patches.

Eventually though, it grows hungry and has to choose which patch to go to. The previous incarnation of the creature just divided the world into the possibilities it survived and the possibilities it didn’t, without taking into account the effect of its decisions on its chances of survival. This new version of the creature will not fall into the same trap.

There are two ways it could have been programmed to avoid this. It could have been given a causal decision theory so it would ask what its actions were likely to cause (would going to this patch of food cause me to be more likely to die). Causation is difficult though. The creature watches and takes in probabilistic information but this isn’t enough – there are many occasions where purely probabilistic information isn’t enough to identify a single causal structure for the world. Add in temporal information and more cases can be distinguished. However, an unobserved third variable might be responsible for the probabilistic relationships and even temporality won’t always help to distinguish this.

Causation is difficult and if instead you could read off useful information directly from just the probabilistic information, surely that would be preferred. Correlation, it turns out, is much easier to figure out. And correlation is what fuels evidential decision theory (EDT). Naive evidential decision theory (we’ll discuss more sophisticated versions in later posts) simply looks at the correlations between the decision and the state of the world. It seems that its simple nature may give it an initial benefit as a decision theory – if evidential decision theory can do all that causal decision theory can, then, the argument goes, it wins because it has less conceptual baggage.

So how does EDT capture correlation. Let’s look at the original equation for expected utility:

In the previous post, we realised that the probability of the world state can’t be treated as being independent of the decision. In other words, the following term of the equation needs to be changed.

Evidential decision theory does this by replacing this with the probability of the world state given the decision – means the probability of A given B.

So evidential decision theory calls for an agent to make the decision which maximises the following formula:

How does this work? Well think of the patches of food. Given that the relationship used in this equation is simple correlation, the agent can easily work out that the probability of being eaten given the decision to go to the more substantial patch is much higher than the probability of being eaten given the choice to go to the less substantial patch. So it will choose to go to the less substantial patch.

To put the numbers in, we need some probabilities. Presuming the agent drew these from its earlier observations, then the probability of death given the substantial patch is 0.8 (8 out of ten creatures were eaten). On the other hand, its probability of being eaten given going to the less substantial patch is 0.

However, while we’ve been focusing on the probabilities in the last few posts (as these are the issue of the debate we’re exploring), the equation above does also mention another factor – the utility received from the combination of a world state and a decision. In a previous post we represented that in this utility table:

Death | Survival | |

Substantial patch | -10 | 5 |

Less substantial patch | -10 | 2 |

Now we have all the information we need to do the necessary calculations (click on the image for a larger copy)

So evidential decision theory reaches the correct decision in the case facing our agent and it does so without relying on any complex causal apparatus. The next few posts will explore how causal decision theory reaches the same decision in at least this instance. The question that will then be asked is, does causal decision theory have advantages such that taking on the extra causal baggage is worthwhile?

]]>

You watch the creature hesitate, processing the best data it can gather about the environment around it. You hold your breath. The last ten years of your life have been spent as part of the team programming this artificial creature – so simple on the face of it, yet it took such a level of complexity to equal even this simple achievement of evolution. The creature is now face with its first decision: one patch of food is more tempting but less sheltered from predators. Another is less tempting but well sheltered. In this setup, predators are so prevalent that the sensible decision is to go for the more sheltered food.

Finally, the agent makes its decision. Seconds later it is caught by a predator. You sigh and download the log. Time to figure out what went wrong. You see it straight away. Before the creature could make its decision, it first needed to build up a utility table. You had expected it to build up the following table.

Predators present | Predators absent | |

Substantial patch | -10 | 5 |

Less substantial patch | 2 | 2 |

Instead, it had built up the following utility table:

Death | Survival | |

Substantial patch | -10 | 5 |

Less substantial patch | -10 | 2 |

There’s nothing wrong with this second utility table. Dying is always worth -10. Dying in either patch of food is worth the same disutility. Survival is indeed worth more in the substantial patch of food because it’s a preferable location. However, the point is that *the creature is more likely to survive if it heads for the less substantial patch*. In other words, the probability of the world state depends on the decision.

However, if we look at the formula for expected utility that we discussed last week, we can see that the probability of the state of the world doesn’t take into account the decision at all.

The term to look at is:

This fails to take into account the fact that the decision can influence the world state (which patch the creature choses influences whether it is likely to survive). This equation is unable to handle this model of reality. So what equation should you reprogram into your artificial creature? This issue is the focus of one of the principle debates in decision theory.

Historically, the debate was between two main decision theories: evidential decision theory and causal decision theory. Broadly, evidential decision theory says that an agent should ask what evidence a decision provides about the world state. So the creature heading toward the sheltered patch of food would provide evidence that the creature is more likely to survive. Causal decision theory, on the other hand, says that an agent should ask what the causal influence of the decision is on the world state. So heading for the sheltered patch causes the creature to be more likely to survive.

The next few posts will outline both of these theories in more detail.

]]>

Post 1. What is decision theory?

Post 2. A problem with naive decision theory

Post 3. Evidential decision theory

Post 4. Causal decision theory

]]>

At some point in the history of the universe, a decision had never been made. The entire history of the universe to that point had just unfurled quietly without a single choice being made. The first warning sign that this era was over was the beginning of life. However, early living things would have floated or stayed still or did whatever the world told them to. They would not have intervened in the world or pondered about the rational response to the environment. Not only could it not decide but, if it could have, it would still have been unable to impose its will on the world.

Then a form of life developed that was able to interact with the world. Maybe it gained the power of locomotion. Maybe it gained the power to cling – to decide when not to be moved by the elements. Maybe it gained any number of abilities but, for whatever reason, suddenly it was able to intervene in the world. And a new question came about: how should it act so as to achieve this. This first intervener would have had few cognitive tools to process this question.

The next development is the most important one to our story. Not first life. Not first intervention. But first decision. A form of life that could not only intervene in the world but could decide how to do so. Life that could choose where to move and when to move. And the question became more important: how should one best take advantage of this ability to decide?

This sequence will explore decision theory, one attempt to answer this question, at least in the abstract.

So imagine then a creature – maybe not the first decison maker but one of its descendents. This is a simple creature that gains its energy by eating algae that floats in the water and that survives in virtue of both eating enough and by avoiding being eaten by its preditors. This creature is faced with a decision: to its left there is a substantial patch of algae that would feed it comfortably for some time. To its right, there is a less substantial patch that it could nevertheless survive on, albeit not for as long. The creature is faced with the decision of which patch to approach.

However, here’s the complication: the more substantial path of algae is in an area that would be more exposed to predators if they were around. The less substantial patch is in a more sheltered area. In other words, if preditors are likely to be around then the creature would be better going to the less substantial patch. If they are unlikely to be around, it is better going to the more substantial patch. This allows us to introduce our first tool from decision theory: the utility table.

Predators present | Predators absent | |

Substantial patch | -10 | 5 |

Less substantial patch | 2 | 2 |

This utility table contains almost everything that decision theory uses to determine the rational decision. Across the top are the possible states of the world (the world can either be such that the predators are present or absent). Along the side are the possible decisions (the creature can head for either the substantial or the less substantial patch). The table cells then contain a utility value for each possible combination of a decision and world state (so if the creature headed to the substantial patch and predators were present then this would be worth a utility of -10). The utility value is simply a measure of how much the decision maker values the outcome. From this table, the rational decision can be determined given the state of the world. So if the predators are present, then the less substantial path is clearly the best option.

Generally, though, decision theory deals with decision making under conditions of uncertainty. This means it deals with circumstances where the state of the world isn’t known. So, our creature might not know whether there are predators around today. In this case, it could make its judgement based on how likely the predators are to be around. This is the final piece of information that decision theory requires: the probablity that the world is in a certain state.

It combines all of this information together into a single formula for each decision which calculates the expected utility of each action. The action with the highest expected utility is the rational decision. (ETA: Those who have studied decision theory before might be expecting the probability here to take into account the decision in some way. This issue will be discussed in the next post).

What does this equation mean? Well basically, we take all of the utilities for each possible decision and add them together. So in the case of the creature deciding to go for the substantial path this is (-10 + 5 = -5) and for the unsubstantial patch is (2 +2 = 4). However, if the state of the world leading to that utility is unlikely to happen, we want that utility to count for less because the creature is unlikely to get it. So all of the utility is weighted by the probability of receiving it (in other words, the probability of the world state occuring as the world state determines what utility the creature will receive).

This is decision theory in its basic form.

The next post will look at a problem with this approach.

]]>

This post will summarise the paper Varieties of Causal Intervention and, by doing so, will explore what a causal intervention is. To start with, imagines a Bayes Net representing a causal process. So let’s say that cleaning your teeth and genes both effect the chances of dental decay but that the same genes also influence the chance that you’ll do your teeth:

Now image there’s a government policy being considered whereby police men would enforce tooth cleaning. To analyse the affects of this we would intervene on “Clean”, which we can think of as setting the value of the variable and ignoring the influence of any parents, and we would then observe the effects of the intervention on decay.

**Intervention vs Observation**

This concept of intervention is different to observation. Imagine, for example, that we observed someone cleaning their teeth. This may have a different effect to intervene to make them clean their teeth as it suggests they’re more likely to have certain genes and these genes also make it less likely that teeth will decay. Thus, in this case, the observation of tooth cleaning would have a stronger causal effect on decay than an intervention.

The important point is that the two concepts are different because intervention surgically removes the context of any parent nodes.

**Make interventions complex**

The simple view of interventions, expressed in most of the literature on the topic, is that they are deterministic, always achieve a desired effect and affect only one variable. However, interventions in the real world often fail to match these simplified assumptions. For example, an intervention to pressure someone to clean their teeth might fail. Or an intervention might affect multiple variables, rather than just one. And finally, an intervention may be indeterminstic and fail to set a variable to a specific value.

To model this, rather than simple changing the value of a variable in the system (setting clean to true, in the above example), the paper suggests introducing a new parent node for the variable which is introduced to change the variable to some particular target distribution and which is outside of the system. See below:

This intervention node will be binary (yes/no) but its interaction with other nodes is open so that the particular target distribution may not be achieved. Note that intervention variables can be parents of more than their target variable, allowing side effects to be modelled.

**Types of intervention**

An intervention then leads to a new probability distribution across its states in the targeted variable. The types of interventions possible are:

- An independent intervention where the results of the intervention do not depend on other parents of the variable. This changes the probability distribution of the target to that intended by the intervention. A dependent intervention would calculate the new probability distribution based on both this intended distribution and also based on the other parents. So imagine a gene therapy intervention that fixes that effects of bad genes on not cleaning teeth. To someone with bad genes this will decrease the chance of decay. For someone who already has the good genes, this will have no effect. This would be a dependent intervention.
- A deterministic intervention aims to achieve a specific effect. Say to force people to clean their teeth. That’s to say, it aims to set a variable to a specific value. It may still be complicated to model if the intervention is dependent. A stochastic intervention aims to leave the target variable with a distribution over more than one state.

**Conclusion**

Representing interventions as the introduction of a new parent node rather than as a surgical change of the value of a variable allows a wider range of types of intervention to be modelled.

]]>

Many of the posts on this blog in relation to causality have a hidden assumption: Namely, that causality is inherently probabilistic (though many of the posts are still just as relevant whether this is accepted or not). However, this goes against the mainstream view. This post will summarise part of the paper Varieties of Causal Intervention which argues against the mainstream.

**Deterministic causality**

Judea Pearl, one of the principle researchers in the area of causality, has argued that causality should be interpreted deterministically for the following reasons:

- The deterministic interpretation is the more intuitive one.
- Deterministic interpretations are more general as any indeterministic interpretation can be modelled as a deterministic one.
- A deterministic causality is needed to make sense of counterfactuals and causal explanation.

This post is going to ignore the first point (intuitive doesn’t mean right) and will explore possible responses to the other two.

**Deterministic causality as the more general theory**

Imagine a system where both A and B have a causal influence on C. We can model that as the following equation:

C = dA + eB + U

This says that C is influenced to degree d by A and degree e by B. It also says that C isn’t entirely determined by A and B but there is some degree of variation. By adding U in, a case that was originally indeterministic has become deterministic. However, that doesn’t mean the causal system being modelled is in fact deterministic unless U is a part of the system. Given that, in practice, U is normally defined as that which is left over, claiming this as part of the system seems to be reasonable only if we presume from the start that the system is deterministic.

Given that indeterminstic worlds are consistent, it seems to be an a posteriori question to determine whether causality should be deterministic or indeterministic and hence such an a priori assumption seems unwarranted.

Even if you don’t buy all that, there’s a further point: This process can be applied two ways. This means that any indeterminstic system can be modelled as a deterministic one but any deterministic one can also be modelled as an indeterministic one.

Neither way is more general.

**Indeterministic causality and causal explanation**

In a previous post I discussed an approach by the same author to type and token causality that claims to present an indeterministic account of causal explanation. As such, the third of the problems listed above seems to be solved.

**The next post is Categories and types of intervention**

]]>

In the previous post we looked at Bayes Theorem, an equation as below:

Basically, what this does is allow us to take a prior probability and produce a posterior probability given some evidence. So, for example, if we were trying to determine the probability the sun would rise tomorrow, we would start with our prior probability (very high) but then, if we learnt that an alien fleet was determine to destroy it tonight, we could use Bayes Theorem to recalculate this probability based on that new evidence.

In the equation, P(A) represents the prior probability and if we don’t know this figure then it becomes impossible to apply the equation. This post is going to explore Kolmogorov Complexity, one suggestion for how we can reason under these conditions.

**What is Kolmogorov Complexity?**

Kolmogorov Complexity is an attempt to formalise the lose concept of how complex a string is. It is defined as the length of the shortest program that can produce the string. Of course, the length on this program will differ depending on which programming language you use to implement it. However, it can be shown that the difference in this length can be no more than a constant.

To explain, let’s imagine that we originally used the programming language c++ to determine the complexity of a string, S. Now let’s say that instead we use java to do the same job. We could choose to write a c++ interpreter in java and then run the original program. That means that, at most, the complexity of the string in java can be equal to its complexity in c++ plus the length of the interpreter. So, while the choice of language can lead to different Kolmogorov Complexities, this difference is bounded.

**Kolmogorov Complexity and Bayes Theorem**

Ockham’s Razor famously says that, “All else being equal, the simplest theory is the best”, or at least something along those lines and this is the basic idea behind use Kolmogorov Complexity in Bayes Theorem. Our problem with Bayes Theorem was circumstances where we did not know the prior probability of a statement. Ockham’s Razor provides a response – the likelihood of the prior should be dependent on the simplicity of the theory. And Kolmogorov Complexity gives us a way to formalise this.

So the idea, under circumstances where no more information is available, is to define the prior, P(m), as:

2^{-K(m)}

Which has the desired format. Actually, this equation will not ensure probabilities sum to 1 and so, instead, we use what is called prefix-Kolmogorov complexity instead of standard Kolmogorov complexity. This prefix-Kolmogorov Complexity, L(m), is calculated as follows:

L(m) = k(m) + Log_{2}K(,)

And the prior probability is then calculated as:

2^{-L(m)}

In this case, the probabilities do add to one.

**Next post**

The next sequence of posts will explore causal discovery – that is, finding causal information from data.

]]>

Having written a few posts on Causality I’m now going to write a short sequence exploring basic concepts in the formalisation of thought. This first post will deal with Bayes Theorem. However, because there are already good explanations of Bayes Theorem (for example, this one by Eliezer Yudkowsky and this one on community blog Less Wrong) I’m not going to do a standard basic introduction but will instead apply Bayes Theorem to a well known puzzle, the Monty Hall problem, to show how it can aid with probabilistic reasoning.

**What is Bayes Theorem**

Bayes Theorem is simply an equation that allows you to calculate conditional probabilities. That is, the probability of A, given B. The equation is:

Which just means that, for example, the probability that I’m hungry given that my belly is rumbling is equal to the number of times that I’m hungry and my belly rumbles out of the number of times that my belly rumbles which is intuitively what we would expect it to say.

There are a number of variants of this formula, but the one that matters for us is Bayes Theorem under the condition that you are given not just one piece of information but two. That formula is as follows:

**What is the Monty Hall Problem?**

You’re on a quiz program and are faced with three doors. A door has been selected at random before the show and a car placed behind it. The other doors have nothing behind them. You are asked to choose a door.

Once you have done so, the host chooses, at random, one of the other doors. He never chooses the one with the car behind it. He opens the door and then asks whether you would like to change your answer?

Should you do so?

**An intuitive response to the Monty Hall Problem?**

The initial response is normally “No” or, more to the point, “Who cares. Whatever door I choose there’s a 1/3 chance of winning so why both changing.”

The answer, however, is actually “Yes.” Think about it like this: You had a 1/3 chance of choosing the winning door in the first place. Now, if you change, you will lose. You had a 2/3 chance of choosing a losing door in the first place. Now, if you change, you will win (because the other door has been opened to reveal no prize so you will inevitably change to the winning door). So 2/3 of the time you win by changing, 1/3 of the time you don’t.

**What information do you gain?**

The reasoning is easy enough to understand. What’s harder is to figure out what new piece of information the presenter gave you by opening the door. It becomes a bit more clear when we consider the following situation: If the host had simply pointed at a door at random, regardless of whether it was the winning door, and not opened it, then changing doors would not be beneficial. So the information is given by the fact that he cannot open the winning door.

So imagine a different situation: The host says you can either guess the door straight up or, before you do, he will point at the door. He will tell you the correct door 2/3 of the time and lie and tell you the incorrect one 1/3 of the time. Do you simply guess or do you wait for him to tell you and then choose the door he points out? If you guess you have a 1/3 chance of winning, if you wait, he is telling the truth 2/3 of the time. So you wait.

The new piece of information in the first case is similar to that: By not choosing a door, the host is giving you information that that is the correct door. It just so happens that this information is only true 2/3 of the time but, it’s still additional information on those occasions.

To put it another way: Even if you were told you’d picked a wrong door, you would still not know which of the other two doors the prize was behind. But after the other wrong door was selected, you would. So his actions in choosing a door give you new information if you knew you’d picked the wrong door. And given that you know you’re 2/3 likely to have picked the wrong door and only 1/3 likely to have picked the right, this information is of use,

**Bayes Theorem and the Monty Hall Problem**

All of this is well and good in relation to the specific problem but, unless you got it right the first time you heard it, what it has revealed is that there is a flaw in the way that you process probabilistic information. And this flaw isn’t necessarily dissolved simply by understanding one circumstance under which it was revealed. This is where Bayes Theorem comes in. Bayes Theorem is an approach that increases your general skills of probabilistic reasoning.

So Bayes theorem, remember, is about conditional probability: The probability that A is true given the evidence B.

Bayes Theorem makes conditioning on the situation (right door/wrong door) explicit. It would still be possible to know Bayes Theorem but not think to use it in this case. But if you approach probabilistic questions with Bayes Theorem in mind then you’re explicitly primed to ask whether there is a conditional approach to the problem. Here’s a Bayes Theorem based approach to the Monty Hall Problem.

So what we want to discover is the probability that the unselected and unopened door contains the prize given the information of the choice and the opening of door.

We now calculate the various components:

Because the host never selects the door you choose or the one with the prize.

Because the position of the prize had an initial probability of 1/3 and this doesn’t change based on the choice.

Now plug these in:

That’s the right answer (the probability that the prize is behind the door that has not been chosen or opened is 2/3 so you should change your choice)

But the important thing about Bayes Theorem isn’t that it gets the right answer, it’s that its use improves your ability to reason with probability so you’re less likely to get other probability questions wrong in the future. Simply by plugging in what you want to know and what you do know, this equation can improve your abilities and not just give you the answer to a specific question.

**The limits of Bayes Theorem**

Bayes Theorem is more than a solution to a specific problem – it’s a general technique for reasoning with probability. That’s not to say it’s a solve all solution. Some problems have the complexity in calculating those components above and, in those cases, Bayes Theorem won’t help. However, it’s a big step toward being able to solve problems where the components are simple but it’s the coming together of these which is complicated (see Greg Egan’s website for an example of a problem that Bayes Theorem may not help you with).

**The next post is Bayes Theorem in practice: Kolmogorov Complexity**

]]>

Post 1: Bayes Theorem and the Monty Hall Problem – An introduction to Bayes Theorem, one of the most important equations in probabalistic reasoning. Applied to the Monty Hall Problem and with links to other more thorough introductions to the theory.

Post 2: Bayes Theorem in practice: Kolmogorov Complexity – Bayes Theorem takes a prior probability of something being true and modifies it based on new evidence. But what do you do if you have no prior probability to use in the equation?

]]>