# Probability/Conditional Probability

## Motivation[edit | edit source]

In the previous chapter, we have actually only dealt with *unconditional probabilities*.
To be more precise, in the problems encountered in the previous chapter, the sample space is defined initially,
and all probabilities are assigned with respect to that initial sample space.
However, in many situations, after defining the initial sample space for a random experiment,
we may get some new information about the random experiment.
Hence we may need to *update* the sample space based on those information.
The probability based on this updated sample space is known as a *conditional probability*.

To illustrate how we get the new information and update the sample space correspondingly, consider the following example:

**Example.**
Suppose we randomly draw a card from the poker deck (such that every card is equally likely to be drawn).
We define the sample space to be the set containing all 52 cards.

(a) Calculate the probability that an ace is drawn.

(b) Suppose an ace is drawn out from the poker deck beforehand. Calculate the probability that an ace is drawn. (This is a *conditional probability*.)

(c) Suppose two aces are drawn out from the poker deck beforehand. Calculate the probability that an ace is drawn. (This is a *conditional probability*.)

(d) Suppose three aces are drawn out from the poker deck beforehand. Calculate the probability that an ace is drawn. (This is a *conditional probability*.)

(e) Suppose we instead randomly draw four cards from the deck. Calculate the probability that four aces are drawn.

*Solution*.

(a) The probability is (since there are four aces in the deck).

(b) With the given *condition* (an ace is removed from the deck), we know that there are only 51 cards, and 3 aces in the deck. So, the sample space is updated to contain these 51 cards. Although we cannot precisely describe the sample space since we do not know *which* ace is removed,
we know that the sample space has 51 sample points and only 3 aces.
It is also reasonable to assume that the sample points in the updated sample space are equally likely. Hence, the probability should be
.
(Notice that we are still able to calculate the probability, despite we do not know exactly what the updated sample space is.)

(c) Similar to (b), the sample space is updated to contain 50 cards and two aces. Hence, the probability is .

(d) Similar to (b), the sample space is updated to contain 49 cards and one ace. Hence, the probability is .

(e) Clearly, we can use the concept of combinatorial probability to calculate this probability: .

Another argument (which may be quite intuitive to many people) is to consider the four draws "one by one".

- For the first draw, the probability is .
- For the second draw, we know that an ace is drawn for the first draw, so the probability is (similar to (b)).
- For the third draw, we know that two aces are drawn for the first two draws, so the probability is (similar to (c)).
- For the fourth draw, we know that three aces are drawn for the first three draws, so the probability is (similar to (d)).

Then, we are somehow told that multiplying all the probabilities gives the desired one ("multiplication rule of probability"):

It turns out that this argument is valid, but actually we have not discussed any results that justify this argument in the previous chapter. Indeed, we have implicitly used the concept of conditional probability (in 2,3,4 above), where the sample space is updated as we have new information/knowledge. (In some sense, the (conditional) probability reflect our state of knowledge about the random experiment (about the deck).)

From the example above, we are able to calculate the (conditional) probability "reasonably" through some arguments (see (b)) when the sample points in the initial sample space are equally likely. Furthermore, we can notice that the condition should be an occurrence of an event, which involves the sample points in the sample space. When the condition does not involves the sample points at all, it is irrelevant to the random experiment. For example, if the condition is "the poker deck costs $10", then this is clearly not an event in the sample space and does not involve the sample points. Also, it is irrelevant from this experiment.

To motivate the definition of conditional probability, let us consider more precisely how we obtain the (conditional) probability in (b). In (b), we are given that an ace is drawn out from the poker deck beforehand. This means that ace can never be drawn in our draw. This corresponds to the occurrence of the event (with respect to original sample space) (denoted by ) which consists of 51 sample points, resulting from excluding that ace from the original 52 sample points. Thus, we can regard the condition as the occurrence of event . Now, under this condition, the sample space is updated to be the set , that is, only points in are regarded as legit sample points now.

Consider part (b) again. Let us denote by the event (with respect to the original sample space).

Now, only *part* of the points (whose also lie in the set ) in the set are regarded as legit sample points.
All other points in set are no longer legit sample points anymore under this condition.
In other words, only the points in *both* sets and (i.e., in set ) are legit sample points in event under this condition.

In the part (b) above, only the three aces remaining in the deck (in both sets and , and hence in set )
are considered to be legit sample points. The other ace in set (the ace that is drawn out in the condition) is
*not* considered to be a legit sample point, since that ace is not in the deck at all!

To summarize, when we want to calculate the *conditional probability* of event *given the occurrence of event *, we do the following:

- We update the sample space to the set .
- We only regard the sample points in set to be the (valid) sample points of event .

In the above example, we encounter a special case where the sample points in the initial sample space (assumed to be finite) are equally likely (and hence the sample points in the updated sample space should also be equally likely). In this case, using the result about combinatorial probability (in previous chapter), the conditional probability, denoted by , is given by

*not*an event/set.)

When the sample points are *not* equally likely, we can apply a theorem in the previous chapter for constructing probability measure
on the updated sample space . (Here, we assume that is countable.)
Particularly, since we are only regarding the sample points in set as the (valid) sample points of event ,
it *seems that* the (naive) "conditional probability" of given the occurrence of event should be given by

*initial*probability space ).

However, when we apply the original probability measure (in the original probability space) to every singleton event in the new sample space , we face an issue: the sum of those probabilities are just which is not 1 in general! But that theorem requires this probability to be 1! A natural remedy to this problem is to define a new probability measure , based on the original probability measure and the above (naive) "conditional probability" , such that the sum must be 1. After noticing that , a natural choice of such probability measure is given by

*normalizing*constant, and every (naive) "conditional probability" (as suggested previously) is scaled by a factor of . (It turns out that the probability measure defined in this way also satisfies all the probability axioms. We will prove this later.)

When comparing this formula with the formula for the equally likely sample points case, the two formulas look quite similar actually.
In fact, we can express the formula for the *equally likely sample points case* in the same form as this formula (since the equally likely
sample points case is actually just a special case for the theorem we are considering):

So, now we have developed a reasonable and natural formula to calculate the conditional probability when the outcomes are equally likely (applicable to finite sample space), and the outcomes are not equally likely (only for countable sample space). It is thus natural to also use the same formula when the sample space is uncountable. This motivates the following definition of conditional probability:

## Definition[edit | edit source]

**Definition.**
(Conditional probability)
Let be a probability space, and be events.
Assume that . Then,
The *conditional probability* of event given the occurrence of event , denoted by , is

**Remark.**

- The assumption that prevents the above formula gives an undefined value.
- It follows that .

- This formula can be illustrated by the following
*tree diagram*:

- This formula can be illustrated by the following

*-------> A P(A n B) / P(A|B) P(B) / *------> B / \ / \ / *-------> A^c \ \ \ *------> B^c

- Sometimes, the probability is unknown, and we are given and . In this case, we can apply this formula to get .
- Also, the value of is often
*not*stated in the question explicitly. Instead, we may have to use our intuition to*understand*the situation to get the conditional probability . - Besides, in a more complicated situation, it may not be clear that what the condition "" in is, and therefore we may have to decide
*what*should we condition at, depending on the context. (When we condition on some appropriate events, we may be able to determine the conditional probability readily.)

- Through this definition, we
*induces*(in some sense) another probability space where the conditional probability is taken to be the probability measure: , where is defined by

- for every .
- (Technically speaking, since we have not proved that satisfies all probability axioms, we cannot say that is a probability space yet. But it turns out to be the case.)

**Example.**
( is a valid probability measure)
Let be a probability space, and be events.
Assume that .
Define the function by

Prove that is a valid probability measure, that is, this function satisfies all three probability axioms.

**Proof.**

**P1 (Nonnegativity)**:
Since and ( satisfies all the probability axioms, particularly, the nonnegativity),
it follows that

**P2 (Unitarity)**:
We have

**P3 (Countable additivity)**:
For every infinite sequence of pairwise disjoint events , we have

**Example.**
(Special cases for conditional probability)

- Suppose . Then, when we update the sample space to set , the set contains all (legit) sample points in the updated sample space. Thus, the conditional probability should be 1. Alternatively and more intuitively, given that occurs, this means the realized outcome lies in set , and hence must lie in set (since ). So, the probability should be 1.

- Formally, we can see this readily: ( when .)

- Suppose and are disjoint. Then, when we update the sample space to set , the set contains zero (legit) sample point in the updated sample space. Thus, the conditional probability should be 1. Alternatively and more intuitively, given that occurs, this means the realized outcome lies in set . So, it must
*not*lie in set . Hence, the probability should be 0.

- Formally, we can see this readily: .

**Example.**
Suppose we roll a fair dice once. Let and be the events that even number comes up
and prime number comes up respectively.

(a) Calculate .

(b) Calculate .

*Solution*.

(a)

(Alternatively, we may use the formula obtained in the motivation section since all sample points are equally likely: (a) ; (b) .)

**Exercise.**

(a) Calculate .

(b) Calculate .

(a)

(b)

**Example.**
Suppose Amy tosses a fair coin twice, and she tells Bob that she obtains at least one heads in the two tosses.

(a) Calculate the probability that the outcome of the two tosses is "HH" (two heads).

(b) Calculate the probability that the outcome of the two tosses is "TT" (two tails).

*Solution*.

(a) The probability is

(b) Notice that . Thus, the probability is 0.

**Exercise.**
A student claims that the probability that the outcome of the two tosses is "HH" should be . He reasons as follows:

When Bob is told that there is at least one heads in the two tosses, we can update the sample space to be . The probability is thus

Point out the mistakes made by the student.

The mistake is that the two outcomes in are *not* equally likely. Indeed, the outcome "exactly one head" in
the set can actually be decomposed into two more specific outcomes: and .
So, strictly speaking is not even a sample space!

**Example.**
(Boy or Girl paradox)
Amy is a mother with two children, where each of them is equally likely to be a boy or girl.

(a) Suppose Amy tells you that she has at least one son. Calculate the probability that Amy has two sons .

(b) Suppose Amy tells you that her older child is a son. Calculate the probability that Amy has two sons.

*Solution*.

Define the sample space to be where represents the older child is boy and the younger child is a girl, etc. Then, all four sample points in the sample space are equally likely.

(a) The probability is

(b) The probability is

**Remark.**

- This example shows that even with a small change in the given information, the conditional probability obtained can be quite different. So, we should be careful about what information is actually given in order to have a correct calculation.

**Example.**
Consider a course in probability theory.
The following summarizes the probability for a student to pass the course when the student has different scores in the mid-term exam:

(a) Suppose the probability for getting at least 50 marks in the mid-term exam (for every student) is 0.6. Calculate the probability that a student gets less than 50 marks in the mid-term exam but pass the course.

(b) Assuming that it is equally likely for the score of mid-term exam of a student to lie in one of the above four intervals, calculate the probability that a student (with unknown score) passes the course.

*Solution*.

(a) Let be the event that a student gets at least 50 marks in the mid-term exam, be the event that a student passes the course. Then, the desired probability is . From the question, we know that , and . Hence, the desired probability is

(b) Let be the event that the score of mid-term exam of a student lies in respectively. From the assumption, we have . Notice that the events are pairwise disjoint. Also,

**Exercise.**
Suppose the assumption for part (b) above still applies.

(a) Calculate the probability that a student gets at least 90 marks in the mid-term exam if the student passes the course. (Answer: approximately 0.3585)

(b) Calculate the probability that a student gets less than 50 marks in the mid-term exam if the student fails the course. (Answer: approximately 0.5185)

Let us continue using the notations defined in part (b) above.

(a) The desired probability is

(b) The desired probability is

**Example.**
Amy rolls two fair six-faced dice, with one colored red and another colored blue (so that they are distinguishable), without looking at the dice.
After Amy rolls the two dice, Bob tells Amy that there is *at least one* 6 coming up (assume Bob tells the truth).
Calculate the probability that 6 comes up for both dice after hearing the information from Bob.

*Solution*.

The condition is there is *at least one* 6 coming up, and the probability of this condition can be calculated by inclusion-exclusion principle:

**Exercise.**

1. Calculate the probability again if the blue dice is colored red, such that the two dice is not distinguishable.

Changing the color of dice does not affect the probability. Thus, the probability is still .

**Remark.**

- After changing the color, although the number of sample points in the sample space changes, the sample points are not equally likely.

2. Chris claims that the desired probability in the example should be , and he reasons as follows:

Given that there is at least one 6 coming up, we know that 6 comes up in a dice. Now, we consider another dice, having six equally likely possible outcomes for the number coming up, namely 1,2,3,4,5 and 6. Thus, we can update sample space to where the second coordinate of each ordered pair represents the number coming up for another dice.

The desired event is that 6 comes up for both dice, that is, . Clearly, . It follows that the probability is .

We know that the correct answer is , and not , but why is this claim wrong? (cf. this discussion)

The six outcomes considered include the outcome in which 6 comes up in *either* red or blue dice.
But the sample points for which 6 comes up for the blue or red dice only (resp.) are missed.
There are five of them, namely .
So, the updated sample space suggested is actually not complete, and thus a wrong answer is obtained.

The complete updated sample space should be

**Remark.**

- Chris' claim is correct if Bob tells Amy that 6 comes up for red (or blue) dice. There is difference between '6 comes up for red (or blue) dice' and 'there is at least one 6 coming up (which does not specify color)'. E.g., if Bob tells Amy that 6 comes up for the red dice, then the updated sample space does consist of 6 outcomes, as suggested by Chris:

**Example.**
Suppose a poker deck is to be distributed to four players: player A,B,C,D.
Calculate the probability that player C gets exactly 3 cards with diamond suit, with the condition

(a) Exactly 8 cards with diamond suit are distributed to players A and B, and then 13 of the remaining 44 cards are to be distributed randomly to player C.

(b) 26 cards are distributed to players A and B, where exactly 8 of them are with diamond suit. Then, the remaining 26 cards are to be distributed randomly and equally to players C and D.

*Solution*.

(a) Let us update the sample space to contain the possible distributions of the 13 of the 44 cards. The number of sample points in the updated sample space is (placing 13 card places for player C into 44 cards). On the other hand, the number of sample points in the event is (placing 3 card places for player C into 5 cards with diamond suit, then placing 10 card places for player C into 39 cards with non-diamond suit). Then, the probability is

(b) Let us update the sample space to contain the possible distributions of the remaining 26 cards. The number of sample points in the updated sample space is (placing 13 card places for player C into 26 cards. Then, the remaining 13 cards are for player D). On the other hand, the number of sample points in the event is . (placing 3 card places for player C into 5 distinguishable cards with diamond suit, then placing 10 card places for player C into 39 cards with non-diamond suit. The remaining 13 cards are for player D.) Then, the probability is

**Exercise.**
Suppose 13 of the 52 cards are to be distributed to player C. Calculate the probability that player C gets
exactly 3 cards with diamond suit. (Answer: approximately 0.056)

The probability is

**Example.**
(Simpson's paradox)
Suppose there are two treatments for a disease: treatment A and B.
To test the effectiveness of the treatments, treatment A and B are applied to two different groups of patients with that disease.
The following summarizes the number of patients with different results for the two treatments:

(a) Calculate the probability of recovery for men receiving treatment A.

(b) Calculate the probability of recovery for men receiving treatment B.

(c) Calculate the probability of recovery for women receiving treatment A.

(d) Calculate the probability of recovery for women receiving treatment B.

(e) Calculate the probability of recovery for all patients receiving treatment A.

(f) Calculate the probability of recovery for all patients receiving treatment B.

(To calculate the probability of recovery, we consider the probability of picking a recovered patient when we select a patient from
the pool of patients randomly such that every patient is equally likely to be picked, where the pool of patients depends on our conditions.)

*Solution*.

(a) The probability is .

(b) The probability is . (This means treatment B is better than A when applied to men only.)

(c) The probability is .

(d) The probability is . (This means treatment B is better than A when applied to women only.)

(e) The probability is .

(f) The probability is . (This means treatment B is *worse* than A when applied to all patients.)

**Remark.**

- This example shows the
*Simpson's paradox*where there is a reverse in direction of a comparison when data from several groups are combined into a single group.

The following is a generalization to the formula . It is useful when we calculate the probability of the occurrence of multiple events together, by considering the events "one by one".

**Proposition.**
(Multiplication rule of probability)
Let be a probability space, and
be events.
Then,

**Proof.**
With the assumptions, we have by definition

**Remark.**

- It is also known as
*chain rule*of probability.

**Example.**
Consider an urn that contains 20 distinguishable balls, where 5 of them is red, 6 of them is blue, 7 of them is green, and 2 of them is purple.
Suppose we draw 5 balls from the urn.

(a) Calculate the probability of drawing 3 red balls and 2 blue balls from the urn, if the draw is done without replacement.

(b) Calculate the probability of drawing 3 red balls and 2 blue balls from the urn, if the draw is done with replacement.

*Solution*.

Let be the event that red ball is drawn in -th draw, and blue ball is drawn in -th draw respectively.

(a) The desired probability can be expressed as (or , etc., but simply changing the order does not affect the probability in this case).

**Method 1**: Combinatorial probability.
Let be the sample space that contains all possible outcomes of the 5 ordered draws.
The probability is

**Method 2**: Multiplication rule.
Then, the probability is given by

(b) By the multiplication rule, the probability is given by

*independent*. We will discuss the formal definition of independence in more details in a later section. )

Two important theorems related to conditional probability, namely law of total probability and Bayes' theorem, will be discussed in the following section.

## Law of total probability and Bayes' theorem[edit | edit source]

Sometimes, it is not an easy task to assign a suitable unconditional probability to an event. For instance, suppose Amy will perform a COVID-19 test, and the result is either positive or negative (impossible to have invalid results). Let . What should be ? It is actually quite difficult to answer directly, since this probability is without any condition. In particular, it is unknown that whether Amy gets infected by COVID-19 or not, and clearly the infection will affect the probability assignment quite significantly.

On the other hand, it may be easier to assign/calculate related conditional/unconditional probabilities.
Now, let .
The *conditional* probability , called *sensitivity*, may be known based on the research on the COVID-19 test.
Also, the *conditional* probability , called *specificity*, may also be known based on the research.
Besides, the probability may be obtained according to studies on COVID-19 infection for Amy's place of living.
Since

This shows that conditional probabilities can be quite helpful for calculating unconditional probabilities, especially when we condition appropriately so that the conditional probabilities, and the probability of the condition are known in some ways.

The following theorem is an important theorem that relates unconditional probabilities and conditional probabilities, as in above discussion.

**Theorem.**
(Law of total probability)

Let be a probability space.

- (finite case) Assume that are pairwise disjoint events such that and . Then,

- (countably infinite case) Assume that are pairwise disjoint events such that and . Then,

**Proof.**
Here we only prove the finite case. The proof for countably infinite case is similar and thus left as an exercise.

Under the assumptions, are pairwise disjoint, and thus are also pairwise disjoint (by observing that , and other intersections have similar results). Also, since , the conditional probabilities are defined. Moreover, since , we can observe that (through Venn diagram, informally). It follows that

**Exercise.**
Prove the countably infinite case for law of total probability.

**Proof.**
Under the assumptions, are pairwise disjoint, and thus are also pairwise disjoint
(by observing that , and other intersections have similar results).
Also, since , the conditional probabilities are defined.
Moreover, since , we can observe that (through Venn diagram, informally).
It follows that

Now, suppose Amy has performed a COVID-19 test, and the result is positive! So now Amy is worrying about whether she really gets infected by COVID-19,
or it is just a *false positive*.
Therefore, she would like to know the conditional probability (the conditional probability for getting infected given testing positive).
Notice that the conditional probability may be known (based on some research).
However, it does not equal the conditional probability generally.
(These two probabilities are referring to two different things.)
So, now we are interested in knowing that whether there is formula that relates these two probabilities, which have somewhat "similar" expressions.
See the following exercise for deriving the relationship between and :

**Exercise.**
Let be a probability space, and be events.
Assume and .
Propose a relationship between and , and prove it.
(*Hint*: You may apply the definition of conditional probability on each of these two conditional probabilities. Do you notice any similarity in the expressions?)

**Proposition**: .

**Proof.**
Applying the definition of conditional probability, we have

*same numerator*. Thus, we have

**Remark.**

- is known as the
*prior probability*of event . It is called prior since this probability does not take into account the knowledge of the occurrence of event . It reflects our initial degrees of belief of event (according to the subjectivism interpretation of probability). - is known as the
*posterior probability*of event. It is called posterior since this probability takes into account the knowledge of the occurrence of event . It reflects our adjusted degrees of belief of event as there are more new information (the occurrence of event in this case) (according to the subjectivism interpretation of probability).

- This formula suggests us the way of adjusting/updating the prior probability to the posterior probability .
- Of course, after obtaining the posterior probability, there may be another round of new information available, and we may need to treat this "posterior probability" as the "prior probability", and update it to a "new version" of posterior probability.

The following theorem is the generalization of the above result.

**Theorem.**
(Bayes' theorem)

Let be a probability space.

- (finite case) Assume that are pairwise disjoint events such that and . Then,

- for every .

- (countably infinite case) Assume that are pairwise disjoint events such that and . Then,

- for every .

**Proof.**

**Finite case**:
Under the assumptions, we have by law of total probability

**Countably infinite case**:
Under the assumptions, we have by law of total probability

**Example.**
Suppose the weather at a certain day can either be sunny or rainy, with equal probability.
Amy has a probability of () to bring an umbrella at that day if the weather of that day is rainy (sunny).
At a day, we see that Amy brings an umbrella. Calculate the probability for that day to be rainy.

*Solution*.
Let be the events that the weather at that day is rainy, sunny and Amy brings an umbrella at that day respectively.
Then, the probability that Amy brings an umbrella at that day is

Given that Amy brings an umbrella at that day, the probability for that day to be rainy is

**Exercise.**
Assume that the weather can also be *cloudy*, such that the weather is twice as likely to be cloudy compared with sunny (or rainy).
(That is, the probability for the weather to be cloudy is twice of the probability for the weather to be sunny (which equals the probability for the weather to be rainy).)
Also, Amy has a probability to bring an umbrella at a day if the weather of that day is cloudy.
Calculate such that the probability for a day to be rainy if Amy brings an umbrella at that day is instead
of .
(Answer:)

Let be the event that the weather is cloudy. Since the weather is twice as likely to be cloudy compared with sunny (or rainy) and also the weather can either be sunny, rainy, or cloudy, it follows that

**Example.**
Consider *Morse code* where dots and dashes are used to encode messages.
Suppose Amy and Bob want to communicate through Morse code. However, when Amy sends a Morse code to Bob, there
is interference during the transmission, and hence there is a probability for a dot (or dash) to be
mistakenly received as a dash (or dot resp.).
It is known that the dots and dashes are used in proportion 3:4 in the Morse code.
From this, we may assume that the probability for a dot and dash to be sent is and respectively.
Calculate the probability that a dot is really sent by Amy if Bob receives a dot.

*Solution*.

Let and be the event that a dot is sent and a dot is received respectively.

Then, we have

**Example.**
Suppose a diagnosis for cancer is conducted for Bob. Let be the event that Bob has cancer,
and be the event that Bob is diagnosed with cancer.
Based on research, it is known that the probability for a correct diagnosis is 0.99.
That is,

*Solution*.
The desired probability is given by .
First,

**Remark.**

- This result is quite counter-intuitive, since it appears that the probability for a correct diagnosis is quite high, but it turns out that this probability is quite low!

- The main reason for this result is that the incidence rate of cancer is quite low.

**Exercise.**
Calculate the probability that Bob does not have cancer if he is not diagnosed with cancer. (Answer: approximately 0.99999)

The desired probability is given by . We have

The following is a famous problem.

**Example.**
(Monty Hall problem)

Suppose you are on a game show. There are three doors in front of you, labelled door 1,2,3, and you are allowed to open one of them. Among these three doors, a new car is behind one of the doors, and a goat is behind the other doors. If you open the door with a car behind it, then you will get the car. Otherwise, you get nothing.

Now, suppose you pick the door 1. Then the host Monty Hall, who knows what are behind the three doors, opens a door with a goat behind it (which is not chosen by you, of course). It turns out that Monty Hall opens door 3. After that, you are given an offer where you can keep choosing door 1, or switch your choice to door 2.

With these information, calculate the probability that the new car is behind door 2.

*Solution*.

Let be the event that a car is behind door 1,2,3 respectively. Also, let be the event
that Monty Hall opens *door 3*.
It should be reasonable to assume that a car is equally likely to be behind one of the doors. So, we have

- Case 1: The car is behind door 1. Then, Monty Hall can open door 2 and 3, and it is reasonable to assume that it is equally likely for him to open one of them. Hence, the conditional probability .
- Case 2: The car is behind door 2. Then, Monty Hall must open door 3 (since this is the only choice). Hence, the conditional probability .
- Case 3: The car is behind door 3. Then, Monty Hall can never open door 3 (where a car is behind it). Hence, the conditional probability .

The desired probability is given by

**Remark.**

- Clearly, since it is impossible for the car to be behind door 3 if Monty Hall opens door 3. It follows that . Thus, it is favourable for you to switch your choice to door 2.

**Exercise.**
1. A student thinks that the probability should be with the following argument:

After Monty Hall opens door 3, we know that a goat is behind it. This means a car is not behind door 3. Then, it is equally likely for the car to be behind one of door 1 and door 2. Hence, the probability is .

What is the mistake in this argument?

The mistake is that a wrong event is conditioned on for the above argument. To be more precise, through this reasoning the probability calculated is actually

2. Suppose there are more doors in addition to the doors 1,2,3, labelled door 4,5,...,, such that there are doors in total. In this case, Monty Hall opens doors where a goat is behind each of them. It turns out that Monty Hall opens doors 3,4,...,. Now, you are given an offer where you can keep choosing door 1, or switch your choice to door 2. Calculate the probability that the car is behind door 2. (Answer: )

Let be the event that a car is behind door 1,2,3,..., respectively. Also, let be the event
that Monty Hall opens *doors 3,4,...,*.
It should be reasonable to assume that a car is equally likely to be behind one of the doors. So, we have

- Case 1: The car is behind door 1. Monty can then open of the doors 2,3,..., (or equivalently, choose one of the doors 2,3,..., to
*not open*). We know that the event means Monty Hall chooses door 2 to*not open*. Thus, the probability is . - Case 2: The car is behind door 2. Monty Hall must then open of the doors 3,4,...,. Thus, the probability is .
- Case 3: The car is behind door 3. Then, Monty Hall must
*not*open door 3, and hence must not open doors 3,4,...,. Thus, the probability is

With similar arguments, we have

**Remark.**

- For other cases in which another door is picked, the same result holds by symmetry (notations can be changed).

**Example.**
(Gambler's ruin)
Consider a gambling game as follows. The player starts with capital
( is a nonnegative integer and is specified by the player), and tosses a fair coin repeatedly until he wants to stop.
If head comes up, then he gains . If tail comes up, then he loses .

Suppose Bob plays this game and decides to stop tossing when one of the following conditions are satisfied:

- He runs out of money. (That is, his capital is .)
- His capital is , where is a certain nonnegative integer specified by Bob.

Calculate the probability that Bob will eventually run out of all money if he starts with capital .

*Solution*.
In this case, we should calculate the probability using *recursive approach*.
First, let be the event that Bob will eventually run out of all money if he starts with capital .
Also, let be the event that head comes up in the first toss.

Then, a crucial observation is that if the first toss gives a head, then the situation is exactly the *same*
as that Bob *starts* playing the game with capital after the toss where he gains .
Hence, we have . Likewise, we have .
(This is often the most important step for solving a problem recursively: identifying the recursive relation given (usually implicitly) in the question.)

Then, we can establish the equation

Notice that the above equation implies that the decrement of the probability when increases by one is the same for every . Since there are in total decrements, it follows that every decrement is of value . Hence,

**Exercise.**
Suppose Bob is so greedy that . Under this case, what does the probability tend to (that is, what is the limit of
the above probability as )? Suggest an intuitive meaning of this result.

Since ,
it follows that the probability () tends to 1 as .
This means Bob will inevitably run out of all money if he is extremely greedy and never wants to stop at any fixed value of .
This result is known as the *Gambler's ruin*.

**Example.**
Player A and B take turn to roll a fair dice, where player A rolls the dice first.
A player wins if he is the first player that rolls a six.
Calculate the probability that player A wins.

*Solution*.
We divide the situation into three cases:

- Player A rolls a six in the first roll. (probability ) In this case, player A wins.
- Player A does not roll a six in the first roll, and then player B rolls a six in the second roll. (probability . This can be obtained by considering multiplication rule of probability. We can see that the probability for player B to roll a six in the second roll given that player A does not roll a six in the first roll is .) But in this case, the probability for player A to win is zero anyway.
- Player A and B does not roll a six in the first and second roll respectively. (probability )

Notice that in case 3, the situation is exactly the same as the original situation at the beginning where it is player A's turn to roll the dice. Hence, under the condition for case 3, the probability for player A to win is the same as the unconditional one.

Thus, we can write

**Remark.**

- Notice that eventually, one of the players must roll a six. Hence, it is impossible for the game to end up with a draw. This means the probability that player B wins is .

**Exercise.**
Suppose the dice is changed such that the probability for getting a six from rolling the dice is
(which is not necessarily ), while the player A still rolls the dice first.
Show that the game is always favourable to player A for every .

**Proof.**
When the probability is , we can write the above equation as

**Example.**
(Three Prisoners problem)
There are three prisoners, A, B and C, which will originally be executed tomorrow. However, the governor chooses one of them randomly
to be pardoned (and hence not executed). Only the warden knows who is to be pardoned, but is not allowed to tell the prisoners.

The three prisoners have also heard about this. So, prisoner A asks the warden

Which of the prisoners B and C will be executed? If both are to be executed, then just tell me randomly one of them.

The warden thinks a while, and tells the prisoner A that prisoner B is to be executed. The warden thinks that he does not give any information to prisoner A about whether he is to be pardoned or not, with the following reasoning:

Let be the events that A, B, or C is pardoned respectively. Also, let be the event that the warden says B will die. Since the prisoner to be pardoned is chosen randomly, we have

After prisoner A hears the answer from warden, the conditional probability for him to be pardoned isSo, this probability is the same as the unconditional one.

However, prisoner A thinks that his probability of being pardoned is increased to with the following reasoning:

Given that prisoner B will be executed, either prisoner A or C will be pardoned. Hence, my probability of being pardoned is .

Explain why the prisoner A's reasoning is incorrect.

*Solution*.
Prisoner A falsely interprets the warden's saying as the event , and he calculates

*not*the same as ! Particularly, if B is not to be pardoned (that is, to be executed), then the warden will not necessarily tell that prisoner B is to be executed, since it is possible that C is also to be pardoned, and the warden tells C is to be executed instead.

**Exercise.**
Suppose prisoner A tells the other two prisoners about the answer from the warden to his question.

(a) What is the probability that prisoner B is to be pardoned?

(b) What is the probability that prisoner C is to be pardoned?

Let us use the notations as defined by the warden's answer.

(a) The probability is

(b) The probability is

**Remark.**

- From this, we actually know that the warden can give information to the prisoners about whether they are pardoned or not through his answer, if any of the prisoners B and C knows about this.

## Independence[edit | edit source]

From the previous discussion, we know that the conditional probability of event given the occurrence of event can be interpreted as the probability of where the sample space is updated to event . In general, through this update, the probability of should be affected. But what if the probability is somehow the same as the one before the update?

If this is the case, then the occurrence of a particular event does not affect the probability of event actually. Symbolically, it means . If this holds, then we have . This means the occurrence of event also does not affect the probability of event . This result matches with our intuitive interpretation of the independence of two events, so it seems that it is quite reasonable to define the independence of events and as follows:

Two events and are independent if or .

However, this definition has some slight issues. If , then some of the conditional probabilities involved may be *undefined*. So, for some events, we may not be able to tell whether they are independent or not using this "definition".
To deal with this, we consider an alternative definition, that is equivalent to the above when and .
To motivate that definition,
we can see that ,
when both and are nonzero.
This results in the following definition:

**Definition.**
(Independence of two events)
Let be a probability space, and be two events.
Then, the events and are *independent* if .

**Remark.**

- We sometimes write when and are independent.

But what if there are more than two events involved? Intuitively, we may consider the following as the general "definition" of independence:

The events are independent if . ( is an integer that is at least 2.)

But we will get some strange results by using this as the "definition":

**Example.**
Suppose we roll a dice twice.
Let be the event that both rolls result in the same number,
be the event that sum of the numbers obtained is between 7 and 10,
be the event that the sum is 7 or 8 or 12.
Show that (that is, the events are "independent" according to the above "definition")
but (that is, the events and are not independent).

**Proof.**
First, we have , , and .
Hence, we have

From this example, merely requiring
cannot ensure all *pairs* of events involved are independent.
So, this suggests us another definition:

The events are independent if all the pairs of events involved are independent.

However, we will again get some strange results by using this as the "definition":

**Example.**
Consider two balls, where one is bigger than another. Both balls are equally likely to be red or blue.
Let be the event that the bigger ball is red, be the event that the smaller ball is red,
and be the event that both balls have the same color.

Show that all the pairs of events involved are independent, but (that is, the events and are not independent.)

**Proof.**
Consider the following tables containing relevant probabilities:

However, .
(Actually, we have . This means given the occurrence of both events and , we know
for sure that the event occurs.
So, the knowledge of the occurrence of *both* events and affect the probability of event ,
although the probability of event is not affected by merely the occurrence of or merely the occurrence of .
)

The above examples suggest us that we actually need *both* requirements

- . (This ensures . That is, if we condition on the intersection of two events, the probability of another one is not affected.)
- All the pairs of events are independent. (That is, .

in the definition of independence of events for the definition to "make sense".

Similarly, the independence of four events should require

- .
- All the triples of events are independent.

- In other words, we need the probability of intersection of any three and any two events to be able to "split as product of probabilities of single events" as in above.

This leads us to the following general definition:

**Definition.**
(Independence)
Let be a probability space, and be events.
Then, the events are (mutually) *independent* if

*for every finite index set*.

**Remark.**

- This definition also applies to finitely many events , where the index set is a subset of instead.
- If all the pairs of events involved are independent, then we say that are
*pairwise independent*. By definition, the mutual independence of implies their pairwise independence (we just consider the index set with two elements in this case). But, the converse does not hold (discussed in a previous example). - Sometimes, based on the situation, two events may be
*assumed*to be independent.

**Example.**
Suppose Amy and Bob take turn to toss a fair coin, where Amy toss the coin first.
Let and be the events that head comes up in the first and second toss respectively.
Calculate the probability .

*Solution*.
It is reasonable to assume that and are independent. Hence,

**Example.**
Suppose a fair coin is tossed 10 times independently. Which of the following outcomes is more likely? (H represents head and T represents tail)

- HHHHHHHHHT
- HTTHTHHTHT

*Solution*.
The probability for each of these outcomes is .
So, they are equally likely.

**Remark.**

- One may falsely think that the second outcome is more likely, since it seems the heads and tails are distributed more "equally". The reasoning behind this may be one falsely interpret the second outcome to be the same as "5 heads and 5 tails" (without specifying the order). But this is not the case: the second one specifies the outcome required in every toss.

**Exercise.**
Calculate the probability that 5 heads and 5 tails are obtained in these 10 tosses. (Answer: approximately 0.246)
(*Hint*: consider the number of different arrangements of 5 heads and 5 tails in these 10 tosses.)

We can regard the 10 tosses as 10 distinguishable cells with capacity one, and 5 "heads" to be 5 indistinguishable balls. Then, the process of arranging the outcomes is equivalent to placing the 5 balls into the 10 cells (the 5 empty cells are for "tails"). Hence, the number of arrangements is given by . Notice that every such arrangement is a sample point in the sample space (for the experiment of tossing a coin 10 times), and the probability of every singleton event is . It follows that the desired probability is

**Example.**
Let be a probability space, and be events.
Show that if the events and are independent, then so is the events and .

**Proof.**
Assume that the events and are independent.
We wish to show that .

We have

**Exercise.**
Show that if the events and are independent,

(a) then so is the events and .

(b) then so is the events and .

**Proof.**
Assume that the events and are independent.
Then,

**Remark.**

- Actually, this result is symmetric to the one in the example above.

**Proof.**
Assume that the events and are independent.
Then, by (a) in this exercise, and are independent.
It follows by the result in the example above that and are independent. (Take "" in the example above to be .)

In general, we have the following result.

**Proposition.**
Let be a probability space, and be events (where is an arbitrary integer such that ).
If are independent, then after replacing some of the events by their respective complements, the independence still holds.

**Proof.**
Assume that the events are independent.
Then, we have

Notice that we can split the process of replacing *some* of the events into multiple steps,
where we replace one event in every step (that is, we replace the events one by one).
Now, we can apply the above argument in each step to ensure that the resulting events are still independent.
Thus, after all steps, and we finish the replacement, the independence still holds.

**Example.**
Let be a probability space, and be events such that
and .

(a) Show that for every event , and