# Introduction

## Overview

Probability theory provides a mathematical model for the study of randomness and uncertainty. Many important decisions, whether from business, government, science, recreation or even one's personal life must be made with incomplete information or some degree of uncertainty. Hence, a formalized study of uncertain or random outcomes occupies an important role in modern society. In situations where one of any number of possible outcomes may occur, the mathematical model of probability theory offers methods for quantifying the likelihoods associated with those outcomes. Probability also provides tools which allow us to move beyond simply describing the information contained within a set of data (descriptive statistics) to actually inferring further information from that data (inferential statistics). Many of the early attempts to model likelihood arose from games of chance. For a brief history of probability see this Wikipedia article.

Although probability theory is now a very formal branch of mathematics, the language of probability is often used informally in everyday speech. We express our beliefs about likelihoods of outcomes in situations involving uncertainty using intuition guided by our experiences and in some cases statistics. Consider the following examples:

• Bill says "Don't buy the avocados here; about half the time, they're rotten". Bill is expressing his belief about the probability of an event — that an avocado will be rotten — based on his personal experience.
• Lisa says "I am 95% certain the capital of Spain is Barcelona". Here, the belief Lisa is expressing is only a probability from her point of view, because only she does not know that the capital of Spain is Madrid (from our point of view, the probability is 100%). However, we can still view this as a subjective probability because it expresses a measure of uncertainty. It is as though Lisa is saying "in 95% of cases where I feel as sure as I do about this, I turn out to be right".
• Susan says "There is a lower chance of being shot in Omaha than in Detroit". Susan is expressing a belief based (presumably) on statistics.
• Dr. Smith says to Christina, "There is a 75% chance that you will live." Dr. Smith is basing this off of his research.
• Nicolas says "It will probably rain tomorrow." In this case the likelihood that it will rain is expressed in vague terms and is subjective, but implies that the speaker believes it is greater than ${\displaystyle {\frac {1}{2}}}$ (or 50%). Subjective probabilities have been extensively studied, especially with regards to gambling and securities markets. While this type of probability is important, it is not the subject of this book. A good reference is "Degrees of Belief" By Steven Vick (2002).

Notice that in the previous examples the likelihood of any particular outcome is expressed as a percentage (between 0% and 100%), as is common in everyday language. However, probabilities in formal probability theory are always expressed as real numbers in the interval ${\displaystyle [0,1]}$ (e.g. a probability of .25 may be expressed as 25%, or a probability of ${\displaystyle {\frac {1}{\pi }}}$ may be expressed as approximately 31.83%). Other differences exist between common expressions of probabilities and formal probability theory. For example, a probability of 0% is typically taken to mean that the event to which that probability is assigned is impossible. However, in probability theory (usually in cases where there are infinitely many possible outcomes) an event ascribed a probability of zero may actually occur. In some situations, it is certain that such an event will occur (e.g. in selecting a real number between 0 and 1, the probability of selecting any given number is zero, but it is certain that one such number will be selected).

Another way to express the probability of an outcome is by its odds: the ratio of the probability of "success" (event occurs) to the probability of "failure" (event does not occur). In gambling odds are expressed as the ratio of the stakes risked by each participant in a wager. For instance: a bookmaker offering odds of 3 to 1 "against" a horse will pay a punter three times their stake (if the horse wins). In fact, the bookmaker (ignoring factors such as his potential need to "lay off" bets which are exposing him to the possibility of an unacceptable overall loss) is announcing that he thinks the horse has a ${\displaystyle {\frac {1}{4}}}$ chance of winning. If we express odds as "chance of winning": "chance of not winning", then 3 to 1 against would be represented as ${\displaystyle 1:3={\frac {1}{4}}:{\frac {3}{4}}}$ or ${\displaystyle {\frac {1}{3}}}$ . So an event with a probability of ${\displaystyle {\frac {1}{4}}}$ or 25% has odds of 33%. This disparity is even more clear where an event has a probability of 50% (e.g., the odds of a coin showing heads is 50%:50% = 1:1 or ${\displaystyle {\frac {1}{2}}}$).

## Types of probability

As mentioned earlier, probability can be expressed informally in a variety of different ways, but even formal definitions and approaches vary. The most general and rigorous approach is known as axiomatic probability theory, which will be the focus of later chapters. Here we briefly discuss a few other approaches, their uses and limitations. All of these approaches rely in one way or another on the concept of an experiment. Recall that probability provides means to study randomness and uncertainty.

An experiment is any action or process whose outcome is subject to uncertainty or randomness.

Here the term experiment is used in a wider sense than its usual connotation with controlled laboratory situations. Further clarification on experiments will be given later, but for now the following examples of experiments will suffice:

• observing whether or not a commercial product is defective.
• tossing a coin one or more times or selecting a card from a card deck.
• conducting a survey.
• measuring the wind speed or rainfall in a particular area.

Assuming that an experiment can be repeated under identical conditions, then each repetition of an experiment is called a trial.

### Basic Concepts

There are two standard approaches to conceptually interpreting probabilities: the relative frequency approach and the subjective belief (or confidence approach). In the Frequency Theory of Probability, probability is the limit of the relative frequency with which certain outcomes occur in repeated trials (note that the outcome of any single trial cannot depend on the outcome of other trials). The relative frequency approach requires that experiments be random and that all possible outcomes be known before execution of the experiment. The probability of any set of outcomes is expressed as the relative frequency with which those outcomes will occur among many repeated trials.

Physical probabilities fall within the category of objective or frequency probabilities, and are associated with random physical systems such as roulette wheels, rolling dice and radioactive atoms. In such systems, a given outcome (such as a die yielding a six) tends to occur at a persistent rate, or 'relative frequency', in a long run of trials. Physical probabilities either explain, or are invoked to explain these stable frequencies.

Relative frequency probabilities are always expressed as a figure between 0% (the outcome essentially never happens) and 100% (the outcome essentially always happens), or similarly as a figure between 0 and 1. According to the Frequency Theory of Probability, saying that "the probability that A occurs is p%" means that if you repeat the experiment many times under essentially identical conditions, the percentage of time for which A occurs will converge to p. For example, a 50% chance that a coin lands "heads up" means that if you toss the coin over and over again, then the ratio of times the coin lands heads to the total number of tosses approaches a limiting value of 50% as the number of tosses grows. Notice that the outcome of one toss never depends on another toss, and that the ratio of heads to total number of tosses is always between 0% and 100%.

In the Subjective Theory of Probability, probability measures the speaker's "degree of belief" that a set of outcomes will result, on a scale of 0% (complete disbelief that the event will happen) to 100% (certainty that the event will happen). According to the Subjective Theory, saying that "the probability that A occurs is ${\displaystyle {\frac {2}{3}}}$ " means that I believe that A will happen twice as strongly as I believe that A will not happen. The Subjective Theory is particularly useful in assigning meaning to the probability of outcomes that in principle can occur only once. For example, how might one assign meaning to the following statement: "there is a 25% chance of an earthquake on the San Andreas fault with magnitude 8 or larger before 2050"? It would be very hard to qualify this measure in terms of relative frequency.

One way to represent an individual's degree of belief in a statement, given available evidence, is with the Bayesian approach. Evidential probability, also called Bayesian probability, can be assigned to any statement whatsoever, even when no random process is involved. On most accounts evidential probabilities are considered degrees of belief, defined in terms of dispositions to gamble at certain odds. The primary evidential interpretations include the classical interpretation, the subjective interpretation, the epistemic or inductive interpretation, and the logical interpretation.

The next several sections discuss the principal theories within the relative frequency approach to probability.

### Classical theory of probability

The classical approach to probability expresses probability as a ratio of the number of favorable outcomes in a series of successive trials to the number of total possible outcomes. Note the immediate implication that the number of total possible outcomes be known. Furthermore, all possible outcomes are assumed to be equally probably and no two possible outcomes can both result from the same trial. Here, the term "favorable" is not subjective, but rather indicates that an outcome belongs to a group of outcomes of interest. This group of outcomes is called an event, which will be formalized with the introduction of axiomatic probability theory.

 Classical definition of probability If the number of outcomes belonging to an event ${\displaystyle E}$ is ${\displaystyle N_{E}}$ , and the total number of outcomes is ${\displaystyle N}$ , then the probability of event ${\displaystyle E}$ is defined as ${\displaystyle p_{E}={\frac {N_{E}}{N}}}$ .

For example, a standard deck of cards (without jokers) has 52 cards. If we randomly draw a card from the deck, we can think of each card as a possible outcome. Therefore, there are 52 total outcomes. We can now look at various events and calculate their probabilities:

• Out of the 52 cards, there are 13 clubs. Therefore, if the event of interest is drawing a club, there are 13 favorable outcomes, and the probability of this event is ${\displaystyle {\frac {13}{52}}={\frac {1}{4}}}$ .
• There are 4 kings (one of each suit). The probability of drawing a king is ${\displaystyle {\frac {4}{52}}={\frac {1}{13}}}$ .
• What is the probability of drawing a king OR a club? This example is slightly more complicated. We cannot simply add together the number of outcomes for each event separately (${\displaystyle 4+13=17}$) as this inadvertently counts one of the outcomes twice (the king of clubs). The correct answer is ${\displaystyle {\frac {16}{52}}}$ from ${\displaystyle {\frac {13}{52}}+{\frac {4}{52}}-{\frac {1}{52}}}$ where this is essentially ${\displaystyle p({\text{club}})+p({\text{king}})-p({\text{king of clubs}})}$ .

Classical probability suffers from a serious limitation. The definition of probability implicitly defines all outcomes to be equiprobable. While this might be useful for drawing cards, rolling dice, or pulling balls from urns, it offers no method for dealing with outcomes with unequal probabilities.

This limitation can even lead to mistaken statements about probabilities. An often given example goes like this:

I could be hit by a meteor tomorrow. There are two possible outcomes: I will be hit, or I will not be hit. Therefore, the probability I will be hit by a meteor tomorrow is ${\displaystyle {\frac {1}{2}}=50\%}$ .

Of course, the problem here is not with the classical theory, merely the attempted application of the theory to a situation to which it is not well adapted.

This limitation does not, however, mean that the classical theory of probability is useless. At many points in the development of the axiomatic approach to probability, classical theory is an important guiding factor.

### Empirical or Statistical Probability or Frequency of occurrence

This approach to probability is well-suited to a wide range of scientific disciplines. It is based on the idea that the underlying probability of an event can be measured by repeated trials.

 Empirical or Statistical Probability as a measure of frequency Let ${\displaystyle n_{A}}$ be the number of times event ${\displaystyle A}$ occurs after ${\displaystyle n}$ trials. We define the probability of event ${\displaystyle A}$ as ${\displaystyle p_{A}=\lim _{n\to \infty }{\frac {n_{A}}{n}}}$

It is of course impossible to conduct an infinite number of trials. However, it usually suffices to conduct a large number of trials, where the standard of large depends on the probability being measured and how accurate a measurement we need.

A note on this definition of probability: How do we know the sequence ${\displaystyle {\frac {n_{A}}{n}}}$ in the limit will converge to the same result every time, or that it will converge at all? The unfortunate answer is that we don't. To see this, consider an experiment consisting of flipping a coin an infinite number of times. We are interested in the probability of heads coming up. Imagine the result is the following sequence:

HTHHTTHHHHTTTTHHHHHHHHTTTTTTTTHHHHHHHHHHHHHHHHTTTTTTTTTTTTTTTT...

with each run of ${\displaystyle k}$ heads and ${\displaystyle k}$ tails being followed by another run twice as long. For this example, the sequence ${\displaystyle {\frac {n_{A}}{n}}}$ oscillates between roughly ${\displaystyle {\frac {1}{3}}}$ and ${\displaystyle {\frac {2}{3}}}$ and doesn't converge.

We might expect such sequences to be unlikely, and we would be right. It will be shown later that the probability of such a run is 0, as is a sequence that converges to anything other than the underlying probability of the event. However, such examples make it clear that the limit in the definition above does not express convergence in the more familiar sense, but rather some kind of convergence in probability. The problem of formulating exactly what this means belongs to axiomatic probability theory.

### Axiomatic probability theory

Although axiomatic probability theory is often frightening to beginners, it is the most general approach to probability and has been employed in tackling some of the more difficult problems in probability. It begins with a set of axioms which, although not immediately intuitive, are guided by the more familiar classical probability theory. These axioms are discussed in the (as yet unwritten) following chapter.

This book is going to discuss the topic of mathematical probability using Calculus and Linear Algebra. Readers of this book should have a good understanding of both those topics before attempting to read and understand this book completely.

# The Counting Principle

## The Counting Principle

Before we can delve into the properties of probability and odds, we need to understand the Counting Principle. We use the Counting Principle to determine how many different ways one can choose/do certain events. It is easier to define the Principle through examples:

### Independent Events

Let's say that John is at a deli sub restaurant. There are 3 different breads, 3 different cheeses, 3 different condiments, and 3 different vegetables he can place on his sandwich, assuming that he can only place one from each category on to his sandwich. How many different ways can he set up his sandwich?

Since choosing a cheese doesn't affect the number of choices of vegetables, condiments, or bread, these events are called independent events. For this problem, we will multiply ${\displaystyle 3\times 3\times 3\times 3}$ , so ${\displaystyle 3^{4}}$ , which is 81. So there are 81 different possible combinations to form this sandwich.

#### Practice Problems

1) Thomas goes to a McDonalds' restaurant and chooses to create his own burger. He has 2 different kinds of cheese, 3 different breads, and 3 different sauces he can choose from, but he can only choose one of each category. How many different ways can he create this burger?

2) Diane is ordering pizza for her family. There are 4 different possible sizes of the pizza. Also, she has to choose one of 5 toppings to place on the pizza and one of 3 different types of cheese for the pizza. In addition, she must choose one of 3 different kinds of crust. How many different ways can she have her pizza?

3)

a) How many 3-digit numbers can be formed from the digits 2, 3, 4, 5, 7 and 9?
b) How many of these numbers are less than 400?

1) ${\displaystyle (2)(3)(3)=18}$

2) ${\displaystyle (4)(5)(3)(3)=180}$

3)

a) Since there are six available digits, the answer is ${\displaystyle (6)(6)(6)=216}$
b) For the value of the 3-digit number to be less than 400, we only have two choices for the first digit, namely 2 or 3. After that we can choose the other two digits freely. The answer is thus ${\displaystyle (2)(6)(6)=72}$.

### Dependent Events

Assume that John is now working at a library. He must put 5 books on a shelf in any order. How many different ways can he order the books on the shelf? Unlike the independent events, when John puts a book on the shelf, that eliminates one book from the remaining choices of books to put next on the shelf; thus these are referred to as dependent events. At first, he has 5 different choices, so our first number in the multiplication problem will be 5. Now that one is missing, the number is reduced to 4. Next, it's down to 3, and so on. So, the problem will be

${\displaystyle (5)(4)(3)(2)(1)}$

However, there is a symbol for this very idea. A ${\displaystyle !}$ represents the term factorial. So, for example, ${\displaystyle 3!=(3)(2)(1)}$ . Factorials are very useful in statistics and probability.

Therefore, the problem can be rewritten as ${\displaystyle 5!}$ , which ends up being equal to 120.

However, not all dependent event problems are that simple. Let's say that there are 10 dogs at a dog competition. How many different ways can one select the Champion AND the Runner-Up? This problem could actually be considered simpler than the last one, but it doesn't involve a factorial. So, how many different ways can the judge determine the Champion? Of course, there are 10 different dogs, so 10 different ways. Next, how many dogs are left to select the Runner-Up from? Well, you removed one, so it's down to 9. Instead of placing a factorial, though, you will only multiply 10 and 9, resulting in 90.

### Independent Or Dependent?

To help you differentiate between the two, we will do a few more examples, but we will have to decide if it is dependent or independent before solving the problem.

Let's say that you are creating a 5-digit garage door opener code (the digits would include 0-9). If there were absolutely no restrictions, would the events be independent of each other or dependent on each other? Of course, there are no restrictions, since you could have five 4's for the code, so to solve the problem, you multiply 10 by itself 5 times, resulting in 100000.

Alternatively, suppose that the first number of the code cannot be 0, and that there can be no repeated numbers whatsoever. Obviously these events are dependent on each other, because there cannot be any repetitions. Let's look at this problem one number at a time.

The first number can be all the numbers except 0, reducing the possible numbers to 9. The second number can be 0 this time, so the possible numbers returns to 10. However, it cannot be a repeat of the previous number, so there are 9 possible choices again. After that, the numbers will reduce by one each time, due to the fact that there cannot be repeats, so the problem would look like this

${\displaystyle (9)(9)(8)(7)(6)={\frac {(9)(9!)}{(5!)}}=27216}$

Now, just one more example. Let's say that you were choosing your schedule for school. There are 8 periods each day, and there are 7 classes to choose from. Nonetheless, you must have a lunch period during the 4th period. We can think of 4th period as non existent because it is in a constant position and therefore does not affect the possible choices. With 7 slots and 7 options, the answer is simply ${\displaystyle 7!}$ .

${\displaystyle (7!)=5040}$

### Review Of The Counting Principle

So, we use the Counting Principle to determine the different unique ways we can do something, such as a sandwich or a selection of classes. Sometimes, these events will affect each other, such as when you can't choose the same number twice for your garage door code, so they are dependent events. However, other times, one event has no effect on the next event, such as when you have different cheeses and breads to choose for your sandwich, so they are independent events. The Counting Principle is a fundamental mathematical idea and an essential part of probability.

## Counting Rules

Rule 1: If any one of ${\displaystyle k}$ mutually exclusive and exhaustive events can occur on each of ${\displaystyle n}$ trials, there are ${\displaystyle k^{n}}$ different sequences that may result from a set of such trials.

• Example: Number of possible sequences resulted from flipping a coin three times is ${\displaystyle k^{n}=2^{3}=8}$.

Rule 2: If ${\displaystyle k_{1},\dots ,k_{n}}$ are the numbers of distinct events that can occur on trials ${\displaystyle 1,\dots ,n}$ in a series, the number of different sequences of ${\displaystyle n}$ events that can occur is ${\displaystyle k_{1}\times \cdots \times k_{n}}$ .

• Example: Number of possible sequences resulted from flipping a coin and roll a six-faced die is ${\displaystyle (k_{1})(k_{2})=(2)(6)=12}$.

Rule 3: The number of different ways that ${\displaystyle n}$ distinguishable things may be arranged in order is ${\displaystyle n!=(1)(2)(3)\times \cdots \times (n-1)(n)}$, in which ${\displaystyle 0!{\overset {\text{def}}{=}}1}$. An arrangement in order is called a permutation. The total number of permutations of ${\displaystyle n}$ objects is ${\displaystyle n!}$ (the symbol ${\displaystyle n!}$ Is called n-factorial).

• Example: Number of possible ways to arrange 10 items in order is ${\displaystyle 10!=10\cdot 9\cdot 8\cdot 7\cdot 6\cdot 5\cdot 4\cdot 3\cdot 2\cdot 1=3,628,800}$

Rule 4: The number of ways of selecting and arranging ${\displaystyle k}$ distinguishable objects from among ${\displaystyle n}$ distinct objects is: ${\displaystyle {\frac {n!}{(n-k)!}}}$ , or as seen on calculators, [nPr].

• Example: Number of possible ways to pick 3 things from 10 items, and arrange them in order is ${\displaystyle {\frac {10!}{(10-3)!}}={\frac {10!}{7!}}=720}$.

Rule 5: The total number of ways of selecting ${\displaystyle k}$ distinct combinations of ${\displaystyle n}$ distinguishable objects, irrespective of order (i.e. order is NOT important), is: ${\displaystyle {\binom {n}{k}}={\frac {n!}{k!(n-k)!}}}$ or as seen on calculators, [nCr].

• Example: Number of possible ways to pick 3 items from 10 items irrespective of order is ${\displaystyle {\binom {10}{3}}={\frac {10!}{3!(7!)}}={\frac {720}{6}}=120}$.

# Probability Spaces

## Concept

We will now proceed to develop a more axiomatic theory of probability, allowing for a simpler mathematical formalism. We shall proceed by developing the concept of a probability space, which will allow us to harness many theorems in mathematical analysis.

Recall that an experiment is any action or process with an outcome that is subject to uncertainty or randomness. A probability space or a probability triple is a mathematical construct that models an experiment and its set of possible outcomes.

## Probability space

Before defining probability space, we define several terms used in its definition.

Definition. (Sample space) The sample space, denoted by ${\displaystyle \Omega }$, is the non-empty set whose elements are all possible outcomes of an experiment.

Remark.

• the sample space is often not unique, since there are often multiple ways to define the possible outcomes of an experiment, possibly because of the difference in expression [1]
• alternative notations of sample space include ${\displaystyle S}$ and ${\displaystyle U}$ (${\displaystyle U}$ stands for 'universal set')
• an outcome from the experiment is commonly denoted by ${\displaystyle \omega }$ (small letter of ${\displaystyle \Omega }$, omega)

Example. A sample space of the numbers coming up from rolling a six-faced dice is ${\displaystyle \Omega =\{1,2,3,4,5,6\}}$.

Definition. (Event) An event is a subset of the sample space.

Remark.

• it follows that the event space ${\displaystyle {\mathcal {F}}={\mathcal {P}}(\Omega )}$ (it is power set of the sample space)
• event consisting a single outcome (which is a singleton) is sometimes referred as simple event, and event consisting more than one outcomes is sometimes referred as compound event
• an event is said to have happened or occurred if the outcome of the experiment is an element of the event

Example. ${\displaystyle \varnothing ,\{1,2,3\},\Omega }$ are events from rolling a six-faced dice, while ${\displaystyle \{0\}}$ is not.

Definition. (Probability space) A probability space is a mathematical triplet ${\displaystyle (\Omega ,{\mathcal {F}},\mathbb {P} )}$ consisting of the sample space ${\displaystyle \Omega }$, a set of events (or event space) ${\displaystyle {\mathcal {F}}}$, and a probability function ${\displaystyle \mathbb {P} }$.

Remark.

• there are multiple ways to define the probability functions, as we will see in the following sections, and among those definitions, the axiomatic definition is the most used, and general
• the probability function is sometimes denoted by ${\displaystyle \Pr ,P}$ or ${\displaystyle p}$ instead
• the notation ${\displaystyle \mathbb {P} }$ is mainly used in this book to distinguish the probability function from other functions named ${\displaystyle P}$ or ${\displaystyle p}$
• a probability space is arbitrary, in the sense that its author ultimately defines which elements ${\displaystyle \Omega }$, ${\displaystyle {\mathcal {F}}}$, and ${\displaystyle \mathbb {P} }$ will contain
• the probability function ${\displaystyle \mathbb {P} }$ may present a model for a particular class of real-world situations

## Terminologies

Terminologies of set from set theory also apply to event, since event is essentially a set. Apart from those terminologies, we also have the following extra terminologies for event.

Definition. (Exhaustive) Events ${\displaystyle E_{1},\dotsc ,E_{n}}$ are exhaustive if ${\displaystyle E_{1}\cup \dotsb \cup E_{n}=\Omega }$.

Example. When we are rolling a six-faced dice, the events ${\displaystyle \{1,2,3,4\}}$ and ${\displaystyle \{3,4,5,6\}}$ are exhaustive, while the events ${\displaystyle \varnothing }$ and ${\displaystyle \{1,2,3,4,5\}}$ are not exhaustive.

Definition. (Partition) A group of events ${\displaystyle E_{1},\dotsc ,E_{n}}$ is a partition of ${\displaystyle \Omega }$ if the events are both disjoint and exhaustive.

Example. When we are rolling a six-faced dice, the group of events ${\displaystyle \varnothing }$ and ${\displaystyle \Omega }$ is a partition, while the group of events ${\displaystyle \{1,2,3,4\}}$ and ${\displaystyle \{3,4,5,6\}}$ is not a partition, since these events are not disjoint.

## Probability definition

The remaining undefined item in the probability space is the probability function ${\displaystyle \mathbb {P} }$, and we will give various definitions of it, in which the combinatorial (or classical), and axiomatic definitions are important.

Definition. (Subjective probability) The probability of an event is a measure of the chance with which we can expect the event to occur. We assign a number between ${\displaystyle 0}$ and ${\displaystyle 1}$ inclusively to the probability of an event. A probability of ${\displaystyle 1}$ means that we are certain the event will occur, and a probability of ${\displaystyle 0}$ means that we are certain the event will not occur.

Example. Amy and Bob access their probabilities of winning the top prize from a lucky draw using the subjective probability approach.

• Amy thinks that she is lucky, and thus assign 0.7 to the probability of winning the top prize
• Bob thinks that he is unlucky, and thus assign 0.1 to the probability of winning the top prize

Remark.

• this illustrates a major problem of subjective probability, namely the probability assigned to an event is often not unique, due to different opinions from different people

Definition. (Combinatorial probability) Assume all outcomes in the sample space ${\displaystyle \Omega }$ are equally likely. Then, the (combinatorial) probability of an event (say ${\displaystyle E}$) in the sample space is

${\displaystyle \mathbb {P} (E)={\frac {{\text{no. of outcomes in }}E}{{\text{no. of outcomes in }}\Omega }}.}$

Remark.

• it is also called classical probability
• if the outcomes are not equally likely, we cannot apply this definition
• by principle of indifference (or insufficient reason), unless there exists evidence showing that the outcomes are not equally likely [2], we should assume that the outcomes are equally likely
• when the sample space contains infinitely many outcomes, the combinatorial probability is undefined

Example. The probability of getting the number 1 coming up from rolling a fair red six-faced dice and a fair blue six-faced dice is ${\displaystyle {\frac {1}{6\times 6}}={\frac {1}{36}}}$.

Proof. The number of pair of numbers coming up for the two dices is ${\displaystyle \underbrace {6} _{\text{red}}\times \underbrace {6} _{\text{blue}}=36}$. Since the dice is fair, the 36 outcomes are equally likely, and so we can apply combinatorial probability here.

${\displaystyle \Box }$

Exercise.

Suppose the blue dice is colored red. Calculate the probability again.

 ${\displaystyle {\frac {1}{36}}}$ ${\displaystyle {\frac {1}{21}}}$ ${\displaystyle {\frac {1}{18}}}$ ${\displaystyle {\frac {1}{15}}}$ ${\displaystyle {\frac {1}{6}}}$

Example. (Capture-mark-recapture) We are fishing in a lake, containing ${\displaystyle N}$ fishes. First, we catch ${\displaystyle k\leq N}$ fishes from the lake (capture), and gave them each a marker (mark). Then, we catch fishes from the lake again (recapture), and catch ${\displaystyle n\geq r}$ (and also ${\displaystyle \leq N}$) fishes this time. The probability that there is ${\displaystyle r\leq k}$ marked fishes in the ${\displaystyle n}$ fishes is ${\displaystyle {\frac {{\binom {k}{r}}\times {\binom {N-k}{n-r}}}{\binom {N}{n}}}}$.

Proof. We order the ${\displaystyle N}$ fishes in the lake notionally (e.g. by assigning them different number one by one), so that they are now distinguishable (notionally), then, we have:

• ${\displaystyle {\binom {N}{n}}}$: the number of outcomes of catching ${\displaystyle n}$ fishes from ${\displaystyle N}$ fishes
• ${\displaystyle {\binom {k}{r}}}$: the number of outcomes of catching ${\displaystyle r}$ marked fishes from ${\displaystyle k}$ marked fishes in the recapture process
• ${\displaystyle {\binom {N-k}{n-r}}}$: the number of outcomes of catching ${\displaystyle n-r}$ unmarked fishes from ${\displaystyle N-k}$ unmarked fishes in the recapture process (this ensure that we only catch ${\displaystyle r}$ marked fishes, by ensuring that the remaining caught fishes do not contain any marked fish)

${\displaystyle \Box }$

Exercise. There are 9 balls in a box, consisting of 3 red balls, 2 blue balls and 4 green balls.

1 Calculate the probability that a red ball is drawn from the box if 1 ball is drawn from the box.

 ${\displaystyle {\frac {1}{28}}}$ ${\displaystyle {\frac {3}{28}}}$ ${\displaystyle {\frac {1}{9}}}$ ${\displaystyle {\frac {1}{3}}}$ none of the above

2 Calculate the probability that 2 red balls and 3 green balls are drawn from the box if 6 balls are drawn from the box.

 ${\displaystyle {\frac {2}{7}}}$ ${\displaystyle {\frac {5}{9}}}$ ${\displaystyle {\frac {5}{7}}}$ ${\displaystyle {\frac {5}{6}}}$ none of the above

3 ${\displaystyle n}$ orange balls are added to the box such that the probability that 2 red balls and 3 green balls are drawn from the box if 6 balls are drawn from the box is now ${\displaystyle {\frac {1}{3}}}$. Calculate ${\displaystyle n}$.

 4 8 16 there exist such ${\displaystyle n}$, and its value is none of the above there does not exist such ${\displaystyle n}$

4 Select the correct (in numerical value sense) expression(s) of the probability that ${\displaystyle r}$ red balls are drawn and ${\displaystyle b}$ blue balls are drawn if ${\displaystyle k}$ balls are drawn from the box (${\displaystyle r}$, ${\displaystyle b}$ and ${\displaystyle k}$ are of values such that all terms in the following are defined).

 ${\displaystyle {\frac {{\binom {3}{r}}{\binom {2}{b}}}{\binom {9}{k}}}}$ ${\displaystyle {\frac {{\binom {3}{r}}{\binom {2}{b}}}{\binom {b+r+k}{k}}}}$ ${\displaystyle {\frac {{\binom {3}{r}}{\binom {2}{b}}}{\binom {9}{9-b-r}}}}$ ${\displaystyle {\frac {{\binom {9-b-k}{r}}{\binom {2}{b}}}{\binom {9}{k}}}}$ ${\displaystyle {\frac {{\binom {3}{r}}{\binom {9-r-k}{b}}}{\binom {9}{k}}}}$

Definition. (Frequentist probability) The probability of an event or outcome is the long-term proportion of times the event would occur if the experiment was repeated independently many times. That is, letting ${\displaystyle n(E)}$ be the no. of times that event ${\displaystyle E}$ occurs from ${\displaystyle n}$ repetitions of experiment, then the probability of ${\displaystyle E}$ is

${\displaystyle \mathbb {P} (E)=\lim _{n\to \infty }{\frac {n(E)}{n}}.}$

Example. Suppose we throw a coin 1 million times (i.e. 1000000 times). The number of head coming up is 700102, the number of tail coming up is 299896, and the number of times that the coin lands on edge is 2.

Then, the probability that the head coming up is close to ${\displaystyle {\frac {700102}{1000000}}=0.700102}$.

After that, we may infer that the coin is unfair.

Definition. (Axiomatic probability) A probability is a set function defined on the event space ${\displaystyle {\mathcal {F}}}$(${\displaystyle ={\mathcal {P}}(\Omega )}$). It assigns a real value ${\displaystyle \mathbb {P} (E)}$ to each event ${\displaystyle E}$, with the following probability axioms satisified:

(P1) for each event ${\displaystyle E\in {\mathcal {F}}}$, ${\displaystyle \mathbb {P} (E)\geq 0}$ (nonnegativity)
(P2) ${\displaystyle \mathbb {P} (\Omega )=1}$ (unitarity)
(P3) for each (countable) infinite sequence of mutually exclusive (or disjoint) events ${\displaystyle E_{1},E_{2},\dotsc }$, ${\displaystyle \mathbb {P} \left(\bigcup _{i=1}^{\infty }E_{i}\right)=\sum _{i=1}^{\infty }\mathbb {P} (E_{i})}$ (countable additivity)

Remark.

• mutually exclusive events means that the intersection of each two of the events is empty set

Example. Based on the probability axioms, the probability of an event is impossible to be -0.1.

Example. (Combinatorial probability is probability) Combinatorial probability is a probability since it satisfies all three probability axioms.

Proof.

(P1) it follows from observing that the no. of outcomes is nonnegative
(P2) it follows from observing that the no. of outcomes in the event (which is a subset of sample space) cannot be larger than the no. of outcomes in the sample space
(P3) it follows from observing that the no. of outcomes in union of (infinite) disjoint sets is the same as the sum of no. of outcomes in each of the (infinite) disjoint sets (possibly through the Venn diagram, non-rigorously)

${\displaystyle \Box }$

With these three axioms only, we can prove many well-known properties of probability.

## Properties of probability

### Basic properties of probability

Proposition. (Probability of empty set) ${\displaystyle \mathbb {P} (\varnothing )=0}$.

Proof. Let ${\displaystyle E_{i}=\varnothing }$ for each positive integer ${\displaystyle i}$. ${\displaystyle E_{1},E_{2},\dotsc }$ are mutually exclusive, since they are all empty sets, and the intersection of each two of them is also empty set. Also, ${\displaystyle E_{1}\cup E_{2}\cup \dotsb =\varnothing \cup \varnothing \cup \dotsb =\varnothing }$. So,

{\displaystyle {\begin{aligned}&&\mathbb {P} (\varnothing )&=\mathbb {P} (E_{1}\cup E_{2}\cup \dotsb )\\&&&{\overset {\text{ P3 }}{=}}\mathbb {P} (E_{1})+\mathbb {P} (E_{2})+\dotsb \\&\Rightarrow &\underbrace {\mathbb {P} (\varnothing )-\mathbb {P} (E_{1})} _{0}&=\mathbb {P} (E_{1})+\mathbb {P} (E_{2})+\dotsb \\&\Rightarrow &\mathbb {P} (E_{2})+\dotsb &=0\\&\Rightarrow &\mathbb {P} (E_{2})&\leq \mathbb {P} (E_{2})+\dotsb =0.\end{aligned}}}
By P1, ${\displaystyle \mathbb {P} (E_{2})\geq 0}$. It follows that from these two inequalities that ${\displaystyle \mathbb {P} (\varnothing )=\mathbb {P} (E_{2})=0}$.

${\displaystyle \Box }$

Proposition. (Extended P3) The property of probability in the third axiom of probability (P3) is also valid for a finite sequence of events.

Proof. For each positive integer ${\displaystyle k}$, suppose that ${\displaystyle A_{1},\dotsc ,A_{k}}$ are disjoint events, and append to these the infinite sequence of events ${\displaystyle A_{k+1}=\varnothing ,A_{k+2}=\varnothing ,\dotsc }$. By P3,

${\displaystyle \mathbb {P} \left(\bigcup _{i=1}^{k}A_{i}\right)=\mathbb {P} \left(\bigcup _{i=1}^{\infty }A_{i}\right)=\sum _{i=1}^{\infty }\mathbb {P} (A_{i})=\sum _{i=1}^{k}\mathbb {P} (A_{i})}$
since ${\displaystyle \sum _{i=k+1}^{\infty }\mathbb {P} (A_{i})=\mathbb {P} (\varnothing )+\dotsb =0}$.

${\displaystyle \Box }$

Proposition. (Simplified law of total probability) For each event ${\displaystyle A,B}$, ${\displaystyle \mathbb {P} (B)=\mathbb {P} (B\cap A)+\mathbb {P} (B\setminus A)}$.

Proof.

${\displaystyle \mathbb {P} (B)=\mathbb {P} (B\cap (\underbrace {A\cup A^{c}} _{\Omega }))=\mathbb {P} {\big (}(B\cap A)\cup (\underbrace {B\cap A^{c}} _{:=B\setminus A}){\big )}{\overset {\text{ ext. P3 }}{=}}\mathbb {P} (B\cap A)+\mathbb {P} (B\setminus A)}$
[3]

${\displaystyle \Box }$

Illustration of simplified law of total probability:

|---------|
|  B\A    | <----- B
|    |----|-----|
|    |BnA |     |
|----|----|     | <---- A
|----------|


Proposition. (Simplified inclusion-exclusion principle) For each event ${\displaystyle A}$ and ${\displaystyle B}$, ${\displaystyle \mathbb {P} (A\cup B)=\mathbb {P} (A)+\mathbb {P} (B)-\mathbb {P} (A\cap B)}$.

Proof. Since events ${\displaystyle A}$ and ${\displaystyle B\setminus A}$ are disjoint, by extended P3,

${\displaystyle \mathbb {P} (A\cup (B\setminus A))=\mathbb {P} (A)+\mathbb {P} (B\setminus A)=\mathbb {P} (A)+(\mathbb {P} (B)-\mathbb {P} (B\cap A))=\mathbb {P} (A)+\mathbb {P} (B)-\mathbb {P} (A\cap B)}$
since ${\displaystyle \mathbb {P} (B)=\mathbb {P} (B\cap A)+\mathbb {P} (B\setminus A)\Rightarrow \mathbb {P} (B\setminus A)=\mathbb {P} (B)-\mathbb {P} (B\cap A)}$.

${\displaystyle \Box }$

Illustration of simplified inclusion-exclusion principle:

|---------|
|         | <----- B
| II |----|-----|
|    |AnB |     |
|----|----| I   | <---- A
|----------|


${\displaystyle \mathbb {P} (A\cup B)=\mathbb {P} ({\text{I}})+\mathbb {P} ({\text{II}})+\mathbb {P} (A\cap B)=\underbrace {\mathbb {P} ({\text{I}})+\mathbb {P} (A\cap B)} _{\mathbb {P} (A)}+\underbrace {\mathbb {P} ({\text{II}})+\mathbb {P} (A\cap B)} _{\mathbb {P} (B)}-\mathbb {P} (A\cap B)}$

Proposition. (Complement rule) For each event ${\displaystyle E}$, ${\displaystyle \mathbb {P} (E)=1-\mathbb {P} (E^{c}).}$

Proof.

${\displaystyle \mathbb {P} (E)=\mathbb {P} (E)+\mathbb {P} (E^{c})-\mathbb {P} (E^{c}){\overset {\text{ ext. P3 }}{=}}\mathbb {P} (\underbrace {E\cup E^{c}} _{\Omega })-\mathbb {P} (E^{c}){\overset {\text{ P2 }}{=}}1-\mathbb {P} (E^{c})}$

${\displaystyle \Box }$

Illustration of complement rule:

|---------------|
|               |
|      E^c      | <--- Omega (Pr(Omega)=1)
|    |---|      |
|    | E |      |
|    |---|      |
|---------------|


Proposition. (Numeric bound for probability) For each event ${\displaystyle E}$, ${\displaystyle 0\leq \mathbb {P} (E)\leq 1}$.

Proof. By P1, ${\displaystyle \mathbb {P} (E)\geq 0}$, and ${\displaystyle \mathbb {P} (E^{c})\geq 0}$. So, ${\displaystyle \mathbb {P} (E)\leq \mathbb {P} (E)+\mathbb {P} (E^{c})=\mathbb {P} (E)+(1-\mathbb {P} (E))=1}$

${\displaystyle \Box }$

Proposition. (Monotonicity) If ${\displaystyle A\subseteq B}$, then ${\displaystyle \mathbb {P} (A)\leq \mathbb {P} (B)}$.

Proof. By simplified law of total probability,

${\displaystyle \mathbb {P} (B)=\mathbb {P} (\underbrace {B\cap A} _{A})+\mathbb {P} (B\setminus A){\overset {\text{ P1 }}{\geq }}\mathbb {P} (A){\cancel {+0}}.}$

${\displaystyle \Box }$

Illustration of monotonicity: (when ${\displaystyle A\subsetneq B}$)

|---------------|
|               |
|               | <--- B
|    |---|      |
|    | A |      |
|    |---|      |
|---------------|


Example. The probability of winning the champion in a competition is less than or equal to that of entering the final of the competition, by monotonicity.

Proof. Let ${\displaystyle C}$ and ${\displaystyle F}$ the event of winning the champion in the competition, and entering the final of the competition respectively. Then, ${\displaystyle C\subseteq F}$, since ${\displaystyle C\Rightarrow ({\text{implies}})\;F}$ (when we win the champion, then we must enter the final), and so ${\displaystyle \mathbb {P} (C)\leq \mathbb {P} (F)}$.

${\displaystyle \Box }$

Exercise.

Select all correct statement(s). All following capital letters are events.

 If ${\displaystyle A=B}$, then ${\displaystyle \mathbb {P} (A)=\mathbb {P} (B)}$ . ${\displaystyle \mathbb {P} {\big (}A\setminus (B\cup C){\big )}=\mathbb {P} (A)+\mathbb {P} (A\cap B)+\mathbb {P} (A\cap C)-\mathbb {P} (A\cap B\cap C)}$. ${\displaystyle {\frac {\mathbb {P} (A\cap B)}{\mathbb {P} (B)}}=1}$ if ${\displaystyle A\subseteq B}$ and ${\displaystyle \mathbb {P} (B)>0}$. ${\displaystyle 0\leq {\frac {\mathbb {P} (A\cap B)}{\mathbb {P} (B)}}\leq 1}$ if ${\displaystyle \mathbb {P} (B)>0}$ .

### More advanced properties of probability

Theorem. (Inclusion-exclusion principle)

Illustration of inclusion-exclusion principle when ${\displaystyle n=3}$

For each event ${\displaystyle E_{1},\dotsc ,E_{n}}$,

{\displaystyle {\begin{aligned}\mathbb {P} (E_{1}\cup \dotsb \cup E_{n})&=\mathbb {P} (E_{1})+\dotsb +\mathbb {P} (E_{n})\\&\;-{\big (}\mathbb {P} (E_{1}\cap E_{2})+\mathbb {P} (E_{1}\cap E_{3})+\dotsb +\mathbb {P} (E_{n-1}\cap E_{n}){\big )}\\&\;+{\big (}\mathbb {P} (E_{1}\cap E_{2}\cap E_{3})+\mathbb {P} (E_{1}\cap E_{2}\cap E_{4})+\dotsb +\mathbb {P} (E_{n-2}\cap E_{n-1}\cap E_{n}){\big )}\\&\;-\dotsb \\&\;+(-1)^{n+1}\mathbb {P} (E_{1}\cap \dotsb \cap E_{n}).\end{aligned}}}

Proof. We can prove this by induction.

Recall the simplified inclusion-exclusion principle, which is essentially the inclusion-exclusion principle when ${\displaystyle n=2}$. So, we know that the inclusion-exclusion principle is true for ${\displaystyle n=2}$, and it remains to prove the case with larger ${\displaystyle n}$.

The idea of the induction is illustrated as follows: by simplified inclusion-exclusion principle,

{\displaystyle {\begin{aligned}\mathbb {P} ((E_{1}\cup \dotsb \cup E_{n-1})\cup {\color {darkgreen}E_{n}})&=\mathbb {P} (E_{1}\cup \dotsb \cup E_{n-1})+\mathbb {P} ({\color {darkgreen}E_{n}})-\mathbb {P} {\big (}(E_{1}\cup \dotsb \cup E_{n-1})\cap {\color {darkgreen}E_{n}}{\big )}\\&=\mathbb {P} (E_{1}\cup \dotsb \cup E_{n-1})+\mathbb {P} ({\color {darkgreen}E_{n}})-\mathbb {P} {\big (}(E_{1}\cap {\color {darkgreen}E_{n}})\cup \dotsb \cup (E_{n-1}\cap {\color {darkgreen}E_{n}}){\big )}\\&=\dotsb \end{aligned}}}

${\displaystyle \Box }$

Remark.

• we can write the inclusion-exclusion principle more compactly as follows:

${\displaystyle \mathbb {P} (E_{1}\cup \dotsb \cup E_{n})=\sum _{j=1}^{n}(-1)^{j+1}\sum _{i_{1}<\dotsb

• an alternative and more elegant proof is provided in the chapter about properties of distributions
• for the intersections of event, each possible distinct combination is involved

Example. When ${\displaystyle n=3}$, for each event ${\displaystyle A,B}$ and ${\displaystyle C}$,

${\displaystyle \mathbb {P} (A\cup B\cup C)=\mathbb {P} (A)+\mathbb {P} (B)+\mathbb {P} (C)-\mathbb {P} (A\cap B)-\mathbb {P} (A\cap C)-\mathbb {P} (B\cap C)+\mathbb {P} (A\cap B\cap C).}$

Exercise.

Select the correct expression(s) for ${\displaystyle \mathbb {P} (A\cup B\cup C\cup D)}$ for each event ${\displaystyle A,B,C}$ and ${\displaystyle D}$.

 {\displaystyle {\begin{aligned}\mathbb {P} (A)+\mathbb {P} (B)+\mathbb {P} (C)+\mathbb {P} (D)-\mathbb {P} (A\cap B)-\mathbb {P} (A\cap C)-\mathbb {P} (A\cap D)-\mathbb {P} (B\cap C)-\mathbb {P} (B\cap D)\\-\mathbb {P} (C\cap D)+\mathbb {P} (A\cap B\cap C)+\mathbb {P} (A\cap B\cap D)+\mathbb {P} (B\cap C\cap D)-\mathbb {P} (A\cap B\cap C\cap D)\end{aligned}}} {\displaystyle {\begin{aligned}\mathbb {P} (A)+\mathbb {P} (B)+\mathbb {P} (C)+\mathbb {P} (D)-\mathbb {P} (A\cap B)-\mathbb {P} (A\cap C)-\mathbb {P} (A\cap D)-\mathbb {P} (B\cap C)-\mathbb {P} (B\cap D)\\-\mathbb {P} (C\cap D)+\mathbb {P} (A\cap B\cap C)+\mathbb {P} (A\cap B\cap D)+\mathbb {P} (B\cap C\cap D)+\mathbb {P} (A\cap B\cap C\cap D)\end{aligned}}} {\displaystyle {\begin{aligned}\mathbb {P} (A)+\mathbb {P} (B)+\mathbb {P} (C)+\mathbb {P} (D)+\mathbb {P} (A\cap B)+\mathbb {P} (A\cap C)+\mathbb {P} (A\cap D)+\mathbb {P} (B\cap C)+\mathbb {P} (B\cap D)\\+\mathbb {P} (C\cap D)-\mathbb {P} (A\cap B\cap C)-\mathbb {P} (A\cap B\cap D)-\mathbb {P} (B\cap C\cap D)+\mathbb {P} (A\cap B\cap C\cap D)\end{aligned}}} ${\displaystyle \mathbb {P} (A)+\mathbb {P} (B)+\mathbb {P} (C)+\mathbb {P} (D)-\mathbb {P} (A\cap B)-\mathbb {P} (A\cap C)-\mathbb {P} (B\cap C)+\mathbb {P} (A\cap B\cap C)-\mathbb {P} {\big (}(A\cup B\cup C)\cap D{\big )}}$

Example. (Application of inclusion-exclusion principle) Among 160 students,

• 40% has a major in mathematics
• 55% has a major in statistics
• 30% has a major in accounting
• 20% has a major in statistics and accounting
• 15% has a major in accounting and mathematics
• 20% has a major in mathematics and statistics
• 10% has a major in mathematics, statistics and accounting

The number of students that do not have any of these majors is ${\displaystyle 160{\big (}1-(0.4+0.55+0.3-0.2-0.15-0.2+0.1){\big )}=160(1-0.8)=32}$

Proof. We can regard the percentage as combinatorial probability, since their definitions match (both are about proportion). Let ${\displaystyle M,S,A}$ be the event that a student among them has a major in mathematics, statistics and accounting respectively. Then,

{\displaystyle {\begin{aligned}\mathbb {P} (M^{c}\cap S^{c}\cap A^{c})&=\mathbb {P} {\big (}(M\cup S\cup A)^{c}{\big )}=1-\mathbb {P} (M\cup S\cup A)\\&=1-(\mathbb {P} (M)+\mathbb {P} (S)+\mathbb {P} (A)-\mathbb {P} (M\cap S)-\mathbb {P} (M\cap A)-\mathbb {P} (S\cap A)+\mathbb {P} (M\cap S\cap A))\\&=1-(0.4+0.55+0.3-0.2-0.15-0.2+0.1)\\&=1-0.8=0.2.\end{aligned}}}
Thus,
${\displaystyle {\frac {\text{no. of students that do not have any of those majors}}{\text{no. of students}}}=0.2\Rightarrow {\text{no. of students that do not have any of those majors}}=0.2(160)=32.}$

Alternatively, we can consider the following Venn diagram:

|-------------| <--------- A
|             |
|        |----|----|
|        |    |    |
| 0.05   |0.05|0.15| <---- M
|        |    |    |
|--------|----|----|------|
|        |0.1 |0.1 |      |
| 0.1    |    |    | 0.25 | <---- S
|        |----|----|      |
|-------------|-----------|


We can from this diagram that ${\displaystyle \mathbb {P} (M\cup S\cup A)=0.05+0.05+0.15+0.1+0.1+0.1+0.25=0.8}$, and then we can calculate the desired no.

The steps for constructing the above Venn diagram: [4]

${\displaystyle \Box }$

Exercise.

1 Calculate the percentage of students that have at least two of those three majors.

 10% 15% 20% 25% 40%

2 Calculate the percentage of students that have one and only one major.

 30% 35% 40% 45% 50%

Remark.

• this example illustrates that we can apply inclusion-exclusion principle to the no. of outcomes in events
• to be more precise, we can replace ${\displaystyle \mathbb {P} }$ with ${\displaystyle N}$ in the inclusion-exclusion principle, for which ${\displaystyle N(E)}$ is the no. of outcomes in the event ${\displaystyle E}$

Lemma. For each event ${\displaystyle E_{1},E_{2},\dotsc }$,

${\displaystyle \mathbb {P} \left(\bigcup _{i=1}^{\infty }E_{i}\right){\overset {\text{ def }}{=}}\mathbb {P} \left(\lim _{n\to \infty }\bigcup _{i=1}^{n}E_{i}\right)=\lim _{n\to \infty }\mathbb {P} \left(\bigcup _{i=1}^{n}E_{i}\right).}$

Proof.

${\displaystyle \lim _{n\to \infty }\mathbb {P} \left(\bigcup _{i=1}^{n}E_{i}\right){\overset {\text{ext. P3}}{=}}\lim _{n\to \infty }\sum _{i=1}^{n}\mathbb {P} (E_{i}){\overset {\text{ def }}{=}}\sum _{i=1}^{\infty }\mathbb {P} (E_{i}){\overset {\text{ P3 }}{=}}\mathbb {P} \left(\bigcup _{i=1}^{\infty }E_{i}\right).}$

${\displaystyle \Box }$

Proposition. (Boole's inequality) For each event ${\displaystyle E_{1},E_{2},\ldots }$, ${\displaystyle \mathbb {P} \left(\bigcup _{i=1}^{\infty }E_{i}\right)\leq \sum _{i=1}^{\infty }\mathbb {P} (E_{i})}$.

Proof. First, by inclusion-exclusion principle, for each event ${\displaystyle A}$ and ${\displaystyle B}$, ${\displaystyle \mathbb {P} (A\cap B)=\mathbb {P} (A)+\mathbb {P} (B)-\mathbb {P} (A\cap B){\overset {\text{ P1 }}{\leq }}\mathbb {P} (A)+\mathbb {P} (B)}$.

So,

${\displaystyle \mathbb {P} \left(\bigcup _{i=1}^{n}E_{i}\right)\leq \mathbb {P} (E_{1})+\mathbb {P} \left(\bigcup _{i=2}^{n}E_{i}\right)\leq \mathbb {P} (E_{1})+\mathbb {P} (E_{2})+\mathbb {P} \left(\bigcup _{i=3}^{n}E_{i}\right)\leq \dotsb \leq \mathbb {P} (E_{1})+\mathbb {P} (E_{2})+\dotsb +\mathbb {P} (E_{n})=\sum _{i=1}^{n}\mathbb {P} (E_{i}).}$

Using the lemma,

${\displaystyle \mathbb {P} \left(\bigcup _{i=1}^{\infty }E_{i}\right)=\lim _{n\to \infty }\mathbb {P} \left(\bigcup _{i=1}^{n}E_{i}\right){\overset {\text{from above}}{\leq }}\lim _{n\to \infty }\sum _{i=1}^{n}\mathbb {P} (E_{i}){\overset {\text{ def }}{=}}\sum _{i=1}^{\infty }\mathbb {P} (E_{i}).}$

${\displaystyle \Box }$

# Mathematical Review

The review of set theory contained herein adopts a naive point of view. We assume that the meaning of a set as a collection of objects is intuitively clear. A rigorous analysis of the concept belongs to the foundations of mathematics and mathematical logic. Although we shall not initiate a study of these fields, the rules we follow in dealing with sets are derived from them. A set is a collection of objects, which are the elements of the set.

If an element ${\displaystyle x}$ belongs to a set ${\displaystyle S}$ , we express this fact by writing ${\displaystyle x\in S}$. If ${\displaystyle x}$ does not belong to ${\displaystyle S}$ , we write ${\displaystyle x\notin S}$ . We use the equality symbol to denote logical identity. For instance, ${\displaystyle x=y}$ means that ${\displaystyle x,y}$ are symbols denoting the same object. Similarly, the equation ${\displaystyle S=T}$ states that ${\displaystyle S,T}$ are two symbols for the same set. In particular, the sets ${\displaystyle S,T}$ contain precisely the same elements. If ${\displaystyle x,y}$ are different objects then we write ${\displaystyle x\neq y}$ . Also, we can express the fact that ${\displaystyle S,T}$ are different sets by writing ${\displaystyle S\neq T}$ .

A set ${\displaystyle S}$ is a subset of ${\displaystyle T}$ if every element of ${\displaystyle S}$ is also contained in ${\displaystyle T}$ . We express this relation by writing ${\displaystyle S\subset T}$ . Note that this definition does not require ${\displaystyle S}$ to be different from ${\displaystyle T}$ . In fact, ${\displaystyle S=T}$ if and only if ${\displaystyle S\subset T}$ and ${\displaystyle T\subset S}$ . If ${\displaystyle S\subset T}$ and ${\displaystyle S}$ is different from ${\displaystyle T}$ , then ${\displaystyle S}$ is a proper subset of ${\displaystyle T}$ and we write ${\displaystyle S\subsetneq T}$ .

There are many ways to specify a set. If the set contains only a few elements, one can simply list the objects in the set;

${\displaystyle S=\{x_{1},x_{2},x_{3}\}}$

The content of a set can also be enumerated whenever ${\displaystyle S}$ has a countable number of elements,

${\displaystyle S=\{x_{1},x_{2},\dots \}}$

Usually, the way to specify a set is to take some collection ${\displaystyle S}$ of objects and some property that elements of ${\displaystyle S}$ may or may not possess, and to form the set consisting of all elements of ${\displaystyle S}$ having that property. For example, starting with the integers ${\displaystyle \mathbb {Z} }$ , we can form the subset of S consisting of all even numbers

${\displaystyle S=\{x\in \mathbb {Z} |x{\text{ is an even number}}\}}$

More generally, we denote the set of all elements that have a certain property ${\displaystyle P}$ by

${\displaystyle S=\{x|x{\text{ satisfies }}P\}}$

The braces are to be read as the words "the set of" whereas the symbol | stands for the words "such that." It is convenient to introduce two special sets. The empty set, denoted by ${\displaystyle \varnothing }$ , is a set that contains no elements. The universal set is the collection of all objects of interest in a particular context, and it is denoted by ${\displaystyle \Omega }$ . Once a universal set ${\displaystyle \Omega }$ is specified, we need only consider sets that are subsets of ${\displaystyle \Omega }$ . In the context of probability, ${\displaystyle \Omega }$ is often called the sample space.

The complement of a set ${\displaystyle S}$ , with respect to the universal set ${\displaystyle \Omega }$ , is the collection of all objects in ${\displaystyle \Omega }$ that do not belong to ${\displaystyle S}$ ,

${\displaystyle S^{c}={\bar {S}}=\{x\in \Omega |x\notin S\}}$

We note that ${\displaystyle \Omega ^{c}=\varnothing }$ .

## Elementary Set Operations

Probability theory makes extensive use of elementary set operations. Below, we review the ideas of set theory, and establish the basic terminology and notation. Consider two sets ${\displaystyle S,T}$ .

The union of sets ${\displaystyle S,T}$ is the collection of all elements that belong to ${\displaystyle S}$ or ${\displaystyle T}$ (or both), and it is denoted by ${\displaystyle S\cup T}$ . Formally, we define the union of these two sets by

${\displaystyle S\cup T=\{x|x\in S{\text{ or }}x\in T\}}$

The intersection of sets ${\displaystyle S,T}$ is the collection of all elements that are common to both ${\displaystyle S}$ and ${\displaystyle T}$ . It is denoted by ${\displaystyle S\cap T}$, and it can be expressed mathematically as

${\displaystyle S\cap T=\{x|x\in S{\text{ and }}x\in T\}}$

When ${\displaystyle S,T}$ have no elements in common, we write ${\displaystyle S\cap T=\varnothing }$ . We also express this fact by saying that ${\displaystyle S,T}$ are disjoint. More generally, a collection of sets is said to be disjoint if no two sets have a common element. A collection of sets is said to form a partition of ${\displaystyle S}$ if the sets in the collection are disjoint and their union is ${\displaystyle S}$ .

The difference of two sets, denoted by ${\displaystyle S-T}$ or ${\displaystyle S\setminus T}$ , is defined as the set consisting of those elements of ${\displaystyle S}$ that are not in ${\displaystyle T}$ ,

${\displaystyle S-T=S\setminus T=\{x|x\in S{\text{ and }}x\notin T\}}$

This set is sometimes called the complement of ${\displaystyle T}$ relative to ${\displaystyle T}$ , or the complement of ${\displaystyle T}$ in ${\displaystyle S}$ .

We have already looked at the definition of the union and the intersection of two sets. We can also form the union or the intersection of arbitrarily many sets. This is defined in the obvious way,

${\displaystyle \bigcup _{\alpha \in I}S_{\alpha }=\{x|x\in S_{\alpha }{\text{ for some }}\alpha \in I\}}$
${\displaystyle \bigcap _{\alpha \in I}S_{\alpha }=\{x|x\in S_{\alpha }{\text{ for all }}\alpha \in I\}}$

The index set I can be finite or even infinite.

## Rules of Set Theory

Given a collection of sets, it is possible to form new ones by applying elementary set operations to them. As in algebra, one uses parentheses to indicate precedence. For instance, ${\displaystyle R\cup (S\cap T)}$ denotes the union of two sets ${\displaystyle R,S\cap T}$ , while ${\displaystyle (R\cup S)\cap T}$ represents the intersection of two sets ${\displaystyle R\cup S,T}$ .The sets thus formed are quite different.

Sometimes different combinations of operations lead to the same set. For instance, we have the two distributive laws

${\displaystyle R\cap (S\cup T)=(R\cap S)\cup (R\cap T)}$
${\displaystyle R\cup (S\cap T)=(R\cup S)\cap (R\cup T)}$

Two particularly useful equivalent combinations of operations are given by De Morgan's laws, which state that

${\displaystyle R-(S\cup T)=(R-S)\cap (R-T)}$
${\displaystyle R-(S\cap T)=(R-S)\cup (R-T)}$

These two laws can be generalized to

${\displaystyle \left(\bigcup _{\alpha \in I}S_{\alpha }\right)^{c}=\bigcap _{\alpha \in I}S_{\alpha }^{c}}$
${\displaystyle \left(\bigcap _{\alpha \in I}S_{\alpha }\right)^{c}=\bigcup _{\alpha \in I}S_{\alpha }^{c}}$

when multiple sets are involved. To establish the first equality, suppose that ${\displaystyle x}$ belongs to ${\displaystyle \left(\bigcup _{\alpha \in I}S_{\alpha }\right)^{c}}$ . Then ${\displaystyle x}$ is not contained in ${\displaystyle \bigcup _{\alpha \in I}S_{\alpha }}$ . That is, ${\displaystyle x}$ is not an element of ${\displaystyle S_{\alpha }}$ for any ${\displaystyle \alpha \in I}$ . This implies that ${\displaystyle x}$ belongs to ${\displaystyle S_{\alpha }^{c}}$ for all ${\displaystyle \alpha \in I}$ , and therefore ${\displaystyle x\in \bigcap _{\alpha \in I}S_{\alpha }^{c}}$ . We have shown that ${\displaystyle \left(\bigcup _{\alpha \in I}S_{\alpha }\right)^{c}\subset \bigcap _{\alpha \in I}S_{\alpha }^{c}}$ . The converse inclusion is obtained by reversing the above argument. The second law can be obtained in a similar fashion.

## Cartesian Products

There is yet another way to create new sets form existing ones. It involves the notion of an ordered pair of objects. Given sets ${\displaystyle S,T}$ , the cartesian product ${\displaystyle S\times T}$ is the set of all ordered pairs ${\displaystyle x,y}$ for which ${\displaystyle x}$ is an element of ${\displaystyle S}$ and ${\displaystyle y}$ is an element of ${\displaystyle T}$ ,

${\displaystyle S\times T=\{(x,y)|x\in S{\text{ and }}y\in T\}}$

## Summary of Probability & Set Theory

• ${\displaystyle A=P(A)\cdot \sum [0,1]}$
• ${\displaystyle {\text{not }}A=P({\bar {A}})=P(A^{c})=1-P(A)}$
• ${\displaystyle A{\text{ or }}B=P(A\cup B)=P(A)+P(B)-P(A\cap B)=P(A)+P(B)}$ [*if and only if ${\displaystyle A}$ and ${\displaystyle B}$ are mutually exclusive]
• ${\displaystyle A{\text{ and }}B=P(A\cap B)=P(A\setminus B)\cdot P(B)=P(A)\cdot P(B)}$ [*if and only if ${\displaystyle A}$ and ${\displaystyle B}$ are independent]
• ${\displaystyle A{\text{ given }}B=P(A|B)={\frac {P(A\cap B)}{P(B)}}}$ [*conditional]
Basic Set Theory

UNION: combined area of set ${\displaystyle A,B}$ , known as ${\displaystyle A\cup B}$ . The set of all items which are members of either ${\displaystyle A}$ or ${\displaystyle B}$

• Union of ${\displaystyle A,B}$ are added together
• Some basic properties of unions:
• ${\displaystyle A\cup B=B\cup A}$
• ${\displaystyle A\cup (B\cup C)=(A\cup B)\cup C}$
• ${\displaystyle A\subset (A\cup B)}$
• ${\displaystyle A\cup A=A}$
• ${\displaystyle A\cup \varnothing =A}$ , where ${\displaystyle \varnothing ={\text{empty set}}}$
• ${\displaystyle A\subset B}$ , if and only if ${\displaystyle A\cup B=B}$

INTERSECTION: area where both ${\displaystyle A,B}$ overlap, known as ${\displaystyle A\cap B}$ . It represents which elements the two sets ${\displaystyle A,B}$ have in common

• If ${\displaystyle A\cap B=\varnothing }$ , then A and B are said to be DISJOINT.
• Some basic properties of intersections:
• ${\displaystyle A\cap B=B\cap A}$
• ${\displaystyle A\cap (B\cap C)=(A\cap B)\cap C}$
• ${\displaystyle (A\cap B)\subset A}$
• ${\displaystyle A\cap A=A}$
• ${\displaystyle A\cap \varnothing =\varnothing }$
• ${\displaystyle A\subset B}$ , if and only if ${\displaystyle A\cap B=A}$
• UNIVERSAL SET: space of all things possible, which contains ALL of the elements or elementary events.
• ${\displaystyle U\setminus A}$ is called the absolute complement of ${\displaystyle A}$

Complement (set): 2 sets can be subtracted. The relative complement (set theoretic difference of ${\displaystyle B}$ and ${\displaystyle A}$). Denoted by ${\displaystyle B\setminus A}$ (or ${\displaystyle B-A}$) is the set of all elements which are members of ${\displaystyle B}$ , but not members of ${\displaystyle A}$

Some basic properties of complements (${\displaystyle {\overline {A}}\ ,\ A^{c}\ ,\ ~A\ ,\ A'}$):

• ${\displaystyle A\cup {\overline {A}}=U}$
• ${\displaystyle A\cap {\overline {A}}=\varnothing }$
• ${\displaystyle {\overline {({\overline {A}})}}=A}$
• ${\displaystyle A\setminus A=\varnothing }$
• ${\displaystyle {\overline {U}}=\varnothing }$ , and ${\displaystyle {\overline {\varnothing }}=U}$
• ${\displaystyle A\setminus B=A\cap {\overline {B}}}$
Summary
• Intersection ${\displaystyle A\cap B}$ --> AND – both events occur together at the same time
• Union ${\displaystyle A\cup B}$ --> OR – everything about both events, ${\displaystyle A,B}$
• Complement ${\displaystyle {\overline {A}}}$ --> NOT ${\displaystyle A}$ – everything else except ${\displaystyle A}$ (or the event in question)
• ${\displaystyle A\cup {\overline {A}}=S}$ (sample space)
• ${\displaystyle A\cap {\overline {A}}=\varnothing }$ (impossible event)

Union and Intersection are:

Commutative
• ${\displaystyle A\cup B=B\cup A}$
• ${\displaystyle A\cap B=B\cap A}$
Associative
• ${\displaystyle A\cup (B\cup C)=(A\cup B)\cup C}$
• ${\displaystyle A\cap (B\cap C)=(A\cap B)\cap C}$
Distributive
• ${\displaystyle A\cup (B\cap C)=(A\cup B)\cap (A\cup C)}$
• ${\displaystyle A\cap (B\cup C)=(A\cap B)\cup (A\cap C)}$

# Combinatorics

## What is combinatorics?

Figure 1. Venn diagram to illustrate union, intersection, and universe.

Combinatorics involves the counting and enumeration of elements of sets and similar structures such as sequences and multisets. We begin by reviewing the properties of sets, as illustrated in the Venn diagram of figure 1, the integer 2 is both even and prime, while the integer 1 is not a prime number. Letting ${\displaystyle P}$ denote prime and ${\displaystyle E}$ denote even, the figure suggests that we define the two sets, ${\displaystyle P=\{2,3,5\}}$, and ${\displaystyle E=\{2,4\}}$ If a set has a finite number of elements[5] that number is known as the set's cardinality. Hence the cardinality of ${\displaystyle E}$ and ${\displaystyle P}$ are 3 and 2, respectively. The intersection, ${\displaystyle P\cap E=\{2\},}$ and union, ${\displaystyle P\cup E=\{2,3,4,5\},}$ of our two sets is probably well known to most of our readers. The value in introducing students to set theory using Venn diagrams is that it is self evident that order is not important in defining the elements of a set. In roster notation, this equivalence can be expressed as,

${\displaystyle E=\{2,4\}=\{4,2\}.}$

Two other sets associated with figure 1 are worth mentioning. The universe can be defined as set of items that are under consideration within the context of any given discussion or analysis of a system. In figure 1, the box around ${\displaystyle P\cup E}$ suggests that the universe for this discussion should be ${\displaystyle U=\{1,2,3,4,5\}.}$ Another important set is the empty set, ${\displaystyle \{\},}$ which contains no elements whatsoever. The empty set can also be written as ${\displaystyle \varnothing }$.

Figure 2. The dotted circles and ovals define the 8 elements of the power set of the set {1,2,3}.

### Power set

The power set (or power set) of a set is the set of its subsets of ${\displaystyle S}$, including the empty set and the set itself. The power set of the set t ${\displaystyle \{1,2,3\}}$ is illustrated in Figure 2. We denote the power set of ${\displaystyle S}$ as ${\displaystyle {\mathcal {P}}(S)}$ (or ${\displaystyle 2^{S}}$).

Example. If ${\displaystyle S=\{1,2,3\}}$, then

${\displaystyle {\mathcal {P}}(S)=\{\{\},\{1\},\{2\},\{3\},\{1,2\},\{3,1\},\{2,3\},\{1,2,3\}\}}$

Remark.

• note that ${\displaystyle S}$ contains ${\displaystyle 3}$ elements, while its power set contains ${\displaystyle 2^{3}=8}$ elements
• the generalization to the power set of a set with ${\displaystyle n}$ elements containing ${\displaystyle 2^{n}}$ elements follows in the next section.
• the empty set ${\displaystyle \{\}}$ is often denoted by ${\displaystyle \varnothing }$

### Sequences

Like a set, a sequence contains a possibly infinite number of elements (or members.) In contrast, with sets, the order (enumeration) is part of what defines a sequence. Sequences are written with parentheses ${\displaystyle (\,)}$ instead of curly brackets ${\displaystyle \{\,\}}$, so that in contrast with the situation for set of ${\displaystyle E}$ (even) in Figure 1, these two sequences are not equal:

${\displaystyle (1,2)\neq (2,1)}$ ... or in string/word notation: ${\displaystyle 12\neq 21}$

In computer science, a sequence consisting of characters is called a string. For that reason, it is often convenient to write sequences of characters as words (or strings.) For example, the sequence ${\displaystyle (b,e,e)}$ is more efficiently expressed as ${\displaystyle {\textit {bee}}}$. Note that repetition plays an essential role in defining a sequence:
Example: While ${\displaystyle {\textit {be}}}$ and ${\displaystyle bee}$ are pronounced the same, ${\displaystyle {\textit {be}}\neq {\textit {bee}}.}$

### Multisets

A multiset (or bag, or mset) combines the properties of a set and a sequence. Order (enumeration) is not important in defining a multiset, but repetition of elements is allowed. It is often convenient to denote multisets with square brackets ${\displaystyle [\,].}$ Example: While ${\displaystyle \{b,e,e\}=\{e,b,e\}}$ represents the equality of two multisets, we have, ${\displaystyle [b,e,e]\neq [b,e],}$ for sequences (or ${\displaystyle {\textit {bee}}\neq {\textit {ebe}}}$ in string notation.)

## Fundamental Counting Principle

Theorem. (Fundamental counting principle) If trial ${\displaystyle 1,\ldots ,k}$ has ${\displaystyle n_{1},\dots ,n_{k}}$ possible outcomes respectively, then the ${\displaystyle k}$ trials have ${\displaystyle n_{1}\times \cdots \times n_{k}}$ possible outcomes.

Proof. First, consider the case for ${\displaystyle k=2}$: we can enumerate each possible outcomes using ordered pair, as follows:

${\displaystyle {\begin{array}{ccccc}{\text{trial 1}}\backslash {\text{trial 2}}&{\text{1st outcome}}&{\text{2nd outcome}}&\cdots &n_{2}{\text{th outcome}}\\\hline {\text{1st outcome}}&(1,1)&(1,2)&\cdots &(1,n_{2})\\{\text{2nd outcome}}&(2,1)&(2,2)&\cdots &(2,n_{2})\\\vdots &\vdots &\vdots &\vdots &\vdots \\n_{1}{\text{th outcome}}&(n_{1},1)&(n_{1},2)&\cdots &(n_{1},n_{2})\\\end{array}}}$
Then, we can count that there are ${\displaystyle n_{1}n_{2}}$ possible outcomes (by considering the rows (or columns) one by one).

After establishing the case for ${\displaystyle k=2}$, we can establish the case for positive integer ${\displaystyle k\geq 3}$ inductively, e.g.:

${\displaystyle {\begin{array}{ccccc}{\text{trial 1 }}\&{\text{ trial 2}}\backslash {\text{trial 3}}&{\text{1st outcome}}&{\text{2nd outcome}}&\cdots &n_{3}{\text{th outcome}}\\\hline {\text{1st outcome}}&(1,1,1)&(1,1,2)&\cdots &(1,1,n_{3})\\{\text{2nd outcome}}&(1,2,1)&(1,2,2)&\cdots &(1,2,n_{3})\\\vdots &\vdots &\vdots &\vdots &\vdots \\n_{1}n_{2}{\text{th outcome}}&(n_{1},n_{2},1)&(n_{1},n_{2},2)&\cdots &(n_{1},n_{2},n_{3})\\\end{array}}}$
then we can count that there are ${\displaystyle n_{1}n_{2}n_{3}}$ outcomes (by considering the rows (or columns) one by one), and we can prove the remaining cases inductively.

${\displaystyle \Box }$

Remark.

• it is also known as multiplicative counting principle, or rule of product

Example.

Figure 3. In set theory, this is called the Cartesian product of two sets, with cardinality, ${\displaystyle {\mathcal {N}}=3\times 2}$

The tree diagram of Figure 3 illustrates this for ${\displaystyle k=2}$, ${\displaystyle n_{1}=3}$, and ${\displaystyle n_{2}=2}$. The number of possible outcomes is ${\displaystyle {\mathcal {N}}=n_{1}\times n_{2}=6.}$

Exercise.

1 Determine the number of possible outcomes if ${\displaystyle n_{1}=2}$ and ${\displaystyle n_{2}=3}$ instead.

 ${\displaystyle 1}$ ${\displaystyle 2}$ ${\displaystyle 3}$ ${\displaystyle 6}$ ${\displaystyle 12}$

2 Suppose ${\displaystyle k=3}$ now. Given that the number of possible outcomes is still ${\displaystyle 6}$ without changing other conditions given in the example, calculate ${\displaystyle n_{3}}$.

 ${\displaystyle 1}$ ${\displaystyle 2}$ ${\displaystyle 3}$ ${\displaystyle 6}$ ${\displaystyle 12}$

Remark.

• this might be visualized by imagining a flip of three-sided die (with three outcomes, e.g. 1,2,3), followed by a flip of a two-sided coin (with two outcomes, e.g. A,B).

### Counting the number of elements in a power set

Example. (Number of elements in a power set) The number of elements in a power set of set ${\displaystyle S}$ with ${\displaystyle n}$ elements is ${\displaystyle 2^{n}}$.

Proof. Consider the ${\displaystyle n}$ elements in ${\displaystyle S}$ one by one. For each of them, we can either include or do not include it in a subset of ${\displaystyle S}$. Then, there are ${\displaystyle n}$ steps involved to construct a subset of ${\displaystyle S}$, and each step has two outcomes. It follows from the multiplicative counting principle that the ${\displaystyle n}$ steps have ${\displaystyle \underbrace {2\times 2\times \cdots \times 2} _{n{\text{ times}}}=2^{n}}$ outcomes. That is, there are ${\displaystyle 2^{n}}$ possible (distinct) subsets of ${\displaystyle S}$. Since power set contains all subsets of ${\displaystyle S}$ by definition, it follows that the power set has ${\displaystyle 2^{n}}$ elements.

${\displaystyle \Box }$

Exercise.

Determine the number of elements in ${\displaystyle {\mathcal {P}}(\varnothing )}$ (i.e. power set of empty set).

 ${\displaystyle 0}$ ${\displaystyle 1}$ ${\displaystyle 2}$ ${\displaystyle 4}$ It is undefined.

Remark.

• ${\displaystyle n}$ is arbitrary nonnegative integer

### The counting principle misused

Figure 4. The counting principle is not useful when the number of choices at each step is not unique.
Figure 5. The counting principle only applies if the outcomes are defined such that ${\displaystyle HT\neq TH}$ (i.e. order matters).

Figures 4 and 5 illustrate the fact that the counting principle is not always useful. Figure 4 calculates the ways the three integers can be added to five, if the integers are restricted to the set ${\displaystyle \{1,2,3\}}$. Since these three integers are choices (decisions), it is convenient to label the choice indices with capital letters: ${\displaystyle T_{A},T_{B},T_{C}.}$

E.g., ${\displaystyle T_{B}=3}$ means the second choice is the integer 3.

We cannot apply the counting principle to figure 4 because ${\displaystyle n_{B}}$ depends on ${\displaystyle T_{A}.}$ In our case, ${\displaystyle T_{A}=1\Rightarrow n_{B}=3,}$ and ${\displaystyle T_{A}=2\Rightarrow n_{B}=2,}$ and ${\displaystyle T_{A}=3\Rightarrow n_{B}=1.}$ This leads us to an important caveat about using the counting principle:

• the counting principle cannot be used if the number of outcomes at each step cannot be uniquely defined

Figure 5 exams two flips of a coin. It calculates the correct number of outcomes to be, ${\displaystyle {\mathcal {N}}=2\times 2}$${\displaystyle =4,}$ but only if we carefully define the outcome. The counting principle is valid only if heads followed by a tails (HT) is a different outcome than tails followed by heads (TH). In other words:

• when counting outcomes it is important to understand the role that order (enumeration) plays in defining outcomes

But, if we instead are counting the outcomes in a fashion such that HT and TH are considered to be the same, then a formula such as ${\displaystyle {\mathcal {N}}=n_{A}\times n_{B}}$ cannot be used:

• the counting principle does not hold if two different decision paths lead to the same final outcome (in the theorem, we say 'trial ${\displaystyle 1,\ldots ,k}$', which implicitly assumes that the order matters in the outcomes for the ${\displaystyle k}$ trials)

Example. Suppose we throw two six-faced dice, with colors red and blue respectively. The number of possible distinct pairs of number facing up is ${\displaystyle 6\times 6=36}$.

Proof. Since the dice are distinguishable, we can use multiplicative principle of counting. To be more precise, we can let the possible numbers facing up of red dice to be the possible outcomes in 'trial 1', and that of blue dice to be the possible outcomes in 'trial 2'. Since each trial has six outcomes, it follows that the number of outcomes (i.e. possible distinct pairs) is ${\displaystyle 6\times 6=36}$.

${\displaystyle \Box }$

Exercise.

Suppose the red dice becomes a blue dice, such that the two dices are not distinguishable anymore. Calculate the number of possible distinct pairs of number facing up.

 ${\displaystyle 6}$ ${\displaystyle 15}$ ${\displaystyle 18}$ ${\displaystyle 21}$ ${\displaystyle 36}$

## Number of ways to select some objects from distinguishable objects

In this section, we will discuss number (no.) of ways to select some objects from distinguishable objects, in four types, classified by whether the selection is ordered, and whether the selection is with replacement.

Before discussing these four types of selection, we will introduce some preliminary mathematical concepts used in the following.

### Preliminary mathematical concepts

Definition. (Factorial) For each nonnegative integer ${\displaystyle n}$, the factorial of ${\displaystyle n}$, denoted by ${\displaystyle n!}$, is

${\displaystyle n!={\begin{cases}1&{\text{if }}n=0,\\n(n-1)\cdots (1)&{\text{if }}n\geq 1.\end{cases}}}$

More generally, we have gamma function.

Definition. (Gamma function) The gamma function is

${\displaystyle \Gamma (x)=\int _{0}^{\infty }y^{x-1}e^{-y}\,dy,}$
in which ${\displaystyle x>0}$.

Proposition. (Relationship between gamma function and factorial) For each nonnegative integer ${\displaystyle n}$, ${\displaystyle \Gamma (n+1)=n!}$.

Proof. Using integration by parts,

{\displaystyle {\begin{aligned}\Gamma (x+1)&=\int _{0}^{\infty }y^{x}e^{-y}\,dy\\&=\int _{0}^{\infty }y^{x}\,d(-e^{-y})\\&=\left[-y^{x}e^{-y}\right]_{y=0}^{y=\infty }+\int _{0}^{\infty }e^{-y}\,d(y^{x})\\&=\left[-y^{x}e^{-y}\right]_{y=0}^{y=\infty }+x\underbrace {\int _{0}^{\infty }y^{x-1}e^{-y}\,dy} _{\Gamma (x)}\\&=0-0+x\Gamma (x)\qquad {\text{since }}e^{-y}\to 0{\text{ as }}y\to \infty \\&=x\Gamma (x).\end{aligned}}}
Since
${\displaystyle \Gamma (1)=\int _{0}^{\infty }y^{1-1}e^{-y}\,dy=\left[-e^{-y}\right]_{0}^{\infty }=0-(-1)=1,}$
${\displaystyle \Gamma (n+1)=n\Gamma (n)=\cdots =n(n-1)\cdots (1)\Gamma (1)=n(n-1)\cdots (1)=n!}$
for each nonnegative integer ${\displaystyle n}$.

${\displaystyle \Box }$

Remark.

• the infinity in the proof is in limit sense, so '${\displaystyle y=\infty }$' is sort of abuse of notations

Theorem. (Binomial series theorem) For each real number ${\displaystyle \alpha }$,

${\displaystyle (1+x)^{\alpha }=1+\alpha x+{\frac {\alpha (\alpha -1)}{2!}}x^{2}+{\frac {\alpha (\alpha -1)(\alpha -2)}{3!}}x^{3}+\cdots ,}$
in which ${\displaystyle |x|<1}$.

Remark.

• special case: ${\displaystyle (1+x)^{-1}=1-x+x^{2}-x^{3}+\cdots }$
• special case: ${\displaystyle (1-x)^{-1}=1+x+x^{2}+x^{3}+\cdots }$

Definition. (Binomial coefficient) The binomial coefficient, indexed by nonegative integers ${\displaystyle n}$ and ${\displaystyle r}$ such that ${\displaystyle n\geq r}$, denoted by ${\displaystyle {\binom {n}{r}}}$, is ${\displaystyle {\binom {n}{r}}={\frac {n!}{r!(n-r)!}}}$.

When ${\displaystyle \alpha =n}$ is an integer, the binomial series theorem reduces to a special case of binomial theorem:

${\displaystyle (1+x)^{n}=1+{\binom {n}{1}}x+{\binom {n}{2}}x^{2}+\cdots +{\binom {n}{n}}x^{n}{\cancel {+{\frac {n(n-1)\cdots (\overbrace {\color {darkgreen}n-n} ^{0})}{n!}}+\cdots }}.}$

Binomial theorem

Theorem. (Binomial theorem) For each nonegative integer ${\displaystyle n}$,

${\displaystyle (x+y)^{n}={\binom {n}{\color {darkgreen}0}}x^{n}y^{\color {darkgreen}0}+{\binom {n}{\color {darkgreen}1}}x^{n-{\color {darkgreen}1}}y^{\color {darkgreen}1}+\cdots +{\binom {n}{\color {darkgreen}n}}x^{n-{\color {darkgreen}n}}y^{\color {darkgreen}n}}$

Proof. It can be proved combinatorially or inductively. Complete proof is omitted.

${\displaystyle \Box }$

Remark.

• binomial series theorem is more useful than binomial theorem in this chapter

The binomial theorem can be illustrated by Pascal's triangle:

${\displaystyle {\begin{array}{lc}(a+b)^{0}=&{\color {Red}{\boldsymbol {1}}}\\(a+b)^{1}=&{\color {Red}{\boldsymbol {1}}}a+{\color {Red}{\boldsymbol {1}}}b\\(a+b)^{2}=&{\color {Red}{\boldsymbol {1}}}a^{2}+{\color {Red}{\boldsymbol {2}}}ab+{\color {Red}{\boldsymbol {1}}}b^{2}\\(a+b)^{3}=&{\color {Red}{\boldsymbol {1}}}a^{3}+{\color {Red}{\boldsymbol {3}}}a^{2}b+{\color {Red}{\boldsymbol {3}}}ab^{2}+{\color {Red}{\boldsymbol {1}}}b^{3}\\(a+b)^{4}=&{\color {Red}{\boldsymbol {1}}}a^{4}+{\color {Red}{\boldsymbol {4}}}a^{3}b+{\color {Red}{\boldsymbol {6}}}a^{2}b^{2}+{\color {Red}{\boldsymbol {4}}}ab^{3}+{\color {Red}{\boldsymbol {1}}}b^{4}\\\end{array}}}$ ${\displaystyle {\begin{array}{c}{\color {Red}{\boldsymbol {1}}}={\binom {0}{0}}\\{\color {Red}{\boldsymbol {1}}}={\binom {1}{0}}\quad \quad {\color {Red}{\boldsymbol {1}}}={\binom {1}{1}}\\{\color {Red}{\boldsymbol {1}}}={\binom {2}{0}}\quad \quad {\color {Red}{\boldsymbol {2}}}={\binom {2}{1}}\quad \quad {\color {Red}{\boldsymbol {1}}}={\binom {2}{2}}\\{\color {Red}{\boldsymbol {1}}}={\binom {3}{0}}\quad \quad {\color {Red}{\boldsymbol {3}}}={\binom {3}{1}}\quad \quad {\color {Red}{\boldsymbol {3}}}={\binom {3}{2}}\quad \quad \quad \quad {\color {Red}{\boldsymbol {1}}}={\binom {3}{3}}\\{\color {Red}{\boldsymbol {1}}}={\binom {4}{0}}\quad \quad {\color {Red}{\boldsymbol {4}}}={\binom {4}{1}}\quad \quad {\color {Red}{\boldsymbol {6}}}={\binom {4}{2}}\quad \quad {\color {Red}{\boldsymbol {4}}}={\binom {4}{3}}\quad \quad {\color {Red}{\boldsymbol {1}}}={\binom {4}{4}}\\\end{array}}}$

### Ordered selection without replacement

Theorem. The no. of ways for ordered selection of ${\displaystyle r\leq n}$ objects from ${\displaystyle n}$ distinguishable objects without replacement is ${\displaystyle {\frac {n!}{(n-r)!}}}$.

Proof. Consider an equivalent situation: selecting ${\displaystyle r\leq n}$ objects from ${\displaystyle n}$ distinguishable objects to be put into ${\displaystyle r}$ ordered boxes, labelled box ${\displaystyle 1,2,\ldots ,r}$, in which each box contains at most one object. By considering the ${\displaystyle r}$ boxes from box ${\displaystyle 1}$ to box ${\displaystyle r}$ ,

• for box ${\displaystyle 1}$, there are ${\displaystyle n}$ choices of object to be put into it
• for box ${\displaystyle 2}$, there are ${\displaystyle n-1}$ choices of object to be put into it, since the object put into box ${\displaystyle 1}$ cannot be simultaneously put into box ${\displaystyle 2}$
• ...
• for box ${\displaystyle r}$, there are ${\displaystyle n-(r-1)=n-r+1}$ choices of object to be put into it, since each of the ${\displaystyle r-1}$ objects put into box ${\displaystyle 1,2,\ldots ,r-1}$ cannot be simultaneously put into box ${\displaystyle r}$

Thus, by multiplicative principle of counting, the desired no. of ways is

${\displaystyle \underbrace {n} _{\text{box 1}}\times \underbrace {(n-1)} _{\text{box 2}}\times \cdots \times \underbrace {(n-r+1)} _{{\text{box }}r}={\frac {n\times (n-1)\times \cdots \times (n-r+1)\times {\color {darkgreen}(n-r)\times \cdots \times 1}}{\color {darkgreen}(n-r)\times \cdots \times 1}}={\frac {n!}{(n-r)!}}}$

${\displaystyle \Box }$

Remark.

• the unordered selection without replacement is also known as combination
• sometimes, ${\displaystyle {\binom {n}{r}}}$ is read as '${\displaystyle n}$-choose-${\displaystyle r}$'

Example. The no. of distinct ways to select ${\displaystyle 3}$ objects to be put into ${\displaystyle 3}$ boxes, labelled ${\displaystyle B_{1},B_{2}}$ and ${\displaystyle B_{3}}$ from ${\displaystyle 5}$ objects, labelled ${\displaystyle O_{1},O_{2},O_{3},O_{4}}$ and ${\displaystyle O_{5}}$ is ${\displaystyle {\frac {5!}{(5-3)!}}=\underbrace {5} _{B_{1}}\times \underbrace {4} _{B_{2}}\times \underbrace {3} _{B_{3}}=60.}$

Exercise.

1 After putting the ${\displaystyle 3}$ objects into the ${\displaystyle 3}$ boxes, ${\displaystyle 2}$ of them are taken out and put into ${\displaystyle 2}$ boxes, labelled ${\displaystyle B_{4}}$ and ${\displaystyle B_{5}}$. Calculate the no. of distinct ways to do this.

 ${\displaystyle 3}$ ${\displaystyle 6}$ ${\displaystyle 60}$ ${\displaystyle 180}$ ${\displaystyle 360}$

2 Continue from previous question. Suppose only ${\displaystyle 1}$ of them is taken out and put into ${\displaystyle 1}$ box, labelled ${\displaystyle B_{4}}$, now. Calculate the no. of distinct ways to do this.

 ${\displaystyle 3}$ ${\displaystyle 6}$ ${\displaystyle 60}$ ${\displaystyle 180}$ ${\displaystyle 360}$

Example. (Competition) There are ${\displaystyle 16}$ candidates for a competition. The no. of ways to award winner, 1st and 2nd runners-up is ${\displaystyle {\frac {16!}{(16-3)!}}=\underbrace {16} _{\text{winner}}\times \underbrace {15} _{\text{1st runner-up}}\times \underbrace {14} _{\text{2nd runner-up}}=3360.}$

If, Amy and Bob are among the ${\displaystyle 16}$ candidates, and it is given that Amy is awarded 1st runner-up, while Bob does not receive any award, the no. of ways to award winner, 1st and 2nd runners-up becomes ${\displaystyle {\frac {14!}{(14-2)!}}\times 1=\underbrace {14} _{\text{winner}}\times \underbrace {1} _{\text{1st runner-up: Amy}}\times \underbrace {13} _{\text{2nd runner-up}}=182}$. In particular, Amy and Bob cannot be awarded winner or 2nd runner-up.

Exercise.

1 Suppose Chris is also among the ${\displaystyle 16}$ candidates. Given that Amy, Bob and Chris receive an award from the competition, calculate the no. of ways to award winner, 1st and 2nd-runners up.

 ${\displaystyle 1}$ ${\displaystyle 3}$ ${\displaystyle 6}$ ${\displaystyle 32}$ ${\displaystyle 96}$

2 Continue from previous question. Given that Amy, Bob and Chris do not receive any award from the competition, calculate the no. of ways to award winner, 1st and 2nd-runners up.

 ${\displaystyle 1716}$ ${\displaystyle 2496}$ ${\displaystyle 3354}$ ${\displaystyle 3357}$ ${\displaystyle 3359}$

A special case of ordered selection without replacement is when the no. of selected objects equals the no. of objects to be selected. In this case, this selection is called permutation, and the no. of ways for permutation of ${\displaystyle n}$ objects (i.e. ordered selection of ${\displaystyle n}$ objects from ${\displaystyle n}$ objects) is ${\displaystyle {\frac {n!}{(n-n)!}}=n!}$.

Example.

Figure 6. 3!=3·2·1= 6 permutations of {1,2,3}.

The 6 ways to permute the string 123 are shown in Figure 6.

### Unordered selection of distinguishable objects without replacement

Theorem. The no. of ways for unordered selection to select ${\displaystyle r\leq n}$ objects from ${\displaystyle n}$ distinguishable objects without replacement is ${\displaystyle {\frac {n!}{r!(n-r)!}}={\binom {n}{r}}}$.

Proof. There are two ways to prove this.

First, consider an equivalent situation: selecting ${\displaystyle r\leq n}$ objects from ${\displaystyle n}$ distinguishable objects without replacement to be put into one box [6]. Then, we consider the no. of ways to do this in order, and then remove some ways that are regarded to be the same for unordered selection (i.e. regarded as the same when we put the objects into one box). The no. of ways to do this in order is ${\displaystyle \underbrace {n} _{\text{choice 1}}\times \underbrace {n-1} _{\text{choice 2}}\times \cdots \times \underbrace {n-r+1} _{{\text{choice }}r}={\frac {n!}{(n-r)!}}}$ (choice ${\displaystyle k}$ means the ${\displaystyle k}$th selection of objects to be put into the box)

Among these ways, putting the same ${\displaystyle r}$ objects into the box in different orders counts as different ways, and we need to merge them together, into one way. To merge them, we consider how many different ways are counted for putting the same ${\displaystyle r}$ objects into the box in different orders. Indeed, this is permutation (ordered selection of ${\displaystyle r}$ objects from ${\displaystyle r}$ distinguishable objects), so the no. of different ways is ${\displaystyle r!}$. So, we count ${\displaystyle r!}$ extra times of no. of ways (i.e. scale up the no. of ways by a factor ${\displaystyle r!}$) , and thus we need to scale down the no. of ways, by dividing the no. by ${\displaystyle r!}$. Thus, the desired no. of ways is ${\displaystyle {\frac {n!}{r!(n-r)!}}={\binom {n}{r}}}$.

Second, we use the notion of generating function, by encoding the selection process into a binomial series, and then use the coefficients to determine the desired no. of ways. To be more precise, recall a special case of binomial series theorem:

${\displaystyle \underbrace {(1+x)(1+x)\cdots (1+x)} _{n{\text{ times}}}(1+x)^{n}=1+\cdots +{\binom {n}{r}}x^{r}+\cdots +{\binom {n}{n}}x^{n}.}$
By encoding each selection to each of ${\displaystyle (1+x)}$, through treating the ${\displaystyle x^{0}=1}$ and ${\displaystyle x^{1}=x}$ in the ${\displaystyle k}$th ${\displaystyle (1+x)}$ as not selecting the object and selecting the object respectively, the coefficient of ${\displaystyle x^{r}}$ is the desired no. of ways, since it is the no. of ways to build ${\displaystyle x^{r}}$, through selecting ${\displaystyle x}$ in ${\displaystyle r}$ ${\displaystyle (1+x)}$'s, and selecting ${\displaystyle 1}$ in other ${\displaystyle (1+x)}$'s (i.e. selecting ${\displaystyle r}$ objects, regardless of the order). Thus, the desired no. of ways is ${\displaystyle {\binom {n}{r}}}$.

${\displaystyle \Box }$

Example.

Figure 8: ${\displaystyle C(n,k)}$ is shown for ${\displaystyle n=3,}$ and ${\displaystyle k=0,1,2,3.}$ The dotted ellipses remind us that these are sets, where order is not important.

For combination, the order in which the items are selected are not important, so each selection from a set can be regarded as a subset of the original set. Figure 8 illustrates for the set ${\displaystyle \{1,2,3\}.}$ The number of elements in this set is ${\displaystyle n=3.}$ From our earlier discussion of the power set, we know that the total number of subsets is ${\displaystyle 2^{3}=8}$. All 8 subsets are shown in the figure, organized by how many items are in each subset (for example, the subset in the upper-left corner contains 3 elements, while all subsets with 2 elements occupy the lower-right corner.) Let ${\displaystyle k}$ denote the number of elements "chosen" to be in each of the 8 subsets of set ${\displaystyle S=\{1,2,3\}}$ (where the number of elements in ${\displaystyle S}$ is, ${\displaystyle n=3}$.)

• ${\displaystyle {\binom {3}{0}}=1}$ set has ${\displaystyle k=0}$ elements. It is the empty set: ${\displaystyle \varnothing }$.
• ${\displaystyle {\binom {3}{1}}=3}$ sets have ${\displaystyle k=1}$ element. They are ${\displaystyle \{1\}}$,${\displaystyle \{2\}}$,and ${\displaystyle \{3\}}$.
• ${\displaystyle {\binom {3}{2}}=3}$ sets have ${\displaystyle k=2}$ elements. They are ${\displaystyle \{2,3\}}$,${\displaystyle \{1,3\}}$,and ${\displaystyle \{1,2\}}$.
• ${\displaystyle {\binom {3}{3}}=1}$ set has ${\displaystyle k=3}$ elements. It is the set itself: ${\displaystyle \{1,2,3\}}$.

Example.

Figure 9: combinations. ${\displaystyle {\textit {4-choose-2}}}$ ${\displaystyle =6}$

No. of ways to select ${\displaystyle 2}$ objects from ${\displaystyle 4}$ distinguishable objects without considering the order is ${\displaystyle {\binom {4}{2}}=6}$

Example. (Competition) There are ${\displaystyle 16}$ candidates for a competition. The no. of ways to select ${\displaystyle 3}$ candidates to enter final is ${\displaystyle {\binom {16}{3}}={\frac {16\times 15\times 14}{3\times 2\times 1}}=560.}$

Exercise.

1 Amy, Bob and Chris are among the ${\displaystyle 16}$ candidates. Calculate the no. of ways to select them to enter final.

 ${\displaystyle 1}$ ${\displaystyle 3}$ ${\displaystyle 6}$ ${\displaystyle 32}$ ${\displaystyle 96}$

2 Continue from the previous question. Calculate the no. of ways to select candidates other than Amy, Bob and Chris to enter final.

 ${\displaystyle 220}$ ${\displaystyle 286}$ ${\displaystyle 554}$ ${\displaystyle 557}$ ${\displaystyle 559}$

#### Special cases worth remembering

The formula for counting combinations has special cases that are worth remembering:

${\displaystyle {\binom {n}{0}}={\binom {n}{n}}=1}$ (There is only one way to pick no thing and only one way to pick all ${\displaystyle n}$ things.)
${\displaystyle {\binom {n}{1}}={\binom {n}{n-1}}=n}$ (there are n ways to pick one thing or to leave one thing out)
${\displaystyle {\binom {n}{k}}={\binom {n}{n-k}}}$ (There are the same number of ways of picking ${\displaystyle k}$ of ${\displaystyle n}$ things as there are of leaving out ${\displaystyle k}$ of ${\displaystyle n}$ things)

### Ordered selection of distinguishable objects with replacement

Theorem. The no. of ways for selecting ${\displaystyle r}$ objects from ${\displaystyle n}$ distinguishable objects in order, with replacement is ${\displaystyle n^{r}}$.

Proof. Consider the equivalent situation: selecting ${\displaystyle r}$ objects from ${\displaystyle n}$ types of objects, in which each type of the objects has unlimited stock, to be put into ${\displaystyle r}$ ordered boxes (the same object may be selected more than once). Then, the no. of ways is ${\displaystyle \underbrace {n} _{\text{box 1}}\times \underbrace {n} _{\text{box 2}}\times \cdots \times \underbrace {n} _{{\text{box }}r}=n^{r}}$, since for each box, there are ${\displaystyle n}$ types of objects that can be selected to be put into it.

${\displaystyle \Box }$

Remark.

• ${\displaystyle r}$ can be greater than ${\displaystyle n}$

Example. (Setting password) The number of ways to set a password with ${\displaystyle 6}$ characters, with the following rules:

(R1) numbers are allowed
(R2) alphabets