# Statistics/Preliminaries

This chapter discusses some preliminary knowledge (related to statistics) for the following chapters in the advanced part.

## Empirical distribution

Suppose ${\displaystyle X}$ is a random variable resulting from a random experiment, with cdf ${\displaystyle F(x)}$. After repeating this random experiment ${\displaystyle n}$ times, we obtain ${\displaystyle n}$ random variables, denoted by ${\displaystyle X_{1},X_{2},\dotsc ,X_{n}}$, associated with the ${\displaystyle n}$ outcomes.

Remark.

• Often, computer is useful for conducting such experiment and repeating it many times.
• In particular, a programming language, called R, is commonly used for computational statistics. You may see the wikibook R Programming for more details about it.
• As a result, the content discussed in this section (as well as the section about resampling) is quite relevant to computational statistics.

We call these ${\displaystyle n}$ random variables, ${\displaystyle X_{1},X_{2},\dotsc ,X_{n}}$, a (random) sample from a distribution with cdf ${\displaystyle F(x)}$ (or pmf/pdf ${\displaystyle f(x)}$), and the number ${\displaystyle n}$ is sample size. When the ${\displaystyle n}$ random variables are independent (i.e., the experiment is repeated independently), then we call the (random) sample as independent (random) sample.

Remark.

• In some other places, just "random sample" is used for referring to what we call independent random sample here.

Also, in this case, the ${\displaystyle n}$ random variables are iid (independent and identically distributed) (they are identically distributed since they comes from the same distribution, which is with the same cdf ${\displaystyle F(x)}$).

Since all these ${\displaystyle n}$ random variables follow the same cdf as ${\displaystyle X}$, we may expect their distribution should be somewhat similar to the distribution of ${\displaystyle X}$, and indeed, this is true. Before showing how this is true, we need to define "the distribution of these ${\displaystyle n}$ random variables" more precisely, as follows:

Definition. (Empirical distribution) The cdf of empirical distribution, empirical cdf, of a random sample ${\displaystyle X_{1},X_{2},\dotsc ,X_{n}}$, denoted by ${\displaystyle F_{\color {darkgreen}n}(x)}$, is ${\displaystyle {\frac {1}{n}}\sum _{k=1}^{n}\mathbf {1} \{X_{k}\leq x\}}$.

Remark.

• ${\displaystyle \mathbf {1} \{A\}}$ is the indicator function with value 1 if ${\displaystyle A}$ is true and 0 otherwise.
• We can see that ${\displaystyle F_{n}(x)}$ "assigns" the probability (or "mass") ${\displaystyle 1/n}$ to each of ${\displaystyle X_{1},X_{2},\dotsc ,X_{n}}$, and this is indeed a valid cdf.
• This is because for each of ${\displaystyle X_{1},\dotsc ,X_{n}}$, if it is less than or equal to ${\displaystyle x}$, then the corresponding indicator function in the sum is one, and thus a value of "${\displaystyle 1/n}$" is contributed to the cdf.
• To understand this more clearly, consider the following example.
• We can interpret ${\displaystyle F_{n}(x)}$ as the relative frequency of the event ${\displaystyle \{X\leq x\}}$. Recall that the frequentist definition of probability of an event is the "long-term" relative frequency of the event (i.e. the relative frequency of the event after repeating a random experiment infinite number of times). As a result, we will intuitively expect that ${\displaystyle F_{n}(x)\approx F(x)}$ when ${\displaystyle n}$ is large.

Example. A random sample of size 5 is taken from an unknown distribution, and the following numbers are obtained:

-1.4, 2.3, 0.8, 1.9, -1.6

(a) Find the empirical cdf.

(b) Let ${\displaystyle Y}$ be a (discrete) random variable with cdf exactly the same as the empirical cdf in (a). Prove that the pmf of ${\displaystyle Y}$ (called empirical pmf) is

${\displaystyle f_{Y}(y)=\mathbb {P} (Y=y)={\frac {1}{5}},\quad y=-1.6,-1.4,0.8,1.9{\text{ or }}2.3.}$
Solution:

(a) First, we order the sample data ascendingly so that we can find the empirical cdf more conveniently:

-1.6, -1.4, 0.8, 1.9, 2.3

The empirical cdf is given by

${\displaystyle F_{5}(x)={\begin{cases}0,&x<-1.6;\\1/5,&-1.6\leq x<-1.4;\\2/5,&-1.4\leq x<0.8;\\3/5,&0.8\leq x<1.9;\\4/5,&1.9\leq x<2.3;\\1,&x\geq 2.3.\\\end{cases}}}$
Explanations:

• After ordering the sample data, we treat each of the number as an observed value of the random sample: ${\displaystyle X_{1}=-1.6,X_{2}=-1.4,X_{3}=0.8,X_{4}=1.9,X_{5}=2.3}$.
• Then, when ${\displaystyle x<1.6}$, none of ${\displaystyle X_{1},\dotsc ,X_{5}}$ is less than or equal to ${\displaystyle x}$. So, all indicator functions involved are zero, and thus the value of the empirical cdf is zero.
• When ${\displaystyle -1.6\leq x<-1.4}$, only ${\displaystyle X_{1}\leq x}$, and thus only the indicator function ${\displaystyle \mathbf {1} \{X_{1}\leq x\}=1}$ in this case, and all other indicator functions are zero. As a result, the value is ${\displaystyle {\frac {\sum _{k=1}^{5}\mathbf {1} \{X_{k}\leq x\}}{5}}={\frac {\mathbf {1} \{X_{1}\leq x\}+0+0+0+0}{5}}={\frac {1}{5}}}$.
• Similarly, when ${\displaystyle -1.4\leq x<0.8}$, only ${\displaystyle X_{1},X_{2}\leq x}$, and thus only the indicator function ${\displaystyle \mathbf {1} \{X_{1}\leq x\}=1}$ and ${\displaystyle \mathbf {1} \{X_{2}\leq x\}=1}$ in this case, and all other indicator functions are zero. As a result, the value is ${\displaystyle {\frac {\sum _{k=1}^{5}\mathbf {1} \{X_{k}\leq x\}}{5}}={\frac {\mathbf {1} \{X_{1}\leq x\}+\mathbf {1} \{X_{2}\leq x\}+0+0+0}{5}}={\frac {2}{5}}}$.
• ...
• When ${\displaystyle x\geq 2.3}$, all ${\displaystyle X_{1},\dotsc ,X_{5}\leq x}$. Hence, all indicator functions are one, and thus the value of the empirical cdf is ${\displaystyle {\frac {1+1+1+1+1}{5}}=1}$.

(b)

Proof. First, notice that the cdf of ${\displaystyle Y}$ is ${\displaystyle F_{Y}(y)=\mathbb {P} (Y\leq y)=\mathbb {P} (Y.

Then, we observe that when ${\displaystyle y=-1.6}$, ${\displaystyle \mathbb {P} (Y\leq y)=F_{5}(-1.6)=1/5}$, and ${\displaystyle \mathbb {P} (Y (from the empirical cdf). Hence, ${\displaystyle f_{Y}(y)={\frac {1}{5}}}$ in this case. Similarly, when ${\displaystyle y=-1.4}$, ${\displaystyle \mathbb {P} (Y\leq y)=F_{5}(-1.4)=2/5}$, and ${\displaystyle \mathbb {P} (Y. Thus, ${\displaystyle f_{Y}(y)={\frac {2}{5}}-{\frac {1}{5}}={\frac {1}{5}}}$ also in this case. With similar arguments, we can show that ${\displaystyle f_{Y}(y)={\frac {1}{5}}}$ also when ${\displaystyle y=0.8,1.9,{\text{ or }}2.3}$.

${\displaystyle \Box }$

Remark.

• Observe from (b) that the support of ${\displaystyle Y}$ contains exactly the numbers in the sample data, which are the realization of the random sample ${\displaystyle X_{1},\dotsc ,X_{5}}$. This shows that the probability ${\displaystyle 1/5}$ is "assigned" to each of ${\displaystyle X_{1},\dotsc ,X_{5}}$.

Theorem. (Glivenko–Cantelli theorem) As ${\displaystyle n\to \infty }$, ${\displaystyle \sup _{x\in \mathbb {R} }|F_{n}(x)-F(x)|\to 0}$ almost surely (a.s.).

Remark.

• ${\displaystyle \sup }$ stands for supremum of a set (with some technical requirements), which means the least upper bound of the set, which is the least element that is greater or equal to each other element in the set.
• The meaning of ${\displaystyle \sup _{x\in \mathbb {R} }|F_{n}(x)-F(x)|}$ is the least upper bound of the set containing the values of ${\displaystyle |F_{n}(x)-F(x)|}$ over ${\displaystyle x\in \mathbb {R} }$.
• The supremum is similar to the concept of maximum (indeed, if maximum exists, then maximum is the same as supremum), but a difference between them is that sometimes supremum exists while maximum does not exist.
• For instance the supremum of the set (or interval) ${\displaystyle [0,1)}$ is 1 (intuitively). However, the maximum of the set ${\displaystyle [0,1)}$ (i.e. the greatest element in the set) does not exist (notice that 1 is not included in this set) [1].
• The term "almost surely" means that this happens with probability 1. The details for the reason of calling this "almost surely" instead of "surely" involves some understanding of measure theory, and so is omitted here.
• Roughly speaking, from this theorem, we know that ${\displaystyle F_{n}(x)}$ is a good estimate of ${\displaystyle F(x)}$, and an even better estimate of (or "closer to") ${\displaystyle F(x)}$ when ${\displaystyle n}$ is large, for every realization ${\displaystyle x_{1},\dotsc ,x_{n}}$ (each of them is real number), since the least upper bound of the absolute difference already tends to zero, and then we will intuitively expect that every such absolute difference also tends to zero.
• This theorem is sometimes referred as the fundamental theorem of statistics, indicating its importance in statistics.

We have mentioned how we can approximate the cdf, and now we would like to estimate the pdf/pmf. Let us first discuss how to estimate the pmf.

For the discrete random variable ${\displaystyle X}$, from the empirical cdf, we know that each ${\displaystyle X_{1},\dotsc ,X_{n}}$ is "assigned" with the probability ${\displaystyle 1/n}$. Also, considering the previous example, the empirical pmf is ${\displaystyle f_{n}(x)={\frac {\sum _{k=1}^{n}\mathbf {1} \{X_{k}=x\}}{n}}}$.

Remark.

• The empirical pmf ${\displaystyle f_{n}(x)}$ shows the relative frequency of occurrences of ${\displaystyle x}$, and therefore can approximate the probability of occurrences of ${\displaystyle x}$, which is the long-term relative frequency of occurrences of ${\displaystyle x}$.

To discuss the estimation of pdf of continuous random variable, we need to define class intervals first.

Definition. (Class intervals) First, choose an integer ${\displaystyle i\geq 1}$, and a sequence of real numbers ${\displaystyle c_{0},c_{1},\dotsc ,c_{i}}$ such that ${\displaystyle c_{0}. Then, the class intervals are ${\displaystyle (c_{0},c_{1}],(c_{1},c_{2}],\dotsc ,(c_{i-1},c_{i}]}$.

For the continuous random variable ${\displaystyle X}$, construct class intervals for ${\displaystyle X}$ which are a non-overlapped partition of the interval ${\displaystyle [X_{\text{min}},X_{\text{max}}]}$, in which ${\displaystyle X_{\text{min}}}$ and ${\displaystyle X_{\text{max}}}$ are the minimum and maximum values in the sample. Then, the pdf

${\displaystyle f(x)\approx {\frac {F(c_{j})-F(c_{j-1})}{c_{j}-c_{j-1}}},\quad x\in (c_{j-1},c_{j}]{\text{ and }}j=1,2,\dotsc ,i,}$
when ${\displaystyle c_{j-1}}$ and ${\displaystyle c_{j}}$ are close, i.e. the length of each class interval is small. (Although the union of the above class intervals is ${\displaystyle (c_{0},c_{i}]}$ and thus the value ${\displaystyle c_{0}}$ is not included in the interval, it does not matter since the value of the pdf at ${\displaystyle c_{0}}$ does not affect the calculation of probability.) Here, ${\displaystyle c_{0}}$ is ${\displaystyle X_{\text{min}}}$ and ${\displaystyle c_{i}}$ is ${\displaystyle X_{\text{max}}}$.

Since ${\displaystyle F(c_{j})-F(c_{j-1})=\mathbb {P} (X\in (c_{j-1},c_{j}])\approx {\color {darkgreen}{\frac {\sum _{k=1}^{n}\mathbf {1} \{X_{k}\in (c_{j-1},c_{j}]\}}{n}}}}$ is the relative frequency of occurrences of the event ${\displaystyle \{X_{k}\in (c_{j-1},c_{j}]\}}$, we can rewrite the above expression as

${\displaystyle f(x)\approx h_{n}(x)={\frac {\color {darkgreen}\sum _{k=1}^{n}\mathbf {1} \{X_{k}\in (c_{j-1},c_{j}]\}}{{\color {darkgreen}n}(c_{j}-c_{j-1})}},\quad x\in (c_{j-1},c_{j}]{\text{ and }}j=1,2,\dotsc ,i}$
in which ${\displaystyle h_{n}(x)}$ is called the relative frequency histogram.

Since there are many possible ways to construct the class intervals, the value of ${\displaystyle h_{n}(x)}$ can differ even with the same ${\displaystyle n}$ and ${\displaystyle x}$. When ${\displaystyle n}$ is large and the length of each class interval is small, we will expect ${\displaystyle h_{n}(x)}$ to be a good estimate of ${\displaystyle f(x)}$ (the theoretical pdf).

There are some properties related to the relative frequency histogram, as follows:

Proposition. (Properties of relative frequency histogram)

(i) ${\displaystyle h_{n}(x)\geq 0}$;

(ii) The total area bounded by ${\displaystyle h_{n}(x)}$ and the ${\displaystyle x}$-axis is one, i.e. ${\displaystyle \int _{c_{0}}^{c_{i}}h_{n}(x)\,dx=1}$ [2];

(iii) The probability of an event ${\displaystyle A}$ that is a union of some class intervals is ${\displaystyle \mathbb {P} (A)\approx \int _{A}^{}h_{n}(x)\,dx}$.

Proof.

(i) Since the indicator function is nonnegative (its value is either 0 or 1), ${\displaystyle n}$ is positive, and ${\displaystyle c_{j}>c_{j-1}}$ so ${\displaystyle c_{j}-c_{j-1}}$ is positive, we have ${\displaystyle h_{n}(x)\geq 0}$ by definition.

(ii)

{\displaystyle {\begin{aligned}\int _{c_{0}}^{c_{i}}h_{n}(x)\,dx&=\int _{c_{0}}^{c_{1}}h_{n}(x)\,dx+\int _{c_{1}}^{c_{2}}h_{n}(x)\,dx+\dotsb +\int _{c_{i-1}}^{c_{i}}h_{n}(x)\,dx\\&={\frac {1}{n}}\left(\int _{c_{0}}^{c_{1}}{\frac {\sum _{k=1}^{n}\mathbf {1} \{X_{k}\in (c_{0},c_{1}]\}}{c_{1}-c_{0}}}\,dx+\int _{c_{1}}^{c_{2}}{\frac {\sum _{k=1}^{n}\mathbf {1} \{X_{k}\in (c_{1},c_{2}]\}}{c_{2}-c_{1}}}\,dx+\dotsb +\int _{c_{i-1}}^{c_{i}}{\frac {\sum _{k=1}^{n}\mathbf {1} \{X_{k}\in (c_{i-1},c_{i}]\}}{c_{i}-c_{i-1}}}\,dx\right)\\&={\frac {1}{n}}\left({\frac {\sum _{k=1}^{n}\mathbf {1} \{X_{k}\in (c_{0},c_{1}]\}}{c_{1}-c_{0}}}\cdot (c_{1}-c_{0})+{\frac {\sum _{k=1}^{n}\mathbf {1} \{X_{k}\in (c_{1},c_{2}]\}}{c_{2}-c_{1}}}\cdot (c_{2}-c_{1})+\dotsb +{\frac {\sum _{k=1}^{n}\mathbf {1} \{X_{k}\in (c_{i-1},c_{i}]\}}{c_{i}-c_{i-1}}}\cdot (c_{i}-c_{i-1})\right)\\&={\frac {1}{n}}\left(\sum _{k=1}^{n}\mathbf {1} \{X_{k}\in (c_{0},c_{1}]\}+\sum _{k=1}^{n}\mathbf {1} \{X_{k}\in (c_{1},c_{2}]\}+\dotsb +\sum _{k=1}^{n}\mathbf {1} \{X_{k}\in (c_{i-1},c_{i}]\}\right)\\&={\frac {1}{n}}\left(\sum _{k=1}^{n}\mathbf {1} \{X_{k}\in (c_{0},c_{1}]\cup (c_{1},c_{2}]\cup \dotsb \cup (c_{i-1},c_{i}]\}\right)\\&={\frac {1}{n}}\left(\sum _{k=1}^{n}\mathbf {1} \{X_{k}\in \underbrace {(c_{0},c_{i}]} _{{\text{sample space of }}X}\}\right)\\&={\frac {1}{n}}\cdot \sum _{k=1}^{n}1\\&={\frac {1}{n}}\cdot n\\&=1.\end{aligned}}}
Here, ${\displaystyle c_{0}}$ is ${\displaystyle X_{\text{min}}}$ and ${\displaystyle c_{i}}$ is ${\displaystyle X_{\text{max}}}$.

(iii) We can "split" the integral in a similar way as in (ii), and then eventually the integral equals ${\displaystyle {\frac {1}{n}}\cdot \sum _{k=1}^{n}\mathbf {1} \{X_{k}\in A\}}$, and it can can approximate ${\displaystyle \mathbb {P} (A)}$ since it is the relative frequency of occurrences of the event ${\displaystyle \{X_{k}\in A\}}$.

${\displaystyle \Box }$

## Expectation

In this section, we will discuss some results about expectation, which involve some sort of inequalities. Let ${\displaystyle a}$ and ${\displaystyle b}$ be constants. Also, let ${\displaystyle \Omega }$ be the sample space of ${\displaystyle X}$.

Proposition. Let ${\displaystyle X}$ be a discrete or continuous random variable. If ${\displaystyle \mathbb {P} (a, then ${\displaystyle a<\mathbb {E} [X]\leq b}$.

Proof. Assume ${\displaystyle \mathbb {P} (a.

Case 1: ${\displaystyle X}$ is discrete.

By definition of expectation, ${\displaystyle \mathbb {E} [X]=\sum _{x\in \Omega }^{}xf(x)}$. Then, we have ${\displaystyle \sum _{x\in \Omega }^{}af(x)<\sum _{x\in \Omega }^{}xf(x)\leq \sum _{x\in \Omega }^{}bf(x)\Rightarrow a\sum _{x\in \Omega }^{}f(x)<\mathbb {E} [X]\leq b\sum _{x\in \Omega }^{}f(x)\Rightarrow a<\mathbb {E} [X]\leq b}$ because of the condition ${\displaystyle \mathbb {P} (a.

Case 2: ${\displaystyle X}$ is continuous.

We have similarly ${\displaystyle \int _{\Omega }^{}af(x)\,dx<\int _{\Omega }^{}xf(x)\,dx\leq \int _{\Omega }^{}bf(x)\,dx\Rightarrow a<\mathbb {E} [X]\leq b}$ because of the condition of ${\displaystyle \mathbb {P} (a.

${\displaystyle \Box }$

Remark.

• We can interchange "${\displaystyle <}$" and "${\displaystyle \leq }$" without affecting the result. This can be seen from the proof.

Proposition. (Markov's inequality) Suppose ${\displaystyle \mathbb {E} [X]}$ is finite. Let ${\displaystyle X}$ be a continuous nonnegative random variable. Then, for each positive number ${\displaystyle a}$, ${\displaystyle \mathbb {P} (X\geq a)\leq {\frac {\mathbb {E} [X]}{a}}}$.

Proof.

${\displaystyle {\frac {\mathbb {E} [X]}{a}}={\frac {1}{a}}\int _{-\infty }^{\infty }\underbrace {xf(x)} _{\color {darkgreen}\geq 0}\,dx{\color {darkgreen}\geq }\int _{a}^{\infty }xf(x)\,dx{\color {darkgreen}\geq }{\frac {1}{a}}\int _{a}^{\infty }af(x)\,dx=\int _{a}^{\infty }f(x)\,dx=\mathbb {P} (X\geq a),}$
as desired.

${\displaystyle \Box }$

Corollary. (Chebyshev's inequality) Suppose ${\displaystyle \mathbb {E} [X^{2}]}$ is finite. Then, for each positive number ${\displaystyle a}$, ${\displaystyle \mathbb {P} (|X|\geq a)\leq {\frac {\mathbb {E} [X^{2}]}{a^{2}}}.}$

Proof. First, observe that ${\displaystyle X^{2}}$ is a nonnegative random variable. Then, by Markov's inequality, for each (positive) ${\displaystyle a'=a^{2}}$, we have

${\displaystyle \mathbb {P} (X^{2}\geq a')\leq {\frac {\mathbb {E} [X^{2}]}{a'}}\implies \mathbb {P} (X^{2}\geq a^{2})\leq {\frac {\mathbb {E} [X^{2}]}{a^{2}}}\implies \mathbb {P} \left({\sqrt {X^{2}}}\geq {\sqrt {a^{2}}}\right)\leq {\frac {\mathbb {E} [X^{2}]}{a^{2}}}\implies \mathbb {P} (|X|\geq a)\leq {\frac {\mathbb {E} [X^{2}]}{a^{2}}}}$
, since ${\displaystyle a}$ is positive.

${\displaystyle \Box }$

Proposition. (Jensen's inequality) Let ${\displaystyle X}$ be a continuous random variable. If ${\displaystyle g}$ is a convex function, then ${\displaystyle g\left(\mathbb {E} [X]\right)\leq \mathbb {E} [g(X)]}$.

Proof. Let ${\displaystyle L(x)=a+bx}$ be the tangent of the function ${\displaystyle g(x)}$ at ${\displaystyle x=\mathbb {E} [X]}$. Then, since ${\displaystyle g}$ is convex, we have ${\displaystyle g(x)\geq L(x)}$ for each ${\displaystyle x}$ (informally, we can observe this graphically). As a result, we have

{\displaystyle {\begin{aligned}&&\int _{\Omega }^{}g(x)f(x)\,dx&\geq \int _{\Omega }^{}L(x)f(x)\,dx\\&\Rightarrow &\mathbb {E} [g(X)]&\geq \mathbb {E} [L(X)]\\&&&=\mathbb {E} [a+bX]\\&&&=a+b\mathbb {E} [X]\\&&&=L(\mathbb {E} [X])\\&&&=g(\mathbb {E} [X])&{\text{since }}L(x){\text{ is tangent of }}g(x){\text{ at }}x=\mathbb {E} [X],\end{aligned}}}
as desired.

${\displaystyle \Box }$

Theorem. (Cauchy-Schwarz inequality) Suppose ${\displaystyle \mathbb {E} [X^{2}]}$ and ${\displaystyle \mathbb {E} [Y^{2}]}$ are finite. Then,

${\displaystyle (\mathbb {E} [XY])^{2}\leq \mathbb {E} [X^{2}]\mathbb {E} [Y^{2}]}$

Proof.

{\displaystyle {\begin{aligned}0&\leq \mathbb {E} [(X\mathbb {E} [Y^{2}]-Y\mathbb {E} [XY])^{2}]\\&={\color {darkgreen}\mathbb {E} [}X^{2}\underbrace {(\mathbb {E} [Y^{2}])^{2}} _{\text{constant}}-2XY\underbrace {\mathbb {E} [Y^{2}]\mathbb {E} [XY]} _{\text{constant}}+Y^{2}\underbrace {(\mathbb {E} [XY])^{2}} _{\text{constant}}{\color {darkgreen}]}\\&=(\mathbb {E} [Y^{2}])^{2}{\color {darkgreen}\mathbb {E} [}X^{2}{\color {darkgreen}]}-2\mathbb {E} [Y^{2}]\mathbb {E} [XY]{\color {darkgreen}\mathbb {E} [}XY{\color {darkgreen}]}+(\mathbb {E} [XY])^{2}{\color {darkgreen}\mathbb {E} [}Y^{2}{\color {darkgreen}]}\\&=\mathbb {E} [Y^{2}]\left(\mathbb {E} [X^{2}]\mathbb {E} [Y^{2}]-2(\mathbb {E} [XY])^{2}+(\mathbb {E} [XY])^{2}\right)\\&=\mathbb {E} [Y^{2}]\left(\mathbb {E} [X^{2}]\mathbb {E} [Y^{2}]-(\mathbb {E} [XY])^{2}\right)\\\end{aligned}}}
Since ${\displaystyle \mathbb {E} [Y^{2}]\geq 0}$, we must have ${\displaystyle \mathbb {E} [X^{2}]\mathbb {E} [Y^{2}]-(\mathbb {E} [XY])^{2}\geq 0\Leftrightarrow (\mathbb {E} [XY])^{2}\leq \mathbb {E} [X^{2}]\mathbb {E} [Y^{2}]}$.

${\displaystyle \Box }$

Example. (Covariance inequality) Use the Cauchy-Schwarz inequality for expectations (above theorem) to prove the covariance inequality (it is sometimes simply called the Cauchy-Schwarz inequality):

${\displaystyle {\big (}\operatorname {Cov} (X,Y){\big )}^{2}\leq \operatorname {Var} (X)\operatorname {Var} (Y)}$
(assuming the existence of the covariance and variances)

Proof. Let ${\displaystyle X'=X-\mathbb {E} [X]}$ and ${\displaystyle Y'=Y-\mathbb {E} [Y]}$. Then, ${\displaystyle \mathbb {E} [X']}$ and ${\displaystyle \mathbb {E} [Y']}$ are finite. Hence, by Cauchy-Schwarz inequality,

${\displaystyle (\mathbb {E} [X'Y'])^{2}\leq \mathbb {E} [(X')^{2}]\mathbb {E} [(Y')^{2}]\Leftrightarrow (\mathbb {E} [(X-\mathbb {E} [X])(Y-\mathbb {E} [Y])]\leq \mathbb {E} [(X-\mathbb {E} [X])^{2}]\mathbb {E} [(Y-\mathbb {E} [Y])^{2}]{\overset {\text{ def }}{\Leftrightarrow }}{\big (}\operatorname {Cov} (X,Y){\big )}^{2}\leq \operatorname {Var} (X)\operatorname {Var} (Y).}$

${\displaystyle \Box }$

## Convergence

Before discussing convergence, we will define some terms that will be used later.

Definition. (Statistics) Statistics are functions of random sample.

Remark.

• The random sample consists of ${\displaystyle n}$ (${\displaystyle n}$ is sample size) random variables ${\displaystyle X_{1},\dotsc ,X_{n}}$.
• Two important statistics are the sample mean ${\displaystyle {\overline {X}}={\frac {\sum _{i=1}^{n}X_{i}}{n}}}$ and the sample variance ${\displaystyle S^{2}={\frac {\sum _{i=1}^{n}(X_{i}-{\overline {X}})^{2}}{n}}}$.
• In many other places, ${\displaystyle S^{2}}$ is used to denote ${\displaystyle {\frac {\sum _{i=1}^{n}(X_{i}-{\overline {X}})^{2}}{n-1}}}$, the unbiased sample variance. In fact, ${\displaystyle S^{2}}$ here is biased (we will discuss what "(un)biased" means in the next chapter). Warning: we should be careful about this difference in definitions.
• Both of ${\displaystyle {\overline {X}}}$ and ${\displaystyle S^{2}}$ are random variables, since random variables are involved in them.

In a particular sample, say ${\displaystyle x_{1},\dotsc ,x_{n}}$, we observe definite values of their sample mean, ${\displaystyle {\overline {x}}={\frac {\sum _{i=1}^{n}x_{i}}{n}}}$, and sample variance, ${\displaystyle s^{2}={\frac {\sum _{i=1}^{n}(x_{i}-{\overline {x}})^{2}}{n}}}$. However, each of the values is only one realization of the respective random variables ${\displaystyle {\overline {X}}}$ and ${\displaystyle S^{2}}$. We should notice the difference between these definite values (not random variables) and the statistics (random variables).

To explain the definitions of the sample mean ${\displaystyle {\overline {X}}}$ and sample variance ${\displaystyle S^{2}}$ more intuitively, consider the following.

Recall that the empirical cdf ${\displaystyle F_{n}(x)}$ assigns probability ${\displaystyle {\frac {1}{n}}}$ to each of the random sample ${\displaystyle X_{1},\dotsc ,X_{n}}$. Thus, by the definition of mean and variance, the mean of a random variable, say ${\displaystyle Y}$, with this cdf ${\displaystyle F_{n}(x)}$ (and hence with the corresponding pmf ${\displaystyle f_{n}(x)}$) is ${\displaystyle \sum _{i=1}^{n}\left(X_{i}\cdot {\frac {1}{n}}\right)={\overline {X}}}$. Similarly, the variance of ${\displaystyle Y}$ is ${\displaystyle \sum _{i=1}^{n}\left((X_{i}-{\overline {X}})^{2}\cdot {\frac {1}{n}}\right)=S^{2}}$. In other words, the mean and variance of the empirical distribution, which corresponds to the random sample, is the sample mean ${\displaystyle {\overline {X}}}$ and the sample variance ${\displaystyle S^{2}}$ respectively, which is quite natural, right?

Remark.

• Here, we use "${\displaystyle X_{i}}$" rather than the usual "${\displaystyle x_{i}}$" in the expression, and the mean and variance are also random variables. This is because the sample space of the empirical cdf consists of random variables ${\displaystyle X_{1},\dotsc ,X_{n}}$, rather than definite values ${\displaystyle x_{1},\dotsc ,x_{n}}$.

Also, recall that the empirical cdf ${\displaystyle F_{n}(x)}$ can well approximate the cdf of ${\displaystyle X}$, ${\displaystyle F(x)}$ when ${\displaystyle n}$ is large. Since ${\displaystyle {\overline {X}}}$ and ${\displaystyle S^{2}}$ are the mean and variance of a random variable with cdf ${\displaystyle F_{n}(x)}$ it is natural to expect that ${\displaystyle {\overline {X}}}$ and ${\displaystyle S^{2}}$ can well approximate the mean and variance of ${\displaystyle X}$.

### Convergence in probability

Definition. (Convergence in probability) Let ${\displaystyle Z_{1},Z_{2},\dotsc }$ be a sequence of random variables. The sequence converges in probability to a random variable ${\displaystyle Z}$, if for each ${\displaystyle \varepsilon >0}$, ${\displaystyle \mathbb {P} (|Z_{n}-Z|>\varepsilon )\to 0}$ as ${\displaystyle n\to \infty }$. If this is the case, we write this as ${\displaystyle Z_{n}\;{\overset {p}{\to }}\;Z}$ for simplicity.

Remark.

• We may compare this definition with the definition of convergence of a deterministic sequence ${\displaystyle (a_{n}:n\in \mathbb {N} }$):
${\displaystyle a_{n}\to a}$ as ${\displaystyle n\to \infty }$ if for each ${\displaystyle \varepsilon >0}$, there exists an integer ${\displaystyle N>0}$ (which is a function of ${\displaystyle \varepsilon }$), such that when ${\displaystyle n\geq N}$, ${\displaystyle |a_{n}-a|<\varepsilon }$ (surely).
• For comparison, we may rewrite the above definition as
${\displaystyle Z_{n}\;{\overset {p}{\to }}\;Z}$ as ${\displaystyle n\to \infty }$ if for each ${\displaystyle \varepsilon >0}$, there exists an integer ${\displaystyle N>0}$ (which is a function of ${\displaystyle \varepsilon }$), such that when ${\displaystyle n\geq N}$ , the probability for ${\displaystyle |Z_{n}-Z|<\varepsilon }$ is very close to one (but this event does not happen surely).
• ${\displaystyle \varepsilon }$ specifies the accuracy of the convergence. If higher accuracy is desired, then ${\displaystyle \varepsilon }$ will be set to be a smaller (positive) value. The probability in the definition is very close to zero (we say that the convergence with a certain accuracy (depending on the value of ${\displaystyle \varepsilon }$) is "achieved" in this case) when ${\displaystyle n}$ is sufficiently large.

The following theorem, namely weak law of large number, is an important theorem which is related to convergence in probability.

Theorem. (Weak law of large number (Weak LLN)) Let ${\displaystyle X_{1},\dotsc ,X_{n}}$ be a sequence of independent random variables with the same finite mean ${\displaystyle \mu }$ and same finite variance ${\displaystyle \sigma ^{2}}$. Then, as ${\displaystyle n\to \infty }$, ${\displaystyle {\overline {X}}\;{\overset {p}{\to }}\;\mu }$.

Proof. We use ${\displaystyle S_{n}}$ to denote ${\displaystyle \sum _{i=1}^{n}X_{i}}$.

By definition, ${\displaystyle {\overline {X}}\;{\overset {p}{\to }}\;\mu }$ as ${\displaystyle n\to \infty }$ is equivalent to ${\displaystyle \mathbb {P} \left(\left|{\frac {S_{n}}{n}}-\mu \right|>\varepsilon \right)\to 0}$ as ${\displaystyle n\to \infty }$.

By Chebyshov's inequality, we have

{\displaystyle {\begin{aligned}\mathbb {P} \left(\left|{\frac {S_{n}}{n}}-\mu \right|>\epsilon \right)&\leq {\frac {1}{\varepsilon ^{2}}}\mathbb {E} \left[\left({\frac {S_{n}}{n}}-\mu \right)^{2}\right]\\&={\frac {1}{\varepsilon ^{2}}}\mathbb {E} \left[\left({\frac {S_{n}-n\mu }{\color {darkgreen}n}}\right)^{2}\right]\\&={\frac {1}{{\color {darkgreen}n^{2}}\varepsilon ^{2}}}\mathbb {E} \left[\left(S_{n}-n\mu \right)^{2}\right]\\&={\frac {1}{n^{2}\varepsilon ^{2}}}\mathbb {E} \left[\left(\sum _{i=1}^{n}X_{i}-\mu \right)^{2}\right]\\&={\frac {1}{n^{2}\varepsilon ^{2}}}\mathbb {E} \left[\sum _{i=1}^{n}\sum _{j=1}^{n}(X_{i}-\mu )(X_{j}-\mu )\right]\\&={\frac {1}{n^{2}\varepsilon ^{2}}}\left(\mathbb {E} \left[\sum _{i=j=1}^{n}(X_{i}-\mu )^{2}\right]+\mathbb {E} \left[\sum _{i=1}^{n}\sum _{j\neq i,j=1}^{n}(X_{i}-\mu )(X_{j}-\mu )\right]\right)\\\end{aligned}}}

Since ${\displaystyle X_{1},X_{2},\dotsc }$ are independent (and hence functions of them are also independent) and the expectation is multiplicative under independence,

{\displaystyle {\begin{aligned}{\frac {1}{n^{2}\varepsilon ^{2}}}\left(\mathbb {E} \left[\sum _{i=j=1}^{n}(X_{i}-\mu )^{2}\right]+\mathbb {E} \left[\sum _{i=1}^{n}\sum _{j\neq i,j=1}^{n}(X_{i}-\mu )(X_{j}-\mu )\right]\right)&={\frac {1}{n^{2}\varepsilon ^{2}}}\left(\mathbb {E} \left[\sum _{i=j=1}^{n}(X_{i}-\mu )^{2}\right]+\sum _{i=1}^{n}\sum _{j\neq i,j=1}^{n}\underbrace {\mathbb {E} [X_{i}-\mu ]} _{=\mu -\mu =0}\underbrace {\mathbb {E} [X_{j}-\mu ]} _{=\mu -\mu =0}\right)\\&={\frac {1}{n^{2}\varepsilon ^{2}}}\cdot \sum _{i=1}^{n}\underbrace {\mathbb {E} \left[(X_{i}-\mu )^{2}\right]} _{=\sigma ^{2}}\\&={\frac {n\sigma ^{2}}{n^{2}\varepsilon ^{2}}}\\&={\frac {\sigma ^{2}}{n\varepsilon ^{2}}}\\&\to 0&{\text{as }}n\to \infty .\end{aligned}}}
So, the probability ${\displaystyle \mathbb {P} \left(\left|{\frac {S_{n}}{n}}-\mu \right|>\varepsilon \right)}$ is less than or equal to an expression that tends to be 0 as ${\displaystyle n\to \infty }$. Since the probability is nonnegative (${\displaystyle \geq 0}$), it follows that the probability also tends to be 0 as ${\displaystyle n\to \infty }$.

${\displaystyle \Box }$

Remark.

• There is also strong law of large number, which is related to almost sure convergence (which is stronger than probability convergence, i.e. implies probability convergence).

There are also some properties of convergence in probability that help us to determine a complex expression converges to what thing.

Proposition. (Properties of convergence in probability) If ${\displaystyle X_{n}\;{\overset {p}{\to }}\;X}$ and ${\displaystyle Y_{n}\;{\overset {p}{\to }}\;Y}$, then

• (linearity) ${\displaystyle aX_{n}+bY_{n}\;{\overset {p}{\to }}\;aX+bY}$ where ${\displaystyle a,b}$ are constants;
• (multiplicativity) ${\displaystyle X_{n}Y_{n}\;{\overset {p}{\to }}\;XY}$;
• ${\displaystyle X_{n}/Y_{n}\;{\overset {p}{\to }}\;X/Y}$ given that ${\displaystyle Y_{n}\neq 0}$ and ${\displaystyle Y\neq 0}$;
• (continuous mapping theorem) if ${\displaystyle g}$ is a continuous function, then ${\displaystyle g(X_{n})\;{\overset {p}{\to }}\;g(X)}$ (and ${\displaystyle g(Y_{n})\;{\overset {p}{\to }}\;g(Y)}$)

Proof. Brief idea: Assume ${\displaystyle X_{n}\;{\overset {p}{\to }}\;X}$ and ${\displaystyle Y_{n}\;{\overset {p}{\to }}\;Y}$. Continuous mapping theorem is first proven so that we can use it in the proof of other properties (the proof is omitted here). Also, it can be shown that ${\displaystyle (X_{n},Y_{n})\;{\overset {p}{\to }}\;(X,Y)}$ (joint convergence in probability, the definition is similar, except that the random variables become ordered pairs, so the interpretation of "${\displaystyle |Z_{n}-Z|}$" becomes the distance between the two points in Cartesian coordinate system, which are represented by the ordered pairs)

After that we define ${\displaystyle g(z_{1},z_{2})=az_{1}+bz_{2}}$, ${\displaystyle g(z_{1},z_{2})=z_{1}z_{2}}$, and ${\displaystyle g(z_{1}/z_{2})=z_{1}/z_{2}}$ respectively, where each of these functions is continuous, and ${\displaystyle a,b}$ are constants. Then, applying the continuous mapping theorem using each of these functions gives us the first three results.

${\displaystyle \Box }$

### Convergence in distribution

Definition. (Convergence in distribution) Let ${\displaystyle Z_{1},Z_{2},\dotsc }$ be a sequence of random variables. The sequence converges in distribution to a random variable ${\displaystyle Z}$ if as ${\displaystyle n\to \infty }$, ${\displaystyle G_{n}(x)\to G(x)}$ for each ${\displaystyle x}$ at which ${\displaystyle G(x)}$ is continuous, where ${\displaystyle G_{n}(x)}$ and ${\displaystyle G(x)}$ are the cdf of ${\displaystyle Z_{n}}$ and ${\displaystyle Z}$ respectively. If this is the case, we write this as ${\displaystyle Z_{n}\;{\overset {d}{\to }}\;Z}$ for simplicity.

Remark.

• The requirement for ${\displaystyle G(x)}$ to be continuous is added so that the convergence in distribution still holds even if the convergence of cdf's fails at some points at which ${\displaystyle G(x)}$ is discontinuous.
• We may alternatively express the definition as ${\displaystyle \lim _{n\to \infty }G_{n}(x)=G(x)}$ which has the same meaning as ${\displaystyle G_{n}(x)\to G(x)}$ as ${\displaystyle n\to \infty }$.
• It can be shown that convergence in probability implies convergence in distribution. That is, if ${\displaystyle X_{n}\;{\overset {p}{\to }}\;X}$, then ${\displaystyle X_{n}\;{\overset {d}{\to }}\;X}$, but the converse is true only when the limiting "${\displaystyle X}$" is a constant, i.e. if ${\displaystyle X_{n}\;{\overset {d}{\to }}\;c}$, then ${\displaystyle X_{n}\;{\overset {p}{\to }}\;c}$ where ${\displaystyle c}$ is a constant.

A very important theorem in statistics which is related to convergence in distribution is central limit theorem.

Theorem. (Central limit theorem (CLT)) Let ${\displaystyle X_{1},X_{2},\dotsc }$ be a sequence of independent random variables with the same finite mean ${\displaystyle \mu }$ and variance ${\displaystyle \sigma ^{2}}$. Then, as ${\displaystyle n\to \infty }$, ${\displaystyle {\frac {{\overline {X}}-\mathbb {E} [{\overline {X}}]}{\sqrt {\operatorname {Var} ({\overline {X}})}}}={\frac {{\sqrt {n}}({\overline {X}}-\mu )}{\sigma }}\;{\overset {d}{\to }}\;Z}$, in which ${\displaystyle Z}$ follows the standard normal distribution, ${\displaystyle {\mathcal {N}}(0,1)}$.

Proof. A (lengthy) proof can be founded in Probability/Transformation of Random Variables#Central limit theorem.

${\displaystyle \Box }$

There are some properties of convergence in distribution, but they are a bit different from the properties of convergence in probability. These properties are given by Slutsky's theorem, and also continuous mapping theorem.

Theorem. (Continuous mapping theorem) If ${\displaystyle X_{n}\;{\overset {d}{\to }}\;X}$, then

${\displaystyle g(X_{n})\;{\overset {d}{\to }}\;g(X)}$
given that ${\displaystyle g}$ is a continuous function.

Proof. Omitted.

${\displaystyle \Box }$

Theorem. (Slutsky's theorem) If ${\displaystyle X_{n}\;{\overset {d}{\to }}\;X}$ and ${\displaystyle Y_{n}\;{\overset {p}{\to }}\;c}$ where ${\displaystyle c}$ is a constant, then

• ${\displaystyle X_{n}+Y_{n}\;{\overset {d}{\to }}\;X+c}$;
• ${\displaystyle X_{n}Y_{n}\;{\overset {d}{\to }}\;cX}$;
• ${\displaystyle X_{n}/Y_{n}\;{\overset {d}{\to }}\;X/c}$ given that ${\displaystyle c\neq 0}$.

Proof. Brief idea: Assume ${\displaystyle X_{n}\;{\overset {d}{\to }}\;X}$ and ${\displaystyle Y_{n}\;{\overset {p}{\to }}\;c}$. Then, it can be shown that ${\displaystyle (X_{n},Y_{n})\;{\overset {d}{\to }}\;(X,c)}$ (joint convergence in distribution, and the definitions of this is similar, except that the cdf's become joint cdf's of ordered pairs). After that, we define ${\displaystyle g(z_{1},z_{2})=z_{1}+z_{2}}$,${\displaystyle g(z_{1},z_{2})=z_{1}z_{2}}$, and ${\displaystyle g(z_{1},z_{2})=z_{1}/z_{2}}$ respectively, where each of the functions is continuous, and then applying the continuous mapping theorem using each of these functions gives us the three desired results.

${\displaystyle \Box }$

Remark.

• Notice that the assumption mentions that ${\displaystyle Y_{n}\;{\overset {\color {darkgreen}p}{\to }}\;c}$ but not ${\displaystyle Y_{n}\;{\overset {\color {darkgreen}d}{\to }}\;c}$.

## Resampling

By resampling, we mean creating new samples based on an existing sample. Now, let us consider the following for a general overview of the procedure of resampling.

Suppose ${\displaystyle X_{1},\dotsc ,X_{n}}$ is a random sample from a distribution of a random variable ${\displaystyle X}$ with cdf, ${\displaystyle F(x)}$. Let ${\displaystyle x_{1},\dotsc ,x_{n}}$ be a corresponding realization of the random sample ${\displaystyle X_{1},\dotsc ,X_{n}}$. Based on this realization, we have also a realization of the empirical cdf: ${\displaystyle {\frac {1}{n}}\sum _{k=1}^{n}\mathbf {1} \{x_{k}\leq x\}}$ [3]. Since this is a realization of the empirical cdf, by Glivenko-Cantelli theorem, it is a good estimate of the cdf ${\displaystyle F(x)}$ when ${\displaystyle n}$ is large [4]. In other words, if we denote the random variable with the same pdf as that realization of the empirical cdf by ${\displaystyle X^{*}}$, ${\displaystyle X^{*}}$ and ${\displaystyle X}$ have similar distributions when ${\displaystyle n}$ is large.

Notice that a realization of empirical cdf is a discrete cdf (since the support ${\displaystyle x_{1},\dotsc ,x_{n}}$ is countable). We now draw a random sample (called the bootstrap (or resampling) random sample) with sample size ${\displaystyle B}$ (called the bootstrap sample size) ${\displaystyle X_{1}^{*},\dotsc ,X_{B}^{*}}$ from the distribution of a random variable ${\displaystyle X^{*}}$ (${\displaystyle X^{*}}$ comes from sampling from ${\displaystyle X}$, so the behaviour of sampling from ${\displaystyle X^{*}}$ is called resampling).

Then, the relative frequency historgram of ${\displaystyle X_{1}^{*},\dotsc ,X_{B}^{*}}$ should be close to that of the corresponding realization of the empirical pmf of ${\displaystyle X^{*}}$ (found from the realization of the empirical cdf of ${\displaystyle X^{*}}$), which is close to pdf ${\displaystyle f(x)}$ of ${\displaystyle X}$. This means the relative frequency historgram of ${\displaystyle X_{1}^{*},\dotsc ,X_{B}^{*}}$ is close to the pdf ${\displaystyle f(x)}$ of ${\displaystyle X}$.

In particular, since the cdf of ${\displaystyle X^{*}}$, ${\displaystyle F_{n}(x)}$, assigns probability ${\displaystyle 1/n}$ to each of ${\displaystyle X_{1}^{*},\dotsc ,X_{B}^{*}}$ [5], the pmf of ${\displaystyle X^{*}}$ is ${\displaystyle \mathbb {P} (X^{*}=x_{i})={\frac {1}{n}},\quad i=1,2,\dotsc ,n}$. Notice that this pmf is quite simple, and therefore it can make the related calculation about it simpler. For example, in the following, we want to know the distribution of ${\displaystyle T^{*}=g(X_{1}^{*},\dotsc ,X_{n}^{*})}$, and this simple pmf can make the resulting distribution also quite simple.

Remark. For the things involved in the bootstrap method ("bootstrapped" things), there is usually an additional "*" in each of their notations.

In the following, we will discuss an application of the bootstrap method (or resampling) mentioned above, namely using bootstrap method to approximate the distribution of a statistic ${\displaystyle T=g(X_{1},X_{2},\dotsc ,X_{n})}$ (the inputs of the functions are random variables and ${\displaystyle g}$ is a function). The reason for approximating, rather than finding the distribution exactly, is that the latter is usually infeasible (or may be too complicated).

To do this, consider the "bootstrapped statistic" ${\displaystyle T^{*}=g(X_{1}^{*},X_{2}^{*},\dotsc ,X_{n}^{*})}$ and the statistic ${\displaystyle T=g(X_{1},X_{2},\dotsc ,X_{n})}$. ${\displaystyle X_{1}^{*},X_{2}^{*},\dotsc ,X_{n}^{*}}$ is the bootstrap random sample (with bootstrap sample size ${\displaystyle n}$) from the distribution of ${\displaystyle X^{*}}$ and ${\displaystyle X_{1},X_{2},\dotsc ,X_{n}}$ is the random sample from the distribution of ${\displaystyle X^{*}}$. When ${\displaystyle n}$ is large, since the distribution of ${\displaystyle X^{*}}$ is similar to that of ${\displaystyle X}$, the bootstrap random sample ${\displaystyle X_{1}^{*},X_{2}^{*},\dotsc ,X_{B}^{*}}$ and the random sample ${\displaystyle X_{1},X_{2},\dotsc ,X_{n}}$ are also similar. It follows that ${\displaystyle T^{*}}$ and ${\displaystyle T}$ are similar as well, or to be more precise, the distributions of ${\displaystyle T^{*}}$ and ${\displaystyle T}$ are close. As a result, we can utilize the distribution of ${\displaystyle T^{*}}$ (which is easier to find and simpler, since the pmf of ${\displaystyle X^{*}}$ is simple as in above) to approximate the distribution of ${\displaystyle T}$. A procedure to do this is as follows:

1. Generate a bootstrapped realization ${\displaystyle x_{1}^{*},x_{2}^{*},\dotsc ,x_{n}^{*}}$ from the bootstrap random sample ${\displaystyle X_{1}^{*},X_{2}^{*},\dotsc ,X_{n}^{*}}$, which is from the distribution of ${\displaystyle X^{*}}$.
2. Calculate a realization of the bootstrapped statistic ${\displaystyle T^{*}}$, ${\displaystyle t^{*}=g(x_{1}^{*},x_{2}^{*},\dotsc ,x_{n}^{*})}$.
3. Repeat 1. to 2. ${\displaystyle j}$ times to get a sequence of ${\displaystyle j}$ realizations of ${\displaystyle T^{*}}$: ${\displaystyle t_{1}^{*},t_{2}^{*},\dotsc ,t_{j}^{*}}$.
4. Plot the relative frequency historgram of the ${\displaystyle j}$ realizations ${\displaystyle t_{1}^{*},t_{2}^{*},\dotsc ,t_{j}^{*}}$.

This histogram of the ${\displaystyle j}$ realizations (which are a realization of a random sample from ${\displaystyle T^{*}}$ with sample size ${\displaystyle j}$) is close to the pmf of ${\displaystyle T^{*}}$ [6], and thus close to the pmf of ${\displaystyle T}$.

1. Intuitively, given a candidate for the maximum, we can always add "a little bit" to it to get a greater candidate. So, there is no "greatest" element in the set.
2. This is because ${\displaystyle X_{\text{min}}=c_{0}}$ and ${\displaystyle X_{\text{max}}=c_{i}}$.
3. This is different from the empirical cdf ${\displaystyle {\frac {1}{n}}\sum _{k=1}^{n}\mathbf {1} \{X_{k}\leq x\}}$.
4. For Glivenko-Cantelli theorem, the empirical cdf is a good estimate of the cdf ${\displaystyle F(x)}$, regardless of what the actual values (realization) of the random sample are, i.e. for each realization of the empirical cdf, it is a good estimate of the cdf ${\displaystyle F(x)}$, when ${\displaystyle n}$ is large.
5. That is, for a realization of random sample ${\displaystyle X_{1},X_{2},\dotsc ,X_{n}}$, say ${\displaystyle x_{1},x_{2},\dotsc ,x_{n}}$, the probability for ${\displaystyle X^{*}}$ to equal ${\displaystyle x_{1},x_{2},\dotsc ,x_{n}}$ (which corresponds to the realization of ${\displaystyle X_{1},X_{2},\dotsc ,X_{n}}$ respectively), is ${\displaystyle 1/n}$ each.
6. The reason is mentioned similarly above: the histogram should be close to the pmf of ${\displaystyle T^{*}}$ since the cdf corresponding to the histogram (i.e. the realization of the empirical cdf of the random sample ${\displaystyle T_{1}^{*},T_{2}^{*},\dotsc ,T_{j}^{*}}$ ) is close to the cdf of ${\displaystyle T^{*}}$