# Statistics/Point Estimation

## Introduction[edit | edit source]

Usually, a random variable resulting from a random experiment is *assumed* to follow a certain distribution with an unknown (but *fixed*) parameter (vector) ^{[1]} ^{[2]} ( is a positive integer, and its value depends on the distribution), taking value in a set , called the parameter space.

**Remark.**

- In the context of
*frequentist statistics*(the context here), parameters are regarded as*fixed*. - On the other hand, in the context of
*Bayesian statistics*, parameters are regarded as*random variables*.

For example, suppose the random variable is assumed to follow a normal distribution . Then, in this case, the parameter vector is unknown, and the parameter space .
It is often useful to *estimate* those unknown parameters in some ways to "understand" the random variable better.
We would like to make sure the estimation should be "good" ^{[3]} enough, so that the understanding is more accurate.

Intuitively, the (realization of) *random sample* should be useful.
Indeed, the estimators introduced in this chapter are all based on the random sample in some sense, and this is what *point estimates* mean.
To be more precise, let us define *point estimation* and *point estimates*.

**Definition.**
(Point estimation)
*Point estimation* is a process of using the value of a *statistic* to give a single value estimate (which can be interpreted as a *point*) of an unknown parameter.

**Remark.**

- Recall that
*statistics*are functions of random sample. - We call the unknown parameter as
*population parameter*(since the underlying distribution corresponding to the parameter is often called a*population*). - The statistics is called a
*point estimator*, and its realization is called a*point estimate*.

- The notation of
*point estimator*commonly has a .

- The notation of

*Point*estimation will be contrasted with*interval*estimation, which uses the value of a statistic to estimate an*interval*of plausible values of the unknown parameter.

**Example.**
Suppose are random samples from the normal distribution .

- We may use the
*statistic*to estimate intuitively, and is called the*point estimator*, and its realization is called the*point estimate*. - Alternatively, we may simply use the statistic (despite it does not involve , it can still be regarded as function of ) to estimate . That is, we use the value of the first random sample from the normal distribution as the point estimate of the mean of the distribution! Intuitively, it may seem that such estimator is quite "bad".

- Such estimator, which just takes one random sample directly, is called a
*single observation estimator*. - We will later discuss how to evaluate how "good" a point estimator is.

- Such estimator, which just takes one random sample directly, is called a

In the following, we will introduce two well-known point estimators, which are actually quite "good", namely *maximum likelihood estimator* and *method of moment estimator*.

## Maximum likelihood estimator (MLE)[edit | edit source]

As suggested by the name of this estimator, it is the estimator that *maximize* some kind of "likelihood".
Now, we would like to know what "likelihood" should we maximize to estimate the unknown parameter(s) (in a "good" way).
Also, as mentioned in the introduction section, the estimator is based on the random sample in some sense. Hence,
this "likelihood" should be also based on the random sample in some sense.

To motivate the definition of maximum likelihood estimator, consider the following example.

**Example.**
In a random experiment, a (fair or unfair) coin is tossed once. Let the random variable if head comes up, and otherwise.
Then, the pmf of is , in which the unknown parameter represents
the probability for head comes up, and .

Now, suppose you get a random sample by tossing that coin independent times (such random sample is called an *independent* random sample, since the random variables involved are independent), and the corresponding realizations .
Then, the probability for , i.e. the random sample have these realizations exactly, is

**Remark.**

**Remark on notation**: You may observe that there is an additional "" in the pmf of . Such notation means the pmf is*with the parameter value*. It is included to*emphasize*the parameter value we are referring to.- In general, we write for pmf/pdf with the parameter value ( may be a vector).

- There are some alternative notations with the same meaning: .

- Similarly, we have similar notations, e.g. , to mean the probability for event to happen, with the parameter value . (It is more common to use the first notation: .)
- We also have similar notations for mean, variance, covariance, etc., like

Intuitively, we would like to find a that maximizes this probability, i.e. make that realization obtained to be the *most* probable one for the random sample.
Such is called the MLE of .
The MLE can also be expressed as ,
in which is called the *likelihood* function, which is the probability for the random sample to be exactly the realization.

Now, let us formally define the terms related to MLE.

**Definition.**
(Likelihood function)
Let be a random sample with a *joint* pmf or pdf , in which the parameter (vector) belongs to the parameter space .
Then, the *likelihood function* of this random sample is
(which is the value of joint pmf or pdf of at respectively) for each .

**Remark.**

- When the random sample comes from a
*discrete*distribution, then the value of likelihood function is the probability at the parameter vector . That is, the probability for getting this specific realization exactly. - When the random sample comes from a
*continuous*distribution, then the value of likelihood function is*not*a probability. Instead, it is only the value of the joint pdf at (which can be greater than one). However, the value can still be used to "reflect" the probability for getting "very close to" this specific realization, where the probability can be obtained by integrating the joint pdf over a "very small" region around . - The natural logarithm of the likelihood function, , denoted by , is called the
*log-likelihood function*.

**Definition.**
(Maximum likelihood estimate)
Given a *likelihood function* for each ,
is called the *maximum likelihood estimate* (which is as a function of , since involves ).

**Remark.**

- gives the argument(s) maximizing the function (the value of the function is maximum at this argument compared to other arguments in ).
- Since (the domain of natural logarithm function is the set of all
*positive*real numbers), the natural logarithm function is strictly increasing, i.e. the output is larger when the input is larger. Thus, when we find the that maximizes , then the input of the natural logarithm function , , is also maximized, and so the output. This means such also maximizes the log-likelihood function. It follows that we also have . Indeed, it is often easier to find . - The
*maximum likelihood estimator*(MLE) of is a function of , which is obtained by replacing the 's in the maximum likelihood estimate by .

Now, let us find the MLE of the unknown paramter in the previous coin flipping example.

**Example.**
(Motivating example revisited)
Recall that we use a coin flipping example to motivate maximum likelihood estimation.
follows the Bernoulli distribution with success probability . The pmf of is . is an independent random sample from the distribution.

- The likelihood function is the
*joint*pmf of ,

- The log-likelihood function is thus

- To find the maximum of the log-likelihood function, we may use derivative test learnt in Calculus. Differentiating with respect to gives

- To find critical point(s) of , we set (we have )
- To verify that actually attains
*maximum*(instead of minimum) at , we need to perform derivative test. In this case, we use first derivative test. - We can see that when , which makes , and thus . On the other hand, when , this makes , and thus . As a result, we can conclude that attains its maximum at . It follows that the MLE of is (not , which is instead maximum likelihood
*estimate*!)

**Exercise.**
Use second derivative test to verify that attains maximum at .

- Since , in which the numerator is negative and the denominator is positive. Thus, . By second derivative test, this means attains maximum at .

Sometimes, there is constraint imposed on the parameter when we are finding its MLE. The MLE of the parameter in this case is called a *restricted* MLE. We will illustrate this in the following example.

**Example.**
Continue from the previous coin flipping example. Suppose we have a constraint on where .
Find the MLE of in this case.

*Solution*:
For the steps about deriving likelihood function and log-likelihood function, they are the same in this case.
Without the restriction, the MLE of is . Now, with the restriction, the MLE of is only when (we always have since ).

If (and thus ), even though is maximized at , we cannot set the MLE to be due to the restriction on : . Under this case, this means when (we have when from previous example), i.e. is strictly increasing when . Thus, is maximized when with the restriction. As a result, the MLE of is (the MLE can be a constant, which can still be regarded as a function of ).

Therefore, the MLE of can be written as a case defined function: , or it can be written as

**Exercise.**
Find the MLE of when .

- When , we cannot set the MLE to be due to the restriction. In this case, we know that when , i.e. is strictly decreasing when . Thus, is maximized at , and so the MLE of is .
- When , we can set the MLE to be at which is maximized, and so is the MLE of in this case.
- Therefore, the MLE of is .

To find the MLE, we sometimes use methods other than derivative test, and we do not need to find the log-likelihood function. Let us illustrate this in the following example.

**Example.**
Let be an independent random sample from the uniform distribution . Find the MLE of .

*Solution*:
The pdf of the uniform distribution is .
Thus, the likelihood function is
.

In order for to attain maximum, first, we need to ensure that for each , so that the product of the indicator functions in the likelihood function is nonzero (the value is actually one in this case). Apart from that, since is a strictly decreasing function of (because (we have )), we should pick a that is as small as possible so that , and hence , is as large as possible.

As a result, we should choose a that is as small as possible, subject to the constraint that for each , which means that (it is always the case that , regardless of the choice of ) for each . It follows that attains maximum when is the maximum of . Hence, the MLE of is .

**Exercise.**
Show that the MLE of does not exist if the uniform distribution becomes .

**Proof.**
In this case, the constraint from the indicator functions become for each .
With similar argument, for the MLE of , we should choose a that is as small as possible subject to this constraint, which means for each .
However, in this case, we *cannot* set to be the maximum of , or else the constraint will not be satisfied and the likelihood function becomes zero due to the indicator function.
Instead, we should set to be slightly greater than the maximum of , so that the constraint can still be satisifed, and is quite small. However, for each such , we can always chooses a smaller that still satisfies the constraint. For example, for each , the smaller beta, can be selected as ^{[4]}. Hence, we cannot find a minimum value of subject to this constraint. Thus, there is no maximum point for , and hence the MLE does not exist.

In the following example, we will find the MLEs of two parameters.

**Example.**
Let be an independent random sample from the normal distribution with mean and variance , .
Find the MLEs of and .

*Solution*:
Let . The likelihood function is
,
and hence the log-likelihood function is
.
Since this function is multivariate, we may use the second partial derivative test from multivariable calculus to find maximum point(s).
However, in this case, we actually do not need to use such test.
Instead, we fix the variables one by one to make the function univariate, so that we can use the derivative test for univariate function to find maximum point (with another variable fixed).

Since and .

Also, , which is independent from (this is important for us to use this kind of method) and .

Since , by the second derivative test (for univariate function), is maximized at , *given any fixed *.

On the other hand, since , and thus (since ).
Thus, by the second derivative test, is maximized at , *given any fixed *.

So, now we fix , and thus we have is maximized at , where is the realization of the sample variance . Now, is fixed as well, and we know that attains maximum at for each fixed , including this fixed . As a result, is maximized at . Hence, the MLE of is , and the MLE of is .

**Exercise.**

(a) Calculate the determinant of the Hessian matrix of at , which can be expressed as

(b) Hence, verify that is the maximum point of using the second partial derivative test.

(a) First,

As a result, the determinant of the Hessian matrix is .

(b) From (a), the determinant of the Hessian matrix is positive. Also, . Thus, by the second partial derivative test, attains maximum at .

**Exercise.**
Let be an independent random sample from the exponential distribution with rate parameter , with pdf , where .
Show that the MLE of is .

**Proof.**
The likelihood function is .
Thus, the log-likelihood function is .
Differentiating the log-likelihood function with respect to gives .
Setting the derivative to be zero, we get
.
It remains to verify that attains maximum at .
Since , this is verified.
Hence, the MLE of is .

**Example.**
(Application of maximum likelihood estimation)
Suppose you are given a box which contains four balls, with unknown number of red and black balls.
Now, you draw three balls out of the box, and find out that you get two red balls and one black ball.
Using maximum likelihood estimation, estimate the number of red and black balls inside the box.

*Solution*:
Given the color of the balls drawn, we know that the box contains at least two red balls and at least one black ball.
This means the box contains either two red balls or three red balls.
Let be the number of red balls inside the box. Then, the number of black balls inside the box is .
The possible values of parameter are 2 and 3.

Now, we compare the probability of getting such result from drawing three balls when and .

- For , the probability is (consider the pmf of hypergeometric distribution).
- For , the probability is .

Hence, the maximum likelihood estimate of is 3. Thus, the estimated number of the red balls is 3, and that of the black balls is 1.

**Exercise.**
Suppose the box now contains 100 balls, with unknown number of red and black balls.
Now, you draw 99 balls out of the box, and find out that you get 98 red balls and one black ball.
Using maximum likelihood estimation, estimate the number of red and black balls inside the box.

Similarly, the box contains at least 98 red balls and one black ball. We use the same notation as in the above example. Then, the number of the black balls is , and the possible values of parameter are 98 and 99.

- For , the probability is
- For , the probability is

Thus, the maximum likelihood estimate of is 99. Thus, the estimated number of the red balls is 99 and that of the black balls is 1.

**Remark.**

- The difference of the probabilities between two possible values of becomes much larger in this case.
- Intuitively, when you have such draw result, you will think that it is quite unlikely that the box has two black balls inside, i.e. the ball that is not drawn is actually black, and somehow you draw all red balls out, but not the black ball.

## Method of moments estimator (MME)[edit | edit source]

For maximum likelihood estimation, we need to utilize the likelihood function, which is found from the joint pmf of pdf of the random sample from a distribution. However, we may not know exactly the pmf of pdf of the distribution in practice. Instead, we may just know some information about the distribution, e.g. mean, variance, and some moments (th moment of a random variable is , we denote it by for simplicity). Such moments often contain information about the unknown parameter. For example, for a normal distribution , we know that and . Because of this, when we want to estimate the parameters, we can do this through estimating the moments.

Now, we would like to know how to estimate the moments.
We let be the th *sample moment* ^{[5]}, where 's are independent and identically distributed.
By *weak law of large number* (assuming the conditions are satisified), we have

- (this can be seen from replacing the "" by "" in the weak law of large number, then the conditions are still satisfied, and so we can still apply the weak law of large number)

In general, we have , since the conditions are still satisfied after replacing the "" by "" in the weak law of large number.

Because of these results, we can estimate the -th moment using the -th sample moment , and the estimation is "better" when is large.
For example, in the above normal distribution example, we can estimate by and by ,
and these estimators are actually called the *method of moments estimator*.

To be more precise, we have the following the definition of the *method of moments*:

**Definition.**
(Method of moments)
Let be a random sample from a distribution with pdf or pmf .
Write moment(s), e.g. , as function(s) of : respectively.
Then, the *method of moments estimator* (MME) of , respectively, is given by the solution (in the form of in terms of , corresponding to the moments ) to the following system of equations:

**Remark.**

- When there are unknown parameters, we need to solve a system of equations, involving sample moments.
- Usually, we select the first moments for the moments, as in the definition. But this is not necessary, and we may choose other moments, including fractional moments (e.g. , and we use in this case).

- Because of this, the method of moment estimator is
*not*unique.

- Because of this, the method of moment estimator is

**Example.**
Let be an independent random sample from the normal distribution . Find the MME of and .

*Solution*:
First, there are two unknown parameters. Thus, we need to solve aa system of 2 equations, involving 2 sample moments and 2 moments.
Since and , consider the following system of equations:

**Remark.**

- We can see that the process of finding the MME of and is much easier than finding the MLE of and . This is because the expression of the first and second moment in terms of parameters is simple in this case. However, when the expression is more complicated, finding the MME of the parameters can be quite complicated.

**Example.**
Let be an independent random sample from the exponential distribution with rate parameter . Find the MME of and compare it to the MLE of .

*Solution*:
Since , consider the following equation:
.
We then have . Hence, the MME of is ,
which is somehow the same as the MLE of .

**Exercise.**
Let be an independent random sample from the uniform distribution . Show that the MMEs of and are and respectively.

**Proof.**
Since and ,
consider the following system of equations:

When , . However, from the definition of the uniform distribution, we need to have , and thus this case is rejected.

When , , which satisfies the definition of the uniform distribution.

Thus, we have the desired result.

## Properties of estimator[edit | edit source]

In this section, we will introduce some criteria for evaluating how "good" a point estimator is, namely *unbiasedness*, *efficienecy* and *consistency*.

### Unbiasedness[edit | edit source]

For to be a "good" estimator of a parameter , a desirable property of is that its expected value equals the value of the parameter , or at least close to the value.
Because of this, we introduce a value, namely *bias*, to measure how close is the mean of to .

**Definition.**
(Bias)
The *bias* of an estimator is

We will also define some terms related to bias.

**Definition.**
((Un)biased estimator)
An estimator is an *unbiased estimator* of a parameter if .
Otherwise, the estimator is called a *biased estimator*.

**Definition.**
(Asymptotically unbiased estimator)
An estimator is an *asymptotically unbiased estimator* of a parameter if
where is the sample size.

**Remark.**

- An unbiased estimator must be an asymptotically unbiased estimator, but the converse is not true, i.e. an asymptotically unbiased estimator may not be an unbiased estimator. Thus, a biased estimator may be an asymptotically unbiased estimator.
- When we discuss the goodness of estimators in terms of unbiasedness, an unbiased estimator is better than an asymptotically unbiased estimator, which is better than an unbiased estimator.

- However, there are also other criteria for evaluating the goodness of estimators apart from unbiasedness, so when we also account for other criteria, a biased estimator may be somehow "better" than an unbiased estimator overall.

**Example.**
Let be an independent random sample from the Bernoulli distribution with success probability . Show that the MLE of , , is an unbiased estimator of .

**Proof.**
Since , the result follows.

**Exercise.**
Suppose the Bernoulli distribution is replaced by binomial distribution with trials and success probability . Show that is a biased estimator of . Modify this estimator such that it is an unbiased estimator of .

**Proof.**
Since , is a biased estimator of .

We can modify this estimator to , and then its mean is . Alternatively, we may choose the estimator to be (), whose mean is also (Other estimators whose mean is are also fine).

**Example.**
Let be an independent random sample from the normal distribution . Show that the MLE of , , is an unbiased estimator of , and the MLE of , , is an *asymptotically* unbiased estimator of .

**Proof.**
First, since , is an unbiased estimator of .

On the other hand,

**Exercise.**
Modify the estimator of such that it becomes an unbiased estimator.

The estimator can be modified as .

### Efficiency[edit | edit source]

We have discussed how to evaluate the unbiasedness of estimators.
Now, if we are given two unbiased estimators, and , how should we compare their goodness?
Their goodness is the same if we are only comparing them in terms of unbiasedness.
Therefore, we need another criterion in this case. One possible way is to compare their *variances*,
and the one with smaller variance is better, since on average, the estimator is less deviated from its mean, which is the value of the unknown parameter by the definition of unbiased estimator, and thus the one with smaller variance is more accurate in some deviation sense.
Indeed, an unbiased estimator can still have a large variance, and thus deviate a lot from its mean. Such estimator is unbiased since the positive deviations and negative deviations somehow cancel out each other.
This is the idea of *efficiency*.

**Definition.**
(Efficiency)
Suppose and are two unbiased estimators of an unknown parameter .
The *efficiency* of relative to is . If , then we say that is *relatively more efficient* than .

**Remark.**

- Since , the estimator with smaller variance is relatively more efficient than the estimator with larger variance.
- Normally, the variance should be nonzero, and thus the efficiency should be defined in normal cases.
- Sometimes, it is also called
*relative efficiency*due to the fact that the efficiency describes equals "how many" . - One may ask that why we use the
*ratio*of variances is used in the definition to compare variances, instead of using the*difference*in variances. A possible reason is that the*ratio*of variances does not have any unit (the unit of the variances (if exists) cancels out each other), but the*difference*in variances can have an unit. Also, using the*ratio*of variances allows us to also*compare*different*efficiencies*numerically, calculated from different variances.

Actually, for the variance of unbiased estimator, since the mean of the unbiased estimator is the unknown paramter , it measures
the mean of the squared deviation from , and we have a specific term for this deviation, namely *mean squared error* (MSE).

**Definition.**
(Mean squared error)
Suppose is an estimator of a parameter . The *mean squared error* (MSE) of is
.

**Remark.**

- From this definition, is the
*mean*value of the*square*of*error*, and hence the name*mean squared error*.

Notice that in the definition of MSE, we do not specify that to be an unbiased estimator. Thus, in the definition may be biased. We have mentioned that when is unbiased, then its variance is actually its MSE. In the following, we will give a more general relationship between and , not just for unbiased estimators.

**Proposition.**
(Relationship between mean squared error and variance)
If exists, then .

**Proof.**
By definition, we have and .
From these, we are motivated to write

**Example.**
Let () be an independent random sample from .

(a) Show that the single observation estimator is an unbiased estimator for .

(b) Calculate the MSE of and respectively.

(c) Which of and is a better estimator of in terms of unbiasedness and efficiency?

*Solution*:

(a) Since , the result follows.

(b) , and .

(c) Since , is relatively more efficient than . Since both and are unbiased estimators of , we conclude that is a better estimator of in terms of unbiasedness and efficiency.

**Exercise.**
In addition to the random sample with sample size in the example, suppose we take another random sample with sample size .
Let and denote the sample mean for the sample with sample size and respectively.

(a) Calculate .

(b) State the condition on the sample sizes and under which is relatively more efficient than .

(a) Since (from example), and (by similar arguments as in the example),

(b) Since , the condition is .

**Remark.**

- This shows that the sample mean with a larger sample size is relatively more efficient than the one with smaller sample size.

**Proposition.**
if and only if and .

**Proof.**

- "if" part is simple. Assume and . Then, .
- "only if" part: we can use proof by contrapositive, i.e. proving that if
*or*, then .

- Case 1: when , it means since the variance is nonnegative. Also, . It follows that , i.e. the MSE does not equal zero.
- Case 2: when , it means . Also, . It follows that , i.e. the MSE does not equal zero.

**Remark.**

- As a result, if we know that , then we know that , i.e. is an asymptotically unbiased estimator (in addition to ) (
*may*be an unbiased estimator).

#### Uniformly minimum-variance unbiased estimator[edit | edit source]

Now, we know that the smaller the variance of an unbiased estimator, the more efficient (and "better") it is.
Thus, it is natural that we want to know what is the *most* efficient (i.e. the "best") unbiased estimator, i.e. the unbiased estimator with the smallest variance.
We have a specific name for such unbiased estimator, namely *uniformly minimum-variance unbiased estimator (UMVUE)* ^{[6]}.
To be more precise, we have the following definition for UMVUE:

**Definition.**
(Uniformly minimum-variance unbiased estimator)
The *uniformly minimum-variance unbiased estimator* (UMVUE) is an unbiased estimator with the *smallest variance* among all unbiased estimators.

Indeed, UMVUE is *unique*, i.e. there is exactly one unbiased estimator with the smallest variance among all unbiased estimators,
and we will prove it in the following.

**Proposition.**
(Uniqueness of UMVUE)
If is an UMVUE of a function of a parameter , then is unique.

**Proof.**
Assume that is an UMVUE of , and is another UMVUE of .
Define the estimator .
Since , is an unbiased estimator of .

Now, we consider the variance of .

*not*an UMVUE of by definition, since we can find another unbiased estimator, namely , with smaller variance than it. Hence, we must have the latter, i.e.

Now, we consider the covariance .

It remains to show that to prove that , and therefore conclude that is *unique*.

From above, we currently have , as desired..

**Remark.**

- Thus, when we are able to find an UMVUE, then it is the unique one, and the variance every other possible unbiased estimator is strictly greater than the variance of the UMVUE.

##### Cramer-Rao lower bound[edit | edit source]

Without using some results, it is quite difficult to determine the UMVUE, since there are many (perhaps even infinitely many) possible unbiased estimator, so it is quite hard to ensure that one particular unbiased estimator is relative more efficient than every other possible unbiased estimators.

Therefore, we will introduce some approaches that help us to find the UMVUE.
For the first approach, we find a *lower bound* ^{[7]} on the variances of all possible unbiased estimators.
After getting such lower bound, if we can find an unbiased estimator with variance to be exactly equal to the lower bound, then the lower bound is the minimum value of the variances, and hence such unbiased estimator is an UMVUE by definition.

**Remark.**

- There are many possible lower bounds, but when the lower bound is greater, it is closer to the actual minimum value of the variances, and hence "better".
- An unbiased estimator can still be an UMVUE even if its variance does not achieve the lower bound.

A common way to find such lower bound is to use the *Cramer-Rao lower bound* (CRLB), and we get the CRLB through *Cramer-Rao inequality*.
Before stating the inequality, let us define some related terms.

**Definition.**
(Fisher information)
The *Fisher information* about is

**Remark.**

- is called the
*score*, and is denoted by .

- When the "" in the log-likelihood function is a
*parameter vector*, then the "" in and is just referring to one of the parameters inside the parameter vector. - To be more clear, in this case, we may write

- When the "" in the log-likelihood function is a

- where "" is a component of "".

- Since , and under some regularity conditions which allow interchange of derivative and integral, this equals , the Fisher information about is also the variance of the score, i.e. .

We have some results that assist us to compute the Fisher information.

**Proposition.**
Let be an independent random sample from a distribution with pdf or pmf .
Also, let , the Fisher information about when the sample size is one.
Then, under some regularity conditions which allow interchange of derivative and integral, .

**Proof.**

**Proposition.**
Under some regularity conditions which allow interchange of derivative and integral, .

**Proof.**