Statistics/Point Estimation

From Wikibooks, open books for an open world
Jump to navigation Jump to search

Introduction[edit | edit source]

Usually, a random variable resulting from a random experiment is assumed to follow a certain distribution with an unknown (but fixed) parameter (vector) [1] [2] ( is a positive integer, and its value depends on the distribution), taking value in a set , called the parameter space.

Remark.

  • In the context of frequentist statistics (the context here), parameters are regarded as fixed.
  • On the other hand, in the context of Bayesian statistics, parameters are regarded as random variables.

For example, suppose the random variable is assumed to follow a normal distribution . Then, in this case, the parameter vector is unknown, and the parameter space . It is often useful to estimate those unknown parameters in some ways to "understand" the random variable better. We would like to make sure the estimation should be "good" [3] enough, so that the understanding is more accurate.

Intuitively, the (realization of) random sample should be useful. Indeed, the estimators introduced in this chapter are all based on the random sample in some sense, and this is what point estimates mean. To be more precise, let us define point estimation and point estimates.

Definition. (Point estimation) Point estimation is a process of using the value of a statistic to give a single value estimate (which can be interpreted as a point) of an unknown parameter.

Remark.

  • Recall that statistics are functions of a random sample.
  • We call the unknown parameter as population parameter (since the underlying distribution corresponding to the parameter is called a population).
  • The statistics is called a point estimator, and its realization is called a point estimate.
  • The notation of point estimator commonly has a .
  • Point estimation will be contrasted with interval estimation, which uses the value of a statistic to estimate an interval of plausible values of the unknown parameter.

Example. Suppose are random samples from the normal distribution .

  • We may use the statistic to estimate intuitively, and is called the point estimator, and its realization is called the point estimate.
  • Alternatively, we may simply use the statistic (despite it does not involve , it can still be regarded as function of ) to estimate . That is, we use the value of the first random sample from the normal distribution as the point estimate of the mean of the distribution! Intuitively, it may seem that such estimator is quite "bad".
  • Such estimator, which just takes one random sample directly, is called a single observation estimator.
  • We will later discuss how to evaluate how "good" a point estimator is.

In the following, we will introduce two well-known point estimators, which are actually quite "good", namely maximum likelihood estimator and method of moment estimator.

Maximum likelihood estimator (MLE)[edit | edit source]

As suggested by the name of this estimator, it is the estimator that maximize some kind of "likelihood". Now, we would like to know what "likelihood" should we maximize to estimate the unknown parameter(s) (in a "good" way). Also, as mentioned in the introduction section, the estimator is based on the random sample in some sense. Hence, this "likelihood" should be also based on the random sample in some sense.

To motivate the definition of maximum likelihood estimator, consider the following example.

Example. In a random experiment, a (fair or unfair) coin is tossed once. Let the random variable if head comes up, and otherwise. Then, the pmf of is , in which the unknown parameter represents the probability for head comes up, and .

Now, suppose you get a random sample by tossing that coin independent times (such random sample is called an independent random sample, since the random variables involved are independent), and the corresponding realizations . Then, the probability for , i.e., the random sample have these realizations exactly, is

Remark.

  • Remark on notation: You may observe that there is an additional "" in the pmf of . Such notation means the pmf is with the parameter value . It is included to emphasize the parameter value we are referring to.
  • In general, we write for pmf/pdf with the parameter value ( may be a vector).
  • There are some alternative notations with the same meaning: .
  • Similarly, we have similar notations, e.g. , to mean the probability for event to happen, with the parameter value . (It is more common to use the first notation: .)
  • We also have similar notations for mean, variance, covariance, etc., like

Intuitively, with these particular realizations (fixed), we would like to find a value of that maximizes this probability, i.e.,, makes the realizations obtained to be the one that is "most probable" or "with maximum likelihood". Now, let us formally define the terms related to MLE.

Definition. (Likelihood function) Let be a random sample with a joint pmf or pdf , and the parameter (vector) ( is the parameter space). Suppose are the corresponding realizations of the random sample . Then, the likelihood function, denoted by , is the function ( is a variable, and are fixed).

Remark.

  • For simplicity, we may use the notation instead of . Sometimes, we may also just write "" for convenience.
  • When we replace by , then the resulting "likelihood function" becomes a random variable, and we denote it by or .
  • The likelihood function is in contrast with the joint pmf or pdf itself, where is fixed and are variables.
  • When the random sample comes from a discrete distribution, then the value of likelihood function is the probability at the parameter vector . That is, the probability for getting this specific realization exactly.
  • When the random sample comes from a continuous distribution, then the value of likelihood function is not a probability. Instead, it is only the value of the joint pdf at (which can be greater than one). However, the value can still be used to "reflect" the probability for getting "very close to" this specific realization, where the probability can be obtained by integrating the joint pdf over a "very small" region around .
  • The natural logarithm of the likelihood function, (or sometimes), is called the log-likelihood function.
  • Notice that the "expression" of the likelihood function is actually the same as that of the joint pdf, and just the inputs are different. So, one may still integrate/sum the likelihood function with respect to (which changes the likelihood function to the joint pdf/pmf in such context in some sense) as if it is the joint pdf/pmf to get probabilities.

Definition. (Maximum likelihood estimate) Given a likelihood function , a maximum likelihood estimate of the parameter is a value at which is maximized.

Remark.

  • The maximum likelihood estimator (MLE) of is (obtained by replacing "" in by "").
  • In some other places, the abbreviation MLE can also mean maximum likelihood estimate depending on the context. However, we will just use the abbreviation MLE when we are talking about maximum likelihood estimator here.
  • Since (the domain of natural logarithm function is the set of all positive real numbers), the natural logarithm function is strictly increasing, i.e., the output is larger when the input is larger. Thus, when we find a value at which is maximized, is also maximized at the same value.

Now, let us find the MLE of the unknown parameter in the previous coin flipping example.

Example. (Motivating example revisited) Recall that we use a coin flipping example to motivate maximum likelihood estimation. follows the Bernoulli distribution with success probability . The pmf of is . is a random sample from the distribution.

  • The likelihood function is the joint pmf of ,

  • The log-likelihood function is thus

  • To find the maximum of the log-likelihood function, we may use derivative test learnt in Calculus. Differentiating with respect to gives

  • To find critical point(s) of , we set (we have )
  • To verify that actually attains maximum (instead of minimum) at , we need to perform derivative test. In this case, we use first derivative test.
  • We can see that when , which makes , and thus . On the other hand, when , this makes , and thus . As a result, we can conclude that attains its maximum at . It follows that the MLE of is (not , which is instead maximum likelihood estimate!)
Clipboard

Exercise. Use second derivative test to verify that attains maximum at .

Solution
  • Since , in which the numerator is negative and the denominator is positive. Thus, . By second derivative test, this means attains maximum at .


Sometimes, there is constraint imposed on the parameter when we are finding its MLE. The MLE of the parameter in this case is called a restricted MLE. We will illustrate this in the following example.

Example. Continue from the previous coin flipping example. Suppose we have a constraint on where . Find the MLE of in this case.

Solution: For the steps about deriving likelihood function and log-likelihood function, they are the same in this case. Without the restriction, the MLE of is . Now, with the restriction, the MLE of is only when (we always have since ).

If (and thus ), even though is maximized at , we cannot set the MLE to be due to the restriction on : . Under this case, this means when (we have when from previous example), i.e., is strictly increasing when . Thus, is maximized when with the restriction. As a result, the MLE of is (the MLE can be a constant, which can still be regarded as a function of ).

Therefore, the MLE of can be written as a case defined function: , or it can be written as

Clipboard

Exercise. Find the MLE of when .

Solution
  • When , we cannot set the MLE to be due to the restriction. In this case, we know that when , i.e., is strictly decreasing when . Thus, is maximized at , and so the MLE of is .
  • When , we can set the MLE to be at which is maximized, and so is the MLE of in this case.
  • Therefore, the MLE of is .


To find the MLE, we sometimes use methods other than derivative test, and we do not need to find the log-likelihood function. Let us illustrate this in the following example.

Example. Let be a random sample from the uniform distribution . Find the MLE of .

Solution: The pdf of the uniform distribution is . Thus, the likelihood function is .

In order for to attain maximum, first, we need to ensure that for each , so that the product of the indicator functions in the likelihood function is nonzero (the value is actually one in this case). Apart from that, since is a strictly decreasing function of (because (we have )), we should pick a that is as small as possible so that , and hence , is as large as possible.

As a result, we should choose a that is as small as possible, subject to the constraint that for each , which means that (it is always the case that , regardless of the choice of ) for each . It follows that attains maximum when is the maximum of . Hence, the MLE of is .

Clipboard

Exercise. Show that the MLE of does not exist if the uniform distribution becomes .

Solution

Proof. In this case, the constraint from the indicator functions become for each . With similar argument, for the MLE of , we should choose a that is as small as possible subject to this constraint, which means for each . However, in this case, we cannot set to be the maximum of , or else the constraint will not be satisfied and the likelihood function becomes zero due to the indicator function. Instead, we should set to be slightly greater than the maximum of , so that the constraint can still be satisifed, and is quite small. However, for each such , we can always chooses a smaller that still satisfies the constraint. For example, for each , the smaller beta, can be selected as [4]. Hence, we cannot find a minimum value of subject to this constraint. Thus, there is no maximum point for , and hence the MLE does not exist.



In the following example, we will find the MLE of a parameter vector.

Example. Let be a random sample from the normal distribution with mean and variance , . Find the MLE of .

Solution: Let . The likelihood function is , and hence the log-likelihood function is . Since this function is multivariate, we may use the second partial derivative test from multivariable calculus to find maximum point(s). However, in this case, we actually do not need to use such test. Instead, we fix the variables one by one to make the function univariate, so that we can use the derivative test for univariate function to find maximum point (with another variable fixed).

Since

and .

Also,

, which is independent from (this is important for us to use this kind of method) and .

Since

, by the second derivative test (for univariate function), is maximized at , given any fixed .

On the other hand, since

, and thus
(since ).

Thus, by the second derivative test, is maximized at , given any fixed .

So, now we fix , and thus we have is maximized at , where is the realization of the sample variance . Now, fix to be , and we know that attains maximum at for each fixed , including this fixed . As a result, is maximized at . Hence, the MLE of is .

Clipboard

Exercise.

(a) Calculate the determinant of the Hessian matrix of at , which can be expressed as .

(b) Hence, verify that is the maximum point of using the second partial derivative test.


Solution

(a) First,

As a result, the determinant of the Hessian matrix is .

(b) From (a), the determinant of the Hessian matrix is positive. Also, . Thus, by the second partial derivative test, attains maximum at .


Clipboard

Exercise. Let be a random sample from the exponential distribution with rate parameter , with pdf , where . Show that the MLE of is .

Solution

Proof. The likelihood function is . Thus, the log-likelihood function is . Differentiating the log-likelihood function with respect to gives . Setting the derivative to be zero, we get . It remains to verify that attains maximum at . Since , this is verified. Hence, the MLE of is .


Example. (Application of maximum likelihood estimation) Suppose you are given a box which contains four balls, with unknown number of red and black balls. Now, you draw three balls out of the box, and find out that you get two red balls and one black ball. Using maximum likelihood estimation, estimate the number of red and black balls inside the box.

Solution: Given the color of the balls drawn, we know that the box contains at least two red balls and at least one black ball. This means the box contains either two red balls or three red balls. Let be the number of red balls inside the box. Then, the number of black balls inside the box is . The possible values of parameter are 2 and 3.

Now, we compare the probability of getting such result from drawing three balls when and .

  • For , the probability is (consider the pmf of hypergeometric distribution).
  • For , the probability is .

Hence, the maximum likelihood estimate of is 3. Thus, the estimated number of the red balls is 3, and that of the black balls is 1.

Clipboard

Exercise. Suppose the box now contains 100 balls, with unknown number of red and black balls. Now, you draw 99 balls out of the box, and find out that you get 98 red balls and one black ball. Using maximum likelihood estimation, estimate the number of red and black balls inside the box.

Solution

Similarly, the box contains at least 98 red balls and one black ball. We use the same notation as in the above example. Then, the number of the black balls is , and the possible values of parameter are 98 and 99.

  • For , the probability is
  • For , the probability is

Thus, the maximum likelihood estimate of is 99. Thus, the estimated number of the red balls is 99 and that of the black balls is 1.

Remark.

  • The difference of the probabilities between two possible values of becomes much larger in this case.
  • Intuitively, when you have such draw result, you will think that it is quite unlikely that the box has two black balls inside, i.e., the ball that is not drawn is actually black, and somehow you draw all red balls out, but not the black ball.




Method of moments estimator (MME)[edit | edit source]

For maximum likelihood estimation, we need to utilize the likelihood function, which is found from the joint pmf of pdf of the random sample from a distribution. However, we may not know exactly the pmf of pdf of the distribution in practice. Instead, we may just know some information about the distribution, e.g. mean, variance, and some moments (th moment of a random variable is , we denote it by for simplicity). Such moments often contain information about the unknown parameter. For example, for a normal distribution , we know that and . Because of this, when we want to estimate the parameters, we can do this through estimating the moments.

Now, we would like to know how to estimate the moments. We let be the th sample moment [5], where 's are independent and identically distributed. By weak law of large number (assuming the conditions are satisified), we have

  • (this can be seen from replacing the "" by "" in the weak law of large number, then the conditions are still satisfied, and so we can still apply the weak law of large number)

In general, we have , since the conditions are still satisfied after replacing the "" by "" in the weak law of large number.

Because of these results, we can estimate the -th moment using the -th sample moment , and the estimation is "better" when is large. For example, in the above normal distribution example, we can estimate by and by , and these estimators are actually called the method of moments estimator.

To be more precise, we have the following the definition of the method of moments:

Definition. (Method of moments) Let be a random sample from a distribution with pdf or pmf . Write moment(s), e.g. , as function(s) of : respectively. Then, the method of moments estimator (MME) of , respectively, is given by the solution (in the form of in terms of , corresponding to the moments ) to the following system of equations:

Remark.

  • When there are unknown parameters, we need to solve a system of equations, involving sample moments.
  • Usually, we select the first moments for the moments, as in the definition. But this is not necessary, and we may choose other moments, including fractional moments (e.g. , and we use in this case).
  • Because of this, the method of moment estimator is not unique.

Example. Let be a random sample from the normal distribution . Find the MME of and .

Solution: First, there are two unknown parameters. Thus, we need to solve aa system of 2 equations, involving 2 sample moments and 2 moments. Since and , consider the following system of equations:

Substituting into , we get . Hence, the MME of is and the MME of is .

Remark.

  • We can see that the process of finding the MME of and is much easier than finding the MLE of and . This is because the expression of the first and second moment in terms of parameters is simple in this case. However, when the expression is more complicated, finding the MME of the parameters can be quite complicated.

Example. Let be a random sample from the exponential distribution with rate parameter . Find the MME of and compare it to the MLE of .

Solution: Since , consider the following equation: . We then have . Hence, the MME of is , which is somehow the same as the MLE of .

Clipboard

Exercise. Let be a random sample from the uniform distribution . Show that the MMEs of and are and respectively.

Solution

Proof. Since and , consider the following system of equations:

From , we have . Substituting it into , we have Solving this equation by quadratic formula, we get .

When , . However, from the definition of the uniform distribution, we need to have , and thus this case is rejected.

When , , which satisfies the definition of the uniform distribution.

Thus, we have the desired result.


Properties of estimator[edit | edit source]

In this section, we will introduce some criteria for evaluating how "good" a point estimator is, namely unbiasedness, efficienecy and consistency.

Unbiasedness[edit | edit source]

For to be a "good" estimator of a parameter , a desirable property of is that its expected value equals the value of the parameter , or at least close to the value. Because of this, we introduce a value, namely bias, to measure how close is the mean of to .

Definition. (Bias) The bias of an estimator is

We will also define some terms related to bias.

Definition. ((Un)biased estimator) An estimator is an unbiased estimator of a parameter if . Otherwise, the estimator is called a biased estimator.

Definition. (Asymptotically unbiased estimator) An estimator is an asymptotically unbiased estimator of a parameter if where is the sample size.

Remark.

  • An unbiased estimator must be an asymptotically unbiased estimator, but the converse is not true, i.e., an asymptotically unbiased estimator may not be an unbiased estimator. Thus, a biased estimator may be an asymptotically unbiased estimator.
  • When we discuss the goodness of estimators in terms of unbiasedness, an unbiased estimator is better than an asymptotically unbiased estimator, which is better than an unbiased estimator.
  • However, there are also other criteria for evaluating the goodness of estimators apart from unbiasedness, so when we also account for other criteria, a biased estimator may be somehow "better" than an unbiased estimator overall.

Example. Let be a random sample from the Bernoulli distribution with success probability . Show that the MLE of , , is an unbiased estimator of .

Proof. Since , the result follows.

Clipboard

Exercise. Suppose the Bernoulli distribution is replaced by binomial distribution with trials and success probability . Show that is a biased estimator of . Modify this estimator such that it is an unbiased estimator of .

Solution

Proof. Since , is a biased estimator of .

We can modify this estimator to , and then its mean is . Alternatively, we may choose the estimator to be (), whose mean is also (Other estimators whose mean is are also fine).


Example. Let be a random sample from the normal distribution . Show that the MLE of , , is an unbiased estimator of , and the MLE of , , is an asymptotically unbiased estimator of .

Proof. First, since , is an unbiased estimator of .

On the other hand,

Thus, , as desired.

Clipboard

Exercise. Modify the estimator of such that it becomes an unbiased estimator.

Solution

The estimator can be modified as .


Efficiency[edit | edit source]

We have discussed how to evaluate the unbiasedness of estimators. Now, if we are given two unbiased estimators, and , how should we compare their goodness? Their goodness is the same if we are only comparing them in terms of unbiasedness. Therefore, we need another criterion in this case. One possible way is to compare their variances, and the one with smaller variance is better, since on average, the estimator is less deviated from its mean, which is the value of the unknown parameter by the definition of unbiased estimator, and thus the one with smaller variance is more accurate in some deviation sense. Indeed, an unbiased estimator can still have a large variance, and thus deviate a lot from its mean. Such estimator is unbiased since the positive deviations and negative deviations somehow cancel out each other. This is the idea of efficiency.

Definition. (Efficiency) Suppose and are two unbiased estimators of an unknown parameter . The efficiency of relative to is . If , then we say that is relatively more efficient than .

Remark.

  • Since , the estimator with smaller variance is relatively more efficient than the estimator with larger variance.
  • Normally, the variance should be nonzero, and thus the efficiency should be defined in normal cases.
  • Sometimes, it is also called relative efficiency due to the fact that the efficiency describes equals "how many" .
  • One may ask that why we use the ratio of variances is used in the definition to compare variances, instead of using the difference in variances. A possible reason is that the ratio of variances does not have any unit (the unit of the variances (if exists) cancels out each other), but the difference in variances can have an unit. Also, using the ratio of variances allows us to also compare different efficiencies numerically, calculated from different variances.

Actually, for the variance of unbiased estimator, since the mean of the unbiased estimator is the unknown paramter , it measures the mean of the squared deviation from , and we have a specific term for this deviation, namely mean squared error (MSE).

Definition. (Mean squared error) Suppose is an estimator of a parameter . The mean squared error (MSE) of is .

Remark.

  • From this definition, is the mean value of the square of error , and hence the name mean squared error.

Notice that in the definition of MSE, we do not specify that to be an unbiased estimator. Thus, in the definition may be biased. We have mentioned that when is unbiased, then its variance is actually its MSE. In the following, we will give a more general relationship between and , not just for unbiased estimators.

Proposition. (Relationship between mean squared error and variance) If exists, then .

Proof. By definition, we have and . From these, we are motivated to write

as desired.

Example. Let () be a random sample from .

(a) Show that the single observation estimator is an unbiased estimator for .

(b) Calculate the MSE of and respectively.

(c) Which of and is a better estimator of in terms of unbiasedness and efficiency?

Solution:

(a) Since , the result follows.

(b) , and .

(c) Since , is relatively more efficient than . Since both and are unbiased estimators of , we conclude that is a better estimator of in terms of unbiasedness and efficiency.

Clipboard

Exercise. In addition to the random sample with sample size in the example, suppose we take another random sample with sample size . Let and denote the sample mean for the sample with sample size and respectively.

(a) Calculate .

(b) State the condition on the sample sizes and under which is relatively more efficient than .


Solution

(a) Since (from example), and (by similar arguments as in the example),

.

(b) Since , the condition is .

Remark.

  • This shows that the sample mean with a larger sample size is relatively more efficient than the one with smaller sample size.



Proposition. if and only if and .

Proof.

  • "if" part is simple. Assume and . Then, .
  • "only if" part: we can use proof by contrapositive, i.e., proving that if or , then .
  • Case 1: when , it means since the variance is nonnegative. Also, . It follows that , i.e., the MSE does not equal zero.
  • Case 2: when , it means . Also, . It follows that , i.e., the MSE does not equal zero.

Remark.

  • As a result, if we know that , then we know that , i.e., is an asymptotically unbiased estimator (in addition to ) ( may be an unbiased estimator).

Uniformly minimum-variance unbiased estimator[edit | edit source]

Now, we know that the smaller the variance of an unbiased estimator, the more efficient (and "better") it is. Thus, it is natural that we want to know what is the most efficient (i.e., the "best") unbiased estimator, i.e., the unbiased estimator with the smallest variance. We have a specific name for such unbiased estimator, namely uniformly minimum-variance unbiased estimator (UMVUE) [6]. To be more precise, we have the following definition for UMVUE:

Definition. (Uniformly minimum-variance unbiased estimator) The uniformly minimum-variance unbiased estimator (UMVUE) is an unbiased estimator with the smallest variance among all unbiased estimators.

Indeed, UMVUE is unique, i.e., there is exactly one unbiased estimator with the smallest variance among all unbiased estimators, and we will prove it in the following.

Proposition. (Uniqueness of UMVUE) If is an UMVUE of a function of a parameter , then is unique.

Proof. Assume that is an UMVUE of , and is another UMVUE of . Define the estimator . Since , is an unbiased estimator of .

Now, we consider the variance of .

Thus, we now have either or . If the former is true, then is not an UMVUE of by definition, since we can find another unbiased estimator, namely , with smaller variance than it. Hence, we must have the latter, i.e.,
This implies when we apply the covariance inequality, the equality holds, i.e.,
which means is increasing linearly with , i.e., we can write
for some constants and .

Now, we consider the covariance .

On the other hand, since the equality holds in the covariance inequality, and (since they are both UMVUE),
Thus, we have .

It remains to show that to prove that , and therefore conclude that is unique.

From above, we currently have , as desired.

Remark.

  • Thus, when we are able to find an UMVUE, then it is the unique one, and the variance every other possible unbiased estimator is strictly greater than the variance of the UMVUE.
Cramer-Rao lower bound[edit | edit source]

Without using some results, it is quite difficult to determine the UMVUE, since there are many (perhaps even infinitely many) possible unbiased estimator, so it is quite hard to ensure that one particular unbiased estimator is relative more efficient than every other possible unbiased estimators.

Therefore, we will introduce some approaches that help us to find the UMVUE. For the first approach, we find a lower bound [7] on the variances of all possible unbiased estimators. After getting such lower bound, if we can find an unbiased estimator with variance to be exactly equal to the lower bound, then the lower bound is the minimum value of the variances, and hence such unbiased estimator is an UMVUE by definition.

Remark.

  • There are many possible lower bounds, but when the lower bound is greater, it is closer to the actual minimum value of the variances, and hence "better".
  • An unbiased estimator can still be an UMVUE even if its variance does not achieve the lower bound.

A common way to find such lower bound is to use the Cramer-Rao lower bound (CRLB), and we get the CRLB through Cramer-Rao inequality. Before stating the inequality, let us define some related terms.

Definition. (Fisher information) The Fisher information about a parameter with sample size is

where is the log-likelihood function (as a random variable).

Remark.

  • is called the score function, and is denoted by .
  • The "" may or may not be a parameter vector. If it is just a single parameter (usually the case here), then it is the same as "". We use "" instead of "" to emphasize that the "" in and is referring to the "" in ""
  • It is possible to define "Fisher information about a parameter vector", but in this case the Fisher information takes the form of a matrix instead of a single number, and it is called Fisher information matrix. However, since it is more complicated, we will not discuss it here.
  • Since the expected value of the score function

and under some regularity conditions which allow interchange of derivative and integral, this equals , the Fisher information about is also the variance of the score function, i.e., .

For the regularity conditions which allow interchange of derivative and integral, they include

  1. the partial derivatives involved should exist, i.e., the (natural log) of the functions involved is differentiable
  2. the integrals involved should be differentiable
  3. the support does not depend on the parameter(s) involved

We have some results that assist us to compute the Fisher information.

Proposition. Let be a random sample from a distribution with pdf or pmf . Also, let , the Fisher information about with sample size one. Then, under some regularity conditions which allow interchange of derivative and integral, .

Proof.

Proposition. Under some regularity conditions which allow interchange of derivative and integral, .

Proof.

Now, it suffices to prove that , which is true since

Remark.

  • This proposition can be quite useful, since after partially differentiating , it is likely that many 's will vanish, and thus the computation of the expectation will be easier.

Theorem. (Cramer-Rao inequality) Let be a random sample from a distribution, and let be an unbiased estimator of (a function of ). Then, under some regularity conditions which allow interchange of derivative and integral, .

Proof. Since is an unbiased estimator of , we have by definition . By definition of expectation, we have where is the likelihood function. Thus,

Consider the covariance inequality: . We have
( by remark about Fisher information)

Remark.

  • is called the Cramer-Rao lower bound (CRLB).
  • When , meaning that is an unbiased estimator of , since , the CRLB becomes .

Example. Let be a random sample from the normal distribution . Show that the MLE of , , is an UMVUE of .

Proof. First, we can see that the regularity conditions are satisfied in this case. So, we may consider the CRLB of as follows. Since

we have
Thus, the CRLB of is .

On the other hand, the variance of is (which is shown in a previous example), which equals the CRLB of . It follows that is an UMVUE of .

Clipboard

Exercise. A student claims that is another UMVUE of , since , which equals the CRLB of as well. Is the claim correct? Why?

Solution

Recall that UMVUE is an unbiased estimator.

The claim is wrong, since is not an unbiased estimator in general. This is because unless . But if , then this estimator is simply , which is exactly the same as . So, the estimator is not another UMVUE in this case.


Sometimes, we cannot use the CRLB method for finding UMVUE, because

  • the regularity conditions may not be satisfied, and thus we cannot use the Cramer-Rao inequality, and
  • the variance of the unbiased estimator may not be equal to the CRLB, but we cannot conclude that it is not an UMVUE, because it may be the case that the CRLB is not attainable at all, and the smallest variance among all unbiased estimators is actually the variance of that estimator, which is larger than the CRLB.

We will illustrate some examples for these two cases in the following.

Example. Let be a random sample from the uniform distribution . If we want to find the UMVUE of , we cannot use the Cramer-Rao inequality to find it, since the support depends on the parameter .

Example. Let be a random sample from the normal distribution . It is given that in this case, , where is the chi-squared distribution with degrees of freedom, and its variance is . Calculate , and the CRLB of .

Solution: By the given information, we have

Hence, .

On the other hand, since

and , the CRLB of is

Remark.

  • is an unbiased estimator of , since .
  • We can observe that is greater than CRLB. But does this mean is not the UMVUE of ? We do not know, since we are not sure that whether there is another unbiased estimators with variance less than , and it is possible that the CRLB is not attainable.

Since the CRLB is sometimes attainable and sometimes not, it is natural to question that when can the CRLB be attained. In other words, we would like to know the attainment conditions for the CRLB, which are stated in the following corollary.

Corollary. (Attainable condition for the CRLB) Let be a random sample from a distribution, and let be an unbiased estimator of . Suppose the regularity conditions in the Cramer-Rao inequality are satisfied. Then, the CRLB can be attained, i.e., there exists some such that , if and only if

where is the score function, and is a constant.

Proof. Considering the proof for Cramer-Rao inequality, we have

We can write as (by result about covariance). Also, (by result about variance). Thus, we have
where is the correlation coefficient between two random variables. This means increases or decreases linearly with , i.e.,
for some constants . Now, it suffices to show that the constant is actually zero.

We know that (since is an unbiased estimator of ), and (from remark about Fisher information). Thus, applying expectations on both side gives

Then, the result follows.

Remark.

  • Considering the proof, we know that if we have such attainable condition satisfied, the variance of the unbiased estimator equals the CRLB of , i.e., that estimator is the UMVUE of .

Example. We have shown that the log-likelihood function of a random sample from the normal distribution is . Prove that the CRLB of is attainable using the attainable conditions for the CRLB.

Proof. The score function is

Since we have and (which is an unbiased estimator of ), the attainable conditions for the CRLB are satisfied (the constant "" is in this case), and hence the CRLB of is attainable.


Remark.

  • Indeed, we know that the CRLB of is attainable before this proof since we have already found an unbiased estimator of , namely , whose variance is exactly equal to the CRLB previously.

Example. Continue from the previous example. Prove that the CRLB of is not attainable using the attainable conditions for the CRLB.

Proof. The score function in this case is

Taking the constant , a potential candidate for the unbiased estimator that achieves the CRLB is . However, we notice that is not calculable since is unknown. It follows that there does not exist some such that , where is some constant and .


Remark.

  • Even if we know that the CRLB of is not attainable, we still do not know whether is the UMVUE, since it is possible that some unbiased estimator with smaller variance (but not achieving the CRLB).

We have discussed MLE previously, and MLE is actually a "best choice" asymptotically (i.e., as the sample size ) according to the following theorem.

Theorem. Suppose is the MLE of an unknown parameter from a distribution. Then, under some regularity conditions, as ,

Proof. Partial proof: we consider the Taylor series of order 2 for , and we will get

where is between and . Since is the MLE of , from the derivative test, we know that (we apply regularity condition to ensure the existence of this derivative). Hence, we have
Since
by central limit theorem,
Furthermore, we apply the weak law of large number to show that
It can be shown in a quite complicated way (and using regularity conditions) that
Considering and , using property of convergence in probability, we have
Considering and , and using Slutsky's theorem, we have
where , and hence . It follows that
This means
and thus
as desired.

Remark.

  • Equivalently, we can write . Thus, the variance of MLE of achieves the CRLB of asymptotically. This means the MLE of is the UMVUE of asymptotically.
  • The regularity conditions are basically similar to the regularity conditions mentioned in Cramer-Rao inequality.

Since we are not able to use the CRLB to find UMVUE in some situations, we will introduce another method to find UMVUE in the following, which uses the concepts of sufficiency and completeness.

Sufficiency[edit | edit source]

Intuitively, a sufficient statistic , which is a function of a given random sample , contains all information needed for estimating the unknown parameter (vector) . Thus, the statistic itself is "sufficient" for estimating the unknown parameter (vector) .

Formally, we can define and describe sufficient statistic as follows:

Definition. (Sufficient statistic) A statistic is a sufficient statistic for the unknown parameter (vector) if the conditional distribution of the random sample given does not depend on .

Remark.

  • The definition may be expressed as

where is the joint pdf or pmf of .
  • The equation means the joint conditional pmf or pdf of given (the value of) is the same as the joint conditional pmf or pdf of given (the values of ), and with the parameter value .
  • This means the pmf of pdf is not changed even if the parameter value is provided, which in turn means the joint conditional pmf or pdf of , given the value of , actually does not depend on .
  • refers to before the realization , and it is a random variable (the randomness comes from ).
  • After the realization , the equation still holds ( is modified to ).

Example. Consider a random sample from . It can be shown that is a sufficient statistic for , but not a sufficient statistic for .

This can be shown by applying the definition. However, we will later give an alternative and often more convenient method to check the sufficiency of a statistic, and find sufficient statistics. We will explain informally why it is true here.

  • contains the information of the central tendency of the distribution, which should be the information needed to estimate the mean . Hence, it is a sufficient statistic for .
  • However, does not contain the information of the dispersion of the distribution (it only tells that the "central location", but for a particular central location, the dispersion can be very different), which should be the information needed to estimate the variance . Hence, it is not a sufficient statistic for .

Remark.

  • From here, we can also expect that the sufficient statistic is not unique, since, for example, should also contain the information of the central tendency (since we can divide it by 2 to get the value of , and thus get the information).
  • Indeed, in general, given is a sufficient statistic for , then is also a sufficient statistic for , provided that is a bijective function (also known as invertible function, one-to-one correspondence, or bijection), so that its inverse exists.

Let us state the above remark about transformation of sufficient statistic formally below.

Proposition. Let be a sufficient statistic of the unknown parameter (vector) . Then, is also a sufficient statistic of for each bijective function .

Now, we discuss a theorem that helps us to check the sufficiency of a statistic, namely (Fisher-Neyman) factorization theorem.

Theorem. (Factorization theorem) Let be the joint pdf of pmf of a random sample . A statistic is a sufficient statistic of if and only if there exist functions and such that

where depends on only through , and does not depend on .

Proof. Since the proof for continuous case is quite complicated, we will only give a proof for the discrete case. For simplicity of presentation, let , , , and , and hence there are notations for different types of pmfs from these. By definition, . Also, we have . Thus, we can write

.

"only if" () direction: Assume is a sufficient statistic. Then, we choose and , which does not depend on by the definition of sufficient statistic. It remains to verify that the equation actually holds for this choice.

Hence,

"if" () direction: Assume we can write . Then,

Now, we aim to show that does not depend on , which means is a sufficient statistic for . We have
which does not depend on , as desired.

Remark.

  • can also be a constant, which does not depend on clearly.

Example. Consider a random sample from . Find a sufficient statistic of .

Solution: The joint pdf of is

Notice that the function depends on only through , so we can conclude that .

Remark.

  • We can also write as , which is also a sufficient statistic for .
  • Intuitively, this is because the latter one also contains the same statistics, and thus contains the same information.
  • Alternatively, we can define the function as , which is a bijective function, so is also a sufficient statistic for .
  • We need to separate "" out from "", since for the function , it cannot depend on . Hence, we cannot include "" in the definition of the function.
  • There are many ways to define the and functions in this case.


For some "nice" distributions, which belong to exponential family, sufficient statistics can be found using another alternative method easily and more conveniently. This method works because of the "nice" form of the pdf or pmf of those distributions, which can be characterized as follows:

Definition. (Exponential family) The distribution of a random variable belongs to the exponential distribution if the pdf or pmf of has the form of

where , for some functions ().

Remark.

  • The value of depends on the number of unknown parameters.
  • Notice that can be 1, and in this case the "" is just a single parameter.
  • Exponential family includes many common distributions, e.g. normal, exponential, gamma, chi squared, beta, Bernoulli, Poisson, geometric, etc.
  • However, some common distributions do not belong to exponential family, e.g. Student's -distribution, -distribution, Cauchy distribution and hypergeometric distribution.

Example. The normal distribution belongs to the exponential family, where (so the "" is 2 in this case), since its pdf can be expressed as

Theorem. (Sufficient statistic for exponential family) Let be a random sample from a distribution belonging to the exponential family, with pdf or pmf where . Then, a sufficient statistic for is

Proof. Since the distribution belongs to the exponential family, the joint pdf or pmf of can be expressed as

From here, for applying the factorization theorem, we can identify the purple part of the function as "", and the red part of the function as "". We can notice that the red part of the function depends on only through . The result follows.

Example. Consider a random sample from . Show that a sufficient statistic of is , using the result for finding sufficient statistic for exponential family.

Proof. From previous example, we have shown that normal distribution belongs to the exponential family, and from the expression there, we can see that a sufficient statistic of is .

Since , we can define function as

which can be shown to be an bijective function.

Thus, is also an sufficient statistic of .


Now, we will start discussing how is sufficient statistic related to UMVUE. We begin our discussion by Rao-Blackwell theorem.

Theorem. (Rao-Blackwell theorem) Let be an arbitrary unbiased estimator of , and be a sufficient statistic of . Define . Then, is an unbiased estimator of and .

Proof. Assume is an arbitrary unbiased estimator of , and is a sufficient statistic of .

First, we prove that is an unbiased estimator of . Before proving the unbiasedness, we should ensure that is actually an estimator, i.e., it is a statistic, which is a function of random sample, and needs to be independent from (so that it is calculable): since is a function of random sample, and is a sufficient statistic, which make the conditional distribution of , given , independent of . Also, is a function of , and thus is also a function of random sample.

Now, we prove that is an unbiased estimator of : since , is an unbiased estimator of .

Next, we prove that : by law of total variance, we have

as desired.

Remark.

  • The random variable is determined by first finding , and then replacing by . Here, is the realization of .
  • From Rao-Blackwell theorem, we know that is a better (or at least "same quality") estimator than in efficiency sense. Notice that the theorem does not state that is the best estimator in efficiency sense (i.e., UMVUE). Instead, it only states that is better than in efficiency sense.
  • After applying this theorem once, the can act as the "arbitrary unbiased estimator of ", and we can apply this theorem again, and so on. This means that after applying this theorem many times, the "" we get will be an UMVUE.
  • We can interpret this process as keep "improving" the unbiased estimator , until it becomes the best one (in efficiency sense), i.e., it is an UMVUE.
  • Since UMVUE is unique, UMVUE must be the conditional expectation of a random variable given a sufficient statistic , which is a function of .
  • Thus, we now can narrow the candidates for UMVUE to functions of sufficient statistic .

To actually determine the UMVUE, we need another theorem, called Lehmann-Scheffé theorem, which is based on Rao-Blackwell theorem, and requires the concept of completeness.

Completeness[edit | edit source]

Definition. (Complete statistic) Let be a random sample from a distribution with a parameter (vector) lying in the parameter space . A statistic is a complete statistic if for each implies for each .

When a random sample is from a distribution in exponential family, then a complete statistic can also be founded easily, similar to the case for sufficient statistic.

Theorem. (Complete statistic for exponential family) If is a random sample from a distribution in exponential family where the unknown parameter (vector) , Then,

is a complete statistic, given that the parameter space contains an open set in .

Proof. Omitted.

Remark.

  • Open sets are a generalization of open intervals. Indeed, open sets in is simply open intervals.
  • Intuitively, open sets refers to the sets that, for each point in the set, the set contain all points that are sufficiently near to that point.
  • For example, a line in (which can be interpreted as a set) is not an open set since for each point in the line, the line does not contain all points that are sufficiently near to that point (there are some points "above" and "below" the line that are not contained in the set).
  • Also, a disk (a region in a plane bounded by a circle) in is not an open set since for each point in the disk, the disk does not contain all points that are sufficiently near to that point (there are some points "above" and "below" the disk that are not contained in the disk).
  • From the previous theorem about sufficient statistic for exponential family, we know that is also a sufficient statistic of under such conditions.
  • When a statistic is sufficient for a parameter (vector) and is also a complete statistic, we call such statistic as a complete and sufficient statistic for .

Theorem. (Lehmann-Scheffé theorem) If is a complete and sufficient statistic for and , then is the unique UMVUE of (with probability 1).

Proof. Assume is a complete and sufficient statistic for and .

Since is a sufficient statistic for , we can apply the Rao-Blackwell theorem. From Rao-Blackwell theorem, if is an arbitrary unbiased estimator of , then is another unbiased estimator where .

To prove that is the unique UMVUE of , we proceed to show that regardless of the choice of the unbiased estimator of , we get the same from the Rao-Blackwell theorem (with probability 1). Then, we will have for every possible unbiased estimator of , (with probability 1) [8], which means is the UMVUE, and is also the unique UMVUE since we always get the same [9].

Assume that is another unbiased estimator of (). By Rao-Blackwell theorem again, there is an unbiased estimator () where . Since both and are unbiased estimators of , we have for each ,

Since is a complete statistic, we have
which means (with probability 1), i.e., we get the same from the Rao-Blackwell theorem in this case (with probability 1).

Remark.

  • The "" in this theorem is a function of , and we know from the proof and Rao-Blackwell theorem that it is actually where is an arbitrary unbiased estimator of .
  • Thus, when we apply this theorem, as long as we can find a function of , , (perhaps by some inspections) such that , we know that is the unique UMVUE of . Also, due to the uniqueness of UMVUE, the is actually where is an arbitrary unbiased estimator of .
  • We can find by some inspections, as in above, in simple cases. However, in more complicated cases, it may not be immediately transparent that what should be the explicit form of such that . In such case, we need to find an unbiased estimator of and evaluate to get the explicit form of .

Example. Consider a random sample from . Let the unknown parameter vector .

(a) Show that the sufficient statistic for , namely , is also a complete statistic.

(b) Hence, show that and is the UMVUE of and respectively.

Solution:

(a)

Proof. It suffices to show that the parameter space contains an open set in . This is true since the parameter space is the whole region above -axis if we represent it using the Cartesian coordinate system, and thus contains an open set.


(b)

Proof. Since and (we have shown these before), and and is function of complete and sufficient statistic (of ) and (of ) respectively, by Lehmann-Scheffé theorem, we have the desired result.


Remark.

  • We have shown that does not attain the CRLB of , and the CRLB of is actually unattainable. Thus, we were not able to determine whether is the UMVUE of before. Now, we know that is actually the UMVUE of with the help of Lehmann-Scheffé theorem.

Example. Consider a random sample from the Bernoulli distribution with success probability , i.e., , with pmf .

(a) Find a complete and sufficient statistic for .

(b) Hence, find the UMVUE of .

(c) Show that is an unbiased estimator of , and is the UMVUE of .

Solution:

(a) The pmf . This means Bernoulli distribution belongs to the exponential family. Also, the parameter space contains an open set in . Hence, is a complete and sufficient statistic for .

(b) Notice that . Hence, (which is a function of ) is the UMVUE of .

(c)

Proof. Since , is an unbiased estimator of .

Now, we consider . We denote by . Then, this expectation becomes . In the following, we evaluate .

Notice that follows the binomial distribution with trials with success probability , i.e., , and . Hence,
Now, replacing by gives
which is the UMVUE of , as desired.

Clipboard

Exercise. Can we find the UMVUE of using the CRLB of ? If yes, find it using this way. If no, explain why.

Solution

No. This is because the log-likelihood function is not differentiable (it has nonzero value only when ), and thus the Fisher information is undefined. Hence, the CRLB does not exist.


Clipboard

Exercise. Consider a random sample from the Poisson distribution with rate parameter , with pmf .

(a) Find a complete and sufficient statistic for .

(b) Find the UMVUE of .


Solution

(a) The pmf is

Hence, Poisson distribution belongs to the exponential family, and a complete and sufficient statistic for is .

(b) Take . Since , we have

Thus, the UMVUE of is (which is a function of ).


Consistency[edit | edit source]

In the previous sections, we have discussed unbiasedness and efficiency. In this section, we will discuss another property called consistency.

Definition. (Consistent estimator) is a consistent estimator of the unknown parameter if .

Remark.

  • By the definition of convergence in probability, means as , for each .

Proposition. If is an (asymptotically) unbiased estimator of an unknown parameter and as , then is a consistent estimator of .

Proof. Assume is an (asymptotically) unbiased estimator of an unknown parameter and as . Since is an (asymptotically) unbiased estimator of , we have (this is true for both asymptotically unbiased estimator and unbiased estimator of ). In addition to this, we have by assumption that . By definition of mean squared error, these imply that . Thus, as , we have by Chebyshov's inequality (notice that exist from above), for each ,

Since probability is nonnegative (), and this probability is less than or equal to an expression that tends to be 0 as , we conclude that this probability tends to be zero as . That is, is a consistent estimator of .

Remark.

  • Unbiasedness alone does not imply consistency.

Example. Let be a random sample from . Then, is an unbiased estimator of since . However, there exist some such that

as , i.e., for some . Since is independent from , this means
for some , which is true. Hence, is not a consistent estimator of .

Clipboard

Exercise.

(a) Propose a consistent estimator of , and show that it is actually a consistent estimator of (Hint: consider weak law of large number).

(b) Propose a consistent estimator of the coefficient of variation (or relative standard deviation) (assuming so that it is defined), and show that it is actually a consistent estimator of (Hint: consider weak law of large number, and properties for convergence in probability. You can use the fact that normal distribution has a finite fourth moment).

Solution

(a) is a consistent estimator of .

Proof. By weak law of large number (notice that mean and variance are finite for normal distribution), as desired.


(b) is a consistent estimator of .

Proof. By weak law of large number (variance is finite, and fourth moment is finite), . Also, by continuous mapping theorem, since . Thus, by properties about convergence in probability and result about sample variance,

By continuous mapping theorem again,
(since ). Hence, by properties about convergence in probability (we assume that ) again,
as desired.




  1. For the parameter vector, it contains all parameters governing the distribution.
  2. We will simply use "" when we do not know whether it is parameter vector or just a single parameter. We may use instead if we know it is indeed a parameter vector.
  3. We will discuss some criterion for "good" in the #Properties of estimator section.
  4. . Thus, .
  5. For each positive integer , always exist, unlike .
  6. "uniformly" means that the variance is minimum compared to other unbiased estimators, over the parameter space (i.e., for each possible value of ). That is, the variance is not just minimum for a particular value of , but all possible values of .
  7. This is different from the minimum value. For lower bound, it only needs to be smaller than all variances involved, and there may not be any variance that actually achieve this lower bound. However, for the minimum value, it has to be one of the values of the variance.
  8. Notice that this is a stronger result than the result in the Rao-Blackwell theorem, where the latter only states that , for the corresponding to
  9. Indeed, we know that UMVUE must be unique from previous proposition. However, in this argument, when we show that is UMVUE, we also automatically show that it is unique.