# Statistics/Hypothesis Testing

## Introduction

[edit | edit source]In previous chapters, we have discussed two methods for *estimating unknown parameters*, namely *point estimation* and *interval estimation*.
Estimating unknown parameters is an important area in statistical inference, and in this chapter we will discuss another important area, namely *hypothesis testing*, which is related to *decision making*.
Indeed, the concepts of *confidence intervals* and *hypothesis testing* are closely related, as we will demonstrate.

## Basic concepts and terminologies

[edit | edit source]Before discussing how to *conduct* hypothesis testing, and *evaluate* the "goodness" of a hypothesis test, let us introduce some basic concepts and terminologies related to hypothesis testing first.

**Definition.**
(Hypothesis)
A (statistical) *hypothesis* is a statement about population parameter(s).

There are two terms that classify hypotheses:

**Definition.**
(Simple and composite hypothesis)
A hypothesis is a *simple hypothesis* if it *completely specifies* the distribution of the population (that is, the distribution is completely known, without any unknown parameters involved), and
is a *composite hypothesis* otherwise.

Sometimes, it is not immediately clear that whether a hypothesis is simple or composite. To understand the classification of hypotheses more clearly, let us consider the following example.

**Example.**
Consider a distribution with parameter , taking values in the parameter space .
Determine whether each of the following hypotheses is simple or composite.

(a) .

(b) where is known.

(c) .

(d) where is known.

(e) where is known.

(f) where is a nonempty subset of . ^{[1]}

*Solution*.

- (a) and (b) are simple hypotheses, since they all completely specifies the distribution.
- (c), (d) and (e) are composite hypotheses, since the parameter is not completely specified, then so is the distribution.
- (f) may be simple hypothesis or composite hypothesis, depending on . If contains exactly one element, then it is simple hypothesis. Otherwise, it is composite hypothesis.

In hypothesis tests, we consider two hypotheses:

**Definition.**
(Null hypothesis and alternative hypothesis)
In hypothesis testing, the hypothesis being tested is the *null hypothesis* (denoted by ) and another *complementary* hypothesis (to ) is the *alternative hypothesis* (denoted by ).

**Remark.**

- is complementary hypothesis to in the sense that if is true (false), then is false (true) (
*exactly*one of and is true). Because of this, we usually say is tested*against*(so we often write ). - Usually, usually corresponds to the
*status quo*("no effect"), and corresponds to some interesting "research findings" (So, is sometimes also called*research hypothesis*.).

- Since often corresponds to the status quo, we usually
*assume*is true, unless there are sufficient evidences against it. - This is somehow analogous to the legal principle of
*presumption of innocence*which states that every person accused of any crime is considered*innocent*( is assumed to be true), until proven guilty (there are sufficient evidences against ).

- Since often corresponds to the status quo, we usually

A general form of and is and
where , which is the *complement* of (with respect to ), i.e., ( is the parameter space, containing all possible values of ).
The reason for choosing the complement of in is that is the complementary hypothesis to , as suggested in the above definition.

**Remark.**

- In some books, it is only required that and to be disjoint (nonempty) subsets of the parameter space , and it is not necessary that .
- However, usually it is still assumed that
*exactly*one of and is true, so it means that is*not*supposed to take values outside the set (otherwise, none of and will be true). - Thus, in this case, we may actually say the parameter space is indeed . With this parameter space (since is assumed to take value in this union), then
*is*the complement of . - Alternatively, some may view the parameter space to be "linked" with a distribution, and so for a given distribution, the parameter space is fixed to be the one suggested by the distribution itself. So, in this case, is
*not*the complement of (with respect to the parameter space). - Despite the different definitions of and , a common feature is that we assume
*exactly*one of and is true.

**Example.**
Suppose your friend gives you a coin for tossing, and we do not know whether it is fair or not.
However, since the coin is given by your friend, you believe that the coin is fair unless there are sufficient evidences suggesting otherwise.
What is the null hypothesis and alternative hypothesis in this context (suppose the coin never land on edge)?

*Solution*.
Let be the probability for landing on heads after tossing the coin.
The null hypothesis is . The alternative hypothesis is .

**Exercise.**
Suppose we replace "coin" with "six-faced dice" in the above question. What is the null and alternative hypothesis? (*Hint*:
You may let be the probability for "1","2",...,"6" coming up after rolling the dice respectively.)

Let be the probability for "1","2",...,"6" coming up after rolling the dice respectively. The null hypothesis is , and the alternative hypothesis is (In fact, when one of is not , it must cause at least one other probability to be different from .)

We have mentioned that exactly one of and is assumed to be true.
To make a decision, we need to *decide* which hypothesis should be regarded as true. Of course, as one may expect, this decision is not perfect, and we will have some errors involved in our decision.
So, we cannot say we "prove that" a particular hypothesis is true (that is, we cannot be *certain* that a particular hypothesis is true).
Despite this, we may "regard" (or "accept") a particular hypothesis as true (but *not* prove it as true) when we have *sufficient evidences* that lead us to make this decision (ideally, with small errors ^{[2]}).

**Remark.**

- Philosophically, "not rejecting " is different from "accepting ", since the phrase "not rejecting " can mean that we actually do not regard as true but just do not have sufficient evidences to reject , instead of meaning that we regard as true. On the other hand, the phrase "accepting " should mean that we regard as true.
- In spite of this, we will not handle these philosophical issues, and we will just assume that whenever there are not sufficient evidences to reject (i.e., we do not reject ), then we will act as if is true, that is, still accept , even if we may not actually "believe" in .
- Of course, in some other places, the saying of "accepting null hypothesis" is avoided because of these philosophical issues.

Now, we are facing with two questions. First, what evidences should we consider? Second, what is meant by "sufficient"?
For the first question, a natural answer is that we should consider the observed *samples*, right?
This is because we are making hypothesis about the population, and the samples are taken from, and thus closely related to the population, which should help us make the decision.

To answer the second question, we need the concepts in *hypothesis testing*. In particular, in hypothesis testing, we will construct a so-called *rejection region* or *critical region* to help us determining that *whether* we should reject the *null* hypothesis (i.e., regard as false), and hence (naturally) regard as true ("accept" ) (we have assumed that exactly one of and is true, so when we regard one of them as false, we should regard another of them as true).
In particular, when we do *not* reject , we will act as if, or accept as true (and thus should also reject since exactly one of of is true).

Let us formally define the terms related to hypothesis testing in the following.

**Definition.**
(Hypothesis test)
A *hypothesis test* is a rule that specifies for which observed sample values we (do not reject and) accept as true (and thus reject ), and for which observed sample values we reject and accept .

**Remark.**

- Hypothesis test is sometimes simply written as "test" for simplicity. We also sometimes use the Greek letters "", "", etc. to denote tests.

**Definition.**
(Rejection and acceptance regions)
Let be the set containing all possible observations of a random sample , .
The *rejection region* (denoted by ) is the subset of for which is rejected. The complement of rejection region (with respect to the set ) () is the *acceptance region* (it is thus the subset of for which is accepted).

**Remark.**

- Graphically, it looks like

S *------------* |///|........| |///\........| |////\.......| |/////\......| *------------* *--* |//|: R *--* *--* |..|: R^c *--*

Typically, we use *test statistic* (a statistic for conducting a hypothesis test) to specify the rejection region.
For instance, if the random sample is and the test statistic is , the rejection region may be, say, (where and is observed value of and respectively).
Through this, we can directly construct a hypothesis test: when , we reject and accept . Otherwise, if , we accept .
So, in general, to specify the rule in a hypothesis test, we just need a *rejection region*.
After that, we will apply the test on testing against .
There are some terminologies related to the hypothesis tests constructed in this way:

**Definition.**
(Left-, Right- and two-tailed tests)
Let be an observed test statistic for a hypothesis test, and be the realizations of random samples.

- If the rejection region is in the form of , then the hypothesis test is called a
*left-tailed test*(or lower-tailed test). - If the rejection region is in the form of , then the hypothesis test is called a
*right-tailed test*(or upper-tailed test). - If the rejection region is in the form of , then the hypothesis test is called a
*two-tailed test*.

**Remark.**

- The inequality signs can be strict, i.e., the above inequality signs can be replaced "" and "".
- We use the terminology "tail" since the rejection region includes the values that are located at the "extreme portions" (i.e., very left (with small values) or very right (with large values) portions) (called tails) of distributions.
- When , we may say the two-tailed test is equal-tailed. In this case, we can also express the rejection region as .
- We sometimes also call upper-tailed and lower-tailed tests as
*one-sided tests*, and two-tailed tests as*two-sided tests*.

**Example.**
Suppose the rejection region is , and it is observed that .
Which hypothesis, or , should we accept?

*Solution*.
Since , we should (not reject and) accept .

**Exercise.**
What is the type of this hypothesis test?

Right-tailed test.

As we have mentioned, the decisions made by hypothesis test should not be perfect, and errors occur.
Indeed, when we think carefully, there are actually *two types* of errors, as follows:

**Definition.**
(Type I and II errors)
A *type I error* is the *rejection* of when is *true*.
A *type II error* is the *acceptance* of when is *false*.

We can illustrate these two types of errors more clearly using the following table.

Accept | Reject | |
---|---|---|

is true | Correct decision | Type I error |

is false | Type II error | Correct decision |

We can express and . Also, assume the rejection region is (i.e., the rejection region with "" replaced by ""). In general, when "" is put together with "", we assume .

Then we have some notations and expressions for *probabilities* of making type I and II errors:
(let be a random sample and )

- The probability of making a type I error, denoted by , is if .
- The probability of making a type II error, denoted by , is if .

**Remark.**

**Remark on notations**: In some other places, may be expressed as "", "" or "". We should be careful that these notations are not supposed to interpret as conditional probabilities^{[3]}. Instead, they are just some notations. This applies similarly to .- When contains a single value only, we simply denote the type I error probability by . Similarly, when contains a single value only, we simply denote the type II error probability by .

Notice that we have a common expression in both and , which is "". Indeed, we can also write this expression as Through this, we can observe that this expression contains all informations about the probabilities of making errors, given a hypothesis test with rejection . Hence, we will give a special name to it:

**Definition.**
(Power function)
Let be a rejection region of a hypothesis test, and be a random sample. Then, the *power function* of the hypothesis test is
where .

**Remark.**

- "" can be thought of as the Greek letter "p". We choose instead of since "" is sometimes used to denote probability (mass or density) functions.
- The power function will be our basis in evaluating the goodness of a test or comparing two different tests.

**Example.**
Suppose we toss a (fair or unfair) coin 5 times (suppose the coin never land on edge), and we have the following hypotheses:
where is the probability for landing on heads after tossing the coin.
Let be the random sample for the 5 times of coin tossing, and be the corresponding realizations.
Also, the value of a random sample is 1 if heads come up and 0 otherwise.
Suppose we will reject if and only if heads come up in all 5 coin tosses.

(a) Determine the rejection region .

(b) What is the power function (express in terms of )?

(c) Calculate and .

*Solution*.

(a) The rejection region .

(b) The power function is

(c) We have and . (Notice that although the probability of type I error can be low, the probability of type II error can be quite high. This is because, intuitively, it is quite "hard" to reject due to the strict requirement. So, even if is false, it may not be rejected, causing a type II error.)

**Exercise.**
Does exist? If yes, calculate it.

exists, and (notice that is a strictly increasing function).

You notice that the type II error of this hypothesis test can be quite large, so you want to revise the test to lower the type II error.

(a) What is in the above hypothesis test?

(b) Suppose the rejection region is modified to . Calculate and . (*Hint*: consider binomial distribution.)

(c) Suppose the rejection region is modified to . Calculate and .

(d) is minimized at which hypothesis test: the original one, the one in (b), or the one in (c)?

(a) if .

(b) In this case, we have , and .

(c) In this case, we have and .

(d) At the original one, , at the one in (b), , and at the one in (c), . So, is minimized at the one in (b).

**Example.**
Let be a random sample from the normal distribution where is known. Consider the following hypotheses:
where is a constant.
We use the test statistic for the hypothesis testing,
and we reject if and only if .

Find the power function , , and .

*Solution*.
The power function is
Thus, and
(some abuse of notations), by the definition of cdf.
(Indeed, is a strictly increasing function of .)

**Exercise.**
Show that if .

**Proof.**
Assume .
Then, .

Ideally, we want to make both and arbitrarily small. But this is generally impossible. To understand this, we can consider the following extreme examples:

- Set the rejection region to be , which is the set of all possible observations of random samples. Then, for each . From this, of course we have , which is nice. But the serious problem is that due to the mindless rejection.
- Another extreme is setting the rejection region to be the empty set . Then, for each . From this, we have , which is nice. But, again the serious problem is that due to the mindless acceptance.

We can observe that to make () to be very small, it is inevitable that () will increase consequently, due to accepting (rejecting) "too much".
As a result, we can only try to minimize the probability of making one type of error, holding the probability of making another type of error *controlled*.

Now, we are interested in knowing that *which* type of errors should be controlled.
To motivate the choice, we can again consider the analogy of legal principle of presumption of innocence. In this case, type I error means proving an innocent guilty, and type II error means acquitting a guilty person. Then, as suggested by Blackstone's ratio, type I error is more serious and important than type II error.
This motivates us to control the probability of type I error, i.e., , at a specified small value , so that we can control the probability of making this more serious error.
After that, we consider the tests that "control the type I error probability at this level", and the one with the smallest is the "best" one (in the sense of probability of making errors).

To describe "control the type I error probability at this level" in a more precise way, let us define the following term.

**Definition.**
(Size of a test)
A test with power function is a *size test* if
where .

**Remark.**

- Supremum is similar to maximum, and in "nice" situations (you may assume the situations here are "nice"), supremum is the same as maximum. Hence, choosing the supremum of over as the size of a test means that the size of a test gives its maximum probability of type I error (reject when is true), considering all situations, i.e., all different possible values of that makes true.
- Intuitively, we choose the maximum probability of type I error to be the size so that the size can tell us how probable type I error occurs in the
*worst situation*, to show that how "well" can the test*control*the type I error^{[4]}. - Special case: if contains a single parameter only, say (a known value) (i.e., is a simple hypothesis which states that ), then .
- is also called the
*level of significance*or*significance level*(these terms are related to the concept of*statistical (in)significance*, which is in turn related to the concept of*-value*. We will discuss these later.). - The "" here and the "" in the confidence coefficient can actually be interpreted as the "same", by connecting confidence intervals with hypothesis testing. We will discuss these later.
- Because of this definition, the null hypothesis conventionally contains an equality (i.e. in the form of or ), since the size of the test can be calculated more conveniently if this is the case.

So, using this definition, controlling the type I error probability at a particular level means that the size of the test should not exceed , i.e., (in some other places, such test is called a *level test*.

**Example.**
Consider the normal distribution (with the parameter space: ) , and the hypotheses
Let be a random sample from the normal distribution ,
and the corresponding realizations are .
Suppose the rejection region is .

(a) Find such that the significance level of the test is .

(b) Calculate the type II error probability . To have the type II error probability , what is the minimum sample size (with the same rejection region)?

*Solution*.

(a) In order for the significance level to be 0.05, we need to have But . So, this means where . We then have

(b) The type II error probability is () With the sample size , the type II error probability is When the sample size increases, will become more negative, and hence the type II error probability decreases. It follows that Hence, the minimum sample size is 12.

**Exercise.**
Calculate the type I error probability and type II error probability when the sample size is 12 (the rejection region remains unchanged).

The type II error probability is The type I error probability is So, with the same rejection region and different sample size, the significance level (type I error probability in this case) of the test changes.

For now, we have focused on using *rejection region* to conduct hypothesis tests.
But this is not the only way. Alternatively, we can make use of -value.

**Definition.**
(-value)
Let be an observed value of a test statistic in a hypothesis test.

**Case 1**: The test is left-tailed. Then, the -value is .**Case 2**: The test is right-tailed. Then, the -value is .**Case 3**: The test is two-tailed.

**Subcase 1**: The distribution of is symmetric about zero (when is true). Then, the -value is .**Subcase 2**: The distribution of is not symmetric about zero (when is true). Then, the -value is .

**Remark.**

- -value can be interpreted as the probability of a test statistic to be at least as "
*extreme*" as the observed test statistic in a hypothesis test, when is true. Here, "extreme" is in favour of , i.e., the "direction of extreme" is towards the "direction of tail" for the test (when the test statistic is more towards the tail direction, it is more likely to fall in the rejection region, and therefore reject and accept ). - So, when the -value is small, it implies that the observed value of the test statistic is already very "extreme", causing the test statistic to be unlikely to be even more "extreme" than the observed value.
- In general, it can be quite difficult to compute -values manually. Thus, -values are often computed using software, e.g. R.
- For
**case 3 subcase 1**, consider the following diagram:

pdf of T(X) | *---* / | \ / | \ / | \ /| | |\ /#| | |#\ /##| | |##\ ---*###| | |###*--- #######| | |####### -------------*------------- ^ ^ <---->| =====> |<----> T(x)<0 T(x) -T(x) "more extreme" "more extreme" T(X)<=T(x) T(X)>=-T(x) ====> |T(X)|>=|T(x)| ( T(x)=-|T(x)|, -T(x)=|T(x)|) <-->^ ^<--> | | T(x)>0 -T(x) T(x) T(X)<=-T(x) T(X)>=T(x) ====> |T(X)|>=|T(x)| (-T(x)=-|T(x)|, T(x)=|T(x)|)

- For
**case 3 subcase 2**, consider the following diagram:

pdf of T | | /*----* | /| \ | /#| \ | /##| \ | /###| *---|--------* |/####| |#########\ ----*------------------------------ ^ | T(x) |---|-------------------------| T(X)<=T(x) T(X)>=T(x) &&&&&: T(X)>= -T(x) choose ^ | t |-------------------------|---| T(X)<=T(x) T(X)>=T(x) &&&: choose T(X)<=-T(x)

- We can observe that the observed value may lie at the left tail or right tail. In either case, for to be "more extreme", the resulting inequality corresponds to the one with smaller probability. Thus, we have "". But we also need to consider the "extreme" in another tail. It is intuitive to say that when is more extreme than (in another tail), then should be also considered as "more extreme". Thus, there is a ""

The following theorem allows us to use -value for hypothesis testing.

**Theorem.**
Let be an observed value of a test statistic in a hypothesis test.
The null hypothesis is rejected at the significance level *if and only if* the -value is less than or equal to .

**Proof.**
(Partial)
We can prove "if" and "only if" directions at once.
Let us first consider the case 1 in the definition of -value.
By definitions, -value is and
(Define such that .).
Then, we have
For other cases, the idea is similar (just the directions of inequalities for are different).

**Remark.**

- From this, we can observe that -value can be used to report the test result in a more "continuous" scale, instead of just a single decision "accept " or "reject ", in the sense that if -value is "much smaller" than the significance level , then we have a "stronger" evidence for rejecting (stronger in the sense that even if the significance level is very low (a very strict requirement on type I error), can still be rejected).
- Also, reporting a -value allows readers to choose an appropriate significance level themselves, and compare the -value with , and therefore make their own decisions, which are not necessary the same as the decisions made in the test report (since readers may choose a different significance level from that of the report).
- Here, let us also mention about the concept of
*statistical significance*. An observation has*statistical significance*if it is "unlikely" to happen (i.e., the observed value is quite "extreme") when the null hypothesis is true. To be more precise, in terms of -value, this means an observed value of a test statistic is*statistically significant*if -value is less than or equal to , and we say that the observed value is*statistically insignificant*otherwise. Thus, can be interpreted as the benchmark of "significant" or "extremeness", and hence the name*significance level*.

**Example.**
Recall the setting of a previous example:
consider the normal distribution (with the parameter space for : ) , and the hypotheses
Let be a random sample from the normal distribution ,
and the corresponding realizations are .

At the significance level , we have determined that the rejection region is . Suppose it is observed that .

(a) Use the rejection region to determine whether we should reject .

(b) Use -value to determine whether we should reject .

*Solution*.

(a) Since , we have . Thus, we should not reject .

(b) Since the test is right-tailed, the -value is where . Thus, should not be rejected.

**Exercise.**

**Remark.**

- From this, we can notice that one can "manipulate" the decision by changing the significance level. In fact, if one sets the significance level to be 1 , then must be rejected (since -value is a probability which must be less than or equal to 1). But such significance level is meaningless, since it means that the type I error probability can be as high as 1, so such test has a large error and the result is not reliable anyway.
- On the other hand, if one sets the significance level to be 0, then must not be rejected (unless the -value is exactly zero, which is very unlikely, since zero -value means that the observation is the
*most extreme one*, so it is (almost) never for the test statistic to be at least as extreme as the observation.).

## Evaluating a hypothesis test

[edit | edit source]After discussing some basic concepts and terminologies, let us now study some ways to evaluate goodness of a hypothesis test. As we have previously mentioned, we want the probability of making type I errors and type II errors to be small, but we have mentioned that it is generally impossible to make both probabilities to be arbitrarily small. Hence, we have suggested to control the type I error, using the size of a test, and the "best" test should the one with the smallest probability of making type II error, after controlling the type I error.

These ideas lead us to the following definitions.

**Definition.**
(Power of a test)
The *power* of a test is the probability of rejecting when is false.
That is, the power is , if the probability of making type II error is .

Using this definition, instead of saying "best" test (test with the smallest type II error probability), we can say "a test with the most power", or in other words, the "most powerful test".

**Definition.**
(Uniformly most powerful test)
A test with rejection region is a *uniformly most powerful (UMP) test* with size for testing
() if

- (size of ) , and
- (UMP) for each and for each test with rejection region and ( is the power function of the test ).

( and are the power functions of the tests and respectively.)

**Remark.**

- The rejection region is sometimes called a
*best rejection region*of size . - In other words, a test is UMP with size if it has size and its power is the largest among all other tests with size less than or equal to , for each . The adverb "uniformly" emphasizes that this is true
*for each*.

- Since the power is largest for each value of , the rejection region of the UMP test does
*not*depend on the choice of , that is, regardless of the chosen value of , the rejection region is the same. This is expected since the rejection region is not supposed to be changed when the choice of is different. The rejection region (fixed) should always be the best, for each .

- Since the power is largest for each value of , the rejection region of the UMP test does

- If is simple, we may simply call the UMP test as the
*most powerful (MP) test*.

## Constructing a hypothesis test

[edit | edit source]There are many ways of constructing a hypothesis test, but of course not all are good (i.e., "powerful"). In the following, we will provide some common approaches to construct hypothesis tests. In particular, the following lemma is very useful for constructing a MP test with size .

### Neyman-Pearson lemma

[edit | edit source] **Lemma.**
(Neyman-Pearson lemma)
Let be a random sample from a population with pdf or pmf ( may be a parameter vector, and the parameter space is ). Let be the likelihood function. Then a test with rejection region
and size is the MP test with size for testing
where is a value determined by the size .

**Proof.**
Let us first consider the case where the underlying distribution is continuous.
With the assumption that the size of is , the "size" requirement for being a UMP test is satisfied immediately.
So, it suffices to show that satisfies the "UMP" requirement for being a MP test.

Notice that in this case, "" is simply . So, for every test with rejection region and , we will proceed to show that .

Since we have as desired.

For the case where the underlying distribution is discrete, the proof is very similar (just replace the integrals with sums), and hence omitted.

**Remark.**

- Sometimes, we call as the
*likelihood ratio*. - In fact, the MP test constructed by Neyman-Pearson lemma is a variant from the
*likelihood-ratio test*, which is more general in the sense that the likelihood-ratio test can also be constructed for*composite*null and alternative hypotheses, apart from*simple*null and alternative hypotheses directly. But, the likelihood-ratio test may not be (U)MP. We will discuss likelihood-ratio test later. - For the
*discrete*distribution, it may be*impossible*to determine a for the rejection for some . In this case, we call such to be*not attainable*. - Intuitively, this test means that we should reject when "likelihood" of () is not as large as the "likelihood" of () (), with respect to the observed samples. For the meaning of "not as large as", it depends on the size .

- Intuitively, we will expect that should be a positive value that is
*strictly less than*1, so that is "less likely" than . This is usually, but not necessarily, the case. Particularly, when the size is large, may be greater than 1.

- Intuitively, we will expect that should be a positive value that is

- Typically, to determine the value of , we need to transform "" to another
*equivalent*inequality for which the probability under is easier to be calculated.

- It must be
*equivalent*, so that its probability under is the same as probability of "" under . As a result, during the process of the transformation, it is better to use "", instead of just "", or even writing different inequalities line by line.

- It must be

- If is a vector, then and should also be vectors.

Even if the hypotheses involved in the Neyman-Pearson lemma are simple, with some conditions, we can use the lemma to construct a UMP test for testing *composite* null hypothesis against *composite* alternative hypothesis.
The details are as follows:
For testing

- Find a MP test with size , for testing using the Neyman-Pearson lemma, where is an arbitrary value such that .
*If the rejection region does not depend on*, then the test has the greatest power for*each*. So, the test is a UMP test with size for testing*If we can further show that*, then it means that the size of the test is still , even if the null hypothesis is changed to . So, after changing to and not changing (also adjusting the parameter space) for the test , the test still satisfies the "MP" requirement (because of not changing , so the result in step 2 still applies), and also the test will satisfy the "size" requirement (because of changing in this way). Hence, the test is a UMP test with size for testing .

For testing , the steps are similar. But in general, there is no UMP test for testing .

Of course, when the condition in step 3 holds but that in step 2 does not hold, the test in step 1 is a UMP test with size for testing where is a constant (which is larger than , or else and are not disjoint). However, the hypotheses are generally not in this form.

**Example.**
Let be a random sample from the normal distribution .

(a) Construct a MP test with size 0.05 for testing .

(b) Hence, show that the test is also a UMP test with size 0.05 for testing .

(c) Hence, show that the test is also a UMP test with size 0.05 for testing .

*Solution*.
(a) We can use the Neyman-Pearson lemma.
First, consider the likelihood ratio
Now, we have
where are some constants.
To find , consider the size 0.05:
()
Hence, we have .
Now, we can construct the rejection region:
and the test with the rejection region is a MP test with size 0.05 for testing .

**Proof.**
Let be an arbitrary value such that .
Then, we can show that (see the following exercise)
where are some constants (may be different from the above constants).
Since here is the same as in (a), the rejection region constructed is also
Notice that does not depend on the value of . It follows that the test
is a UMP test with size 0.05 for testing .

**Proof.**
It suffices to show that .
First let us consider the power function
where is the cdf of .
Now, since when *increases*, *decreases* and hence *decreases*,
it follows that the power function is a *strictly increasing* function of .
Hence,
Then, the result follows.

**Exercise.**
Show that
for every .

**Proof.**
First, consider the likelihood ratio
Then, we have
(The last equivalence follows since .)

**Remark.**

- This rejection region has appeared in a previous example.

Now, let us consider another example where the underlying distribution is discrete.

**Example.**
Let be a discrete random variable. Its pmf is given by
(Notice that the sum of values in each row is 1. The parameter space is .)
Given a *single observation* , construct a MP test with size 0.1 for testing .

*Solution*.
We use the Neyman-Pearson lemma. First, we calculate the likelihood ratio for each value of :
For convenience, let us sort the likelihood ratios in ascending order (we put the undefined value at the last):
By Neyman-Pearson lemma, the MP test with size 0.1 for testing is a test with size 0.1 and rejection region
So, it remains to determine .
Since the size is 0.1, we have
Notice that
So, we can choose (approximately), so that rejection region is

**Exercise.**
Calculate the probability of making type II error for the above test.

The probability is (Notice that although the test is MP, the type II error probability is still quite large in this case.)

Use Neyman-Pearson lemma to construct another MP test with size 0.05 for testing .

It is impossible to construct the MP test with this size using Neyman-Pearson lemma, since we are unable to choose a such that . We can choose (approximately) for the size to be 0.04, or choose for the size to be 0.06, but we are impossible to choose a for the size to be 0.05.

### Likelihood-ratio test

[edit | edit source]Previously, we have suggested using the Neyman-Pearson lemma to construct MPT for testing simple null hypothesis against simple alternative hypothesis. However, when the hypotheses are composite, we may not be able to use the Neyman-Pearson lemma. So, in the following, we will give a general method for constructing tests for any hypotheses, not limited to simple hypotheses. But we should notice that the tests constructed are not necessarily UMPT.

**Definition.**
(Likelihood ratio test)
Let .
The *likelihood ratio test* with size for testing (, and may be a vector)
has the rejection region
where
is a constant determined by the size .

**Remark.**

- When and exist, we have where is the restricted maximum likelihood estimate of in , and is the maximum likelihood estimate of in . We can assume the existence in the following.
- Since , , so we have .
- Intuitively, when is very small, i.e., , it suggests that there are many falling that are more likely than all in . So, should be intuitively rejected.
- On the other hand, when is very close to 1, i.e., , it suggests that there are very few falling in that are more likely than all in . So, should be intuitively
*not*rejected. - When the simple and alternative hypotheses are simple, the likelihood ratio test will be the same as the test suggested in the Neyman-Pearson lemma.

## Relationship between hypothesis testing and confidence intervals

[edit | edit source]We have mentioned that there are similarities between hypothesis testing and confidence intervals.
In this section, we will introduce a theorem suggesting how to construct a hypothesis test from a confidence interval (or in general, confidence *set*), and vice versa.

**Theorem.**
For each , let be the rejection region of a size test
for testing .
Also, let be the corresponding realizations of a random sample from the underlying distribution.
Furthermore, let and .

Define a set Then, the random set is a confidence set of .

Conversely, let a set be a confidence set of an unknown parameter . For each , define Then, is the rejection region of a size test for testing .

**Proof.**
For the first part, since is the rejection region of the size test, we have
Hence, the coverage probability for the random set is
which means that the random set is a confidence set of .

For the second part, by assumption, we have So, the size of the test with the rejection region is

**Remark.**

- The "" can take arbitrary value in . So, one can take as the unknown parameter of a distribution.
- Usually, the first result is more useful. But, the second result justifies our intuition that given a confidence interval of an unknown parameter , if a particular value lies in the confidence interval, then we are "" confident that . Now, from this theorem, we know that we can interpret "being confident that " as "accepting at the significance level ".

- For example, if a 95% confidence interval of is , and since , we intuitively say that we are 95% confident that . Now, we can say, more formally, we accept at the significance level .
- So, the relationship between the confidence coefficient and the significance level is now clear.
- In some places, given some observed values, when a confidence interval of includes zero, then has only
*statistically insignificant*difference from . This saying is natural when we consider the relationship between confidence coefficient and the significance level.

- Since 0 is included in the confidence interval, we accept (and do
*not*reject) at the significance level . This means that the observed values are*statistically insignificant*. Hence, we have such saying.

- Since 0 is included in the confidence interval, we accept (and do

- ↑ If is empty, then this hypothesis has no meaning at all, so we are not interested in this case.
- ↑ Thus, a natural measure of "goodness" of a hypothesis test is its "size of errors". We will discuss these later in this chapter.
- ↑ This is because it is meaningless to condition on "" or " (is true)" which are not random, and hence has probability zero or one. When the probability is zero, the "conditional probability" is not defined. When the probability is one, conditional on it is the same as not conditional on it.
- ↑ Even when a test has low probability of making type I error for most parameter values in , if the test has a high probability of making type I error for a certain parameter value in , then this intuitively means that the test does
*not control*the type I error well, right?