Before we discuss hypothesis testing, we need to understand the way in which statisticians develop information about populations.
When we talk about populations, we don't just mean groups of people as in city and country populations. If we think about an observation as a numerical recording of information, as in a measurement, then a population consists of the totality of the measurements with which we are concerned.
This means it might be the population consisting of all heights of Australian males or the population consisting of all lengths of a certain type of fish in a particular lake. It might also refer to an infinite population. If we toss a regular pair of dice indefinitely recording the total that occurs, we obtain an infinite set of values ranging from $2$2 through to $12$12.
Hypothesis testing by sampling a population is an established technique of decision theory.
A key role of a statistician is to infer (make general statements about) certain characteristics of a population by sampling some of it.
The information in any $n$n-size sample drawn from a population is often analyzed by using a sample statistic, like the sample mean $\overline{x}$x for example, and these sample statistics become estimates of the corresponding population statistics like the population mean $\mu$μ.
The larger the sample the more representative the sample statistic becomes.
The method by which a statistician infers certain characteristics of a population in this way is generally known as Decision Theory or the Theory of Statistical Inference.
The diagram shows the concept. A sample (in this diagram of size $6$6) is drawn from a population of unknown size and population mean $\mu$μ. A sample statistic (in this case $\overline{x}$x) is calculated.
Using the sample mean, the statistician either tries to infer something about $\mu$μ or else bring its value into question.
The confidence the statistician has in $\overline{x}$x as a dependable estimate of $\mu$μ depends on a number of factors, such as the size of the sample taken and on how the sample was drawn.
In the technique of hypothesis testing, the statistician draws a sample specifically to check the validity of a claimed or hypothesized population mean $\mu$μ. That is to say, sometimes when a random sample is taken, there is a concerning disparity between $\overline{x}$x and $\mu$μ.
Hypothesis testing provides a decision rule that allows the investigator to accept or reject that hypothesized $\mu$μ based on the evidence provided by the sample mean.
Consider these examples where hypothesis testing could be considered.
Referring to example $3$3, the dispensing machine is meant to be delivering $200$200 ml of water into each cup. However, in reality, there would be variation around this amount that would depend largely on the quality of the machine. We would in most instances expect a range of possible quantity values in our sample batch of $30$30 cups and from these we could calculate the sample mean $\overline{x}$x.
Imagine if we repeat the experiment over an over again, so that we could calculate lots and groups of sample means from the corresponding batches of $30$30 cups. From all of these sample means we could construct a frequency histogram.
What would the histogram look like?
We would expect that most of the sample means would cluster close to claimed average of $200$200 ml - some higher and some lower. There would be some however that would be further away (where the dispenser has come up short or delivered too much) but the frequency of these would reduce the further away from $200$200 ml the sample means were.
Such a histogram is referred to as a sampling distribution.
A sampling distribution is nothing more than a distribution of $n$n-size sample means.
One of these sample means ($\overline{x}=193$x=193) has been highlighted as a small red box.
Note how the sampling distribution is approximately normal with the average at $200$200 ml.
Think of this average as the average of all sampling means. It is denoted $\mu_{\overline{x}}$μx.
Any hypothesis test starts by assuming that a claimed or hypothesized average, called the null hypothesis, is true. The null hypothesis is always the default hypothesis. It is the starting point for any hypothesis test.
From an historical perspective, scientists would apply certain treatments to things (people, animals and other objects) to see if those treatments had measurable effects. The default position would always be that a treatment had no effect unless sufficient evidence to the contrary was observed. This explains the origin of the term null hypothesis - it is the 'no effect' hypothesis.
Statisticians denote the null hypothesis $H_0$H0, where the subscript $0$0 implies 'no effect'.
While our water dispensing example has little to do with treatments, the default position we have taken is that, on average, the machine should deliver $200$200 ml per cup, and so we write:
$H_0$H0: $\mu=200$μ=200
Any null hypothesis, generally described as $\mu=\mu_0$μ=μ0, is the statement that is always under examination.
It is the statement that is being challenged by the sample evidence.
If evidence can be produced that sufficiently brings into question the validity of the null hypothesis, then we can decide to reject it for some other alternative hypothesis.
This alternative hypothesis is usually labeled $H_a$Ha and takes the form of either $\mu\ne\mu_0$μ≠μ0, $\mu<\mu_0$μ<μ0 or $\mu>\mu_0$μ>μ0.
Just exactly what constitutes sufficient is the question we address now.
For example would a sample mean of $193$193 ml be sufficiently different to $\mu=200$μ=200 to reject $H_0$H0? While a sample mean like this is entirely possible, it might be just too improbable to accept it - something may be wrong with the machine.
At its heart, any decision that is made must ultimately be an arbitrary one. With decisions like this there is always the chance of making a mistake.
One way forward is to arbitrarily decide on some interval around $\mu=200$μ=200 ml within which a sample mean may fall without rejecting the null hypothesis. This is like drawing lines in the sand.
For example, suppose we simply draw in vertical lines at $195$195 and $205$205. These endpoints are sometimes referred to as critical sample mean values.
The lines are shown here:
Note that a sample mean of $193$193 ml is beyond these limits.
Any value beyond the limits, either below $195$195 ml or above $205$205 ml in this case, is said to be in the critical region. A sample mean in a critical region provides a sufficient reason to reject $H_0$H0 in favor of $H_a$Ha.
The action of drawing lines like this immediately creates the potential for error.
Take for example the possibility of the manufacturer finding an actual sample mean of $193$193 ml. As stated above, with critical values set at $195$195 ml and $205$205 ml, $H_0$H0 would be rejected in favor of some $H_a$Ha.
We now ask what is the risk of doing this?
If this decision is a mistake, then we have committed a Type 1 error.
We have rejected $H_0$H0 in favor of $H_a$Ha when in fact $H_0$H0 was actually true.
The probability that a mistake was made like this can actually be calculated. It is simply the proportion of area under the normal curve that lies in the critical regions. This is because the ratio of 'critical region' area to 'total' area represents the probability of finding a sample mean in the critical region.
There looks to be about $30%$30% of the sample means in the critical region defined by the two lines. This would indicate that the probability of making a Type 1 error would be approximately $0.3$0.3.
The Greek letter $\alpha$α is used to denote the Type 1 probability, so that:
$Pr$Pr(Type 1 error)$=\alpha$=α
Drawing the lines so that $\alpha=0.3$α=0.3 is usually considered a little too severe. Usually tests are set at smaller $\alpha$α levels in order to make it quite difficult to reject $H_0$H0.
Suppose the lines had been drawn further apart so as to include $193$193. The smaller critical region would have meant that $H_0$H0 could not have been rejected. But this alerts us to another type of error.
It might be that $H_0$H0 is actually false and we have made a mistake retaining it. When $H_0$H0 is accepted as true, when in fact it is actually false, a Type 2 error is said to have occurred.
Using $\beta$β as the probability, we write:
$Pr$Pr(Type 2 error)=$\beta$β
A useful analogy for understanding Type 1 and Type 2 errors is provided by the Universal Declaration of Human Rights. Article 11 of that declaration states in part that a person is 'presumed innocent until proven guilty'. The presumption of innocence can be thought of as the null hypothesis $H_0$H0. If sufficient evidence is gathered to convict the person (analogous to an extreme sample mean), then the person is found guilty. This verdict could be thought of as $H_a$Ha.
We make a Type 1 error when a person, who is found guilty is in fact innocent, and we make a Type 2 error when a person, who is found not guilty (the default status), actually committed the crime. The probability of making either error should be kept to a minimum, however perhaps it is better for a guilty person to roam free than to send an innocent person to the gallows. What do you think?
You can see why the above method might be too arbitrary to persist with. A researcher could conceivably draw lines at any point and argue against $H_0$H0 on the basis of any $\alpha$α size.
To avoid this a universally accepted convention has been established. In a way it is a standardized protocol to hypothesis testing for all scientific research.
There are three parts to this convention:
Rather than choosing arbitrary limits to define the rejection region, the accepted convention is to define $\alpha$α first.
For example, suppose we define $\alpha$α to be $0.05$0.05. This means that we set the probability of making a type 1 error at $0.05$0.05. Put another way, we set the proportion of area in the critical region as $0.05$0.05.
In the chapter titled Hypothesis testing 1 and 2 tailed tests you will learn that there are two basic types of Hypothesis testing - one like the water dispensing example which results in two critical regions, and another which results in one critical region in either the left or right tail of the distribution. For now, we will stick to considering the 2-tail scenario.
By setting $\alpha=0.05$α=0.05, in a 2-tailed test, we would have $\frac{\alpha}{2}=0.025$α2=0.025 in each tail. We then can determine the position that the lines need to be drawn by using normal area tables.
Generally speaking the most common $\alpha$α levels are $\alpha=0.10$α=0.10, $\alpha=0.05$α=0.05 and $\alpha=0.01$α=0.01, which corresponds to each tail containing $\frac{\alpha}{2}=0.05$α2=0.05, $\frac{\alpha}{2}=0.025$α2=0.025 and $\frac{\alpha}{2}=0.005$α2=0.005 respectively.
These agreed $\alpha$α probabilities are referred to as the test's level of significance.
The smaller the level of significance, the harder it becomes to reject the null hypothesis
An important result first investigated by the French mathematician Abraham de Moivre in 1733 is now known as the Central Limit Theorem.
The Central Limit Theorem states that if random samples of size $n$n are drawn from any population with a mean of $\mu$μ and a variance of $\sigma^2$σ2 then the sampling distribution of $\overline{x}$x will be approximately normally distributed with the average of the sample means (often denoted $\mu_{\overline{x}}$μx) equal to $\mu$μ and the variance of the sample means, (also often denoted $\sigma_{\overline{x}}^2$σ2x), equal to $\frac{\sigma^2}{n}$σ2n.
The standard deviation of the sample means, given by $\frac{\sigma}{\sqrt{n}}$σ√n is often referred to as the Standard Error of the Mean, or SEM for short.
While not an entirely accurate description, we can at least get an intuitive feel for the Central Limit Theorem with the following mind experiment. Suppose we imagine sampling, say, $120$120 cups of water and then writing down each of the $120$120 results in a long list.
If we group consecutive pairs of results, and calculate the $60$60 averages, we would find that these averages would be scattered fairly widely around the population average of $200$200 ml.
If we instead group into sets of three, and thus work out the $40$40 averages, we would expect that these new averages would exhibit less scatter than the paired averages because groups of three would naturally show less variation than the groups of two.
Continuing in this way, we could take groups of four, five and six consecutive numbers and determine $30,24$30,24 and $20$20 averages respectively. As the groups got larger, we would notice that the variation in the averages would reduce. While still centered around $\mu=200$μ=200, they would tend to become more homogeneous as the group size increase.
That is to say the variation in the sample reduces as the number in each sample increases. This is the magic in the Central Limit Theorem. As the same size $n$n increases, the total variation becomes inversely proportional to $n$n.
The Central Limit Theorem is the key to the entire technique.
In the water dispensing sampling distribution, we could only make a guess at the size of the critical region. But by knowing the variance of the population $\sigma^2$σ2 and the sample size $n$n, we can also obtain $\sigma_{\overline{x}}^2$σ2x, the variance of the sampling distribution. This means we can determine the exact proportion of area in the critical regions of the sampling distribution.
But we can go further and simplify all problems involving large samples, by transforming each of the corresponding sampling distributions to just one standardized normal distribution.
Assuming the null hypothesis $\mu=\mu_0$μ=μ0, with the sampling distribution parameters as:
we can transform any sampling distribution to the standard normal distribution given as:
$z=\frac{\overline{x}-\mu_0}{\sigma/\sqrt{n}}$z=x−μ0σ/√n
Think of this formula as simply the difference between the sample and population mean divided by the standard error. In other words the difference between the two means is being sliced into standard error units.
Because the area under the standard normal curve is 1, $\alpha$α now equals the area of the critical region! That is to say, the test's level of significance is exactly the size of the area in the critical region under the standard normal curve.
Keeping with our water dispensing machine with an hypothesized mean of $\mu=200$μ=200 ml, suppose that we also knew that the population variance was $750$750. We could write:
$H_0$H0: $\mu=200$μ=200
$H_a$Ha: $\mu\ne200$μ≠200
Now the actual sample size in our dispensing example was $30$30 cups, and so the variance of the sampling distribution becomes $\frac{750}{30}=25$75030=25. Therefore the standard deviation of the sampling distribution is $5$5.
We can see therefore (referring to the second histogram above) that the manufacturer drew lines at $\overline{x}=200\pm5$x=200±5.
Equivalently this corresponds to critical values of the standardized normal distribution of $z=\pm1$z=±1.
We can show this algebraically.
From $z=\frac{\overline{x}-200}{\sqrt{750}/\sqrt{30}}=\frac{\overline{x}-200}{5}$z=x−200√750/√30=x−2005
For $\overline{x}=195,z=\frac{195-200}{5}=-1$x=195,z=195−2005=−1
For $\overline{x}=205,z=\frac{205-200}{5}=1$x=205,z=205−2005=1
From the Empirical Rule, the area within $\pm1$±1 under the standardized normal curve is $0.68$0.68.
This means that $\alpha$α (the area in the critical region) is $0.32$0.32 or that the area in each tail is $0.16$0.16.
The diagram below shows this procedure graphically.
Our original estimate of $\alpha=30%$α=30% turned out to be fairly close. It confirms that drawing lines at $195$195 ml and $205$205 ml would be too restrictive in terms of validating $H_0$H0.
Sometimes we are interested in a proportion of some binary population rather than a population mean.
For example we might be interested in the proportion of defective light globes coming off an assembly line, or the proportion of voters choosing a nominated candidate in a presidential election, or the proportion of dice rolls showing the number $6$6 when a single die is rolled over and over again.
In terms of assembly lines and dice rolls we imagine an infinite population, so it is simply impossible to determine a definitive population proportion in either instance. In terms of a presidential election, even though voter numbers are finite, it would be impractical to ask every voter prior to voting what their intentions were.
So rather than thinking about an infinite population, we define a p-value as the theoretical probability of a particular event. The aim of the hypothesis test is to compare a $n$n-size sample proportion to the p-value with a view to accept or reject it.
To illustrate with a simple example, when flipping a coin we may be concerned about its fairness. The p-value of the coin is taken as $0.5$0.5. We might flip the coin a number of times and find that the proportion of heads landing face up is only $0.3$0.3. Given that sample proportion, the hypothesis test provides a decision rule that allows us to either reject or accept the p-value.
In situations like the one above, the standard practice is to denote the sample proportion as $\hat{p}=\frac{x}{n}$^p=xn where $\hat{p}$^p might be, say, the proportion of defective light globes, or the proportion of voters voting for candidate A or the proportion of occurrences of $6$6). We denote $n$n as the sample size, and $x$x is the number of "successes" in the sample (e.g. the number of defective light globes, the number of voters voting for candidate A or the number of occurrences of $6$6).
We can consider the problem of testing the null hypothesis $H_0$H0 that the proportion of successes in the entire population is given by some p-value $p=p_0$p=p0. The alternative hypothesis $H_a$Ha might be $p\ne p_0$p≠p0 for a 2-tailed test, or else either $p
Given $H_0$H0, repeated $n$n-size sampling will produce a sampling distribution of proportions with mean proportion given by $\mu_{\hat{p}}=p_0$μ^p=p0 and variance given by $\sigma_{\hat{p}}^2=\frac{p_0(1-p_0)}{n}$σ2^p=p0(1−p0)n.
Hence our test statistic for a sufficiently large sample becomes:
Then by replacing $\hat{p}$^p by $\frac{x}{n}$xn and multiplying numerator and denominator by $n$n, we obtain the easier equivalent form for the test statistic.
In this form, substitution becomes easy. The numerator is simply the signed difference between the expected and sample number of "successes", and the denominator is the square root of the product of $n$n and the two complementary probabilities $p_0$p0 and $(1-p_0)$(1−p0).
This hypothesis test uses the normal curve approximation of the binomial distribution. One of the conditions in making this assumption is that the binomial distribution approximates the normal curve sufficiently, and it is generally agreed that a good guide to this is to check that both $np_0$np0 and $np_0(1-p_0)$np0(1−p0) are both greater than $5$5. So its a good idea to check this first.
This diagram is a depiction of the parabolic space where the binomial distribution sufficiently approximates the normal distribution. The horizontal axis shows the sample size $n$n and the vertical axis shows the probability $p$p. Note that under no circumstances is a sample size of less than $20$20 considered adequate. Also note that the closer $p$p is to $0.5$0.5, the less sample size is required.
We could on the other hand use the inequalities to design our test.
Suppose for example we were interested in checking the veracity of two casino dice. The game the dice are being used for involves summing the two numbers landing when both dice are thrown onto a board. We might decide to check the p-value $\frac{1}{36}$136, the probability that the dice sum will be $12$12.
Since $np(1-p)>5$np(1−p)>5, we must have that $n(\frac{1}{36}\times\frac{35}{36})>5$n(136×3536)>5. This means that $n>\frac{5\times36^2}{35}=185.143$n>5×36235=185.143, and so perhaps we might decide to roll the two dice say $200$200 times to collect our sample proportion.
The critical values are obtained from standardized normal tables (or the last line of the student $t$t tables) in exactly the same way as the hypothesis tests that were conducted on the means for large samples. The student $t$t tables provide the critical values for small sized samples (between $n=20$n=20 and $n=30$n=30), provided of course the population is normally distributed.
The company Speedy manufacture remote controlled cars, some of which come off the assembly line defective. Speedy believe that the proportion of defective cars coming off their assembly line is about $6%$6% but a recent random sample of $100$100 cars contained $8$8 defective ones. Would this sample information indicate that the general defect rate has increased? Use a confidence level of $\alpha=0.05$α=0.05.
Before we begin, note that:
$np_0=100\times0.06=6$np0=100×0.06=6
$np_0(1-p_0)=100\times0.06\times(1-0.06)=5.64$np0(1−p0)=100×0.06×(1−0.06)=5.64
and so we can proceed with the test.
This is clearly a 1-tailed test with:
Now $\hat{p}=\frac{x}{n}=\frac{9}{100}=0.09$^p=xn=9100=0.09
Hence $n=100$n=100 and $x=9$x=9, so $np_0=100\times0.06=6$np0=100×0.06=6 and so $z=\frac{9-6}{\sqrt{100\times0.06\times0.94}}=1.263$z=9−6√100×0.06×0.94=1.263.
The critical $z$z value at the right tail is given by $z_c=1.645$zc=1.645, and so the sample proportion, while larger than $0.06$0.06, is not high enough to reject the null hypothesis.
However, we are naturally led to ask what defective number would be required to reject $H_0$H0?
In this instance, because the defective rate is relatively low, it doesn't take too many more defectives in the sample before $H_0$H0 is rejected.
There is only one variable in our test statistic once $p_0$p0 has been fixed. So by rearranging the formula to make $x$x the subject we find that $x=np_0+z_c\times\sqrt{np_0(1-p_0)}$x=np0+zc×√np0(1−p0).
With $p_0=0.06$p0=0.06, $n=100$n=100 and $z_c=1.645$zc=1.645, this reduces to $x\approx9.91$x≈9.91 and so $x=10$x=10 defectives will cause a rejection of $H_0$H0.
This is just one more defective in the sample. The sensitivity is primarily due to the relatively small size of the defective rate $p_0$p0. For example if $p_0=0.5$p0=0.5, with $n$n and $z_c$zc the same, we can calculate that approximately $31$31 defectives will cause a rejection of $H_0$H0. In other words, there can be $5$5 more defective cars than the expected $25$25 defective cars without rejecting $H_0$H0. Of course, there would be serious concerns for the company if half of their cars are coming off the assembly line defective!
A "$6$6" on the roll of a dice should turn up $\frac{1}{6}$16 of the time. However, in the real world, this needs to be tested for any particular dice that rolls of the assembly line.
Suppose, in such a test, a particular die rolled $120$120 times turns up a "$6$6" $10$10 times. Should we be concerned?
Note we have both $np_0=20$np0=20 and $np_0(1-p_0)=16.67$np0(1−p0)=16.67 both over $5$5.
We could conduct a 1 tailed hypothesis test with:
With $n=120$n=120 and $x=10$x=10, our test value $z$z becomes $z=\frac{10-20}{\sqrt{120\times\frac{1}{6}\times\frac{5}{6}}}\approx-2.4495$z=10−20√120×16×56≈−2.4495.
At $\alpha=0.01$α=0.01, we have $z_c=-2.33$zc=−2.33, and so $H_0$H0 can be rejected. This means that $10$10 occurrences of a "$6$6" in $120$120 rolls is enough evidence to suspect that the die is faulty. We could say that we are $99%$99% sure that the die is faulty and should not be used in a casino.
This chapter discusses the meaning of the terms 2-tailed and 1-tailed tests, and provides a more detailed explanation of the type 2 error.
There are two basic types of hypothesis tests available, and the choice of test very much depends on the nature of the investigation. We will look at the 2-tailed test first.
We take as our first example a certain water dispensing machine that, for some reason, is not accurately delivering $200$200 ml of water to each plastic cup. The cup may overflow if too much water is dispensed. Patrons will complain if too little water is dispensed. Hence, we are interested in checking the validity of the claimed population mean $\mu=200$μ=200 ml.
We will reject the null hypothesis if our sample mean is significantly different to $200$200 ml, whether that difference is positive or negative. Thus we need to reject $H_0$H0 if, for a given confidence level, a sample mean lands in either tail of the sampling distribution.
Lets develop the water dispenser example a little further.
Suppose we were told that dispenser machines of this kind delivered on average $200$200 ml per cup with a variance of $750$750. We wish to know whether the stated mean is accurate, so we set up a 2-tailed test with the following parameters:
Note that a 2-tailed test has the alternative hypothesis $H_a$Ha written with a $\ne$≠ sign. That is, at this point, we are not really interested in whether the machine delivers more than $200$200 ml, or less than $200$200 ml, but rather we are interested in whether it delivers something significantly different to $200$200 ml.
Note also that the confidence level we chose is $\alpha=0.03$α=0.03. This is somewhat atypical of most testing, but sometimes there is a rationale for particular choice. By choosing $\alpha=0.03$α=0.03, we are expressing a $97%$97% confidence in $H_0$H0 should a sample mean fall within the critical limits.
We do a little math. From the Central Limit Theorem we assume that the sampling distribution has a mean of $200$200 ml and a variance of $\frac{\sigma^2}{n}=\frac{750}{30}=25$σ2n=75030=25, and thus has a standard deviation (the standard error of the mean) of $5$5.
Therefore our $z$z variable becomes $z=\frac{\overline{x}-200}{5}$z=x−2005 ready for the sample results.
We also need to find the critical $z$z values that cut out left and right tail areas of $0.015$0.015. From tables this turns out to be $\pm2.17$±2.17. If the $z$z value of our sample mean falls above $2.17$2.17 or below $-2.17$−2.17, then $H_0$H0 is in serious trouble.
We sample $30$30 cups and find that the average amount of water dispensed into each cup was $213$213 ml.
Substituting this sample mean into our $z$z formula shows the computed $z$z value as:
$z=\frac{\overline{x}-200}{5}=\frac{213-200}{5}=2.6$z=x−2005=213−2005=2.6
The number $2.6$2.6 is in the right hand part of the critical region, and this means that $H_0$H0 needs to be rejected in favor of $H_a$Ha. This means that we are expressing a $97%$97% level of confidence that the claimed mean is not $200$200 ml.
Note that, in the strictest sense, we are not claiming that it is larger than $200$200, even though the computed $z$z value landed in the right tail. This is a 2-tailed test with a specific alternative hypothesis. What we would formally have to complete is a 1-tail test which we will do shortly.
Here is a graphical depiction of the two tailed test. Look at this carefully:
The diagram covers most of the procedure. We see the two tails and the critical region area in each tail as $0.015$0.015 which automatically defines the critical $z_c$zc values as $\pm2.17$±2.17 obtainable from the standard normal tables.
We see the computed $z$z value of $2.6$2.6 corresponding to the actual sample mean of $213$213 ml on a sample mean axis. It is clearly in the rejection region providing sufficient evidence to reject $H_0$H0. We can state with $97%$97% certainty, that the claimed mean is not $200$200 ml.
When sampling from a population, it is usually the case that statisticians do not have access to the population variance $\sigma^2$σ2. It is, in most practical situations, unknown.
Provided the sample is large enough (perhaps $30$30 or more) a good estimate of the population variance is the sample variance $s^2$s2. The variable $z=\frac{\overline{x}-\mu}{s/\sqrt{n}}$z=x−μs/√n is still approximately distributed as a standardized normal variable. Note carefully that $s$s, taken as an approximation to $\sigma$σ, is a 'constant' in the formula (a population statistic), and the only real variable is the sampling mean $\overline{x}$x.
For small samples however, the sample variance can fluctuate quite severely from sample to sample, and so using $s^2$s2 as a substitute for $\sigma^2$σ2 in those circumstances changes the sampling distribution to something other than normal.
This means that, when the sample size is small, transforming the sampling distribution to a standardized $z$z distribution, as we do using large samples, no longer works. Prior to the year 1908, samples had to be of a reasonable size (around 30) to be reliable indicators of population parameters.
In 1908 this all changed when a remarkable statistics paper was published by William S Gosset (1876-1937). In this paper Gosset had derived the actual probability distribution for small size samples. Although similar to the standardized $z$z distribution, this new distribution was a function of the sample size itself. That is to say the distribution changed as the sample size changed.
The graph below shows three of these probability distributions together. The symbol $\nu$ν is known as the number of degrees of freedom for the sample where $\nu=n-1$ν=n−1. Recall that calculating a sample variance $s^2$s2 involves a division by $n-1$n−1 rather than $n$n, and that's where the idea came to use $\nu$ν instead of $n$n in the description of the distribution.
By the time the sample size becomes close to $30$30 ($n=30$n=30 corresponds to $\nu=29$ν=29), the distribution is virtually identical to the standard normal distribution. In fact it can be shown that for $\nu>2$ν>2, the variance of the $t$t distribution is given by $\frac{\nu}{\nu-2}$νν−2 so that as $\nu\rightarrow\infty$ν→∞, the variance approaches $1$1.
At the time Gosset was employed by the Guinness brewery in Ireland. Because of the possibility of trade secrets leaking from that establishment, research publications by staff were strictly forbidden. Gosset was able to get around this by secretly publishing under the name 'student' and as a consequence the $t$t distribution became known as the student $t$t distribution.
You can see that the distributions are not all wildly different from each other. Like the standard normal distribution each of the new distributions exhibits a bell symmetry, but for low values of $n$n, the peak is not as high as that for the normal distribution, and the tails tend to be a little thicker. In other words, the probability is more dispersed than that for the normal distribution.
For a given confidence level $\alpha$α, the lower the sample size the thicker the distribution tails become and the further the critical values must be from the center of the distribution. In other words, as n increases, the t values need to move further out in order to keep the given Type 1 probability constant.
What this means is that it becomes harder to reject the null hypothesis for a given $\alpha$α when the sample size is small. This is a good thing, because small sized samples are less reliable that larger samples. Put another way, we are more likely to trust a large sized sample statistic because the variability is expected to be smaller.
The computed values of the student $t$t distribution are found from the transformation formula $t=\frac{\overline{x}-\mu}{s/\sqrt{n}}$t=x−μs/√n where both $\overline{x}$x and $s$s are variables.
For example, suppose a sample of size $16$16 is drawn from a normally distributed population of mean $\mu=50$μ=50 and found to have a sample mean and standard deviation of $46$46 and $8$8 respectively. Then the computed $t$t statistic becomes:
$t=\frac{\overline{x}-\mu}{s/\sqrt{n}}=\frac{46-50}{8/\sqrt{16}}=-2$t=x−μs/√n=46−508/√16=−2
This value would be compared to values provided to us in a student $t$t table where critical $t$t values are listed for various levels of confidence $\alpha$α.
When Gosset derived the formula for the $t$t distribution, he assumed that the population from which the samples were drawn were normally distributed themselves. However, even if the population distribution is not normal, but still bell shaped, then the sampling distribution of $t$t still approximates the $t$t distribution very closely.
A table of critical $t$t values always uses $\nu$ν, the number of degrees of freedom of the sample with $\nu\ge1$ν≥1. Here are the $t$t values (the areas $\alpha$α for $1$1 and $2$2 tailed tests) for the first $20$20 degrees of freedom (from $n=2$n=2 to $n=21$n=21). Note that the positive $t$t values are given only, so the value is simply negated when dealing with left tailed tests.
table ref: http://growingknowing.com/GKStatsBookStudentTTable.html
So for example suppose a random sample of $16$16 candy bars are weighed and found to have a mean and standard deviation of $57$57 grams and $8$8 grams . If this is to be tested against a null hypothesis of $\mu=60$μ=60 grams using a 1-tailed test at a confidence level of $0.05$0.05, then, with $\nu=15$ν=15, the critical $t$t statistic is given by $t=-1.7531$t=−1.7531.
The computed $t$t value becomes $t=\frac{57-60}{8/\sqrt{16}}=-1.5$t=57−608/√16=−1.5. This computed value is not in the critical region and therefore the null hypothesis cannot be rejected.
Because successive $t$t distributions become asymptotic to the standard normal distribution as $n\rightarrow\infty$n→∞, the $t$t table eventually provides the $\alpha$α quantities for large samples as well, with the corresponding critical values written on the last line of the $t$t table as $\nu=\infty$ν=∞.
We have listed these here:
$\alpha=0.10$α=0.10 critical $t=1.282$t=1.282
$\alpha=0.05$α=0.05 critical $t=1.645$t=1.645
$\alpha=0.025$α=0.025 critical $t=1.960$t=1.960
$\alpha=0.01$α=0.01 critical $t=2.326$t=2.326
$\alpha=0.005$α=0.005 critical $t=2.576$t=2.576
The following examples show how Hypothesis Testing on population means using small samples works in practice.
A certain brand of candy bar indicates on the plastic wrap packaging an average weight of $60$60 grams.
Assuming the weights are normally distributed, $10$10 randomly chosen bars are weighed and found to have an average weight of $58$58 grams and a standard deviation of $3.5$3.5 grams. Based on the sample results, and using a 1 tailed test on the hypothesized mean of $60$60 grams at the $\alpha=0.05$α=0.05 level does our sample support the wrapper claim?
The fact that we have chosen a 1 tailed test clearly means that we suspect that the bars are in fact less than $60$60 grams. We first set up $H_0$H0 and $H_a$Ha.
The critical $t$t statistic, provided to us from tables, is given by $t=-1.8331$t=−1.8331.
We compute the sample $t$t value as:
$t=\frac{\overline{x}-\mu}{s/\sqrt{n}}=t=\frac{58-60}{3.5/\sqrt{10}}=-1.807$t=x−μs/√n=t=58−603.5/√10=−1.807
The value is inside the critical region, and therefore $H_0$H0 cannot be rejected at this level of confidence.
In the chapter Hypothesis testing-1 sample 1 and 2 tailed tests an example was discussed involving burning times for candles made by the Cute Candles company. We repeat this example using the $t$t distribution to see if a different result presents itself.
Recall that a sample of $25$25 candles were tested to check the validity of the null hypothesis that the company's candles have a burn life of $50$50 hours. The sample mean and standard deviation was $49.2$49.2 hours and $2.5$2.5 hours respectively.
Also recall that the estimated sampling distribution variance was found as $\frac{\sigma^2}{25}\approx\frac{s^2}{25}=\frac{6.25}{25}=0.25$σ225≈s225=6.2525=0.25 and this implies that the standard deviation of the sampling distribution is $0.5$0.5. This lead to a computed $z$z value given by $z=\frac{\overline{x}-50}{0.5}=\frac{49.2-50}{0.5}=-1.6$z=x−500.5=49.2−500.5=−1.6.
This computed value was then compared to the critical $z$z of $-1.645$−1.645 and as a result $H_0$H0 could not be rejected.
The sample size of $25$25 was perhaps a little too small to be used in this way, now that we know about the student t distribution. It will be interesting then to compare this large sample $z$z test against the more appropriate student $t$t test.
With a sample mean and standard deviation given by $49.2$49.2 and $2.5$2.5, we have the computed $t$t value given by $t=\frac{49.2-50}{2.5/\sqrt{25}}=-1.6$t=49.2−502.5/√25=−1.6. In other words, there is no change to the computed value.
However, the critical $t$t score does change.
From tables, with $24$24 degrees of freedom, we find that the critical $t$t value is $-1.711$−1.711.
This is good news for the research section of the company.
There is therefore no change to the earlier advice that $H_0$H0 cannot be rejected and the claim of $50$50 hours burning life still stands.
There is a very important lesson with the candle problem.
For a fixed computed $t$t score, as the sample size decreases, the sampling distribution becomes less peaked and the tails begin to thicken. This in turn means that for a fixed critical region of area $\alpha$α the critical value moves outward.
You can see this with our example. The computed $z$z value and $t$t value were identical, but the critical $t$t value pushed out, making it even harder to reject $H_0$H0.
But this makes complete sense!
The smaller the sample size, the more variability you are accepting, and so the less chance there is of rejecting $H_0$H0.
In fact a sample size of just $2$2 candles shows a critical value of $-6.314$−6.314! Such a scenario would mean that it would be virtually impossible to reject $H_0$H0.
Buzz Electrics claim that the life of their Candy bar vending machine light globes are normally distributed and last on average $1600$1600 hours. A sample of $5$5 light globes are tested and found to have a mean of $1550$1550 hours and a sample standard deviation of $80$80 hours.
Assuming the life of a light globe is normally distributed, test the hypothesis that $\mu=1600$μ=1600 against the alternative hypothesis $\mu\ne1600$μ≠1600 using both an $\alpha=0.01$α=0.01 and $\alpha=0.05$α=0.05 significance level. Comment on your results.
A sample of $5$5 like this is a very small sample. Referring to a table of critical $t$t values we find that, for a 2-tailed test with $\nu=4$ν=4 and $\alpha=0.01$α=0.01, the critical $t$t values are $\pm3.747$±3.747. Also, for $\alpha=0.05$α=0.05 the critical $t$t values change to $\pm2.132$±2.132.
The sample mean of $1550$1550 hours means that the computed $t$t statistic becomes:
$t=\frac{1550-1600}{80/\sqrt{5}}=-1.3975$t=1550−160080/√5=−1.3975
This is low, but still no where near enough to be able to reject the null hypothesis at either level of confidence.
Looking at the problem differently, we might ask what would the sample mean need to be in order for the null hypothesis to be rejected at, say, the $0.05$0.05 level?
This again is easy to work out by solving the equation $\frac{\overline{x}-1600}{80/\sqrt{5}}=\pm2.132$x−160080/√5=±2.132.
Thus:
$\frac{\overline{x}-1600}{80/\sqrt{5}}$x−160080/√5 | $=$= | $\pm2.132$±2.132 |
$\overline{x}-1600$x−1600 | $=$= | $\pm\frac{80\times2.132}{\sqrt{5}}$±80×2.132√5 |
$\therefore$∴ $\overline{x}$x | $=$= | $1600\pm\frac{170.56}{\sqrt{5}}$1600±170.56√5 |
$=$= | $1600\pm76.28$1600±76.28 | |
$=$= | $1676.28$1676.28, $1523.72$1523.72 | |
A discrepancy of over $75$75 between the sample mean and $\mu$μ is required to reject the null hypothesis at the $0.05$0.05 level of confidence. A larger sample size would have reduced the discrepancy needed.
The following examples show how hypothesis testing on population means using large samples works in practice.
A line marking machine is supposed to paint on average $600$600 mm marks on road surfaces with a variance of $200$200 mm. The local council have become concerned that the machine is actually producing markings that appear to be of different lengths - some longer and some shorter. A random sample of $50$50 marks are measured and found to have an average length of $604$604 mm. Should the council be concerned?
We begin by recognizing that this is a $2$2-tailed test with:
By the Central Limit Theorem, under this hypothesis, the sampling distribution has a mean of 600 mm and a variance of $\sigma^2=\frac{200}{50}=4$σ2=20050=4.
In other words the variable $z=\frac{\overline{x}-\mu}{2}$z=x−μ2 has a standard normal distribution.
A decision about the level of significance (the size of the Type 1 error) needs to be made.
Suppose we define $\alpha=0.01$α=0.01, so that $\frac{\alpha}{2}=0.005$α2=0.005.
From standardized normal tables, we find that the critical region is approximately defined by the inequalities $z<-2.58$z<−2.58 and $z>2.58$z>2.58 as depicted here:
In other words, the total blue area of $\alpha=0.01$α=0.01 in the two tails ($0.005$0.005 in each tail) is the probability that we reject $H_0$H0 by mistake, when a sample mean lands within those critical regions. We make a Type 1 error in doing so.
A quick computation shows that the sample mean $\overline{x}=604$x=604 converts to a $z$z value given by $z=\frac{604-600}{2}=2$z=604−6002=2.
This means that even though the $z$z value is fairly high, it is not high enough to reject the null hypothesis that $\mu=600$μ=600. At this level of confidence, we cannot reject $H_0$H0.
A check on a $0.05$0.05 level of confidence however shows the critical region changing to the inequalities given by $z<-1.96$z<−1.96 and $z>1.96$z>1.96.
Since $2>1.96$2>1.96, we now have sufficient evidence to reject $H_0$H0 in favor of $H_a$Ha. In effect we can state that at the $\alpha=0.05$α=0.05 level, we believe there may well be a problem with the line marking machine. Based on the evidence it could be that on average the lines have a larger length than $600$600 mm.
Why then can our opinion suddenly switch just on the basis of an arbitrary confidence level? This is a great question to think about. Here's one way to understand it.
Imagine a statistician's answer:
A confidence level of $\alpha=0.01$α=0.01 means essentially that you want me to be $99%$99% sure that $H_0$H0 should be rejected. If you want me to be $99%$99% sure, then it forces me to set a very high bench mark for a sample mean's $z$z score.
However, if you only want me to $95%$95% sure, then I'll show some leniency. You are allowing me a greater chance of being wrong, so I'll lower the bench mark accordingly.
A random sample of $100$100 giant tortoises during the past year showed an average life span of $143.6$143.6 years, with a standard deviation of $17.8$17.8 years. A scientific journal recently claimed an average life span for the reptile as $140$140 years. Does the sample data indicate that the real average is greater than $140$140? Use a significance level of $\alpha=0.05$α=0.05.
This example is interesting because the variance $\sigma^2$σ2 of the 'population' is not given. When this happens we are forced to use the variance of the sample as an estimate of $\sigma^2$σ2.
Provided the sample is large enough, perhaps $30$30 or more, a good estimate of the population variance is the sample variance $s^2$s2. The variable $\frac{\overline{x}-\mu}{s/\sqrt{n}}$x−μs/√n is still approximately distributed as a standardized normal variable.
Note that the denominator $s/\sqrt{n}$s/√n becomes an estimate of the standard error.
So we set up a one tail test because we are interested in whether the reptile lives longer than $140$140 years.
Now we know that $\overline{x}=143.6$x=143.6, and since $\sigma^2\approx s^2=17.8$σ2≈s2=17.8 years and $n=100$n=100 we have that:
$z=\frac{143.6-140}{17.8/\sqrt{100}}=2.02$z=143.6−14017.8/√100=2.02
Clearly this $z$z value is well into the critical region, and so $H_0$H0 must be rejected. We have to accept on the basis of the sample data that the life span of the giant tortoise is more than $140$140 years.
Note that the risk of rejecting $H_0$H0 incorrectly can be quantified. We could make any one of three statements:
This example shows how a statistician might in practice summarize the elements of a particular problem.
A research study measured the resting pulse rates of $40$40 Olympic marathon runners and found a mean pulse rate of $65.45$65.45 beats per minute with a standard deviation of $8.25$8.25 beats per minute. Researchers want to know if this sample is different from the generally accepted pulse rate for marathon runners believed to believed to be $68$68 beats per minute. Use a confidence level of $\alpha=0.05$α=0.05.
Think: What do we know?
Do: Calculate the $z$z value.
$z=\frac{65.45-68}{8.25/\sqrt{40}}=-1.95$z=65.45−688.25/√40=−1.95
Reflect:
Strictly speaking we cannot reject the null hypothesis that the standard rate for all swimmers is $68$68 beats per minute. However because the computed value is perilously close to the critical value, we might check for any rounding errors with our sample data, or even repeat the test with another sample. Consider that at 1-tailed test with an alternative hypothesis that $\mu<68$μ<68 would mean that $z_c=-1.645$zc=−1.645, and so the null hypothesis would be immediately rejected.
What this example serves to show that any test result should not be accepted blindly. Our intuition and common sense should also play a pivotal role in decision theory.