Populations

Before we discuss hypothesis testing, we need to understand the way in which statisticians develop information about populations.

When we talk about populations, we don't just mean groups of people as in city and country populations. If we think about an observation as a numerical recording of information, as in a measurement, then a population consists of the totality of the measurements with which we are concerned.

This means it might be the population consisting of all heights of Australian males or the population consisting of all lengths of a certain type of fish in a particular lake. It might also refer to an infinite population. If we toss a regular pair of dice indefinitely recording the total that occurs, we obtain an infinite set of values ranging from 2 through to 12.

Hypothesis testing by sampling a population is an established technique of decision theory.

Decision theory

A key role of a statistician is to infer (make general statements about) certain characteristics of a population by sampling some of it.

The information in any n-size sample drawn from a population is often analyzed by using a sample statistic, like the sample mean \overline{x} for example, and these sample statistics become estimates of the corresponding population statistics like the population mean \mu.

The larger the sample the more representative the sample statistic becomes.

The method by which a statistician infers certain characteristics of a population in this way is generally known as Decision Theory or the Theory of Statistical Inference.

The diagram shows the concept. A sample (in this diagram of size 6) is drawn from a population of unknown size and population mean \mu. A sample statistic (in this case \overline{x}) is calculated.

Using the sample mean, the statistician either tries to infer something about \mu or else bring its value into question.

The confidence the statistician has in \overline{x} as a dependable estimate of \mu depends on a number of factors, such as the size of the sample taken and on how the sample was drawn.

Introduction to hypothesis testing

In the technique of hypothesis testing, the statistician draws a sample specifically to check the validity of a claimed or hypothesized population mean \mu. That is to say, sometimes when a random sample is taken, there is a concerning disparity between \overline{x} and \mu.

Hypothesis testing provides a decision rule that allows the investigator to accept or reject that hypothesized \mu based on the evidence provided by the sample mean.

Consider these examples where hypothesis testing could be considered.

A researcher is interested in estimating the average number of children per family in a certain large city. The researcher randomly samples 200 family units and determines a sample mean \overline{x}=2.37. She is concerned that \overline{x} is different to the government gazetted mean of 2.5 children per family for the entire city.
A manufacturer has a machine that produces light bulbs continually on a production line. He tests a batch of 12 randomly chosen bulbs to see how long they last when lit. He finds that the average life is 2035hours. The manufacturer is thinking about increasing the claimed average of 2000 hours referred to on each of the cardboard light bulb boxes, but needs a sound statistical argument to support the change.
A water dispenser used for scientific experiments should deliver 200 ml of water to a plastic cup each time a dispensing button is pressed. As it is a scientific instrument, the manufacturer plans to check the device by drawing a sample batch of 30 cups of water and calculating a sample mean.

Sampling distribution

Referring to example 3, the dispensing machine is meant to be delivering 200 ml of water into each cup. However, in reality, there would be variation around this amount that would depend largely on the quality of the machine. We would in most instances expect a range of possible quantity values in our sample batch of 30 cups and from these we could calculate the sample mean \overline{x}.

Imagine if we repeat the experiment over an over again, so that we could calculate lots and groups of sample means from the corresponding batches of 30 cups. From all of these sample means we could construct a frequency histogram.

What would the histogram look like?

We would expect that most of the sample means would cluster close to claimed average of 200 ml - some higher and some lower. There would be some however that would be further away (where the dispenser has come up short or delivered too much) but the frequency of these would reduce the further away from 200 ml the sample means were.

Such a histogram is referred to as a sampling distribution.

A sampling distribution is nothing more than a distribution of n-size sample means.

One of these sample means (\overline{x}=193) has been highlighted as a small red box.

Note how the sampling distribution is approximately normal with the average at 200 ml.

Think of this average as the average of all sampling means. It is denoted \mu_{\overline{x}}.

Null hypothesis

Any hypothesis test starts by assuming that a claimed or hypothesized average, called the null hypothesis, is true. The null hypothesis is always the default hypothesis. It is the starting point for any hypothesis test.

From an historical perspective, scientists would apply certain treatments to things (people, animals and other objects) to see if those treatments had measurable effects. The default position would always be that a treatment had no effect unless sufficient evidence to the contrary was observed. This explains the origin of the term null hypothesis - it is the 'no effect' hypothesis.

Statisticians denote the null hypothesis H_0, where the subscript 0 implies 'no effect'.

While our water dispensing example has little to do with treatments, the default position we have taken is that, on average, the machine should deliver 200 ml per cup, and so we write:

H_0: \mu=200

Any null hypothesis, generally described as \mu=\mu_0, is the statement that is always under examination.

It is the statement that is being challenged by the sample evidence.

If evidence can be produced that sufficiently brings into question the validity of the null hypothesis, then we can decide to reject it for some other alternative hypothesis.

This alternative hypothesis is usually labeled H_a and takes the form of either \mu \neq \mu_0, \mu < \mu_0 or \mu>\mu_0.

Just exactly what constitutes sufficient is the question we address now.

For example would a sample mean of 193 ml be sufficiently different to \mu=200 to reject H_0? While a sample mean like this is entirely possible, it might be just too improbable to accept it - something may be wrong with the machine.

A gateway into the concept

At its heart, any decision that is made must ultimately be an arbitrary one. With decisions like this there is always the chance of making a mistake.

One way forward is to arbitrarily decide on some interval around \mu=200 ml within which a sample mean may fall without rejecting the null hypothesis. This is like drawing lines in the sand.

For example, suppose we simply draw in vertical lines at 195 and 205. These endpoints are sometimes referred to as critical sample mean values.

The lines are shown here:

Note that a sample mean of 193 ml is beyond these limits.

Any value beyond the limits, either below 195 ml or above 205 ml in this case, is said to be in the critical region. A sample mean in a critical region provides a sufficient reason to reject H_0 in favor of H_a.

Type 1 errors

The action of drawing lines like this immediately creates the potential for error.

Take for example the possibility of the manufacturer finding an actual sample mean of 193 ml. As stated above, with critical values set at 195 ml and 205 ml, H_0 would be rejected in favor of some H_a.

We now ask what is the risk of doing this?

If this decision is a mistake, then we have committed a Type 1 error.

We have rejected H_0 in favor of H_a when in fact H_0 was actually true.

The probability that a mistake was made like this can actually be calculated. It is simply the proportion of area under the normal curve that lies in the critical regions. This is because the ratio of 'critical region' area to 'total' area represents the probability of finding a sample mean in the critical region.

There looks to be about 30% of the sample means in the critical region defined by the two lines. This would indicate that the probability of making a Type 1 error would be approximately 0.3.

The Greek letter \alpha is used to denote the Type 1 probability, so that:

Pr(Type 1 error)=\alpha

Drawing the lines so that \alpha= 0.3 is usually considered a little too severe. Usually tests are set at smaller \alpha levels in order to make it quite difficult to reject H_0.

Type 2 errors

Suppose the lines had been drawn further apart so as to include 193. The smaller critical region would have meant that H_0 could not have been rejected. But this alerts us to another type of error.

It might be that H_0 is actually false and we have made a mistake retaining it. When H_0 is accepted as true, when in fact it is actually false, a Type 2 error is said to have occurred.

Using \beta as the probability, we write:

Pr(Type 2 error)=\beta

A useful analogy for understanding Type 1 and Type 2 errors is provided by the Universal Declaration of Human Rights. Article 11 of that declaration states in part that a person is 'presumed innocent until proven guilty'. The presumption of innocence can be thought of as the null hypothesis H_0. If sufficient evidence is gathered to convict the person (analogous to an extreme sample mean), then the person is found guilty. This verdict could be thought of as H_a.

We make a Type 1 error when a person, who is found guilty is in fact innocent, and we make a Type 2 error when a person, who is found not guilty (the default status), actually committed the crime. The probability of making either error should be kept to a minimum, however perhaps it is better for a guilty person to roam free than to send an innocent person to the gallows. What do you think?

The standardized approach

You can see why the above method might be too arbitrary to persist with. A researcher could conceivably draw lines at any point and argue against H_0 on the basis of any \alpha size.

To avoid this a universally accepted convention has been established. In a way it is a standardized protocol to hypothesis testing for all scientific research.

There are three parts to this convention:

Part 1: A test's level of significance

Rather than choosing arbitrary limits to define the rejection region, the accepted convention is to define \alpha first.

For example, suppose we define \alpha to be 0.05. This means that we set the probability of making a type 1 error at 0.05. Put another way, we set the proportion of area in the critical region as 0.05.

In the chapter titled Hypothesis testing 1 and 2 tailed tests you will learn that there are two basic types of Hypothesis testing - one like the water dispensing example which results in two critical regions, and another which results in one critical region in either the left or right tail of the distribution. For now, we will stick to considering the 2-tail scenario.

By setting \alpha=0.05, in a 2-tailed test, we would have \frac{\alpha }{2}=0.025 in each tail. We then can determine the position that the lines need to be drawn by using normal area tables.

Generally speaking the most common \alpha levels are \alpha=0.10, \alpha=0.05 and \alpha=0.01, which corresponds to each tail containing \frac{\alpha }{2}=0.05, \frac{\alpha }{2}=0.025 and \frac{\alpha }{2}=0.005 respectively.

These agreed \alpha probabilities are referred to as the test's level of significance.

Remember!

The smaller the level of significance, the harder it becomes to reject the null hypothesis

Part 2: The Central Limit Theorem

An important result first investigated by the French mathematician Abraham de Moivre in 1733 is now known as the Central Limit Theorem.

The Central Limit Theorem states that if random samples of size n are drawn from any population with a mean of \mu and a variance of \sigma ^2 then the sampling distribution of \overline{x} will be approximately normally distributed with the average of the sample means (often denoted \mu_{\overline{x}}) equal to \mu and the variance of the sample means, (also often denoted \sigma ^2_{\overline{x}}), equal to \frac{\sigma ^2}{n}.

The standard deviation of the sample means, given by \frac{\sigma}{\sqrt{n}} is often referred to as the Standard Error of the Mean, or SEM for short.

While not an entirely accurate description, we can at least get an intuitive feel for the Central Limit Theorem with the following mind experiment. Suppose we imagine sampling, say, 120 cups of water and then writing down each of the 120 results in a long list.

If we group consecutive pairs of results, and calculate the 60 averages, we would find that these averages would be scattered fairly widely around the population average of 200 ml.

If we instead group into sets of three, and thus work out the 40 averages, we would expect that these new averages would exhibit less scatter than the paired averages because groups of three would naturally show less variation than the groups of two.

Continuing in this way, we could take groups of four, five and six consecutive numbers and determine 30, 24 and 20 averages respectively. As the groups got larger, we would notice that the variation in the averages would reduce. While still centered around \mu=200, they would tend to become more homogeneous as the group size increase.

That is to say the variation in the sample reduces as the number in each sample increases. This is the magic in the Central Limit Theorem. As the same size n increases, the total variation becomes inversely proportional to n.

The Central Limit Theorem is the key to the entire technique.

In the water dispensing sampling distribution, we could only make a guess at the size of the critical region. But by knowing the variance of the population \sigma ^2 and the sample size n, we can also obtain \sigma ^2_{\overline{x}}, the variance of the sampling distribution. This means we can determine the exact proportion of area in the critical regions of the sampling distribution.

Part 3: The standardized normal curve

But we can go further and simplify all problems involving large samples, by transforming each of the corresponding sampling distributions to just one standardized normal distribution.

Assuming the null hypothesis \mu=\mu_0, with the sampling distribution parameters as:

\mu_{\overline{x}}=\mu_0
\sigma ^2_{\overline{x}}=\frac{\sigma ^2}{n}

we can transform any sampling distribution to the standard normal distribution given as:

z=\frac{\overline{x}-\mu_0 }{\sigma /\sqrt{n}}

Think of this formula as simply the difference between the sample and population mean divided by the standard error. In other words the difference between the two means is being sliced into standard error units.

Because the area under the standard normal curve is 1, \alpha now equals the area of the critical region! That is to say, the test's level of significance is exactly the size of the area in the critical region under the standard normal curve.

Worked example

Question 1

Keeping with our water dispensing machine with an hypothesized mean of \mu=200 ml, suppose that we also knew that the population variance was 750. We could write:

H_0: \mu=200

H_a: \mu\neq 200

Now the actual sample size in our dispensing example was 30 cups, and so the variance of the sampling distribution becomes \frac{750}{30}=25. Therefore the standard deviation of the sampling distribution is 5.

We can see therefore (referring to the second histogram above) that the manufacturer drew lines at \overline{x}=200\pm5.

Equivalently this corresponds to critical values of the standardized normal distribution of z=\pm1.

We can show this algebraically.

From z=\frac{\overline{x}-200 }{\sqrt{750}/\sqrt{30}}=\frac{\overline{x}-200 }{5}

For \overline{x}=195, z=\frac{195-200}{5}=-1

For \overline{x}=205, z=\frac{205-200}{5}=1

From the Empirical Rule, the area within \pm1 under the standardized normal curve is 0.68.

This means that \alpha (the area in the critical region) is 0.32 or that the area in each tail is 0.16.

The diagram below shows this procedure graphically.

Our original estimate of \alpha =30% turned out to be fairly close. It confirms that drawing lines at 195 ml and 205 ml would be too restrictive in terms of validating H_0.

What are p-values?

Sometimes we are interested in a proportion of some binary population rather than a population mean.

For example we might be interested in the proportion of defective light globes coming off an assembly line, or the proportion of voters choosing a nominated candidate in a presidential election, or the proportion of dice rolls showing the number 6 when a single die is rolled over and over again.

In terms of assembly lines and dice rolls we imagine an infinite population, so it is simply impossible to determine a definitive population proportion in either instance. In terms of a presidential election, even though voter numbers are finite, it would be impractical to ask every voter prior to voting what their intentions were.

So rather than thinking about an infinite population, we define a p-value as the theoretical probability of a particular event. The aim of the hypothesis test is to compare a n-size sample proportion to the p-value with a view to accept or reject it.

To illustrate with a simple example, when flipping a coin we may be concerned about its fairness. The p-value of the coin is taken as 0.5. We might flip the coin a number of times and find that the proportion of heads landing face up is only 0.3. Given that sample proportion, the hypothesis test provides a decision rule that allows us to either reject or accept the p-value.

The notation

In situations like the one above, the standard practice is to denote the sample proportion as \hat{p}=\frac{x}{n} where \hat{p} might be, say, the proportion of defective light globes, or the proportion of voters voting for candidate A or the proportion of occurrences of 6). We denote n as the sample size, and x is the number of "successes" in the sample (e.g. the number of defective light globes, the number of voters voting for candidate A or the number of occurrences of 6).

The hypothesis test

We can consider the problem of testing the null hypothesis H_0 that the proportion of successes in the entire population is given by some p-value p=p_0. The alternative hypothesis H_a might be p\neq p_0 for a 2-tailed test, or else either p< p_0 or p> p_0 for a 1-tailed test.

Given H_0, repeated n-size sampling will produce a sampling distribution of proportions with mean proportion given by \mu_\hat{p}=p_0 and variance given by \sigma ^2_{\hat{p}}=\frac{p_0(1-p_0)}{n}.

The test statistic

Hence our test statistic for a sufficiently large sample becomes:

Then by replacing \hat{p} by \frac{x}{n} and multiplying numerator and denominator by n, we obtain the easier equivalent form for the test statistic.

In this form, substitution becomes easy. The numerator is simply the signed difference between the expected and sample number of "successes", and the denominator is the square root of the product of n and the two complementary probabilities p_0 and (1-p_0).

Limits on n and p

This hypothesis test uses the normal curve approximation of the binomial distribution. One of the conditions in making this assumption is that the binomial distribution approximates the normal curve sufficiently, and it is generally agreed that a good guide to this is to check that both np_0 and np_0(1-p_0) are both greater than 5. So its a good idea to check this first.

This diagram is a depiction of the parabolic space where the binomial distribution sufficiently approximates the normal distribution. The horizontal axis shows the sample size n and the vertical axis shows the probability p. Note that under no circumstances is a sample size of less than 20 considered adequate. Also note that the closer p is to 0.5, the less sample size is required.

We could on the other hand use the inequalities to design our test.

Suppose for example we were interested in checking the veracity of two casino dice. The game the dice are being used for involves summing the two numbers landing when both dice are thrown onto a board. We might decide to check the p-value \frac{1}{36}, the probability that the dice sum will be 12.

Since np(1-p)>5, we must have that n(\frac{1}{36}\times\frac{35}{36})>5. This means that n>\frac{5\times{36}^2}{35}=185.143, and so perhaps we might decide to roll the two dice say 200 times to collect our sample proportion.

Critical values

The critical values are obtained from standardized normal tables (or the last line of the student t tables) in exactly the same way as the hypothesis tests that were conducted on the means for large samples. The student t tables provide the critical values for small sized samples (between n=20 and n=30), provided of course the population is normally distributed.

xamples

Example 1 (defective cars)

The company Speedy manufacture remote controlled cars, some of which come off the assembly line defective. Speedy believe that the proportion of defective cars coming off their assembly line is about 6% but a recent random sample of 100 cars contained 8 defective ones. Would this sample information indicate that the general defect rate has increased? Use a confidence level of \alpha=0.05.

Before we begin, note that:

np_0=100\times0.06=6

np_0(1-p_0)=100\times0.06\times(1-0.06)=5.64

and so we can proceed with the test.

This is clearly a 1-tailed test with:

H_0: p_0=0.06
H_a: p>0.06
Confidence Level: \alpha = 0.05

Now \hat{p}=\frac{x}{n}=\frac{9}{100}=0.09

Hence n=100 and x=9, so np_0=100\times0.06=6 and so z=\frac{9-6}{\sqrt{100\times0.06\times0.94}}=1.263.

The critical z value at the right tail is given by z_c=1.645, and so the sample proportion, while larger than 0.06, is not high enough to reject the null hypothesis.

However, we are naturally led to ask what defective number would be required to reject H_0?

In this instance, because the defective rate is relatively low, it doesn't take too many more defectives in the sample before H_0 is rejected.

There is only one variable in our test statistic once p_0 has been fixed. So by rearranging the formula to make x the subject we find that x=np_0+z_c\times\sqrt{np_0(1-p_0)}.

With p_0=0.06, n=100 and z_c=1.645, this reduces to x\approx9.91 and so x=10 defectives will cause a rejection of H_0.

This is just one more defective in the sample. The sensitivity is primarily due to the relatively small size of the defective rate p_0. For example if p_0=0.5, with n and z_c the same, we can calculate that approximately 31 defectives will cause a rejection of H_0. In other words, there can be 5 more defective cars than the expected 25 defective cars without rejecting H_0. Of course, there would be serious concerns for the company if half of their cars are coming off the assembly line defective!

Question 3

A "6" on the roll of a dice should turn up \frac{1}{6} of the time. However, in the real world, this needs to be tested for any particular dice that rolls of the assembly line.

Suppose, in such a test, a particular die rolled 120 times turns up a "6" 10 times. Should we be concerned?

Note we have both np_0=20 and np_0(1-p_0)=16.67 both over 5.

We could conduct a 1 tailed hypothesis test with:

H_0: p_0=\frac{1}{6}
H_a: p_0< \frac{1}{6}
Confidence level: \alpha=0.01

With n=120 and x=10, our test value z becomes z=\frac{10-20}{\sqrt{120\times\frac{1}{6}\times\frac{5}{6}}}\approx-2.4495.

At \alpha=0.01, we have z_c=-2.33, and so H_0 can be rejected. This means that 10 occurrences of a "6" in 120 rolls is enough evidence to suspect that the die is faulty. We could say that we are 99% sure that the die is faulty and should not be used in a casino.

This chapter discusses the meaning of the terms 2-tailed and 1-tailed tests, and provides a more detailed explanation of the type 2 error.

There are two basic types of hypothesis tests available, and the choice of test very much depends on the nature of the investigation. We will look at the 2-tailed test first.

The two tailed test

We take as our first example a certain water dispensing machine that, for some reason, is not accurately delivering 200 ml of water to each plastic cup. The cup may overflow if too much water is dispensed. Patrons will complain if too little water is dispensed. Hence, we are interested in checking the validity of the claimed population mean \mu=200 ml.

We will reject the null hypothesis if our sample mean is significantly different to 200 ml, whether that difference is positive or negative. Thus we need to reject H_0 if, for a given confidence level, a sample mean lands in either tail of the sampling distribution.

Lets develop the water dispenser example a little further.

The set-up

Suppose we were told that dispenser machines of this kind delivered on average 200 ml per cup with a variance of 750. We wish to know whether the stated mean is accurate, so we set up a 2-tailed test with the following parameters:

H_0:\mu=200
H_a:\mu\neq 200
Confidence level: \alpha=0.03

Note that a 2-tailed test has the alternative hypothesis H_a written with a \neq sign. That is, at this point, we are not really interested in whether the machine delivers more than 200 ml, or less than 200 ml, but rather we are interested in whether it delivers something significantly different to 200 ml.

Note also that the confidence level we chose is \alpha=0.03. This is somewhat atypical of most testing, but sometimes there is a rationale for particular choice. By choosing \alpha=0.03, we are expressing a 97% confidence in H_0 should a sample mean fall within the critical limits.

The math

We do a little math. From the Central Limit Theorem we assume that the sampling distribution has a mean of 200 ml and a variance of \frac{\sigma ^2}{n}=\frac{750}{30}=25, and thus has a standard deviation (the standard error of the mean) of 5.

Therefore our z variable becomes z=\frac{\overline{x}-200}{5} ready for the sample results.

We also need to find the critical z values that cut out left and right tail areas of 0.015. From tables this turns out to be \pm 2.17. If the z value of our sample mean falls above 2.17 or below -2.17, then H_0 is in serious trouble.

The sample

We sample 30 cups and find that the average amount of water dispensed into each cup was 213 ml.

Substituting this sample mean into our z formula shows the computed z value as:

z=\frac{\overline{x}-200}{5}= \frac{213-200}{5}=2.6

The number 2.6 is in the right hand part of the critical region, and this means that H_0 needs to be rejected in favor of H_a. This means that we are expressing a 97% level of confidence that the claimed mean is not 200 ml.

Note that, in the strictest sense, we are not claiming that it is larger than 200, even though the computed z value landed in the right tail. This is a 2-tailed test with a specific alternative hypothesis. What we would formally have to complete is a 1-tail test which we will do shortly.

The graph

Here is a graphical depiction of the two tailed test. Look at this carefully:

The diagram covers most of the procedure. We see the two tails and the critical region area in each tail as 0.015 which automatically defines the critical z_c values as \pm 2.17 obtainable from the standard normal tables.

We see the computed z value of 2.6 corresponding to the actual sample mean of 213 ml on a sample mean axis. It is clearly in the rejection region providing sufficient evidence to reject H_0. We can state with 97% certainty, that the claimed mean is not 200 ml.

What are t statistics?

When sampling from a population, it is usually the case that statisticians do not have access to the population variance \sigma ^2. It is, in most practical situations, unknown.

Provided the sample is large enough (perhaps 30 or more) a good estimate of the population variance is the sample variance s^2. The variable z=\frac{\overline{x}-\mu}{s/\sqrt{n}} is still approximately distributed as a standardized normal variable. Note carefully that s, taken as an approximation to \sigma, is a 'constant' in the formula (a population statistic), and the only real variable is the sampling mean \overline{x}.

W. Gosset and the t distribution

For small samples however, the sample variance can fluctuate quite severely from sample to sample, and so using s^2 as a substitute for \sigma ^2 in those circumstances changes the sampling distribution to something other than normal.

This means that, when the sample size is small, transforming the sampling distribution to a standardized z distribution, as we do using large samples, no longer works. Prior to the year 1908, samples had to be of a reasonable size (around 30) to be reliable indicators of population parameters.

In 1908 this all changed when a remarkable statistics paper was published by William S Gosset (1876-1937). In this paper Gosset had derived the actual probability distribution for small size samples. Although similar to the standardized z distribution, this new distribution was a function of the sample size itself. That is to say the distribution changed as the sample size changed.

The graph below shows three of these probability distributions together. The symbol \nu is known as the number of degrees of freedom for the sample where \nu=n-1. Recall that calculating a sample variance s^2 involves a division by n-1 rather than n, and that's where the idea came to use \nu instead of n in the description of the distribution.

By the time the sample size becomes close to 30 (n=30 corresponds to \nu=29), the distribution is virtually identical to the standard normal distribution. In fact it can be shown that for \nu>2, the variance of the t distribution is given by \frac{\nu}{\nu-2} so that as \nu\rightarrow\infinity, the variance approaches 1.

At the time Gosset was employed by the Guinness brewery in Ireland. Because of the possibility of trade secrets leaking from that establishment, research publications by staff were strictly forbidden. Gosset was able to get around this by secretly publishing under the name 'student' and as a consequence the t distribution became known as the student t distribution.

You can see that the distributions are not all wildly different from each other. Like the standard normal distribution each of the new distributions exhibits a bell symmetry, but for low values of n, the peak is not as high as that for the normal distribution, and the tails tend to be a little thicker. In other words, the probability is more dispersed than that for the normal distribution.

For a given confidence level \alpha, the lower the sample size the thicker the distribution tails become and the further the critical values must be from the center of the distribution. In other words, as n increases, the t values need to move further out in order to keep the given Type 1 probability constant.

What this means is that it becomes harder to reject the null hypothesis for a given \alpha when the sample size is small. This is a good thing, because small sized samples are less reliable that larger samples. Put another way, we are more likely to trust a large sized sample statistic because the variability is expected to be smaller.

Computed values of t

The computed values of the student t distribution are found from the transformation formula t=\frac{\overline{x}-\mu}{s/\sqrt{n}} where both \overline{x} and s are variables.

For example, suppose a sample of size 16 is drawn from a normally distributed population of mean \mu=50 and found to have a sample mean and standard deviation of 46 and 8 respectively. Then the computed t statistic becomes:

t=\frac{\overline{x}-\mu}{s/\sqrt{n}}=\frac{46-50}{8/\sqrt{16}}=-2

This value would be compared to values provided to us in a student t table where critical t values are listed for various levels of confidence \alpha.

When Gosset derived the formula for the t distribution, he assumed that the population from which the samples were drawn were normally distributed themselves. However, even if the population distribution is not normal, but still bell shaped, then the sampling distribution of t still approximates the t distribution very closely.

t Distribution table

A table of critical t values always uses \nu, the number of degrees of freedom of the sample with \nu \geq 1. Here are the t values (the areas \alpha for 1 and 2 tailed tests) for the first 20 degrees of freedom (from n=2 to n=21). Note that the positive t values are given only, so the value is simply negated when dealing with left tailed tests.

_{table ref: http://growingknowing.com/GKStatsBookStudentTTable.html}

So for example suppose a random sample of 16 candy bars are weighed and found to have a mean and standard deviation of 57 grams and 8 grams . If this is to be tested against a null hypothesis of \mu=60 grams using a 1-tailed test at a confidence level of 0.05, then, with \nu=15, the critical t statistic is given by t=-1.7531.

The computed t value becomes t=\frac{57-60}{8/\sqrt{16}}=-1.5. This computed value is not in the critical region and therefore the null hypothesis cannot be rejected.

From small samples to large samples

Because successive t distributions become asymptotic to the standard normal distribution as n\rightarrow \infinity, the t table eventually provides the \alpha quantities for large samples as well, with the corresponding critical values written on the last line of the t table as \nu=\infinity.

We have listed these here:

\alpha=0.10 critical t= 1.282
\alpha =0.05 critical t=1.645
\alpha=0.025 critical t = 1.960
\alpha=0.01 critical t=2.326
\alpha=0.005 critical t=2.576

Small sample hypothesis testing

The following examples show how Hypothesis Testing on population means using small samples works in practice.

Worked examples

Question 4

A certain brand of candy bar indicates on the plastic wrap packaging an average weight of 60 grams.

Assuming the weights are normally distributed, 10 randomly chosen bars are weighed and found to have an average weight of 58 grams and a standard deviation of 3.5 grams. Based on the sample results, and using a 1 tailed test on the hypothesized mean of 60 grams at the \alpha=0.05 level does our sample support the wrapper claim?

The fact that we have chosen a 1 tailed test clearly means that we suspect that the bars are in fact less than 60 grams. We first set up H_0 and H_a.

H_0: \mu=60
H_a: \mu < 60
confidence level: \alpha = 0.05 (1 tailed)
sample n: 10 (\nu=9 degrees of freedom)

The critical t statistic, provided to us from tables, is given by t=-1.8331.

We compute the sample t value as:

t=\frac{\overline{x}-\mu}{s/\sqrt{n}}=t=\frac{58-60}{3.5/\sqrt{10}}=-1.807

The value is inside the critical region, and therefore H_0 cannot be rejected at this level of confidence.

Question 5

In the chapter Hypothesis testing-1 sample 1 and 2 tailed tests an example was discussed involving burning times for candles made by the Cute Candles company. We repeat this example using the t distribution to see if a different result presents itself.

Recall that a sample of 25 candles were tested to check the validity of the null hypothesis that the company's candles have a burn life of 50 hours. The sample mean and standard deviation was 49.2 hours and 2.5 hours respectively.

Also recall that the estimated sampling distribution variance was found as \frac{ \sigma ^2}{25} \approx \frac{s^2}{25}=\frac{6.25}{25}=0.25 and this implies that the standard deviation of the sampling distribution is 0.5. This lead to a computed z value given by z=\frac{\overline{x}-50}{0.5}= \frac{49.2-50}{0.5}=-1.6.

This computed value was then compared to the critical z of -1.645 and as a result H_0 could not be rejected.

The sample size of 25 was perhaps a little too small to be used in this way, now that we know about the student t distribution. It will be interesting then to compare this large sample z test against the more appropriate student t test.

With a sample mean and standard deviation given by 49.2 and 2.5, we have the computed t value given by t=\frac{49.2-50}{2.5/\sqrt{25}}=-1.6. In other words, there is no change to the computed value.

However, the critical t score does change.

From tables, with 24 degrees of freedom, we find that the critical t value is -1.711.

This is good news for the research section of the company.

There is therefore no change to the earlier advice that H_0 cannot be rejected and the claim of 50 hours burning life still stands.

Important note

There is a very important lesson with the candle problem.

For a fixed computed t score, as the sample size decreases, the sampling distribution becomes less peaked and the tails begin to thicken. This in turn means that for a fixed critical region of area \alpha the critical value moves outward.

You can see this with our example. The computed z value and t value were identical, but the critical t value pushed out, making it even harder to reject H_0.

But this makes complete sense!

The smaller the sample size, the more variability you are accepting, and so the less chance there is of rejecting H_0.

In fact a sample size of just 2 candles shows a critical value of -6.314! Such a scenario would mean that it would be virtually impossible to reject H_0.

Question 6

Buzz Electrics claim that the life of their Candy bar vending machine light globes are normally distributed and last on average 1600 hours. A sample of 5 light globes are tested and found to have a mean of 1550 hours and a sample standard deviation of 80 hours.

Assuming the life of a light globe is normally distributed, test the hypothesis that \mu=1600 against the alternative hypothesis \mu \neq 1600 using both an \alpha= 0.01 and \alpha = 0.05 significance level. Comment on your results.

A sample of 5 like this is a very small sample. Referring to a table of critical t values we find that, for a 2-tailed test with \nu=4 and \alpha =0.01, the critical t values are \pm 3.747. Also, for \alpha=0.05 the critical t values change to \pm 2.132.

The sample mean of 1550 hours means that the computed t statistic becomes:

t=\frac{1550-1600}{80/\sqrt{5}}=-1.3975

This is low, but still no where near enough to be able to reject the null hypothesis at either level of confidence.

Looking at the problem differently, we might ask what would the sample mean need to be in order for the null hypothesis to be rejected at, say, the 0.05 level?

This again is easy to work out by solving the equation \frac{\overline{x}-1600}{80/\sqrt{5}}=\pm 2.132.

Thus:

\frac{\overline{x}-1600}{80/\sqrt{5}}	=	\pm 2.132
\overline{x}-1600	=	\pm \frac{80\times2.132}{\sqrt{5}}
\therefore \overline{x}	=	1600\pm \frac{170.56}{\sqrt{5}}
	=	1600\pm76.28
	=	1676.28, 1523.72

A discrepancy of over 75 between the sample mean and \mu is required to reject the null hypothesis at the 0.05 level of confidence. A larger sample size would have reduced the discrepancy needed.

Large sample hypothesis testing

The following examples show how hypothesis testing on population means using large samples works in practice.

Worked examples

Question 7

A line marking machine is supposed to paint on average 600 mm marks on road surfaces with a variance of 200 mm. The local council have become concerned that the machine is actually producing markings that appear to be of different lengths - some longer and some shorter. A random sample of 50 marks are measured and found to have an average length of 604 mm. Should the council be concerned?

We begin by recognizing that this is a 2-tailed test with:

H_0: \mu=600
H_a: \mu \neq 600

By the Central Limit Theorem, under this hypothesis, the sampling distribution has a mean of 600 mm and a variance of \sigma ^2=\frac{200}{50} = 4.

In other words the variable z=\frac{\overline{x}-\mu}{2} has a standard normal distribution.

A decision about the level of significance (the size of the Type 1 error) needs to be made.

Suppose we define \alpha =0.01, so that \frac{\alpha}{2}=0.005.

From standardized normal tables, we find that the critical region is approximately defined by the inequalities z< -2.58 and z> 2.58 as depicted here:

In other words, the total blue area of \alpha=0.01 in the two tails (0.005 in each tail) is the probability that we reject H_0 by mistake, when a sample mean lands within those critical regions. We make a Type 1 error in doing so.

A quick computation shows that the sample mean \overline{x}=604 converts to a z value given by z=\frac{604-600}{2} = 2.

This means that even though the z value is fairly high, it is not high enough to reject the null hypothesis that \mu=600. At this level of confidence, we cannot reject H_0.

A check on a 0.05 level of confidence however shows the critical region changing to the inequalities given by z< -1.96 and z> 1.96.

Since 2>1.96, we now have sufficient evidence to reject H_0 in favor of H_a. In effect we can state that at the \alpha=0.05 level, we believe there may well be a problem with the line marking machine. Based on the evidence it could be that on average the lines have a larger length than 600 mm.

Why then can our opinion suddenly switch just on the basis of an arbitrary confidence level? This is a great question to think about. Here's one way to understand it.

Imagine a statistician's answer:

A confidence level of \alpha=0.01 means essentially that you want me to be 99% sure that H_0 should be rejected. If you want me to be 99% sure, then it forces me to set a very high bench mark for a sample mean's z score.

However, if you only want me to 95% sure, then I'll show some leniency. You are allowing me a greater chance of being wrong, so I'll lower the bench mark accordingly.

Question 8

A random sample of 100 giant tortoises during the past year showed an average life span of 143.6 years, with a standard deviation of 17.8 years. A scientific journal recently claimed an average life span for the reptile as 140 years. Does the sample data indicate that the real average is greater than 140? Use a significance level of \alpha = 0.05.

This example is interesting because the variance \sigma ^2 of the 'population' is not given. When this happens we are forced to use the variance of the sample as an estimate of \sigma ^2.

Provided the sample is large enough, perhaps 30 or more, a good estimate of the population variance is the sample variance s^2. The variable \frac{\overline{x}-\mu}{s/\sqrt{n}} is still approximately distributed as a standardized normal variable.

Note that the denominator s/\sqrt{n} becomes an estimate of the standard error.

So we set up a one tail test because we are interested in whether the reptile lives longer than 140 years.

H_0: \mu=140
H_a: \mu>140
Level of significance: \alpha =0.05
Critical region defined by: z> 1.645

Now we know that \overline{x}=143.6, and since \sigma ^2 \approx s^2=17.8 years and n=100 we have that:

z=\frac{143.6-140}{17.8/\sqrt{100}}=2.02

Clearly this z value is well into the critical region, and so H_0 must be rejected. We have to accept on the basis of the sample data that the life span of the giant tortoise is more than 140 years.

Note that the risk of rejecting H_0 incorrectly can be quantified. We could make any one of three statements:

The probability of rejecting an average age of 140 years in favor of some age more than 140 years is 0.05.
There is no more than a 5% chance we are in error of rejecting the journal's figure of 140 years.
We are at least 95% confident about rejecting H_0.

Question 9

This example shows how a statistician might in practice summarize the elements of a particular problem.

A research study measured the resting pulse rates of 40 Olympic marathon runners and found a mean pulse rate of 65.45 beats per minute with a standard deviation of 8.25 beats per minute. Researchers want to know if this sample is different from the generally accepted pulse rate for marathon runners believed to believed to be 68 beats per minute. Use a confidence level of \alpha=0.05.

Think: What do we know?

H_0: \mu=68
H_a: \mu\neq 68 (2-tailed)
CL: \alpha=0.05
Z_c: \pm1.96
Sample Size: n=40
Sample mean: \overline{x}=65.45
Sample standard deviation: 8.25

Do: Calculate the z value.

z=\frac{65.45-68}{8.25/ \sqrt{40}}=-1.95

Reflect:

Strictly speaking we cannot reject the null hypothesis that the standard rate for all swimmers is 68 beats per minute. However because the computed value is perilously close to the critical value, we might check for any rounding errors with our sample data, or even repeat the test with another sample. Consider that at 1-tailed test with an alternative hypothesis that \mu<68 would mean that z_c=-1.645, and so the null hypothesis would be immediately rejected.

What this example serves to show that any test result should not be accepted blindly. Our intuition and common sense should also play a pivotal role in decision theory.

16.08 Hypothesis testing

What is Mathspace

About Mathspace