Measures of spread in a quantitative (numerical) data set seek to describe whether the scores in a data set are very similar and clustered together, or whether there is a lot of variation in the scores and they are very spread out.
There are several methods to describe the spread of data, which vary greatly in complexity. It is possible to simply look at the numerical range of the entire data set, or break the data into chunks. The spread of data can also be compared to the mean, which can then be normalised for a meaningful comparison to other data sets.
This section will define the range, interquartile range, and standard deviation as measures of spread. How to break data into quartiles of any number is also explored.
The range is the simplest measure of spread in a quantitative (numerical) data set. It is the difference between the maximum and minimum scores in a data set.
Subtract the lowest score in the set from the highest score in the set. That is: \text{Range }=\text{ Highest score}-\text{Lowest score}.
For example, at one school the ages of students in Year 7 vary between 11 and 14. So the range for this set is 14-11=3.
As a different example, if we looked at the ages of people waiting at a bus stop, the youngest person might be a 7 year old and the oldest person might be a 90 year old. The range of this set of data is 90-7=83, which is a much larger range of ages.
Remember, the range only changes if the highest or lowest score in a data set is changed. Otherwise, it will remain the same.
What is the lowest score in a set if the range is 8, and the highest score is 19?
Whilst the range is very simple to calculate, it is based on the sparse information provided by the upper and lower limits of the data set. To get a better picture of the internal spread in a data set, it is often more useful to find the set's quartiles, from which the interquartile range (IQR) can be calculated.
Quartiles are scores at particular locations in the data set-similar to the median, but instead of dividing a data set into halves, they divide a data set into quarters. Let's look at how we would divide up some data sets into quarters now.
Make sure the data set is ordered before finding the quartiles or the median.
Here is a data set with 8 scores:
First locate the median, between the 4\text{th} and 5\text{th} scores:
Now there are four scores in each half of the data set, so split each of the four scores in half to find the quartiles. We can see the first quartile, Q_{1} is between the 2\text{nd} and 3\text{rd} scores, so there are two scores on either side of Q_{1}. Similarly, the third quartile, Q_{3} is between the 6\text{th} and 7\text{th} scores:
Now let's look at a situation with 9 scores:
This time, the 5\text{th} term is the median. There are four terms on either side of the median, like for the set with eight scores. So Q_{1} is still between the 2\text{nd} and 3\text{rd} scores and Q_{3} is between the 6\text{th} and 7\text{th} scores.
Finally, let's look at a set with 10 scores:
For this set, the median is between the 5\text{th} and 6\text{th} scores. This time, however, there are 5 scores on either side of the median. So Q_{1} is the 3\text{rd} term and Q_{3} is the 8\text{th} term.
Each quartile represents 25\% of the data set. The lowest score to the first quartile is approximately 25\% of the data, the first quartile to the median is another 25\%, the median to the third quartile is another 25\%, and the third quartile to the highest score represents the last 25\% of the data. We can combine these sections together-for example, 50\% of the scores in a data set lie between the first and third quartiles.
These quartiles are sometimes referred to as percentiles. A percentile is a percentage that indicates the value below which a given percentage of observations in a group of observations fall. For example, if a score is in the 75\text{th} percentile in a statistical test, it is higher than 75\% of all other scores. The median represents the 50\text{th} percentile, or the halfway point in a data set.
Q_{1} is the first quartile (sometimes called the lower quartile). It is the middle score in the bottom half of data and it represents the 25\text{th} percentile.
Q_{2} is the second quartile, and is usually called the median, which we have already learned about. It represents the 50\text{th} percentile of the data set.
Q_{3} is the third quartile (sometimes called the upper quartile). It is the middle score in the top half of the data set, and represents the 75\text{th} percentile.
The interquartile range (IQR) is the difference between the third quartile and the first quartile. 50\% of scores lie within the IQR because it contains the data set between the first quartile and the median, as well as the median and the third quartile. Since it focuses on the middle 50\% of the data set, the interquartile range often gives a better indication of the internal spread than the range does, and it is less affected by individual scores that are unusually high or low (called outliers).
Subtract the first quartile from the third quartile. That is, \text{IQR} = Q_{3}-Q_{1}
Consider the following set of scores:33,\,38,\,50,\,12,\,33,\,48,\,41
Sort the scores in ascending order.
Find the number of scores.
Find the median.
Find the first quartile of the set of scores.
Find the third quartile of the set of scores.
Find the interquartile range.
For the following set of scores in the bar chart to the right:
Input the data in the following distribution table:
\text{Score }(x) | \text{Freq }(f) | fx | \text{Cumulative Freq } (cf) |
---|---|---|---|
30 | |||
40 | |||
50 | |||
60 | |||
70 | |||
\text{Total} |
Find the median score using the distribution table above.
Find the first quartile score.
Find the third quartile score.
Find the interquartile range.
Standard deviation is a measure of spread, which helps give a meaningful estimate of the variability in a data set. While the quartiles gave us a measure of spread about the median, the standard deviation gives us a measure of spread with respect to the mean. It is a weighted average of the distance of each data point from the mean. A small standard deviation indicates that most scores are close to the mean, while a large standard deviation indicates that the scores are more spread out away from the mean value.
The standard deviation can be calculated for a population or a sample.
The symbols used are:
\displaystyle \text{Population standard deviation} | \displaystyle = | \displaystyle \sigma \text{ (lowercase sigma)} |
\displaystyle \text{Sample standard deviation} | \displaystyle = | \displaystyle s |
In Statistics mode on a calculator, the following symbols might be used:
\displaystyle \text{Population standard deviation} | \displaystyle = | \displaystyle \sigma _n |
\displaystyle \text{Sample standard deviation} | \displaystyle = | \displaystyle \sigma _{n-1} |
Note: It is only required to calculate standard deviation using the automatic function in the statistics mode of our calculators, so we will not go through the formal definition and equation here.
The standard deviation is found by calculating the square root of the variance.
Variance is the average of the squared differences from the mean. Here is its formula. \sigma ^2=\dfrac{1}{n}\Sigma\left(x_i-\mu \right)^2
This is the formula by which a calculator calculates the standard deviation of a data set from a full population. That is, it is the formula used for census data rather than sample data. \sigma =\sqrt{\dfrac{1}{n}\Sigma\left(x_i-\mu \right)^2}
In this formula,
The numbers x_i are the values in the data set. There is one value for each subscript i.
There are n numbers x_i in the data set. So, i goes from 1 to n in the summation.
The symbol \mu (Greek letter 'mu') is the population mean.
The Greek letter \sigma (sigma) is used for the population standard deviation.
The symbol \Sigma (upper case sigma) is the summation symbol.
Simply put, standard deviation describes the spread of data by comparing the distance of each score to the mean. It is complicated to calculate, but it gives a lot of information about the spread of data because it takes into account every data point in the set.
Standard deviation is also a very powerful way of comparing different data sets, particularly if there are different means and population numbers.
Find the population standard deviation of the following set of scores by using the Statistics mode on the calculator:8,\,20,\,16,\,9,\,9,\,15,\,5,\,17,\,19,\,6
Round your answer to two decimal places.
Fill in the table and answer the questions below.
Complete the table given below.
\text{Class} | \text{Class Centre} | \text{Frequency} | fx |
---|---|---|---|
1-9 | 8 | ||
10-18 | 6 | ||
19-27 | 4 | ||
28-36 | 6 | ||
37-45 | 8 | ||
\text{Total} |
Use the class centres to estimate the mean of the data set, correct to two decimal places.
Use the class centres to estimate the population standard deviation, correct to two decimal places.
If we used the original ungrouped data to calculate standard deviation, do you expect that the ungrouped data would have a higher or lower standard deviation?
Standard deviation is a weighted average of how far each piece of data varies from the mean. The standard deviation can be calculated for a population (\sigma) or a sample (s).
The standard deviation is a more complex calculation but takes every data point into account. The standard deviation is significantly impacted by outliers.
For each measure of spread:
A larger value indicates a wider spread (more variable) data set.
A smaller value indicates a more tightly packed (less variable) data set.