When we are trying to understand what our data is telling us, we usually find statistics that tell us the location of the data (such as the mean or median) as well as measures of spread, such as the range.
To get a better picture of the distribution of a data set, with a concise set of values, we often use the five number summary.
The five number summary is made up of the minimum and maximum values, the median, and two other values, known as upper and lower quartiles.
We are familiar with the median as the middle value in a data set when the values are arranged in order. The median is a useful statistic that tells us the location of the data.
Quartiles are values at particular locations in the data set – similar to the median, but instead of dividing a data set into halves, they divide a data set into quarters.
$\editable{1}$1 | $\editable{3}$3 | $\editable{4}$4 | $\editable{7}$7 | $\editable{11}$11 | $\editable{12}$12 | $\editable{14}$14 | $\editable{19}$19 |
First locate the median, between the $4$4th and $5$5th values:
Median | ||||||||||||||
$\downarrow$↓ | ||||||||||||||
$\editable{1}$1 | $\editable{3}$3 | $\editable{4}$4 | $\editable{7}$7 | $\editable{11}$11 | $\editable{12}$12 | $\editable{14}$14 | $\editable{19}$19 |
Now there are $4$4 values in each half of the data set, so split each of the four values in half to find the quartiles. We can see the lower quartile is between the $2$2nd and $3$3rd values; there are two values on either side of the first quartile. Similarly, the upper quartile is between the $6$6th and $7$7th values:
lower quartile | Median | upper quartile | ||||||||||||
$\downarrow$↓ | $\downarrow$↓ | $\downarrow$↓ | ||||||||||||
$\editable{1}$1 | $\editable{3}$3 | $\editable{4}$4 | $\editable{7}$7 | $\editable{11}$11 | $\editable{12}$12 | $\editable{14}$14 | $\editable{19}$19 |
We can see that the intervals between the quartiles each contain two values–one quarter of the total number of values in the data set.
The lower quartile is also called the first quartile, or $Q_1$Q1. It is the middle value between the minimum value and the median. To calculate the lower quartile, we identify the scores less than the median (which we call the lower half). Then we determine the middle value of this lower half.
The median is also known as the second quartile, or $Q_2$Q2, which we have already learnt about and it represents the middle value in the sorted data set.
The median is the $\frac{n+1}{2}$n+12th value in the sorted data set, where $n$n is the number of values in the data set.
The upper quartile is also called the third quartile, or $Q_3$Q3. It is the middle value between the median and the maximum value. The upper quartile can be found by identifying the scores in the upper half (above the median). Then we determine the middle value of this upper half.
The range is the difference between the maximum value and minimum value in the data set.
The interquartile range, or IQR, is the difference between the upper quartile and the lower quartile. Half of the values in the data set lie within the interquartile range.
The interquartile range is a useful measure of the spread of data because, unlike the range, it is not affected by unusually large or small values.
The five number summary is the set of values made up of the:
These values break our data set into four parts as shown in this diagram
Knowing the five number summary can help us identify key regions of the data set.
The individual values required for a five number summary are readily obtained using the Statistics mode on a CAS calculator.
Determine the five number summary and interquartile range for this data set:
$-2,10,-1,6,9,6,-6,1,7$−2,10,−1,6,9,6,−6,1,7.
ClassPad
This data can be entered in the Statistics mode, "list1", without needing to sort the values in ascending order.
Use the Calc -> One-variable menu to calculate the values for the five number summary (and other statistics).
Hence, the five number summary is set out below:
Minimum | Lower quartile | Median | Upper quartile | Maximum |
---|---|---|---|---|
$-6$−6 | $-1.5$−1.5 | $6$6 | $8$8 | $10$10 |
Note that, in this example, neither quartile is a value from the data set because the positions of the quartiles fall between values.
The interquartile range is the difference between the upper quartile and the lower quartile:
Interquartile range | $=$= | $8.5-(-1.5)$8.5−(−1.5) |
$=$= | $10$10 |
Use class centres to determine the five number summary and interquartile range for the data represented by the histogram:
The histogram data can be represented by the frequency table below:
Class | Class centre | Frequency |
---|---|---|
$30-<40$30−<40 | $35$35 | $5$5 |
$40-<50$40−<50 | $45$45 | $5$5 |
$50-<60$50−<60 | $55$55 | $7$7 |
$60-<70$60−<70 | $65$65 | $1$1 |
$70-<80$70−<80 | $75$75 | $3$3 |
ClassPad
Using Statistics mode, enter class centres into "list1" and frequencies into "list2"
Use the Calc -> One-variable menu to calculate the values for the five number summary (and other statistics), using the "Freq" setting to select frequencies from "list2"
Hence, the five number summary is set out below:
Minimum | Lower quartile | Median | Upper quartile | Maximum |
---|---|---|---|---|
$35$35 | $40$40 | $55$55 | $55$55 | $75$75 |
In this case, median and upper quartile have the same value.
The interquartile range is the difference between the upper quartile and the lower quartile:
Interquartile range | $=$= | $55-40$55−40 |
$=$= | $15$15 |
The table shows the number of points scored by a basketball team in each game of their previous season.
$59$59 | $67$67 | $73$73 | $82$82 | $91$91 | $58$58 | $79$79 | $88$88 |
$69$69 | $84$84 | $55$55 | $80$80 | $98$98 | $64$64 | $82$82 |
Sort the data in ascending order.
State the maximum value of the set.
State the minimum value of the set.
Find the median value.
Find the lower quartile.
Find the upper quartile.
Answer the following questions using the given frequency table.
Score |
Frequency |
---|---|
$15$15 |
$13$13 |
$16$16 | $9$9 |
$17$17 | $23$23 |
$18$18 | $19$19 |
$19$19 | $8$8 |
$20$20 | $13$13 |
Complete the five number summary using a CAS calculator.
Minimum: $\editable{}$
Lower quartile: $\editable{}$
Median: $\editable{}$
Upper quartile: $\editable{}$
Maximum: $\editable{}$
Calculate the interquartile range.
To gain a place in the main race of a car rally, teams must compete in a qualifying round. The median time in the qualifying round determines the cut off time to make it through to the main race. Below are some results from the qualifying round.
$75%$75% of teams finished in $159$159 minutes or less.
$25%$25% of teams finished in $132$132 minutes or less.
$25%$25% of teams finished between with a time between $132$132 and $142$142 minutes.
Determine the cut off time required in the first round to make it through to the main race.
Determine the interquartile range in the qualifying round.
In the qualifying round, the ground was wet, while in the main race, the ground was dry. To make the times more comparable, the finishing time of each team from the qualifying round is reduced by $5$5 minutes. What would be the new median time from the qualifying round?
We start with a number line that covers the full range of values in our data set.
We then plot the values from the five number summary above the number line, and connect them in a certain way to create a box plot. Here is an example:
The two vertical edges of the box show the upper and lower quartiles of the data range. The left hand side of the box is $Q_1$Q1 and the right hand side of the box is $Q_3$Q3. The vertical line inside the box shows the median.
Then there are two lines that extend from the box outwards. The endpoint of the left line is at the minimum value, while the endpoint of the right line is at the maximum value.
The box plot must be drawn parallel to a number line so that the values for the five number summary can be easily read from the graph.
The example above is represents the five number summary set out below:
Minimum | Lower quartile | Median | Upper quartile | Maximum |
---|---|---|---|---|
$18$18 | $51$51 | $68$68 | $87$87 | $100$100 |
The interquartile range (IQR) is the difference between the upper quartile and the lower quartile.
For this example, the IQR is $87-51=36$87−51=36.
Since the marks of the box plot represent quartiles, each region represents $25%$25% of the values in the data set. Hence, in this example, we can make statements such as:
The box plot below shows the age at which a group of people got their driving licenses.
What is the oldest age at which someone got their license?
What is the youngest age at which someone got their licence?
What percentage of people were aged from $18$18 to $22$22?
$10%$10%
$25%$25%
$50%$50%
The middle $50%$50% of responders were within how many years of one another?
$9$9
$6$6
$7$7
$8$8
In which quartile are the ages least spread out?
$4$4th
$1$1st
$3$3rd
$2$2nd
The bottom $50%$50% of responders were within how many years of one another?
$5$5
$4$4
$6$6
Use a CAS calculator to construct a boxplot for this data set, and determine the upper quartile:
$48,4,8,36,8,28,20,40,44$48,4,8,36,8,28,20,40,44.
ClassPad
Using the Statistics mode:
Use a CAS calculator to construct a boxplot for the data set represented by this frequency table:
Score | Frequency |
---|---|
$12$12 | $1$1 |
$13$13 | $0$0 |
$14$14 | $8$8 |
$15$15 | $11$11 |
$16$16 | $14$14 |
$17$17 | $7$7 |
ClassPad
Using the Statistics mode:
For this data set the box plot shows that values are most spread out below the lower quartile.
Sometimes a data set contains unusually high or low values. These unusual values are called outliers and may arise from data collection errors or due to the natural variation of the data.
We often want to identify the outlier values, and see the characteristics of the data without the effect of the outliers. In this case, we can construct a modified box plot to show the outlier values separately.
For the data set used in Example 5, construct a box plot and determine the range with the outlier value displayed separately.
ClassPad
Using the Statistics mode:
The boxplot shows a dot to indicate that the value of $12$12 is an outlier.
With this outlier displayed separately, the minimum score is $14$14 and the lower $25%$25% of values no longer appears to be unusually spread out.
Answer the following questions using the given grouped frequency table.
Class | Class centre | Frequency |
---|---|---|
$40\le x<45$40≤x<45 | $42.5$42.5 | $3$3 |
$45\le x<50$45≤x<50 | $47.5$47.5 | $4$4 |
$50\le x<55$50≤x<55 | $52.5$52.5 | $7$7 |
$55\le x<60$55≤x<60 | $57.5$57.5 | $3$3 |
$60\le x<65$60≤x<65 | $62.5$62.5 | $3$3 |
$65\le x<70$65≤x<70 | $67.5$67.5 | $9$9 |
$70\le x<75$70≤x<75 | $72.5$72.5 | $4$4 |
$75\le x<80$75≤x<80 | $77.5$77.5 | $5$5 |
Complete the five number summary using a CAS calculator.
Minimum: $\editable{}$
Lower quartile: $\editable{}$
Median: $\editable{}$
Upper quartile: $\editable{}$
Maximum: $\editable{}$
Calculate the interquartile range.
Salaries earned by employees at a software company is given in the histogram below.
Use your CAS calculator to construct a box plot, using the class centres.
Calculate the interquartile range.
Using the box plot, approximately what percentage of salaries lie in the range $\$90000$$90000 to $\$100000$$100000?
Complete the following statement.
The highest $25%$25% of salaries lie between $\$\quad$$ $\editable{}$ and $\$\quad$$ $\editable{}$ inclusive.
Parallel box plots are used to compare two sets of data visually.
We call these parallel box plots as they are presented parallel to each other along the same number line for comparison. They must therefore be in the same scale, so a visual comparison is fairly straightforward.
It is important to clearly label each box plot. Here we have plotted two sets of data, comparing the time it took two different groups of people to complete an online task.
We will see in later lessons that this format is very useful for comparing the characteristics of two (or more) data sets.
The heights (in metres) of the boys and girls in a class of $30$30 students were recorded. The results are given in the table below.
Boys: | $1.65$1.65 | $1.66$1.66 | $1.67$1.67 | $1.68$1.68 | $1.63$1.63 | $1.62$1.62 | $1.61$1.61 | $1.60$1.60 | $1.75$1.75 | $1.76$1.76 | $1.77$1.77 | $1.78$1.78 | $1.73$1.73 | $1.72$1.72 | $1.71$1.71 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Girls: | $1.55$1.55 | $1.56$1.56 | $1.57$1.57 | $1.58$1.58 | $1.53$1.53 | $1.52$1.52 | $1.51$1.51 | $1.50$1.50 | $1.69$1.69 | $1.70$1.70 | $1.71$1.71 | $1.72$1.72 | $1.67$1.67 | $1.66$1.66 | $1.65$1.65 |
Complete the table for the given data of the heights of boys in the class.
Minimum | $\editable{}$ |
---|---|
First quartile | $\editable{}$ |
Median | $\editable{}$ |
Third quartile | $\editable{}$ |
Maximum | $\editable{}$ |
Complete the table for the given data of the heights of girls in the class.
Minimum | $\editable{}$ |
---|---|
First quartile | $\editable{}$ |
Median | $\editable{}$ |
Third quartile | $\editable{}$ |
Maximum | $\editable{}$ |
Draw a parallel box plot for this data.