topic badge
CanadaON
Grade 9

7.04 Box plots

Lesson

Five number summary

When we are trying to understand what our data is telling us, we usually find statistics that tell us the location of the data (such as the mean or median) as well as measures of spread, such as the range.

To get a better picture of the distribution of a data set, with a concise set of values, we often use the five number summary.

The five number summary is made up of the minimum and maximum values, the median, and two other values, known as upper and lower quartiles.

 

Medians and quartiles

We are familiar with the median as the middle value in a data set when the values are arranged in order.  The median is a useful statistic that tells us the location of the data.

Quartiles are values at particular locations in the data set – similar to the median, but instead of dividing a data set into halves, they divide a data set into quarters.

Exploration

  • Here is a data set with $8$8 values:
$\editable{1}$1   $\editable{3}$3   $\editable{4}$4   $\editable{7}$7   $\editable{11}$11   $\editable{12}$12   $\editable{14}$14   $\editable{19}$19

 

First locate the median, between the $4$4th and $5$5th values:

        Median        
              $\downarrow$              
$\editable{1}$1   $\editable{3}$3   $\editable{4}$4   $\editable{7}$7   $\editable{11}$11   $\editable{12}$12   $\editable{14}$14   $\editable{19}$19

 

Now there are $4$4 values in each half of the data set, so split each of the four values in half to find the quartiles. We can see the lower quartile is between the $2$2nd and $3$3rd values; there are two values on either side of the first quartile. Similarly, the upper quartile is between the $6$6th and $7$7th values:

    lower quartile   Median   upper quartile    
      $\downarrow$       $\downarrow$       $\downarrow$      
$\editable{1}$1   $\editable{3}$3   $\editable{4}$4   $\editable{7}$7   $\editable{11}$11   $\editable{12}$12   $\editable{14}$14   $\editable{19}$19

 

We can see that the intervals between the quartiles each contain two values–one quarter of the total number of values in the data set.

 

Calculating quartiles

The lower quartile is also called the first quartile, or $Q_1$Q1. It is the middle value between the minimum value and the median. To calculate the lower quartile, we identify the scores less than the median (which we call the lower half). Then we determine the middle value of this lower half.

The median is also known as the second quartile, or $Q_2$Q2, which we have already learnt about and it represents the middle value in the sorted data set.

The median is the $\frac{n+1}{2}$n+12th value in the sorted data set, where $n$n is the number of values in the data set.

The upper quartile is also called the third quartile, or $Q_3$Q3. It is the middle value between the median and the maximum value. The upper quartile can be found by identifying the scores in the upper half (above the median). Then we determine the middle value of this upper half.

The range is the difference between the maximum value and minimum value in the data set.

The interquartile range, or IQR, is the difference between the upper quartile and the lower quartile. Half of the values in the data set lie within the interquartile range.

The interquartile range is a useful measure of the spread of data because, unlike the range, it is not affected by unusually large or small values.

 

Five number summary

The five number summary is the set of values made up of the:

  • minimum
  • lower quartile ($Q_1$Q1)
  • median ($Q_2$Q2)
  • upper quartile ($Q_3$Q3)
  • maximum

These values break our data set into four parts as shown in this diagram

Knowing the five number summary can help us identify key regions of the data set.

  • One quarter of values in the data set lie below the lower quartile;
  • One quarter of values lie above the upper quartile;
  • One half of values lie below the median; one half of the values lie above the median;
  • One half of the values in the data set lie within the interquartile range, between the lower and upper quartiles.

 

Five number summary with and without technology

The individual values required for a five number summary are readily obtained using the Statistics mode on a CAS calculator or spreadsheet. It can also be calculated by hand.

If we enter all of the data in one column (in this example column A) we can use these formulas to help us do the calculations quickly.

Statistic Formula
Minimum =MIN(A:A)
$Q_1$Q1 =QUARTILE(A:A, 1)
Median =MEDIAN(A:A)
$Q_3$Q3 =QUARTILE(A:A,3)
Maximum =MAX(A:A)

Otherwise, we can do it by hand by finding the median of the whole set and then the median of the two half sets that the median divides it into.

Worked examples

Example 1

Determine the five number summary and interquartile range for this data set:

$-2,10,-1,6,9,6,-6,1,7$2,10,1,6,9,6,6,1,7.

If we are doing this by hand, we should reorder in ascending order, but if using technology we can enter the data as is with one entry per row:

In order: $-6,-2,-1,1,6,6,7,9,10$6,2,1,1,6,6,7,9,10

Broken into two lists with the median:

Lower half Median Upper half
$-6,-2,-1,1$6,2,1,1 $6$6 $6,7,9,10$6,7,9,10

The five number summary:

Minimum Lower quartile Median Upper quartile Maximum
$-6$6 $-1.5$1.5 $6$6 $8$8 $10$10

Note that, in this example, neither quartile is a value from the data set because the positions of the quartiles fall between values.

The interquartile range is the difference between the upper quartile and the lower quartile:

Interquartile range $=$= $8.5-(-1.5)$8.5(1.5)
  $=$= $10$10
Example 2

Use class centres to determine the five number summary and interquartile range for the data represented by the histogram:

The histogram data can be represented by the frequency table below:

Class Class centre Frequency
$30-<40$30<40 $35$35 $5$5
$40-<50$40<50 $45$45 $5$5
$50-<60$50<60 $55$55 $7$7
$60-<70$60<70 $65$65 $1$1
$70-<80$70<80 $75$75 $3$3

Notice that there are $21$21 items in this list. This means that the $11$11th item is the median and the value halfway between the $5$5th and $6$6th is $Q_1$Q1 and the item between $15$15th and $16$16th is $Q_3$Q3.

  • $minX=35$minX=35 and $maxX=75$maxX=75. Note that these are the class centres, so it is possible that there are higher and lower values. However, in practice, box plots are usually used to summarise continuous data, with class intervals small enough that it is reasonably accurate to use the class centres to find estimates for summary statistics.
  • The median will be $Med=55$Med=55 as the $11$11th data point is in this class.
  • Lower quartile is $Q_1=40$Q1=40 as the $5$5th item is $35$35 and the $6$6th is $45$45
  • Upper quartile is $Q_3=55$Q3=55

So, the five number summary is set out below:

Minimum Lower quartile Median Upper quartile Maximum
$35$35 $40$40 $55$55 $55$55 $75$75

In this case, median and upper quartile have the same value.

The interquartile range is the difference between the upper quartile and the lower quartile:

Interquartile range $=$= $55-40$5540
  $=$= $15$15

 

Practice questions

Question 1

The table shows the number of points scored by a basketball team in each game of their previous season.

$59$59 $67$67 $73$73 $82$82 $91$91 $58$58 $79$79 $88$88
$69$69 $84$84 $55$55 $80$80 $98$98 $64$64 $82$82  
  1. Sort the data in ascending order.

  2. State the maximum value of the set.

  3. State the minimum value of the set.

  4. Find the median value.

  5. Find the lower quartile.

  6. Find the upper quartile.

 

QUESTION 2

Answer the following questions using the given frequency table.

Score

Frequency

$15$15

$13$13

$16$16 $9$9
$17$17 $23$23
$18$18 $19$19
$19$19 $8$8
$20$20 $13$13
  1. Complete the five number summary using a CAS calculator.

    Minimum: $\editable{}$

    Lower quartile: $\editable{}$

    Median: $\editable{}$

    Upper quartile: $\editable{}$

    Maximum: $\editable{}$

  2. Calculate the interquartile range.

 

Question 3

To gain a place in the main race of a car rally, teams must compete in a qualifying round. The median time in the qualifying round determines the cut off time to make it through to the main race. Below are some results from the qualifying round.

$75%$75% of teams finished in $159$159 minutes or less.

$25%$25% of teams finished in $132$132 minutes or less.

$25%$25% of teams finished between with a time between $132$132 and $142$142 minutes.

  1. Determine the cut off time required in the first round to make it through to the main race.

  2. Determine the interquartile range in the qualifying round.

  3. In the qualifying round, the ground was wet, while in the main race, the ground was dry. To make the times more comparable, the finishing time of each team from the qualifying round is reduced by $5$5 minutes. What would be the new median time from the qualifying round?

 

Box plots

Box plots, sometimes called box-and-whisker plots, can be a useful way of displaying quantitative (numerical) data as they clearly show the five values from a five number summary of a data set. In particular, a box plot highlights the middle $50%$50% of the scores in the data set, between $Q_1$Q1 and $Q_3$Q3.
 

Features of a box plot

We start with a number line that covers the full range of values in our data set.

We then plot the values from the five number summary above the number line, and connect them in a certain way to create a box plot. Here is an example:

The two vertical edges of the box show the upper and lower quartiles of the data range. The left hand side of the box is $Q_1$Q1 and the right hand side of the box is $Q_3$Q3. The vertical line inside the box shows the median.

Then there are two lines that extend from the box outwards. The endpoint of the left line is at the minimum value, while the endpoint of the right line is at the maximum value.

The box plot must be drawn parallel to a number line so that the values for the five number summary can be easily read from the graph.

The example above is represents the five number summary set out below:

Minimum Lower quartile Median Upper quartile Maximum
$18$18 $51$51 $68$68 $87$87 $100$100

The interquartile range (IQR) is the difference between the upper quartile and the lower quartile.

For this example, the IQR is $87-51=36$8751=36.

Since the marks of the box plot represent quartiles, each region represents $25%$25% of the values in the data set. Hence, in this example, we can make statements such as:

  • $50%$50% of values lie in the range from $51$51 to $87$87 (the interquartile range)
  • $25%$25% of values are less than $51$51
  • $75%$75% of values are between $18$18 and $87$87
  • the top $25%$25% of values are least spread out because this region is the smallest

 

Practice questions

Question 4

Question 5

The box plot below shows the age at which a group of people got their driving licenses.

15
20
25
30
35
Age

A box plot displayed above a horizontal number line. The box plot above represents the distribution of ages at which a group of people obtained their driving licenses.
The number line is titled as "Age" and has major tick marks at intervals of 5, ranging from 15 to 35. Between each major tick marks, there are four minor tick marks representing $1$1 unit increment. On the box plot, the box spans from $18$18, representing the first quartile, to $25$25, representing the third quartile, with a vertical line dividing the box at $22$22, representing the median. Thin horizontal lines extend from the edges of the box plot to $17$17 on the left and $31$31 on the right, both plotted as whiskers representing minimum and maximum data points, respectively.

  1. What is the oldest age at which someone got their license?

  2. What is the youngest age at which someone got their licence?

  3. What percentage of people were aged from $18$18 to $22$22?

    $10%$10%

    A

    $25%$25%

    B

    $50%$50%

    C
  4. The middle $50%$50% of responders were within how many years of one another?

    $9$9

    A

    $6$6

    B

    $7$7

    C

    $8$8

    D
  5. In which quartile are the ages least spread out?

    $4$4th

    A

    $1$1st

    B

    $3$3rd

    C

    $2$2nd

    D
  6. The bottom $50%$50% of responders were within how many years of one another?

    $5$5

    A

    $4$4

    B

    $6$6

    C

 

 

Outliers and box plots

Sometimes a data set contains unusually high or low values. These unusual values are called outliers and may arise from data collection errors or due to the natural variation of the data.

We often want to identify the outlier values, and see the characteristics of the data without the effect of the outliers. In this case, we can construct a modified box plot to show the outlier values separately.

In later grades we will look at how to calculate the upper and lower bounds for outliers.

Practice questions

Question 6

Answer the following questions using the given grouped frequency table.

Class Class centre Frequency
$40\le x<45$40x<45 $42.5$42.5 $3$3
$45\le x<50$45x<50 $47.5$47.5 $4$4
$50\le x<55$50x<55 $52.5$52.5 $7$7
$55\le x<60$55x<60 $57.5$57.5 $3$3
$60\le x<65$60x<65 $62.5$62.5 $3$3
$65\le x<70$65x<70 $67.5$67.5 $9$9
$70\le x<75$70x<75 $72.5$72.5 $4$4
$75\le x<80$75x<80 $77.5$77.5 $5$5
  1. Complete the five number summary using a CAS calculator.

    Minimum: $\editable{}$

    Lower quartile: $\editable{}$

    Median: $\editable{}$

    Upper quartile: $\editable{}$

    Maximum: $\editable{}$

  2. Calculate the interquartile range.

 
Question 7

Salaries earned by employees at a software company is given in the histogram below.

  1. Use your CAS calculator to construct a box plot, using the class centres.

    65000
    70000
    75000
    80000
    85000
    90000
    95000
    100000
    105000

  2. Calculate the interquartile range.

  3. Using the box plot, approximately what percentage of salaries lie in the range $\$90000$$90000 to $\$100000$$100000?

  4. Complete the following statement.

    The highest $25%$25% of salaries lie between $\$\quad$$ $\editable{}$ and $\$\quad$$ $\editable{}$ inclusive.

 

Parallel box plots

Parallel box plots are used to compare two sets of data visually.

We call these parallel box plots as they are presented parallel to each other along the same number line for comparison. They must therefore be in the same scale, so a visual comparison is fairly straightforward.

It is important to clearly label each box plot. Here we have plotted two sets of data, comparing the time it took two different groups of people to complete an online task.

We will see in later lessons that this format is very useful for comparing the characteristics of two (or more) data sets.

 

Practice question

Question 8

The heights (in metres) of the boys and girls in a class of $30$30 students were recorded. The results are given in the table below.

Boys: $1.65$1.65 $1.66$1.66 $1.67$1.67 $1.68$1.68 $1.63$1.63 $1.62$1.62 $1.61$1.61 $1.60$1.60 $1.75$1.75 $1.76$1.76 $1.77$1.77 $1.78$1.78 $1.73$1.73 $1.72$1.72 $1.71$1.71
Girls: $1.55$1.55 $1.56$1.56 $1.57$1.57 $1.58$1.58 $1.53$1.53 $1.52$1.52 $1.51$1.51 $1.50$1.50 $1.69$1.69 $1.70$1.70 $1.71$1.71 $1.72$1.72 $1.67$1.67 $1.66$1.66 $1.65$1.65
  1. Complete the table for the given data of the heights of boys in the class.

    Minimum $\editable{}$
    First quartile $\editable{}$
    Median $\editable{}$
    Third quartile $\editable{}$
    Maximum $\editable{}$
  2. Complete the table for the given data of the heights of girls in the class.

    Minimum $\editable{}$
    First quartile $\editable{}$
    Median $\editable{}$
    Third quartile $\editable{}$
    Maximum $\editable{}$
  3. Draw a parallel box plot for this data.

Outcomes

9.B3.5

Pose and solve problems involving rates, percentages, and proportions in various contexts, including contexts connected to real-life applications of data, measurement, geometry, linear relations, and financial literacy.

9.D1.2

Represent and statistically analyse data from a real-life situation involving a single variable in various ways, including the use of quartile values and box plots.

What is Mathspace

About Mathspace