An outlier is a data point that varies significantly from the body of the data. An outlier will be a value that is either significantly larger or smaller than other observations. In this lesson, we will visually identify outliers present in a set of data. Outliers are important to identify as they point to unusual bits of data that may require further investigation. For example, if you had data on the temperature in a volcanic lake and we had a high outlier, it is worth investigating if the sensor is faulty or we need to prepare a nearby town for evacuation.

Consider the dot plot below. We would call 9 an outlier as it is well above the rest of the data.

The image shows a dot plot with numbers 0 to 10 on the axis. Ask your teacher for more information.

Examples

Example 1

Identify the outlier in the data set:

63,\, 67,\, 71,\, 76,\, 111

Worked Solution

Create a strategy

Identify the value that is much greater or much smaller than the other values.

Apply the idea

The outlier is 111.

Idea summary

An outlier is a data point that varies significantly from the body of the data. An outlier will be a value that is either significantly larger or smaller than other observations.

Quantify outliers

To determine if a data value is an outlier, there is a rule that involves the interquartile range (IQR). This rule calculates the upper and lower fences of a box plot. A fence refers to the upper and lower boundaries, and any score which lies outside of the fences are classified as outliers.

A data point is classified as an outlier if it lies above the upper fence or below the lower fence. These are calculated as follows:\text{Lower fence} = \text{Lower quartile} -1.5 \times \text{Interquartile Range}\\ \text{Upper fence} = \text{Upper quartile} +1.5 \times \text{Interquartile Range}

Using the five-number summary and the upper and lower fences we can construct an box plot and identify any outliers.

A box plot showing the first quartile, third quartile, lower fence, and upper fence. Ask your teacher for more information.

The above diagram shows the construction of the box plots and how the upper and lower fences are constructed. Any data points outside of these are outliers.

Examples

Example 2

Consider the data set below: 1,\,1,\,3,\,21,\,7,\,9,\,10,\,6,\,11

Construct a five number summary.

Worked Solution

Create a strategy

Determine the minimum and maximum scores, the lower and upper quartiles, and the median.

Apply the idea

First, sort the data set to be in order. We have: 1,\,1,\,3,\,6,\,7,\,9,\,10,\,11,\,21

We can clearly see that the minimum is 1 and the maximum is 21.

There are 9 scores, so the middle score will be the fifth score when the scores are in order. The median is 7.

To find the quartiles, we take the parts of the list on either side of the median. The lower part is 1,\,1,\,3,\,6, and the upper part is 9,\,10,\,11,\,21. Since, both have an even number of scores, we solve the average of their middle scores: Q_1=\dfrac{1+3}{2}=2 \qquad \qquad Q_3=\dfrac{10+11}{2}=10.5

So our five number summary is:

\text{Minimum}	1
Q_1	2
\text{Median}	7
Q_3	10.5
\text{Maximum}	21

Determine if there are any outliers.

Worked Solution

Create a strategy

Calculate the lower fence and upper fence using the formulas:\text{Lower fence} = \text{Lower quartile} -1.5 \times \text{IQR}\\\text{Upper fence} = \text{Upper quartile} +1.5 \times \text{IQR}

Apply the idea

We solve for the \text{IQR} first. We found in part (a) that Q_1=2 and Q_3=10.5.

\displaystyle \text{IQR}	\displaystyle =	\displaystyle Q_3-Q_1	Write the formula
	\displaystyle =	\displaystyle 10.5-2	Substitute the quartiles
	\displaystyle =	\displaystyle 8.5	Evaluate

\displaystyle \text{Lower fence}	\displaystyle =	\displaystyle 2-1.5\times8.5	Substitute the values
	\displaystyle =	\displaystyle -10.75	Evaluate using a calculator
\displaystyle \text{Upper fence}	\displaystyle =	\displaystyle 10.5+1.5\times8.5	Substitute the values
	\displaystyle =	\displaystyle 23.25	Evaluate using a calculator

So there are no outliers in this data set.

Reflect and check

Notice that although 21 is quite a bit larger than the other scores in the set, it is not far enough away to be considered an outlier. We should always check the upper and lower fence values to determine if a score is an outlier or not.

Draw a box plot to represent this data.

Worked Solution

Create a strategy

Use the five number summary from part (a).

Apply the idea

Example 3

Consider the dot plot below.

A dot plot of the scores ranging from 1 to 9. Ask your teacher for more information.

Determine the median, lower quartile score and the upper quartile score.

Worked Solution

Create a strategy

Use the formula: \left(\dfrac{n+1}{2}\right) to find the position of the median and quartiles.

Apply the idea

There are n=16 scores. The position of the median is given by:

\displaystyle \text{Position of median}	\displaystyle =	\displaystyle \dfrac{16+1}{2}	Substitute n=16
	\displaystyle =	\displaystyle \dfrac{17}{2}	Evaluate the addition
	\displaystyle =	\displaystyle 8.5	Evaluate the division

The median is the average of the 8th and 9th scores. Since both are 3, the middle score is also 3. \text{Median}=3

The lower quartile will be the median of the first 8 scores: 1,\,1,\,2,\,2,\,2,\,2,\,2,\,3

We can see that the first quartile is 2.\text{Lower quartile}=2

The upper quartile will be the median of the last 8 scores: 3,\,3,\,3,\,4,\,4,\,4,\,5,\,9

We can see that the last quartile is 4.\text{Upper quartile}=4

Hence, calculate the interquartile range.

Worked Solution

Create a strategy

Use the interquartile range formula: \text{IQR} = Q_{3} - Q_{1}

Apply the idea

\displaystyle \text{IQR}	\displaystyle =	\displaystyle 4-2	Substitute the quartiles
	\displaystyle =	\displaystyle 2	Evaluate

Calculate 1.5\times \text{IQR}, where \text{IQR} is the interquartile range.

Worked Solution

Create a strategy

Use the calculated \text{IQR} from part (b).

Apply the idea

\displaystyle 1.5\times \text{IQR}	\displaystyle =	\displaystyle 1.5\times 2	Substitute the \text{IQR}
	\displaystyle =	\displaystyle 3	Evaluate

An outlier is a score that is more than 1.5\times \text{IQR} above or below the Upper Quartile or Lower Quartile respectively. State the outlier.

Worked Solution

Create a strategy

Calculate the lower fence and upper fence using the formulas:\text{Lower fence} = \text{Lower quartile} -1.5 \times \text{IQR}\\\text{Upper fence} = \text{Upper quartile} +1.5 \times \text{IQR}

Apply the idea

In part (a), we found that the upper quartile is 4 and the lower quartile is 2. In part (c), we found the value, 1.5\times \text{IQR}=3.

\displaystyle \text{Lower fence}	\displaystyle =	\displaystyle 2-3	Substitute the values
	\displaystyle =	\displaystyle -1	Evaluate
\displaystyle \text{Upper fence}	\displaystyle =	\displaystyle 4+3	Substitute the values
	\displaystyle =	\displaystyle 7	Evaluate

Since 9 is above the upper fence of 7, then the outlier is 9.

Idea summary

Effects of outliers

Once an outlier is identified, the underlying cause of the outlier should be investigated. If the outlier is simply a mistake then it should be removed from the data - this can often occur when recording or transferring data by hand or conducting a survey where a respondent may not take the questionnaire seriously. If the data is not a mistake it should not be removed from the data set as while it is unusual it is representative of possible outcomes - for example, you would not remove a very tall student's height from data for a class just because it was unusual for the class.

When data contains an outlier we should be aware of its impact on any calculations we make. Let's look at the effect that outliers have on the three measures of center - mean, median and mode:

The Mean will be significantly affected by the inclusion of an outlier:

Including a high outlier will increase the mean.
Including a low outlier will decrease the mean.

The Median is the middle value of a data set, the inclusion of an outlier will not generally have a significant impact on the median unless there is a large gap in the center of the data.

Including a high outlier may increase the median slightly or it may remain unchanged.
Including a low outlier may decrease the median slightly or it may remain unchanged.

The Mode is the most frequent value, as an outlier is an unusual value it will not be the mode. The inclusion of an outlier will have no impact on the mode.

Examples

Example 4

Consider the following set of data: 37,\,46,\,35,\,56,\,56,\,35,\,125,\,36,\,48,\,56

Fill in this table of summary statistics.

Mean	\quad
Median	\quad
Mode	\quad

Worked Solution

Create a strategy

To find the mean, use the formula: \text{Mean}=\dfrac{\text{Sum of score}}{\text{Number of scores}}

To find the median find the middle score.

To find the mode, find the most frequent score.

Apply the idea

\displaystyle \text{Mean}	\displaystyle =	\displaystyle \dfrac{37+46+35+56+56+35+125+36+48+56}{10}	Use the formula
	\displaystyle =	\displaystyle \dfrac{530}{10}	Evaluate the addition
	\displaystyle =	\displaystyle 53	Evaluate the division

To find the median, order the scores: 35,\,35,\,36,\,37,\,46,\,48,\,56,\,56,\,56,\,125

The middle scores are: 46,\,48

\displaystyle \text{Median}	\displaystyle =	\displaystyle \dfrac{46+47}{2}	Find the average
	\displaystyle =	\displaystyle 47	Evaluate

To find the mode, choose the score which occurs most often.

\text{Mode}=56

By filling in the table:

Mean	53
Median	47
Mode	56

Which data value is an outlier?

Worked Solution

Create a strategy

Choose the value that is much greater or much smaller than the rest of the data set.

Apply the idea

\text{Outlier}=125

Fill in this table of summary statistics after removing the outlier.

Mean	\quad
Median	\quad
Mode	\quad

Worked Solution

Apply the idea

\displaystyle \text{Mean}	\displaystyle =	\displaystyle \dfrac{37+46+35+56+56+35+36+48+56}{10}	Use the formula
	\displaystyle =	\displaystyle \dfrac{405}{9}	Evaluate the addition
	\displaystyle =	\displaystyle 45	Evaluate the division

To find the median, order the scores: 35,\,35,\,36,\,37,\,46,\,48,\,56,\,56,\,56

The middle score is 46.

\text{Median}=46

To find the mode, choose the score which occurs most often.

\text{Mode}=56

By filling in the table:

Mean	45
Median	46
Mode	56

Let A be the original data set and B be the data set without the outlier.

Complete the table using the symbols >,< and = to compare the statistics before and after removing the outlier.

	\text{With outlier}		\text{Without\ outlier}
Mean:	A	⬚	B
Median:	A	⬚	B
Mode:	A	⬚	B

Worked Solution

Create a strategy

Compare the statistics in part (a) and in part (c).

Apply the idea

Statistics from parts (a) and (c):

	With outlier	Without outlier
Mean	53	45
Median	47	46
Mode	56	56

Comparison table:

	\text{With outlier}		\text{Without\ outlier}
Mean:	A	>	B
Median:	A	>	B
Mode:	A	=	B

Idea summary

The Mean will be significantly affected by the inclusion of an outlier:

Including a high outlier will increase the mean.
Including a low outlier will decrease the mean.

The Median is the middle value of a data set, the inclusion of an outlier will not generally have a significant impact on the median.

The Mode is the most frequent value, as an outlier is an unusual value it will not be the mode. The inclusion of an outlier will have no impact on the mode.

Suitability of measures of centre

We can use the mean, median, or mode to describe the centre of a data set. Sometimes one measure may better represent the data than another and sometimes we want just one statistic for an article or report rather than detail on the different measures. When deciding which to use we need to ask ourselves which measure would best represent the type of data we have. Some main considerations are:

Is there a repeated value? If there are no repeated values or only a couple of randomly repeated values then the mode will not be representative of the data. If there is one or two highly frequent data points these may be a fair representation of the centre of the data.
Is there an outlier? As we have seen an outlier will significantly affect the mean \text{-} this may give a distorted view of the centre of the data. For example, if we had a list of houses sold in an area and a historic mansion was sold for a price well above the other houses in the area, then using the median would be a better representation of average house prices in the area than the mean.
Do you need all the data values to be taken into account? Only the mean uses all the values in its calculation.

Examples

Example 5

The salaries of part-time employees at a company are given in the dot plot below. Which measure of center best reflects the typical wage of a part-time employee?

A dot plot of salaries in thousand dollars, ranging from 18 to 38. Ask your teacher for more information.

Worked Solution

Create a strategy

Choose the measure that is appropriate for data with extreme values.

Apply the idea

The data set has extreme values that are away from the rest of the set at 37 and 38.

So the median is the best measure of centre, since it will not be affected by the extreme values.

Idea summary

Main considerations when choosing a suitable measure of centre:

If there are no repeated values or only a couple of randomly repeated values then the mode will not be representative of the data.
If there are outliers the median will be more suitable than the mean.
Only the mean uses all the values in its calculation.

Outcomes

U1.AoS1.5

construct and interpret graphical displays of data, and describe the distributions of the variables involved and interpret in the context of the data

1.08 Outliers

Ideas

Outliers

Examples

Example 1

Quantify outliers

Examples

Example 2

Example 3

Effects of outliers

Examples

Example 4

Suitability of measures of centre

Examples

Example 5

Outcomes

U1.AoS1.5

What is Mathspace

About Mathspace