topic badge
AustraliaVIC
VCE 11 General 2023

1.05 Data distribution

Lesson

Introduction

When describing the shape of data sets, it is often useful to focus on how the data is distributed and whether the shape is symmetrical or not. Recall that the  measures of centre  previously explored were the median, mean and mode. Skew is considered relative to a central measure.

Shape of data

Data may be described as symmetrical or asymmetrical.

There are many cases where the data tends to be around a central value with no bias left or right. In such a case, roughly 50 \% of scores will be above the mean and 50 \% of scores will be below the mean. In other words, the mean and median roughly coincide.

A bell-shaped curve.

The normal distribution is a common example of a symmetrical distribution of data. The normal distribution looks like this bell-shaped curve.

A symmetrical curve drawn over the histogram. Ask your teacher for more information.

This picture shows how a data set that has an approximate normal distribution may appear in a histogram. The dark line shows the nice, symmetrical curve that can be drawn over the histogram that the data roughly follows.

A symmetrical curve labelled 0 in the middle and negative 3 in the left.

In the distribution below, the 0 point in the middle represents the mean, the median and the mode - all of these measures of central tendency are equal for this distribution, since it is symmetrical. If there was a line at 0 as our axis of symmetry, notice that the left-hand side is a perfect reflection of the right-hand side.

If a data set is asymmetrical instead (i.e. it isn't symmetrical), it may be described as skewed.

A data set that has positive skew (sometimes called a 'right skew') has a longer tail of values above the peak of the graph, such that more than half of the scores are above the peak. This means the mean is greater than the median, which is greater than the mode: \text{mode} < \text{median} < \text{mean}

A positively skewed graph looks something like this:

General shape of positively skewed data with left side stretched out. Ask your teacher for more information.

Notice that there are more scores above the peak than below the peak.

A data set that has negative skew (sometimes called a 'left skew') has a longer tail of values below the peak of the graph, such that more than half of the scores are below the peak. This means the mode is greater than the median, which is greater than the mean: \text{mean} < \text{median} < \text{mode}

A negatively skewed graph looks something like this:

General shape of negatively skewed data with left side stretched out. Ask your teacher for more information.

Notice that there are more scores below the peak than above the peak.

Examples

Example 1

State whether the scores in each histogram are positively skewed, negatively skewed or symmetrical (approximately).

a
A histogram starting from score 8 to 17. Ask your teacher for more information.
A
Positively skewed
B
Negatively skewed
C
Symmetrical
Worked Solution
Create a strategy

Check where the bulk of the data sits and look at the general shape of the distribution.

Apply the idea

The scores are roughly even in both the high and low end, so the distribution is symmetrical, option C.

b
A histogram with relatively low scores. Ask your teacher for more information.
A
Positively skewed
B
Negatively skewed
C
Symmetrical
Worked Solution
Create a strategy

Check where the bulk of the data sits and look at the general shape of the distribution.

Apply the idea

Most of the scores on the histogram are relatively low, so the distribution is positively skewed, option A.

c
A histogram with relatively high scores. Ask your teacher for more information.
A
Symmetrical
B
Negatively skewed
C
Positively skewed
Worked Solution
Create a strategy

Check where the bulk of the data sits and look at the general shape of the distribution.

Apply the idea

Most of the scores on the histogram are relatively high, so the distribution is negatively skewed, option B.

Idea summary

A distribution is said to be symmetric if its left and right sides are mirror images of one another.

A uniform distribution is a symmetrical distribution where each outcome is equally likely, so the frequency should be the same for each outcome.

A data set that has positive skew (sometimes called a 'right skew') has a longer tail of values to the right of the data set. The mass of the distribution is concentrated on the left of the figure. Most of the scores are relatively low.

A data set that has negative skew (sometimes called a 'left skew') has a longer tail of values to the left of the data set. The mass of the distribution is concentrated on the right of the figure. Most of the scores are relatively high.

Cluster, outliers, and modality

In a set of data, a cluster occurs when a large number of the scores are grouped together within a very small range.

The shape of data also shows us whether there are any outliers or unusually high or low values in a data set.

For example, in the dot plot below, do you see how all the ages range between 12 and 14 except one? This means that 24 is an outlier.

A dot plot showing ages from 11 to 25. Ask your teacher for more information.

In this case, the outlier is very obviously way outside the range of the rest of the data set.

The formal definition of an outlier is a score that is more than 1.5 \times \text{IQR} above the upper quartile, or less than 1.5 \times \text{IQR} below the lower quartile. This will be discussed further in a later section.

Modality describes the prevalence of local peaks in a data set. The peaks don't necessarily need to be the mode of the whole data set, but rather a local cluster of data that is more frequent and stands out from the surrounding data. When looking at the modality of a data set, it is usually useful to examine a graph of the data. Modality is described by the number of peaks.

A histogram showing prices ranging from 0 to 35. Ask your teacher for more information.

A data set that has two distinct peaks, like in the histogram below, is called bimodal.

A dot plot showing scores ranging from 0 to 20. Ask your teacher for more information.

To determine the modality of a distribution, simply identify the number of modal peaks. For instance, the data shown in the dot plot below has three modal peaks because there are local peaks at scores of 6,\,12, and 20.

Examples

Example 2

For the Stem and Leaf plot attached:

StemLeaf
05
17\ 8
20\ 8
31\ 3\ 3\ 7\ 8\ 9
41\ 3\ 5\ 8\ 8\ 8
5
6
7
8
92
Key 1\vert 2 = 12
a

Are there any outliers?

Worked Solution
Create a strategy

Look for significantly large or small values in the data set.

Apply the idea

Looking at the stem, we can see that after 4, we have 9, which is significantly larger. So, yes, there is an outlier.

Reflect and check

Another way to look for an outlier is to find the values of upper and lower quartile. The upper quartile is 45, and the lower quartile is 28. We take the difference to get an \text{IQR} of 17. We can now see if any numbers in the data set are more than 45 + 1.5 \times 17 = 70.5, or less than 28- 1.5 \times 17 = 2.5.

b

Identify the outlier.

Worked Solution
Create a strategy

Use the answer in part (a).

Apply the idea

In part (a), we found an outlier and looking again at the stem and leaf plot, 92 is the outlier. \text{Outlier}=92

Reflect and check

To be an outlier the number has to be more than 1.5 \times \text{IQR} above the upper quartile or below the lower quartile. In this case, any outlier must be more than 70.5 or less than 2.5.

c

Is there any clustering of data?

Worked Solution
Create a strategy

Look for a large number of the scores grouped within a small range.

Apply the idea

Looking at the stem, both 3 and 4 have 6 leaf values which can be identified as clustering - where the large portion of the scores lie. So, yes, there is clustering in the data set.

d

Where does the clustering occur?

A
10s to 20s
B
30s to 40s
C
20s to 30s
Worked Solution
Create a strategy

Use the answers in part (c)

Apply the idea

In part (c), we found that stems 3 and 4 have clustering of data. So, the correct answer is B.

e

What is the modal class(es)?

A
10-19
B
40-49
C
30-39
D
20-29
Worked Solution
Create a strategy

Look at the class that occurs most often. Note that there may be more than one modal class.

Apply the idea

We found in part (c) that stems 3 and 4 have the same number of data points, and both have the highest frequency.

So the correct answers are options B and C.

f

The distribution of the data is:

A
Positively skewed
B
Symmetrical
C
Negatively skewed
Worked Solution
Create a strategy

Ignore the outlier and check if the where the most of the scores lie.

Apply the idea

Most of the scores are relatively high in the data set, which makes a negatively skewed distribution. So the correct answer is C.

Example 3

How many peaks are there on the graph?

A dot plot showing scores ranging from 2 to 20. Ask your teacher for more information.
Worked Solution
Create a strategy

Count how many high points are in the data distribution.

Apply the idea

Scores on 6,\,12, and 20 have high points or frequency in the data distribution, which makes them the peaks. So we have 3 peaks on the graph.

Idea summary

In a set of data, a cluster occurs when a large number of the scores are grouped together within a very small range.

An outlier is a value that is either noticeably greater or smaller than other observations.

Modality is described by the number of peaks. The peaks don't necessarily need to be the mode of the whole data set but rather a local cluster of data that is more frequent and stands out from the surrounding data.

Best choice of centre

It is important to note that certain features in a data set can significantly affect one or more of the three measures of central tendency (the mean, median and mode).

Remember the mode is the most frequently occurring score. So, if a data set has a significant number of repeated scores, then the mode could be a good measure of centre.

If the range of scores is reasonably small and there are no outliers, then the mean is an appropriate measure of centre.

Unlike the mean, the median is not affected by outliers. So, the median is a good measure of central tendency if a data set has outliers or a large range.

The shape of the data may also determine which measure of central tendency is the most appropriate measure of a data set.

If a data set is symmetrical, then the mean and median will be approximately equal. If the data is unimodal (has only one mode) then the mode will also be approximately equal. If the data has more than one mode (e.g. if it is bimodal) then the modes may be different to the mean and median.

When data is positively skewed, the mean is the highest measure of central tendency and the mode is the lowest measure of central tendency. For positively skewed data: \text{mode}<\text{median}<\text{mean}

When data is negatively skewed, the mode is the highest measure of central tendency and the mean is the lowest measure of central tendency. For negatively skewed data: \text{mean}<\text{median}<\text{mode}

Therefore, in skewed data, the most appropriate measure of central tendency will be the median.

Here is a basic summary of selecting an appropriate measure of central tendency. Note that often it can be helpful to consider more than one measure.

Data set ...MeanMedianMode
has outliers yes
has many repeated values yes
has a relatively small range yes
is skewed yes

Of course, sometimes the context of the data being analysed lends itself to particular measures as well.

Examples

Example 4

Which measure of centre would be best for the following data set? 15,\,13,\,16,\,17,\,15,\,15,\,15

A
Mean
B
Mode
C
Median
Worked Solution
Create a strategy

Use the table to check the best measure of centre suited for the given data set.

Data set ...MeanMedianMode
has outliers yes
has many repeated values yes
has a relatively small range yes
is skewed yes
Apply the idea

Notice that there are no outliers, but there are some repeated values, which is 15. So mode would be the best measure of centre for the given data set. The correct answer is B.

Example 5

Every week over 45 weeks, a kayaking club runs social sessions that are open to the public. On each session, the number of people who attend is recorded. The results are displayed in the table provided.

Number of people attendingNumber of weeks
126
135
146
155
166
175
186
195
206

Considering the distribution of the responses, which of the following is true?

A
The mean is a better indicator of the typical number of people who attended each session than the median.
B
The median is a better indicator of the typical number of people who attended each session than the median.
C
The mean and median are equally accurate indicators of the typical number of people who attended each session.
Worked Solution
Create a strategy

Check on how the data is distributed in the table.

Apply the idea

Notice that the distribution of the responses is fairly consistent. This distribution suggests that both mean and median can be equally accurate indicators of the data set. So the correct answer is C.

Idea summary

Here is a basic summary of selecting an appropriate measure of central tendency. Note that often it can be helpful to consider more than one measure.

Data set ...MeanMedianMode
has outliers yes
has many repeated values yes
has a relatively small range yes
is skewed yes

Of course, sometimes the context of the data being analysed lends itself to particular measures as well.

Outcomes

U1.AoS1.2

the concept of a data distribution and its display using a statistical plot

U1.AoS1.3

the five-number summary and possible outliers

What is Mathspace

About Mathspace