Recall that data can be either numerical or categorical. While column (or bar) graphs are preferred for displaying categorical data, histograms are the preferred option for data that is numerical.
Continuous numerical data, such as times, heights, weights or temperatures, are based on measurements, so any data value is possible within a large range of values. For displaying this type of data, a histogram is used.
Although a histogram looks similar to a bar chart, there are a number of important differences between them:
Histograms show the distribution of data values, whereas a bar chart is used to compare data values.
Histograms are used for numerical data, whereas bar charts are often used for categorical data.
A histogram has a numerical scale on both axes, while a bar chart only has a numerical scale on the vertical axis.
The columns in a bar chart could be re-ordered, without affecting the representation of the data. In a histogram, each column corresponds with a range of values on a continuous scale, so the columns cannot be re-ordered.
A Histogram:
A Bar chart:
Key features of a frequency histogram:
The horizontal axis is a continuous numerical scale (like a number line). It represents numerical data, such as time, height, mass or temperature, and may be divided into class intervals.
The vertical axis is the frequency of each data value or class interval.
There are no gaps between the columns because the horizontal axis is a continuous scale. It is possible for a class interval to have a frequency of zero, but this is not the same as having gaps between each column.
It is good practice, when creating a histogram, to leave a half-column-width gap between the vertical axis and the first column.
Class interval | Frequency |
---|---|
45\leq \text{time} \lt 50 | 9 |
50\leq \text{time} \lt 55 | 7 |
55\leq \text{time} \lt 60 | 20 |
60\leq \text{time} \lt 65 | 30 |
65\leq \text{time} \lt 70 | 6 |
What may surprise us at first is that the histogram has only five columns, even though it represents 72 different data values.
To produce the histogram, the data is first grouped into class intervals (which are also called bins), using the frequency distribution table.
In the table above,
The first class interval includes the running times for 9 different runners. Each of their times fall within a range that is greater than or equal to 45 minutes, but less than 50 minutes. This class interval is represented by the first column in the histogram.
The second class interval includes the running times for 7 different runners, each with times falling with a range greater than or equal to 50 minutes, but less than 55 minutes. This class interval is represented by the second column in the histogram, and so on.
Every data value must go into exactly one and only one class interval. Class intervals should be equal width.
There are several different ways that class intervals are defined. Here are some examples with two adjacent class intervals:
Class interval formats | Frequency |
---|---|
45\lt \text{time} \leq 50 \qquad 0\lt \text{time} \leq 55 | \text{Upper endpoint included, lower endpoint} \\ \text{excluded.} |
45\leq \text{time} \lt 50 \qquad 0\leq \text{time} \lt 55 | \text{Lower endpoint included, upper endpoint} \\ \text{excluded.} |
45\text{ to} \lt 50 \qquad \qquad 45\text{ to} \lt 50 | \text{Lower endpoint included, upper endpoint} \\ \text{excluded.} |
45 - 49 \qquad \, \, \quad \qquad 50 - 54 | \text{Suitable for data rounded to the nearest } \\ \text{whole number, or discrete data.} |
45 \to 50\qquad \quad \qquad 50 \to 55 | \text{Not clear which endpoints are included or} \\ \text{ excluded. Assume the upper endpoint is included.} |
Regardless of the format used, each class interval for a given set of data should be consistent across all class intervals.
Note: In this course, class intervals for any particular set of data will be the same width. There are situations in data representation when class intervals are different widths, but this is beyond the scope of this course.
The class centre is the average of the endpoints of each interval.
For example, if the class interval is 45\leq \text{time} \lt 50, or 45-50, the class centre is calculated as follows: \begin{aligned} \text{Class interval}&=\dfrac{45+50}{2} \\ &=47.5 \end{aligned}
Since the class centre is an average of the endpoints, it is often used as a single value to represent the class interval. In some histograms, it may be used for the scale on the horizontal axis, with the class centre displayed directly below the middle of each vertical column.
Find the class centre for the class interval 19\leq t<23 where t represents time.
In product testing, the number of faults detected in producing a certain machinery is recorded each day for several days. The frequency table shows the results.
Number of faults | Frequency |
---|---|
0 - 3 | 10 |
4 - 7 | 14 |
8 - 11 | 20 |
12 - 15 | 16 |
Construct a histogram to represent the data.
What is the lowest possible number of faults that could have been recorded on any particular day?
As part of a fuel watch initiative, the price of petrol at a service station was recorded each day for 21 days. The frequency table shows the findings.
Price (cents per Litre) | Class Centre | Frequency |
---|---|---|
130.9 - 135.9 | 133.4 | 6 |
135.9 - 140.9 | 138.4 | 5 |
140.9 - 145.9 | 143.4 | 5 |
145.9 - 150.9 | 148.4 | 5 |
What was the highest price that could have been recorded?
How many days was the price above 140.9 cents?
Key features of a frequency histogram:
The horizontal axis is a continuous numerical scale (like a number line). It represents numerical data, such as time, height, mass or temperature, and may be divided into class intervals.
The vertical axis is the frequency of each data value or class interval.
There are no gaps between the columns because the horizontal axis is a continuous scale. It is possible for a class interval to have a frequency of zero, but this is not the same as having gaps between each column.
It is good practice, when creating a histogram, to leave a half-column-width gap between the vertical axis and the first column.
For grouped data:
Every data value must go into exactly one and only one class interval. Class intervals should be equal width.
Each class interval must be the same size, e.g. 1-5,5-10,10-15,\ldots , 10-20, 20-30, 30-40, \ldots
The class centre is the average of the end points of the class interval.
Dot plots are a graphical way of displaying the distribution of numerical or categorical data on a simple scale with dots representing the frequency of data values. They are best used for small to medium size sets of data and are good for visually highlighting how the data is spread and whether there are any gaps in the data or outliers. We will look at identifying outliers in more detail in our next lesson.
In a dot plot, each individual value is represented by a single dot, displayed above a horizontal line. When data values are identical, the dots are stacked vertically. The graph appears similar to a pictograph or column graph with the number of dots representing the total count.
To correctly display the distribution of the data, the dots must be evenly spaced in columns above the line
The scale or categories on the horizontal line should be evenly spaced
A dot plot does not have a vertical axis
The dot plot should be appropriately labelled
Here is a dot plot of the number of goals scored in each of Bob’s soccer games.
How many times was one goal scored?
Which number of goals were scored equally and most often?
How many games were played in total?
A stem plot, or stem and leaf plot, is used for organising and displaying numerical data. It is appropriate for small to moderately sized data sets. The graph is similar to a column graph on its side. An advantage of a stem and leaf plot over a column graph is the individual scores are retained and further calculations can be made accurately.
In a stem and leaf plot, the right-most digit in each data value is split from the other digits, to become the leaf. The remaining digits become the stem.
Stem | Leaf |
---|---|
1 | 0\ 3\ 6 |
2 | 1\ 6\ 7\ 8 |
3 | 5\ 5\ 6 |
4 | 1\ 1\ 5\ 6\ 9 |
5 | 0\ 3\ 6\ 8 |
Key 2\vert 1 = 21 |
The stems are arranged in ascending order, to form a column, with the lowest value at the top
The leaf values are arranged in ascending order from the stem out, in rows, next to their corresponding stem
A single vertical line separates the stem and leaf values
There are no commas or other symbols between the leaves, only a space between them
In order to correctly display the distribution of the data, the leaves must line up in imaginary columns, with each data value directly below the one above
A stem and leaf plot includes a key that describes the way in which the stem and the leaf combine to form the data value
The stem-and-leaf plot below shows the age of people to enter through the gates of a concert in the first 5 seconds.
Stem | Leaf |
---|---|
1 | 1\ 2\ 6\ 8\ 9\ 9 |
2 | 0\ 2\ 3\ 4\ 5\ 7\ 7\ 8 |
3 | 2\ 2\ 4\ 7 |
4 | |
5 | 9 |
Key 1\vert 2 = 12 |
How many people passed through the gates in the first 5 seconds?
What was the age of the youngest person?
What was the age of the oldest person?
What proportion of the concert-goers were under 24 years old?
Stem | Leaf |
---|---|
1 | 0\ 3\ 6 |
2 | 1\ 6\ 7\ 8 |
3 | 5\ 5\ 6 |
4 | 1\ 1\ 5\ 6\ 9 |
5 | 0\ 3\ 6\ 8 |
Key 2\vert 1 = 21 |
Back-to-back stem and leaf plots allow for the display of two data sets at the same time. These types of plots are a great way to make comparisons between data sets.
Reading a back-to-back stem and leaf plot is very similar to a regular stem and leaf plot. The "stem" is used to group the scores and each "leaf" indicates the individual scores within each group. The "stem" is a column and the stem values are written downwards in that column. The "leaf" values are written across in the rows corresponding to the "stem" value. In a back-to-back stem-and-leaf plot, however, two sets of data are displayed simultaneously. One set of data is displayed with its leaves on the left, and the other with its leaves on the right. The "leaf" values are still written in ascending order from the stem outwards.
The back-to-back stem plots show the number of pieces of paper used over several days by Maximillian’s and Charlie’s students.
Maximillian's students | Charlie's students | |
---|---|---|
7 | 0 | 7 |
3 | 1 | 1\ 2\ 3 |
8 | 2 | 8 |
4\ 3 | 3 | 2\ 3\ 4 |
7\ 6\ 5 | 4 | 9 |
3\ 2 | 5 | 2 |
Key: 6 \vert 1 \vert 2 = 16 \text{ and }12
Which of the following statements are true?
I. Maximillian's students did not use 7 pieces of paper on any day.
II. Charlie's median is higher than Maximillian’s median.
III. The median is greater than the mean in both groups.
A back-to-back stem plot is very similar to a regular stem plot, in that the "stem" is used to group the scores and each "leaf" indicates the individual scores within each group.
If you have to create your own stem-and-leaf plot, it's easier to write all your scores in ascending order before you start putting them into a stem and leaf plot.