topic badge

7.05 Histograms with smooth curves

Histograms and smooth curves

A histogram titled Baby seal weights with Frequency on the y-axis, and Weight of baby seals (lbs) on the x-axis. Ask your teacher for more information.

We have seen that a histogram can be used to represent a set of univariate numerical data.

Remember that a histogram is a data display that divides the data into bins or intervals and shows the frequency of data points within each bin with the height of each bar.

Sometimes we will see a frequency polygon on a histogram which is a line graph that follows the shape of the histogram using either the left corners, centers, or right corners of the bars.

A histogram with line graph titled Baby seal weights with Frequency on the y-axis, and Weight of baby seals (lbs) on the x-axis. Ask your teacher for more information.
Center of bars
A histogram with line graph titled Baby seal weights with Frequency on the y-axis, and Weight of baby seals (lbs) on the x-axis. Ask your teacher for more information.
Left corner of bars
A histogram with line graph titled Baby seal weights with Frequency on the y-axis, and Weight of baby seals (lbs) on the x-axis. Ask your teacher for more information.
Right corner of bars

Notice that the pieces of the bars that are above the frequency polygon, could be used to fill in the unfilled areas below the frequency polygon.

To give us a better understanding of the overall shape of the data, we can draw a smooth curve over the histogram. This curve is sometimes called a density curve and shows where values are concentrated.

The image shows a bell-shaped curve drawn over a histogram. Ask your teacher for more information.

The dark line is the smooth curve that can be drawn over the histogram to model the distribution.

Like the frequency polygons, the smooth curves may cross the bars in the center, left, or right.

A distribution shows the shape and how a data set is spread out or clustered within the range.

We have formal ways to describe the shape:

The image shows a bell-shaped curve drawn over a histogram. Ask your teacher for more information.

This data distribution is symmetrical, or bell-shaped. It has no skew.

The image shows a curve shown over a histogram of positively skewed data. Ask your teacher for more information.

This data distribution shows a positive skew. Notice the "tail" at the positive end as it trails off. This is sometimes called a right skew.

The image shows a curve over a histogram of negatively skewed data. Ask your teacher for more information.

This data distribution shows a negative skew. Notice the "tail" at the negative end as it trails off. This is sometimes called a left skew.

Based on symmetry or skew of the distribution, we can make observations about the measures of center - mean, median, and mode.

Exploration

Select different distrbution shapes using the checkboxes under the histogram and notice how the measures of center compare.

Loading interactive...
  1. What do you notice about the mean, median and mode when the data is symmetrical?

  2. What do you notice about the mean, median and mode when the data is positively skewed?

  3. What do you notice about the mean, median and mode when the data is negatively skewed?

A bell-shaped curve with dashed line at the pick and labeled with Mean, Median, and Mode.

For symmetrical or non-skewed data:

  • Roughly 50 \% of scores will be above the mean and 50 \% of scores will be below the mean.

  • The mean, median and mode should roughly coincide.

A right skew with 3 dashed lines labeled Mode, Median, and Mean.

For a positive or right skew:

  • The data frequencies are much lower as you move to the right.

  • The mean will get pulled up by the skew, so is usually greater than the median.

  • The median is usually greater than the mode.
A left skew with 3 dashed lines labeled Mean, Median, and Mode.

For a negative or left skew:

  • The data frequencies are much lower as you move to the left.

  • The mean is pulled down by the skew, so is usually less than the median.

  • The median is usually less than the mode.

The range is a helpful measure of spead when analyzing data distributions. In later lessons, we will look at how standard deviation can be read from a symmetrical smooth curve.

Examples

Example 1

The given histogram represents the distribution of hours students sleep each night.

Select the smooth curve that most accurately models this distribution.

A histogram with Frequency on the y-axis, and Sleep duration (hours) on the x-axis. Ask your teacher for more information.
A
A negative skew with Frequency on the y-axis, and Sleep duration (hours) on the x-axis. Ask your teacher for more information.
B
A negative skew with Frequency on the y-axis, and Sleep duration (hours) on the x-axis. Ask your teacher for more information.
C
A negative skew with Frequency on the y-axis, and Sleep duration (hours) on the x-axis. Ask your teacher for more information.
D
A line graph with Frequency on the y-axis, and Sleep duration (hours) on the x-axis. Ask your teacher for more information.
Worked Solution
Create a strategy

Look for the smooth curve that closely follows the shape and peak of the histogram bars, accurately representing the distribution's center, spread, and symmetry.

The correct curve will typically align well with the measures of center and spread of the data displayed in the histogram.

Apply the idea

This histogram shows a negative (left) skew and mode around 6 with a frequency of 175, so we are looking for a smooth curve with the same key features.

Option A: This smooth curve has a negative (left) skew, but the peak is around 5.5 with a frequency of 125, so this option is not correct.

Option B: The smooth curve has a peak is around 6, but it has a frequency of 125 and is symmetric in shape, so this option is not correct.

Option C: This smooth curve has a negative (left) skew and shows a peak around 6 with frequency close to 170, so this curve is correct.

Option D: This is not a smooth curve, just a frequency polygon, so it is jagged. This is not the correct option.

Reflect and check

Depending on whether we use the right corner, left corner, of middle of the bars, there can be multiple correct smooth curves.

Example 2

The given smooth curve was created from data collected on the statistical question "What weekly pay is typical for a job while in high school?"

A positive skew with Frequency on the y-axis, and Weekly pay ($) on the x-axis. Ask your teacher for more information.
a

Describe and interpret the shape of the distribution.

Worked Solution
Create a strategy

Is there a tail that slowly tapers off? If so, which side it is on? This will tell us the shape.

Apply the idea

There is a tail on the right side where the larger values are, so the shape is a positive (right) skew.

This means that the majority of students earn a weekly play on the lower end (around \$150 per week), but a few earn a high amount (around \$350 per week).

b

Estimate and interpret an appropriate measure of center.

Worked Solution
Create a strategy

For a symmetric distribution the mean, median, and mode occur right at the peak. Since this distribution has a positive (right) skew, the median will be pulled up (to the right) a little from the mode, and the mean will be pulled up even more from the mode.

Apply the idea
A positive skew with labels Mode, Median, and Mean. Ask your teacher for more information.

The most appropriate measure of center for this data is the median because the mean is more heavily impacted by the skew and the mode is not at all impacted by the skew. The median will allow us to take the higher data values into account without letting them have too much of an impact.

By using the fact that the skew pulls the mean slightly higher than the mode, we can estimate the median weekly pay for high school students to be around \$175 per week.

Reflect and check

The mode would be a suitable measure at around \$150 per week.

The mean would likely be even higher, possibly closer to \$200 per week.

c

Estimate and interpret an appropriate measure of spread.

Worked Solution
Create a strategy

Measures of spread include range, interquartile range, and standard deviation.

Apply the idea

Without the raw data, we cannot determine the standard deviation or interquartile range.

We can estimate that the maximum pay is about \$400 and the minimum pay is about \$50 so the range is about 400-50=350. This means that the weekly pay for high school students varies by \$350 from the lowest pay to the highest pay.

Reflect and check

We will learn how to estimate the standard deviation from a symmetric smooth curve in future lessons.

d

What can be inferred from this distribution?

A
Most students with a job make more than \$200 per week.
B
The weekly amount the students make are evenly distributed from \$50 to \$400.
C
No students make more than \$350.
D
Most students with a job make more than \$150 per week.
Worked Solution
Create a strategy

We can go through each statement and see which are correct.

Apply the idea

Option A: "Most" usually more than half, so this is saying that the median is more than \$200, which is not the case. Most students earn between \$100 and \$200 per week. This is incorrect.

Option B: "Evenly distributed" means that the frequency would be about the same for all amounts betwen \$50 and \$400. An evenly distributed smooth curve would be a horizontal line. This is incorrect.

Option C: The curve does not reach zero or end at or before \$350, so there are students who earned more than \$350.

Option D: In part (b), we said that the median was around \$175, so about half of students earn more than \$175, this means that more than half earn more than \$150. This is correct.

Option D is the correct answer.

Idea summary

A distribution is symmetrical if its left and right sides are mirror images of one another.

A data set that has positive or right skew has a longer tail of values to the right of the data set. The mass of the distribution is concentrated on the left of the figure.

A data set that has negative or left skew has a longer tail of values to the left of the data set. The mass of the distribution is concentrated on the right of the figure.

Compare data distributions

When we look at two or more univariate data displays, like smooth curves, there are certain characteristics that can help us compare the two data distributions.

Center

The center of a data set describes the entire data set with a single number. The mode is found at the peak of the smooth curve, but the mean and median may be skewed if the data is skewed.

x
y
Spread

The spread of a data set describes how varied or similar the data is. The spread of a density curve can be determined by the width of the curve at the x-axis.

x
y
Shape

The shape of a data set can be determined by looking at the outline of the curve and describes the distribution of data within the set.

Identifying the shape of a density curve can help us understand the corresponding data set. Here are some examples of how we describe the shape of density curves:

The image shows Negatively skewed (left skewed), Symmetrical distribution, and Positively skewed (right skewed). Ask your teacher for more information.

Exploration

Consider the following histograms that show the height of students in two basketball teams. We know that one graph represents a team made up of Grade 12 students and the other represents a Grade 9 team.

A histogram with symmetrical distribution with Number of students on the y-axis, and Height (in) on the x-axis. Ask your teacher for more information.
Team A
A histogram with positively skewed with Number of students on the y-axis, and Height (in) on the x-axis. Ask your teacher for more information.
Team B
  1. Are there the same number of students in each team? Does it matter?

  2. What are the similarities and differences in terms of measures of spread, central tendency and shape of data?

  3. Which team do you think corresponds to the Grade 12 team, and which team do you think corresponds to the Grade 9 team?

It is important to be able to compare data sets because it helps us make conclusions or judgements about the data. For example, suppose Jim scores 50\% on a geography test and 70\% on a history test. Based on those grades alone, it makes sense to say that he did better in history.

However, looking at the smooth curves that represent the class results, we can see they tell a different story.

10
20
30
40
50
60
70
80
90
100
\text{Grade }(\%)
2
4
6
8
10
12
\text{Number of students}

Notice the geography class had a mean of 40\%, while the history class had a mean of 80\%. Now we know that Jim scored well above the average in geography, and well below the average in history. With this extra information, it makes more sense to say that he did better in geography.

Examples

Example 3

The following curves show the average math test results for two different classes. Curves 1 and 2 show the results for class 1 and 2 respectively.

The image shows a graph with two bell curves representing test results (%). Ask your teacher for more information.
a

State the similarities and differences between the following pair of density curves.

Worked Solution
Create a strategy

To compare and contrast the two curves, we can look to the shapes, centers, and spreads of each curve.

Apply the idea

The shape of curve 1 is skewed right and the shape of curve 2 is skewed left, so the shapes are both skewed, but in opposite directions.

The mean of curve 1 will be above the peak due to the skew, so will be around 65\%. The mean of curve 2 will be below the peak due to the skew, so will be around 85\%. Their centers are quite different.

The spread of curve 1 goes from about 40 to 90\% and the spread of curve 2 goes from about 55 to 100\%.Therefore we can say the spreads are over different percentages, but are about the same size.

b

Interpret the test results of class 1 and class 2.

Worked Solution
Create a strategy

We can use the findings from part (a) in order to draw conclusions about the test results.

Apply the idea

Class 1 has lower test results than class 2 on average and has a large spread of results. Class 2 has a much higher average test score than class 1, but also has a fairly large spread.

c

If Anthony scored 60\% in Class 1, and Brodie scored 80\% in Class 2, who did better?

Worked Solution
Create a strategy

We can look at how their results compare to their class because the class results are very different which means that one class might have had a much harder test.

Apply the idea
The image shows a graph with two curves representing test results (%). Curve 1 is blue and peaks around 60%, labeled 'Anthony'. Curve 2 is red and peaks around 80%, labeled 'Brodie'.

Anthony's grade is just above the grades of most students in the class, while Brodie's grade is below the grades of most students in his class. This tells us that Anthony did better when we consider the results of their classmates.

Example 4

The following curves show the distributions of the race times for two different years of the Shelby Forest Loop Marathon.

The image shows a graph with two curves representing time (hr). The curve for 2020 is peaking around 4 hours. The curve for 2021 is peaking slightly lower around 4.5 hours. The x-axis ranges from 3 to 10 hours.

Describe the similarities and differences between the following pair of smooth curves.

Worked Solution
Create a strategy

We can look at the shape, measure of center or clusters, and the range as a measure of spread.

Apply the idea

Both density curves have a significant right skew. However, 2021 has another small peak around 9 hours, while 2020 tapers off smoothly. 2021 is slightly more skewed than 2020.

Their centers are also fairly similar with a mean at around 5 hours. Both means are to the right of the peak. They are pulled there by the right skew of the data.

The spread of 2020 is slightly lower than 2021 as there were no extreme data values with times greater than 8 hours, unlike in 2021. The range is about 8 for 2020 and about 10 for 2021, so slightly less consistent in 2021.

Reflect and check

Looking at the corresponding histograms and frequency polygons, we can see that the 2021 race has an extreme value which leads to the secondary bump. With real-life data, it is often not perfectly smooth.

The image shows a histogram of the 2020 Race results with a frequency distribution. The x-axis represents time in hours, ranging from 3 to 10 hours. The y-axis represents the frequency, with values ranging from 0 to 16. A line graph overlays the histogram, illustrating the distribution curve.
The image shows a histogram of the 2021 Race results with a frequency distribution. The x-axis represents time in hours, ranging from 3 to 10 hours. The y-axis represents the frequency, with values ranging from 0 to 18. A line graph overlays the histogram, illustrating the distribution curve.
Idea summary

We can first identify and then compare the measures of center, measures of spread, and shape of smooth curves and histograms.

The context of the curves is important to consider when interpreting the comparisons.

Outcomes

A2.ST.1

The student will apply the data cycle (formulate questions; collect or acquire data; organize and represent data; and analyze data and communicate results) with a focus on univariate quantitative data represented by a smooth curve, including a normal curve.

A2.ST.1c

Examine the shape of a data set (skewed versus symmetric) that can be represented by a histogram, and sketch a smooth curve to model the distribution.

A2.ST.1e

Describe and interpret a data distribution represented by a smooth curve by analyzing measures of center, measures of spread, and shape of the curve

A2.ST.1j

Compare multiple data distributions using measures of center, measures of spread, and shape of the distributions.

What is Mathspace

About Mathspace