We can summarize data in many ways including using descriptive statistics like mean, median, mode or with data displays like histograms or boxplots. The way we summarize data can depend on the type of data. In this lesson, we will look at summarizing and formulating questions for univariate data.
This histogram displays numerical data of the heights of students in a class.
Previously, we have seen these measures of center:
Sometimes one measure may better represent the data than another. When deciding which to use we need to ask ourselves "Which measure would best represent the type of data we have?"
This histogram summarizes numerical univariate data.
We can say the modal class is 100– \lt 200. What do you think that means?
One data value is 842, what could we call that value?
One of the measures of center is 199. Which measure of center would this be? Explain.
One of the measures of center is 114. Which measure of center would this be? Explain.
Depending on what the distribution looks like when graphed using a histogram, the measures of center locations can vary.
Mean and median can be the same and in the same bin as the mode
Mean and median can be similar and both in the same bin as the mode
Mean and median can be quite different, but in the same bin as the mode.
Mean, median, and mode can all be in different bins
Benefits | Drawbacks | |
---|---|---|
Mean | Includes all of the data in the calculation, widely used | Heavily impacted by extreme values or uneven distributions |
Median | Tells us the middle, not impacted by extreme values | Does not include all data values |
Mode | Quick to identify, tells us about the most frequent value(s) | Does not include all data value, is not necessarily in the middle |
The salaries of part-time employees at a company are given in the dot plot, rounded to the nearest thousand.
Which measure of center best reflects the typical wage of a part-time employee?
Calculate and interpret the chosen measure of center from part (a).
A journalist wanted to report on road speed cameras being used as revenue raisers. She obtained data that showed the number of times 20 speed cameras issued a fine to motorists in one month. The results were: 101,\,102,\,115,\,115,\,121,\,124,\,127,\,128,\,130,\, 130,\,\\ 143,\, 143,\,146,\,162,\,162,\,163,\,178,\,183,\,194,\,977
Determine the mean number of times a speed camera issued a fine in that month. Give your answer rounded to one decimal place.
Determine the median number of times a speed camera issued a fine in that month. Give your answer rounded to one decimal place.
The journalist wants to give the impression that speed cameras are just being used to raise revenue. Which statement should she make? Explain.
We can summarize numerical univariate data using measures of center. A measure of center is a single data value that describes the center or middle of a whole set of data.
Mean: the point on a number line where the data distribution is balanced.
Median: the middle value of a data set in ranked order.
Mode: the piece of data that occurs most frequently.
Benefits | Drawbacks | |
---|---|---|
Mean | Includes all of the data in the calculation, widely used | Heavily impacted by extreme values |
Median | Tells us the middle, not impacted by extreme values | Does not include all data values |
Mode | Quick to identify, tells us about the most frequent value(s) | Does not include all data value, is not necessarily in the middle |
Previously, we have seen these measure of dispersion (spread):
There are two more measures of spread or dispersion called the variance and the standard deviation. Let's explore why they are helpful.
Two sets of data were collected from two different samples of passengers in an airport. Passengers were asked the approximate duration of the flight they had just been on.
Set A | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Set B | 1 | 1 | 1 | 1 | 5 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 13 | 17 | 17 | 17 | 17 |
Find the mean of both data sets. What does it tell us about the data?
Find the median of both data sets. What does it tell us about the data?
Find the range of both data sets. What does it tell us about the data?
Find the interquartile range of both data sets. What does it tell us about the data?
How would you describe the differences between the two sets in a way that the given summary statistics don't show?
The variance is a way of showing how spread out numbers in a data set are. The variance looks at the square of the distances between each data value from the mean.
Consider the following data set with a mean of 73.7: 100, 51, 79, 57, 60, 64, 95, 98, 56, 77
A small variance indicates that most scores are close to the mean, while a large variance indicates that the scores are more spread out. This is the formula:
The standard deviation is the square root of the variance and is often used when analyzing the dispersion of univariate data. (The square root undoes the squaring we did to make the distances positive when finding the variance). This is the formula:
x | x-\mu | \left(x - \mu \right)^2 |
---|---|---|
100 | 26.3 | 691.69 |
51 | -22.7 | 515.29 |
79 | 5.3 | 28.09 |
57 | -16.7 | 278.89 |
60 | -13.7 | 187.69 |
64 | -9.7 | 94.09 |
95 | 21.3 | 453.69 |
98 | 24.3 | 590.49 |
56 | -17.7 | 313.29 |
77 | 3.3 | 10.89 |
\mu=73.7 | \text{Sum}=3164.10 |
The process for calculating standard deviation is time-consuming, so we will be using our calculator to find the standard deviation. In statistics mode on a calculator, the symbol \sigma_{n} or \sigma_{x} may also be used.
There is a second type of standard deviation for when you are working with a sample and not a population. This is the sample standard deviation, with the symbol s or s_{x}. Generally, s and \sigma will be fairly close.
Here are some examples of what a data set can look like with the same measures of center, but different measures of dispersion.
Mean =5, Median =5, Standard deviation=2.84
Mean =5, Median =5, Standard deviation=2
Mean =5, Median =5, Standard deviation=0.93
Mean =5, Median =5, Standard deviation=0
When comparing the standard deviation, we should consider the scale or order of magnitude of the data. For example, the standard deviation for human baby weights might be smaller than for whale baby weights, but that does not necessarily mean that human baby weights are more consistent.
The number of push-ups Mario does each day is shown.33,\,32,\,32,\,32,\,31,\,32,\,32,\,32,\,32,\,32
Calculate the variance by hand using a spreadsheet or table. Round your answer to two decimal places.
Calculate his standard deviation by hand using a spreadsheet or table. Round your answer to two decimal places.
Use technology to find his standard deviation, rounded to two decimal places.
The given data sets show the time to get to school for 20 students at two different school, rounded to the nearest minute. Some summary statistics are given.
3 | 4 | 5 | 8 | 9 | 11 | 11 | 11 | 12 | 13 |
14 | 15 | 15 | 18 | 19 | 27 | 33 | 33 | 42 | 45 |
15 | 15 | 17 | 17 | 17 | 18 | 18 | 18 | 18 | 19 |
19 | 22 | 22 | 22 | 23 | 23 | 23 | 24 | 24 | 24 |
School A | School B | |
---|---|---|
Mean | 17.4 | 19.9 |
Median | 13.5 | 19 |
Variance | 142.14 | 9.09 |
Interpret and compare the means for both schools.
Interpret and compare the medians for both schools.
Calculate the standard deviations for both schools.
Interpret and compare the standard deviations for both schools.
We can describe univariate data using measures of dispersion (spread). Measures of dispersion (spread) are a single data value that describes how varied a data set is.
Range: the difference between the upper extreme and the lower extreme.
Interquartile range: the difference between the upper quartile and the lower quartile.
Variance: A measure of the spread of a data set. the mean of the squares of the differences between each element and the mean of the data set.
Standard deviation: A measure of the spread of a data set. The square root of the mean of the squares of the differences between each element and the mean of the data set or the square root of the variance.
Benefits | Drawbacks | |
---|---|---|
Range | Easy to calculate, tells about the extremes of the data | Heavily impacted by extreme values |
Interquartile range | Tells us about the middle half of the data, not impacted by extreme values | Does not include all data values |
Standard deviation | Tells us about how far values are from the mean, widely used in other areas of statistics | Impacted by extreme values, best to use technology to calculate |
The statistical investigation process is a process that begins with the need to solve a real-world problem and aims to reflect the way statisticians work. The data cycle gives us a nice structure to follow:
To help us formulate or write a question about numerical univariate data, we need to consider what variables we want to explore.
We want to formulate a question about univariate data if we are thinking about frequencies, measures of center or spread, or amounts. This is compared to bivariate data where we are thinking about relationships and predictions.
While univariate data always has exactly one variable, we may compare that variable across different categories like comparing the heights of Freshman to Seniors.
We've formulated statistical (investigative) questions with bivariate data, now we'll write them for univariate data.
Well formulated statistical questions | Not statistical questions |
---|---|
How heavy are babies when they are born? | What is the heaviest recorded weight of a baby? |
How do the lengths of Oscar nominated films compare to the lengths of Caines Film Festival winner? | How long was the "Titantic" movie? |
There are different ways to collect data for our statistical question, different questions are more suited to different methods.
Observation: Watching and noting things as they happen
Survey: Asking people questions to get information
Scientific experiment: Doing tests in a controlled way to get data
Acquire existing secondary data: Use data which was collected by a reliable source like census data, Common Online Data Analysis Platform (CODAP), or peer reviewed studies.
When doing a survey or using secondary sources, it is important that the data is collected from a sample that is representative of the population, so that our analysis of the data is valid.
We will aim to collect very large data sets because they provide a reasonable approximation for the population. This means that data displays like histograms will need to be used to group that data.
Determine the type of data that needs to be collected for each statistical question.
How many times have students in the school been to Washington, DC?
Can we accurately predict a dog's adult size based on their birth weight?
How long do people take to run 3 mile races? Is this comparable for different age groups?
Atanasio's basketball coach just had a knee replacement. Now he is interested in wait times for joint replacement surgeries.
Formulate a question that could be used to explore the scenario.
Formulate a question that requires the use of a measure of center to explore the scenario.
Formulate a question that requires the use of a measure of dispersion to explore the scenario.
Diego loves attending amusement parks, but does not like waiting in lines. This leads him to ask the question: "How long do people wait to ride the newest roller coaster at Busch Gardens Williamsburg?"
Determine which method would be the most appropriate and explain why.
Explain how a sample could be selected to get unbiased data.
Explain what the standard deviation of the data set might tell us.
Explain what the median of the data might tell us.
We can formulate questions and then collect continuous numerical data to explore univariate data with large data sets. Univariate data will have one variable or attribute collected for each member of the sample or population.
A very large data set can help provide a reasonable approximation for the population. However, we need to make sure that the data is selected from a representative sample of the population that reflects.
We can collect the data using surveys, experiments, observation, or secondary sources.