topic badge

7.01 Formulate questions and collect bivariate data

Formulate questions and collect bivariate data

The statistical investigation process is a process that begins with the need to solve a real-world problem and aims to reflect the way statisticians work. The data cycle gives us a nice structure to follow:

A data cycle with four stages. At the top, there is Formulate questions represented by a speech bubble with a question mark. To the right, Collect or acquire data is shown with an icon of a person and a magnifying glass. At the bottom, Organize and represent data is illustrated with a dot plot. To the left, Analyze and communicate results is indicated by a person with charts. Clockwise arrows are drawn from one stage to the next.

Recall that bivariate data is data which is represented by two variables. This data is typically numerical and can be organized with a scatterplot.

We need to identify the variables that will be explored in the data cycle. Identifying the independent variable and dependent variable is important for formulating our questions and for accurate data analysis.

Previously, we have only looked at one set of bivariate data, but we can compare bivariate data for two categorical variables.

Categorical variables can be added to a scatterplot using color or different symbols.

A scatterplot showing age and height of males and females. Speak to your teacher for more information.

Once we know what variables we will be exploring, we need to formulate a question that requires the collection of data that can be analyzed using a data display.

Statistical question

A statistical question that can be answered by collecting data and whose answer may vary depending on the sample the data is collected from.

It can also be called an investigative question.

We need to look at the data source when questions are formulated. We can consider:

  • What population do we want to make a conclusion about?

  • How can we find relevant data? Is it easy to acquire secondary data that already exists?

  • Who will be using the conclusions from the analysis?

Once the question has been formulated, we need to determine how to collect or acquire the necessary data. Here are some ways to collect data:

  • Research using secondary sources to find existing data.

  • Surveys can be done by asking each member of the representative sample two questions or giving them a questionnaire. Answers can be more open ended than a poll.

  • Observations can be made by watching members of the sample and noting particular characteristics.

  • Scientific experiments can be done by carefully selecting a sample and controling as many other variables as possible, then varying the independent variable and measuring how the dependent variable responds.

Exploration

There are many formulas and ways to estimate or predict someone's adult height such as doubling their height at age 2 or using a combination of their biological parents' heights. With a partner or in a small group, use the data cycle to explore relationships involving adult height.

  1. Brainstorm potential statistical questions.

  2. What would the variables be for each question? Are they all numerical or are some categorical?

  3. Do you think there could be a single model that would be accurate across all demographics like race and gender? Explain.

  4. Which data collection method might be best for exploring this relationship?

When doing a survey or using secondary sources, it is important that the data is collected from a sample that is representative of the population, so that our analysis of the data is valid.

Representative means that characteristics of the population should be similar to the sample.

A concept of sampling from a population. On the left is a large circle labeled Population containing many diverse cartoon faces representing individuals. On the right is a smaller circle labeled Sample with a subset of the faces from the population, connected by an arrow indicating selection from the larger group to the smaller.

In this course, we will aim to collect larger data sets because they provide a reasonable approximation for the population.

When there is bias in the data cycle, we may get misleading or inaccurate conclusions.

Statistical bias

Any aspect of the data cycle process that leads to a difference between the conclusion and the actual truth for the population.

Example:

Sampling bias, Observer bias, Measurement bias

Sampling bias can occur due to undercoverage or exclusion when a particular subgroup is under-represented or fully excluded.

There are a number of ways we can avoid bias in our sample, including:

  • Having a sample that is large enough to represent the characteristics of the population. The larger the sample size, the closer the results will be to that of the population.

  • Having a sample that is selected without strategically choosing more people from a certain group.

  • Randomly selecting the sample.

Examples

Example 1

First identify the variables, then write an investigative question related to each scenario.

a

The local souvenir shop has noticed that their sweatshirt sales seem to be related to the temperature outside. They want to investigate this relationship more closely.

Worked Solution
Create a strategy

When creating an investigative question, it should focus specifically on the relationship between the two variables involved, which in this case are temperature and sweatshirt sales.

Apply the idea

Independent variable: Temperature

Dependent variable: Sweatshirt sales

Some possible investigative questions might be:

  1. Does a decrease in temperature correlate with an increase in sweatshirt sales?

  2. How does temperature relate to the number of sweatshirts sold?

  3. Is there a specific temperature range that results in the highest sweatshirt sales?

Reflect and check

Each of these questions investigates a different aspect of the relationship, making them effective investigative questions.

If we wanted to be even more explicit, we could add "For the local souvenir shop" to the beginning of each question to make the population clear. However, this should be clear based on the scenario.

b

For a school in Fairfax, VA, the principal noticed that the number of days missed by a student in September is a good predictor of the number of total days they will be absent throughout the year. She wants to investigate this relationship.

Worked Solution
Create a strategy

An investigative question must require the collection of data to answer it. Once we have identified the variables, we can formulate the question.

Apply the idea

Independent variable: Number of days absent in September

Dependent variable: Number of total days absent throughout the year

Some possible investigative questions might be:

  1. How does the number of days missed in September relate to the number of total days absent throughout the year?

  2. Does a high number of missed days in September correlate to a high number of total days absent throughout the year?

  3. Is there a threshold number of days missed in September that would indicate a high number of total days absent throughout the year?

Reflect and check

After one iteration of the data cycle, we might come up with a different question that explores something we noticed in the analysis.

If we want to be more specific, we could add "For a school in Fairfax, VA" to the beginning of each question to make the population clear.

c

A baker wants to adjust his pricing model to be more competitive. He wants to look at the price he charged for a cake compared to the time it took to create. He is curious if he would need specific models for wedding cakes versus to birthday cakes or if the same model would be appropriate for all kinds of cakes.

Worked Solution
Create a strategy

The investigative question should focus on the relationship between the two numerical variables involved.

There is also a categorical variable that could be used for deeper exploration.

Apply the idea

Independent variable: Time it took to create a cake

Dependent variable: Price charged for a cake.

Categorical variable: Type of cake.

Some possible investigative questions might be:

  1. Does the price charged for a cake correspond to the time taken to create it?

  2. Is a higher price charged for a cake that took longer to create?

  3. Is there a specific time duration for creating a cake that would result in a higher price being charged?

Reflect and check

After one iteration of the data cycle, we could ask the follow up questions by splitting the scatterplot into two or using different symbols or colors for the points to see if the relationship is the same.

  1. Does the price charged for a cake correspond to the time taken to create it? Is this relationship the same or different for wedding and birthday cakes?

  2. Is a higher price charged for a cake that took longer to create? Is this consistent for birthday and wedding cakes?

  3. Is there a specific time duration for creating a cake that would result in a higher price being charged? Is this range the same for wedding and birthday cakes?

Example 2

For each investigative question, select which data collection technique would be best. Explain your answer.

A
Observation
B
Polls
C
Research
D
Scientific experiment
E
Survey
a

For the intersection of Chain Bridge Road and Eaton Place, Louisa notices that sometimes she can walk through easily, but sometimes she gets stuck in a crowd.

She asks the question "For the intersection at Chain Bridge Road and Eaton Place, how can the relationship between pedestrian density (people per square yard) and walking speed (feet per second) be modeled?"

Worked Solution
Create a strategy

To identify the best data collection technique for the investigative question, we should first identify the variables and then select the most feasible approach to collect or measure data.

Apply the idea

The variables are:

  • Independent variable: Pedestrian density

  • Dependent variable: Average walking speed

Based on the variables, the best data collection technique for the investigative question is observation. We can measure pedestrian density and average walking speed by observing, such as using video footage from overhead cameras and drones.

This is also ideal because it is not intrusive and it can provide clear, accurate and quantifiable data.

The answer is option A.

Reflect and check

Let's examine other data collection techniques and their applicability to the given investigative question:

  • Polls - this technique may not give an accurate measurement of the variables as the collected data would mostly be from the respondent's assumption.

  • Research - it is unlikely that we could find secondary data on pedestrian density and average walking speed for that specific intersection.

  • Scientific experiment - this technique requires selecting samples and controlling variables, which is not ideal because implementing it would be challenging.

  • Survey - this technique requires direct interaction with the samples and is not feasible for measuring pedestrian density and walking speed.

b

Polly loves attending concerts, but finds that she often can't see the stage because of the taller people in front of her. This leads her to ask the question:

"For those who attend concerts at the local venue, is there a relationship between height and amount spent on concerts in a year?"

Worked Solution
Create a strategy

To identify the best data collection technique, we should first identify the variables and then select the easiest and most accurate approach to collect or measure data.

Apply the idea

The variables are:

  • Independent variable: Height

  • Dependent variable: Amount spent on concerts in a year

Based on the variables, the best data collection technique for the investigative question is a survey. Surveys allow us to gather personal or private information such as height and spending habits.

She could try a convenience sample by emailing the venue's mailing list or try physical surveys handed out at concerts. Her sampling method should check that the sample is representative by reaching a diverse group of concert attendees.

The answer is option E.

Reflect and check

If she wanted to do another iteration of the data cycle, she could explore this relationship for different venues with and without tiered seating.

Idea summary

Bivariate data is data which is represented by two variables. This data is typically numerical and can be organized with a scatterplot.

To explore bivariate data, we first need to formulate an investigative question and then we can determine how to collect or acquire the necessary data. Such as:

  • Research using secondary sources to find existing data.

  • Surveys can be done by asking each member of the representative sample two questions or giving them a questionnaire.

  • Observations can be made by watching members of the sample and noting particular characteristics.

  • Scientific experiments can be done by controling other variables, then varying the independent variable and measuring the dependent variable.

Scatterplots and relationships

We often display bivariate data using a scatterplot where the independent variable is written on the horizontal axis and the dependent variable is on the vertical axis. We can describe the relationship based on how closely the points follow a particular model using the following terms:

  • Form, usually described as a linear relationship or nonlinear relationship

  • Strength, describing how closely the data points match the model line or curve

For linear relationships, we may also describe their direction as positive or negative.

As a review, here are some examples:

1
2
3
4
5
6
7
8
9
10
11
x
1
2
3
4
5
6
7
8
9
10
11
y
A linear relationship that is strong and positive
1
2
3
4
5
6
7
8
9
10
11
x
-1300
-1200
-1100
-1000
-900
-800
-700
-600
-500
-400
-300
-200
-100
100
200
300
y
A nonlinear relationship that is weak

For larger data sets, we can use technology such as spreadsheets and graphing calculators to graph scatterplots. This is especially helpful for when we also want to analyze which model or equation would be the most appropriate.

Exploration

A group of patients participating in a medical trial were given different dosages of the same medication. In addition to the medication provided, some of the patients also take insulin while others do not. The doctors running the trial then rated the effectiveness of the medication for each patient.

Use the checkboxes to display different subsets of data values in this scatterplot.

Loading interactive...
  1. What relationship is this scatterplot trying to explore? Formulate a question that this scatterplot could be used to help answer.

  2. When the entire data set is displayed the same way, how would you describe the trend?

  3. When just the people in the study who are taking insulin are shown, how would you describe the trend?

  4. When just the people in the study who are not taking insulin are shown, how would you describe the trend?

  5. Formulate a new question for a second round of the data cycle based on what you notice from the categories in this scatterplot.

While working through the data cycle, we will often uncover patterns or relationships we didn't think of before that can lead us to new investigative questions. For instance, we may realize that by grouping data into categories, we uncover relationships between variables that appeared to be unrelated when the data was combined.

Examples

Example 3

The table shows the grade of 12 students in English and French.

StudentGrade in EnglishGrade in French
18589
27171
35756
46062
57986
67676
77177
89186
95090
104947
116667
129292
a

Which of the following scatterplots correctly represents the above data?

A
A scatterplot with title Grade in English for the x-axis and Grade in French in the y-axis. Speak to your teacher for more information.
B
A scatterplot with title Grade in English for the x-axis and Grade in French in the y-axis. Speak to your teacher for more information.
C
A scatterplot with title Grade in English for the x-axis and Grade in French in the y-axis. Speak to your teacher for more information
Worked Solution
Create a strategy

Draw your own scatterplot and plot each given point to identify the correct graph.

Apply the idea

The correct answer is option A.

We would not need to plot all of the points to confirm which scatterplot is correct.

We could eliminate option B, because there is a point plotted at \left(90, 50\right), but there is no corresponding student.

We could eliminate option C, after plotting only two points as there is no corresponding point to student 2 at \left(71,71\right).

Reflect and check

We can also use technology to create a scatterplot for the given data.

A screenshot of the GeoGebra statistics tool showing the scatterplot of a given data. Speak to your teacher for more details.
b

Is the relationship between students' English and French grade positive or negative?

A
Positive
B
Negative
Worked Solution
Create a strategy

Think of what happens to the students' grade in French as their grade in English increase.

Apply the idea

As the students’ grades in English increase, so do their grades in French. So, the answer is option A: Positive.

c

Is the relationship between students' English and French grade strong or weak?

A
Strong
B
Weak
Worked Solution
Create a strategy

Look at how scattered the points are on the scatterplot.

Apply the idea

When the points on a scatterplot tend to follow a single line, the relationship is strong.

When the points on a scatterplot are scattered greatly around a single line, the relationship is weak.

So, the answer is option A: Strong.

Idea summary

When describing a relationships shown in a scatterplot, we can describe the:

  • Form, usually described as a linear relationship or nonlinear relationship

  • Strength, describing how closely the data points match the model line or curve

  • Direction, usually described as positive relationship or negative relationship

Outcomes

A2.ST.2

The student will apply the data cycle (formulate questions; collect or acquire data; organize and represent data; and analyze data and communicate results) with a focus on representing bivariate data in scatterplots and determining the curve of best fit using linear, quadratic, exponential, or a combination of these functions.

A2.ST.2a

Formulate investigative questions that require the collection or acquisition of bivariate data and investigate questions using a data cycle.

A2.ST.2b

Collect or acquire bivariate data through research, or using surveys, observations, scientific experiments, polls, or questionnaires.

A2.ST.2c

Represent bivariate data with a scatterplot using technology.

A2.ST.2g

Make predictions, decisions, and critical judgments using data, scatterplots, or the equation(s) of the mathematical model.

What is Mathspace

About Mathspace