Lesson 1, Topic 1
In Progress

1.10. Scatter diagram

ryanrori February 3, 2021

[responsivevoice_button rate=”0.9″ voice=”UK English Female” buttontext=”Listen to Post”]

Statisticians gather data to determine correlations (relationships) between events.  Scatter plots will often show at a glance whether a relationship exists between two sets of data. 

Example:

Let’s decide if studying longer will affect Regent’s grades based upon a specific set of data.  Given the data below, a scatter plot has been prepared to represent the data.  Remember when making a scatter plot, do NOT connect the dots.

http://regentsprep.org/Regents/math/data/scattergraph.gif

Notice: Certain values may have more than one result, such as (7, 90) and (7,85) and (7,100).

  • The data displayed on the graph resembles a line rising from left to right. Since the slope of the line is positive, there is a positive correlation between the two sets of data.  This means that according to this set of data, the longer Regent studies, the better grade he will get on his examination.
  • If the slope of the line had been negative (falling from left to right), a negative correlation would exist. Under a negative correlation, the longer Regent studies, the worse grade he would get on his examination.  
  • If the plot on the graph is scattered in such a way that it does not approximate a line (it does not appear to rise or fall), there is no correlation between the sets of data.  No correlation means that the data just doesn’t show if studying longer has any effect on the examination scores.

Scatter plots are used by researchers to look for correlations. A correlation is a relationship between the data, which can suggest that one event may affect another event. 

In order to use scatter plots in this way, you must have two sets of numerical data. One set is plotted on the x-axis of a graph, and the other set is plotted on the y-axis. The resulting scatter plot will often show at a glance whether a relationship exists between the two sets of data.

Here’s an example:

Suppose you want to find out whether more hours spent studying will have an effect on a person’s mark.

You set up an experiment with some people, recording how many hours they spent studying and then recording what happened to their mark.

You can see the data in the table below:

It’s difficult to see any pattern in the table, although it’s clear that different things happened to different people. One person studied for 1 hour and had their mark go up 2%, while another person who also studied for 1 hour saw a drop of 1%!

If there is any pattern here, we’ll have to graph the data to see it:

  • We’ll plot the hours spent studying on the x-axis, since it’s the independent variable.
  • The change in the Math mark is the dependent variable, so it goes on the y-axis.

The first thing you notice about the graph is that, while the points are scattered around, they do seem to line up. More specifically, they seem to be getting higher as you move to the right on the graph.

This type of correlation is called weak positive correlation:

  • It’s a correlation because the points do seem to form a pattern … in this case, a line
  • It’s positive because the points tend to get higher as you move to the right
  • It’s weak because, while the points seem to line up, they do so only weakly.

Here is the graph again. We’ve shown a line that seems to describe the direction the points are heading in. This is called the line of best fit.

There are methods for determining where this line is, but for our purposes we’ll use just two criteria to find and draw the line:

  • The line of best fit must more or less follow the direction of the points
  • There should be roughly the same number of points on each side of the line
  • Lines of best fit can be used to predict results.

In our example above you’ll notice that very few of the points are actually on the line of best fit. In fact, some of the data points (representing different people) are quite far from the line. You can think of the line of best fit as an average description of what’s going on in the experiment.

We might conclude that there is a correlation between hours spent studying and a change in your mark, and describe it this way:

  • there is a weak positive correlation
  • as the number of hours of studying increases, the math mark seems to increase

The correlation suggested by the graph is just that … a suggestion. This does not prove that more hours spent studying causes your mark to go up. There may in fact be some other (uncontrolled) variable that actually caused the increase in marks as the hours spent studying increased. But the correlation tells us that there is some sort of connection between the two, and we may want to investigate further to look for the actual cause, or confirm that more hours spent studying really is the cause.