Lesson 1, Topic 1
In Progress

1.7. Methods for calculating statistics

ryanrori February 3, 2021

[responsivevoice_button rate=”0.9″ voice=”UK English Female” buttontext=”Listen to Post”]

In statistics, the most important calculations are:

  • Mean
  • Mode
  • Median
  • Variance
  • Standard deviation (std dev)
Centre and spread

Plotting data in a frequency distribution shows the general shape of the distribution and gives a general sense of how the numbers are bunched.  Several statistics can be used to represent the “centre” of the distribution.  These statistics are commonly referred to as measures of central tendency and include the mean, median and mode. 

Calculating the mean

We ask 5 friends to rate a popular movie on a scale from 1 to 10 and here’s what we get:

Fred: 6

Sally: 9

Michael: 8

Raul: 9

Elena: 2

To calculate the mean, you sum up all the numbers in the sample and then divide by the sample size. The sum is 5+10+8+9+2= 34. Since the sample size is 5, the mean is 34/5 = 6.8.

This then is the average of the sample. 

Calculating the mode

The mode is the number that appears the most often in the sample. 

To calculate the mode, we count the number of times each rating is made. So we have one 6, two 9’s, one 8, and one 2. Since we have two 9’s and one of everything else, 9 is the mode. 

But what would happen if we have the following sequence: 2, 2, 8, 9, 9?

In this case, we would say that there is no unique mode. A mode is unique if and only if one number is more frequent than all others. 

Calculating the median

The median is the value we get when we order all of our numbers and then find the one in the middle.

If we order the numbers from smallest to largest, we get: 2, 6, 8, 9, 9

Since we have a sample size of 5, the number in the middle is 8.

But what happens if the sample size is even? In this case, we can add the two middle numbers and divide by 2. 

So, if our numbers are: 2,6,8,9, then the median is (6+8)/2 = 7. 

Measures of Spread – Although the average value in a distribution is informative about how scores are centred in the distribution, the mean, median, and mode lack context for interpreting those statistics.  

Measures of variability provide information about the degree to which individual scores are clustered about or deviate from the average value in a distribution. 

  • Range – The simplest measure of variability to compute and understand is the range.  The range is the difference between the highest and lowest score in a distribution.  Although it is easy to compute, it is not often used as the sole measure of variability due to its instability.  Because it is based solely on the most extreme scores in the distribution and does not fully reflect the pattern of variation within a distribution, the range is a very limited measure of variability. 
  • Interquartile Range (IQR) – Provides a measure of the spread of the middle 50% of the scores.  The IQR is defined as the 75th percentile – the 25th percentile.  The advantage of using the IQR is that it is easy to compute and extreme scores in the distribution have much less impact but its strength is also a weakness in that it suffers as a measure of variability because it discards too much data.  Researchers want to study variability while eliminating scores that are likely to be accidents.  
  • Variance – The variance is a measure based on the deviations of individual scores from the mean.  The variance is based on squared deviations of scores about the mean.  When the deviations are squared, the rank order and relative distance of scores in the distribution is preserved while negative values are eliminated.  Then to control for the number of subjects in the distribution, the sum of the squared deviations, S(X`X), is divided by N (population) or by N – 1 (sample).  The result is the average of the sum of the squared deviations and it is called the variance. 
  • Standard deviation – The standard deviation (s or s) is defined as the positive square root of the variance.  The variance is a measure in squared units and has little meaning with respect to the data.  Thus, the standard deviation is a measure of variability expressed in the same units as the data.  The standard deviation is very much like a mean or an “average” of these deviations.  In a normal (symmetric and mound-shaped) distribution, about two-thirds of the scores fall between +1 and -1 standard deviations from the mean and the standard deviation is approximately 1/4 of the range in small samples (N < 30) and 1/5 to 1/6 of the range in large samples (N > 100). 

Calculating Variance

The variance is a measure of the variation of the sample data. The larger the variance, the more random the answers appear. Many people find standard deviation to be a more useful measure of variability.

The method for calculating the variance is different depending on whether we are calculating the variance of a population (everyone) or the variance of a sample (some but not all).

Here are the steps:

  1. Figure out the mean. This is the sum of the numbers given divided by the sample size (i.e. the average): 

(6+ 9+ 8 + 9 + 2)/5 = 34/5 = 6.8

  1. Figure out the difference between each number and its mean so that we have:

(6 – 6.8), (9 – 6.8), (8 – 6.8), (9 – 6.8), (2 – 6.8) = -0.8, 2.2, 1.2, 2.2, -4.8

  1. Get the square of each difference in step #2 so that we have:

(-0.8)*(-0.8), (2.2)*(2.2), (1.2)*(1.2), (2.2)*(2.2), (-4.8)*(-4.8) = 0.64, 4.84, 1.44, 4.84, 23.04

  1. Get the sum of all the squares in step 3 so that we have:

sum of squares = 0.64 + 4.84 + 1.44 + 4.84 + 23.04 = 34.8

  1. Now, for the sample variance, we divide the sum in step 4 by the sample size – 1

Variance = 34.8/(5-1) = 34.8/4 = 8.7

Calculating the Standard Deviation

The standard deviation, like variance, is a measure of the variation of the sample data. The larger the standard deviation, the more random the answers appear.  Standard deviation is more popular as a measure than variance. 

The method for calculating the standard deviation is different depending on whether we are calculating the variance of a population (everyone) or the variance of a sample (some but not all).  The method is the same as variance with one additional step. 

Here are the steps:

  1. Figure out the mean. This is the sum of the numbers given divided by the sample size (i.e. the average). (6+ 9+ 8 + 9 + 2)/5 = 34/5 = 6.8
  2. Figure out the difference between each number and its mean so that we have:

(6 – 6.8), (9 – 6.8), (8 – 6.8), (9 – 6.8), (2 – 6.8) = -0.8, 2.2, 1.2, 2.2, -4.8

  1. Get the square of each difference in step 2 so that we have:

(-0.8)*(-0.8), (2.2)*(2.2), (1.2)*(1.2), (2.2)*(2.2), (-4.8)*(-4.8) = 0.64, 4.84, 1.44, 4.84, 23.04

  1. Get the sum of all the squares in step 3 so that we have:

sum of squares = 0.64 + 4.84 + 1.44 + 4.84 + 23.04 = 34.8

  1. We divide the sum in step 4 by the sample size – 1

34.8/(5-1) = 34.8/4 = 8.7

  1. Last, we take the square root of the value in step 5.

Standard Deviation = sqrt (8.7) = roughly 2.95 

Interpreting Standard Deviation

A smaller standard deviation means that there is more agreement between the numbers (less variation) and a larger standard deviation means that there is less agreement (more variation).

If the observations are random and fall in a bell curve, then we can use the standard deviation to make the following observations: 

  • 68% of the numbers lie within one standard deviations of the mean
  • 95% of the numbers lie within two standard deviations of the mean

Back to our previous example of rating a movie:

Now, movie ratings are, in theory, not random since they are based on the quality of a movie. Additionally, we can know that 100% are between 1 and 10 and are most likely whole numbers.

But, what would it say for another movie if the mean were 5 and the standard deviation was 1 and we assume that ratings form a bell curve.

With this information, we can expect:

  • 68% of all people will rate the movie between 4 and 6 since 4= 5-1 and 6 = 5+1 
  • 95% of all people will rate the movie between 3 and 7 since 3 = 5 – 2*1 and 7 = 5 + 2*1 

We will look at more methods for calculating statistics in Module 2.