# 2.4 Correlation Coefficients

Graphing your data is essential to understanding what it is telling you. Never rush the statistics, get to know your data first! You should always examine a scatterplot of your data. However it is useful to have a numeric measure of how strong the relationship is. The Pearson We're not going to blind you with formulae but is helpful to have some grasp of how the stats work. The basic principle is to measure how strongly two variables relate to each other, that is to what extent do they
The black lines across the middle represent the mean value of X (25.50) and the mean value of Y (23.16). These lines are the reference for the calculation of covariance for all of the participants. Notice how the point highlighted by blue lines is above the mean for one variable but below the mean for the other. A score below the mean creates a negative difference (approximately 10 - 23.16 = -13.6) while a score above the mean is positive (approximately 41 - 25.5 = 15.5). If an observation is above the mean on X and also above the mean on Y than the product (multiplying the differences together) will be positive. The product will also be positive if the observation is below the mean for X and below the mean for Y. The product will be negative if the observation is above the mean for X and below the mean for Y or vice versa. Only the three points highlighted in red produce positive products in this example. All of the individual products are then summed to get a total and this is divided by the product of the standard deviations of both variables in order to scale it (don't worry too much about this!). This is the correlation coefficient, Pearson's
The correlation coefficient tells us two key things about the association: **Direction**- A positive correlation tells us that as one variable gets bigger the other tends to get bigger. A negative correlation means that if one variable gets bigger the other tends to get smaller (e.g. as a student’s level of economic deprivation decreases their academic performance increases).**Strength**- The weakest linear relationship is indicated by a correlation coefficient equal to 0 (actually this represents no correlation!). The strongest linear correlation is indicated by a correlation of -1 or 1. The strength of the relationship is indicated by the magnitude of the value regardless of the sign (+ or -), so a correlation of -0.6 is equally as strong as a correlation of +0.6. Only the direction of the relationship differs.
It is also important to use the data to find out: **Statistical significance**- We also want to know if the relationship is statistically significant. That is, what is the probability of finding a relationship like this in the sample, purely by chance, when there is no relationship in the population? If this probability is sufficiently low then the relationship is statistically significant.**How well the correlation describes the data**- This is best expressed by considering how much of the variance in the outcome can be explained by the explanatory variable. This is described as the proportion of variance explained, r^{2}(sometimes called the coefficient of determination). Conveniently the r^{2 }can also be found just by squaring the Pearson correlation coefficient. The r^{2}provides us with a good gauge of the substantive size of a relationship. For example, a correlation of 0.6 explains 36% (0.6^{2 }= .036) of the variance in the outcome variable.
Let us return to the example used on the previous page - the relationship between age 11 and age 14 exam scores. This time we will be able to produce a statistic which explains the strength and direction of the relationship we observed on our scatterplot. This example once again uses the LSYPE 15,000 dataset. Take the following route through SPSS:
The pop-up menu (shown below) will appear. Move the two variables that you wish to examine (in this case ks2stand and ks3stand) across from the left hand list into the
SPSS will provide you with the following output (
As inferred from the scatterplot on the previous page, there is a positive correlation between age 11 and age 14 exam performances such that a high score in one is associated with a high score in the other. The value of .886 is strong - it means that one variable accounts for about 79% of the variance in the other (r
Pearson's
It is very important that correlation does not get confused with causation. All that a correlation shows us is that two variables are related, not that one necessarily causes the other. Consider the following example. When looking at National test results, pupils who joined their school part of the way through a key stage tend to perform at the lower end of the attainment scale than those who attended school for the whole of the key stage. The figure below (
It may appear that joining a school later leads to poorer exam attainment and the later you join the more attainment declines. However we cannot necessarily infer that there is a causal relationship here. It may be reverse causality, for example where pupils with low attainment and behaviour problems are excluded from school. Or it might be that the relationship between mobility and attainment arises because both are related to a third variable, such as socio-economic disadvantage.
This diagram ( |