# 2.6 Assumptions of Simple Linear Regression

Simple linear regression is only appropriate when the following conditions are satisfied: **Linear relationship:**The outcome variable Y has a roughly linear relationship with the explanatory variable X.**Homoscedasticity:**For each value of X, the distribution of residuals has the same variance. This means that the level of error in the model is roughly the same regardless of the value of the explanatory variable (homoscedasticity - another disturbingly complicated word for something less confusing than it sounds).**Independent errors:**This means that residuals (errors) should be uncorrelated.
It may seem as if we're complicating matters but checking that the analysis you perform is meeting these assumptions is vital to ensuring that you draw valid conclusions.
The following issues are not as important as the assumptions because the regression analysis can still work even if there are problems in these areas. However it is still vital that you check for these potential issues as they can seriously mislead your analysis and conclusions. **Problems with outliers/influential cases:**It is important to look out for cases which may unduly influence your regression model by differing substantially to the rest of your data.**Normally distributed residuals:**The residuals (errors in prediction) should be normally distributed.
Let us look at these assumptions and related issues in more detail - they make more sense when viewed in the context of how you go about checking them.
The below points form an important checklist:
Correlation = .886 (r
Correlation = .495 (r
Note how the restriction evident in
The single outlier in the plot on the right (
The chart on the right (
The dashed black line (which is hard to make out!) represents a normal distribution while the red line represents the distribution of the residuals (technically the lines represent the cumulative probabilities). We are looking for the residual line to match the diagonal line of the normal distribution as closely as possible. This appears to be the case in this example – though there is some deviation the residuals appear to be essentially normally distributed.
Even if students were randomly allocated to schools, social processes often act to create this dependence. Such clustering can be taken care of by the using of design weights which indicate the probability with which an individual case was likely to be selected within the sample. For example in published analyses of LSYPE clustering was controlled by specifying school as a cluster variable and applying published design weights using the SPSS/PASW complex samples module. More generally researchers can control for clustering through the use of multilevel regression models (also called hierarchical linear models, mixed models, random effects or variance component models) which explicitly recognised the hierarchical structure that may be present in your data. Sounds complicated, right? It definitely can be and these issues are beyond the scope of this website. However if you feel you want to develop these skills we have an excellent sister website provided by another NCRM supported node called LEMMA which explicitly provides training on using multilevel modelling. We also know a good introductory text on multilevel modelling which you can find among our resources. The next page will show you how to complete a simple linear regression and check the assumptions underlying it (well... most of them!) using SPSS/PASW. |