About this Resource
P|E|A|S FAQ   | home   | contact us   | index   | links and further resources   |

6: Young people and delinquency

> 6.1 Background > 6.2 Types of imputation> 6.3 Getting started

> 6.4 Results using a small subset of variables > 6.5 Results using a large number of derived variables
> 6.6 Results using individual questions

> 6.7 Comparisons and conclusions
> 6.7.1 Trends over time > 6.7.2 Logistic regression > 6.7.3 Conclusions

> 6.8 Details of this survey and the data


uparrowtop

6.1 Background

This exemplar is based on data from the Edinburgh Study of Youth Transitions and Crime. This is a longitudinal survey that has collected data on young people through their six years of secondary education (ages 12 to 17). A cohort of pupils beginning their secondary school education in Edinburgh in autumn 1998 was the target population. Only a few schools and a small proportion of eligible pupils were withdrawn by their parents. Further information about the project can be found at http://www.law.ed.ac.uk/cls/esytc. The data for the earlier sweeps of the study can be accessed at the ESRC data archive, and the later ones will be there in time.

To illustrate InfoButtonmultiple imputations for longitudinal data we are using data collected at all six sweeps. Each year the participants completed a detailed questionnaire. For the first 4 years almost all the participants were in school and the response rate was excellent. At years 5 and 6 it was slightly worse as a proportion of young people left school and moved away from home.

The population who were eligible for the study changed over time as families moved in and out of the city. For the longitudinal analysis we considered the 4328 participants who were eligible to participate in year 4. The response rate at each year is shown in Figure 6.1. The non-responders in years 1 and 2 were largely those who moved into study schools after this. From year 4 onwards the decline is largely due to the worse response rate from those who had left school and had to be followed up elsewhere. 74% of participants responded at all 6 sweeps.

graph of response rates
Figure 6.1 Response rates by yearly sweep

In this exemplar we will be looking at how the rates of self-reported delinquency change over the 6 years. The questionnaire contained a series of questions like the following example:-

  • In the last year did you steal something from a shop or a store?
    If ‘yes’ how many times?

There were between 15 and 18 such questions, depending on the sweep of the questionnaire. Three measures of delinquency were derived from each set of questions

  1. Prevalence 1 if a yes response to any question, else 0.
  2. Variety Number of questions answered out of the total.
  3. Volume – Sum of the number of times for all questions answered yes, with the answer 'more than 10 times' coded as 11.
Being missing on any question lead to a missing score. This was most pronounced at sweep 6 where some of the questions were no longer relevant to those who had left school (e.g. truancy). Figure 6.2 shows how the mean values of these measures changed over time for the observed subjects.
delinquency scores graph
Figure 6.2 Mean delinquency scores by gender and school year (variety and volume divided by numbers of questions asked).

We can see that the delinquency measures peak at around secondary years 3 and 4 (S3 and S4) and decline after this. But perhaps some of this decline relates to losses to follow up? If the more delinquent failed to respond in later sweeps the mean score for responders would be increased.

For every longitudinal survey we need to consider what we mean by our ‘eligible sample’ since some respondents may not be asked to participate in every wave of the survey. This eligible sample may sometimes vary depending on how many sweeps of the survey we wish to look at together. For this survey it was decided to consider the eligible sample as those who were pupils in Edinburgh schools at S4, and to exclude any pupils who had moved away in earlier years. Table 1 shows the response patterns for all 4328 pupils in the eligible sample. An R represents a response at each wave. The pupils in the first row of Table 6.1 responded at all sweeps, the second row missed the last sweep, and so on. Although a large proportion of response patterns (the first 6 groups) are nested there are many others, so a purely nested analysis would not work.

Table 6.1 Response patterns for all respondents
pattern
number of pupils
comment
RRRRRR�
3124
responded all 6 sweeps
RRRRR-
399
missed 6
RRRR--
192
missed 5 and 6
RRR---
55
missed 4,5,6
RR----
16
missed 3,4,5,6
R-----
5
first sweep only
RRRR-R
122
missed sweep 5 only
-RRRRR
100
miss wed sweep 1 only
--RRRR
69
missed 1 and 2
others
246
37 other distinct response patterns

In addition to those who did not fill in a questionnaire at some sweeps (unit non-response) we will have additional missing data from individual questions that were not answered ( InfoButtonitem non response ). For the questions on delinquency that we investigate the unit non-response averaged around 5%, with a smaller proportion missing in the older age groups for most questions.

We will illustrate how different ways of imputing the non-respondents' values affect the patterns of delinquency over time.

uparrowtop

6.2 Types of imputation

6.2.1 Theoretical differences

Different types of imputation are discussed in the theory part of this site: Imputation section 6.2. We will be illustrating only model- based imputation here. Two main approaches have been developed. Both of them involve making some quite strong assumptions about the data.
  1. Assume a multivariate normal distribution. This is the implemented by Schafer and available in some packages and discussed in his text book. See theory section 6.
  2. Predict each variable in your data set sequentially from all the others. Methods using this approach are known as chained methods.

Both of these methods are InfoButtonproper imputations and both can be run as InfoButtonmultiple imputations. Methods to combine the imputations can then be used that allow the InfoButtonstandard errors of estimates to be increased to allow for the missing data.

6.2.2 Different practical approaches

The scores mentioned above were derived from a large number of individual questions about offending behaviour. We have a choice of either

  • Imputing the total scores at each sweep
  • Imputing the original questions and recalculating scores (102 variables)

The second of these will have the advantage of obtaining scores from people who have answered some, but not all, of the questions that make up the score.

As well as the choices between methods and approaches, we need to decide which other variables in the data set to use to predict what values the missing data should take. We also need to decide how the imputation of each variable is to be handled, what model to use and which other variables will be used to impute it.

uparrowtop

6.3 Getting Started

From links in this section you can:-

  • Downlaod or open the data files
  • Analyze them with any of the 4 packages you have available
  • View the code (with comments) and the ouput, even if you don't have the software.

To start, click the mini guide for the statistical package you want to use to analyse Exemplar 6.

For additional help click on the appropriate novice guide.

For details of the data sets see below.

Mini Guides

mini-book
Guides for Novices

mini-book

This exemplar has two different data files. One with the scores (ex6) and some other variables and one with all 102 detailed questions (ex6det). See below ex6 and ex6det for a list of variables in each one.

# Indicates a web page is to view program code and results outside packages.

* Indicates that file should be saved to a local computer before running.

Package
Data Sets Program Code Output
SAS

ex6prob files give details of methods that appear not to give good results

checkimp_MACRO.SAS is a SAS MACRO to check imputations

ex6.sas7bdat*
ex6det.sas7bdat*
ex6.sas*
ex6det.sas*
ex6formats.sas* checkimp_MACRO.SAS* ex6prob.sas* ex6sas.htm# ex6DETsas.htm#
ex6DETprob.htm#
ex6ressas.htm#
ex6DETressas.htm#
Stata
ex6.dta
ex6det.dta
ex6.do*
ex6det.do*
ex6Stata.htm#
ex6DETStata.htm#
ex6resStata.htm#
results in detail to follow
SPSS
ex6.sav

ex6.sps*
ex6spss.htm#

ex6resspss.htm#
R
ex6.RData ex6det.RData ex6.R
exdet6.R
ex6R.htm#
ex6resR.htm#

Note: the current R implantations of chained equations and NORM are not working correctly. This has been reported by us and by others and we will let you know when it is fixed.

The MICE R library is not currently on the R web site, but it can be obtained directly from the authors' web site. Versions that worked with previous versions of R ( and that were used for the code here) are available.

uparrowtop

6.4 Results using a small subset of variables

6.4.1 Imputing prevalence

The data for this exemplar consist of three types of scores (prevalence, volume and variety) that are calculated as from a set of questions. One way to approach this is to use the total scores and impute the missing values of the totals. An alternative would be to impute the answers to the individual questions. The first approach makes the problem smaller but might be seen as less satisfactory. It means that the totals have to be imputed whenever there is InfoButtonunit non response to any of the questions. As we will see, it can also lead to inconsistencies between variables.

We start by simplifying the problem and trying to impute the 1/0 variable of any reported delinquency at each sweep. These are binary variables, so obviously not normally distributed. But the literature suggests that procedures using the multivariate normal distribution can work even when the assumptions are quite wrong.

If we just use the 6 values of the delinquency prevalence we can use a repeated measures analysis of variance that will adjust correctly for the missing data without any complicated imputation being required, but it makes the assumption of multivariate normality. Only some repeated measures programs will do this correctly (see below section on software).

The programs for this exemplar carry out procedures to deal with the missing data in different ways. These include:-

1. Using all available data at each sweep
2. Using only those pupils who responded at all 6 sweeps (complete cases)
3. Carrying out a repeated measures analysis
4. Imputing prevalence at other sweeps using chained methods with logistic regressions
5. Imputing prevalence at other sweeps assuming a normal distribution

Figure 6.3 illustrates the results from these analyses for the 1/0 variable measuring delinquency prevalence over time. Note the expanded scale here compared to Figure 6.1.

graph of results of diferrent imputation strategies graph of results of diferrent imputation strategies
Figure 6.3 Results of imputations from a simple model

All the imputation methods give results that are fairly close to one another and are only differ from those for all the available data at the sixth sweep, and then by only a modest amount. Using only complete cases (listwise deletion) gives much lower rates. This might have been expected since those who have completed the survey on every occasion might be expected to be a more law-abiding group. The complete data from each sweep and the imputed data agree well. The drop in delinquency at S5 and S6 seems, from this analysis, to be mainly genuine and not just the result of sample attrition.

Sweep 6
Boys
Girls
Imputation Prevalence Standard error Prevalence Standard error
1
67.79%
1.02%
61.48%
1.03%
2
66.83%
1.02%
62.32%
1.03%
3
68.16%
1.01%
62.97%
1.03%
4
68.48%
1.02%
62.04%
1.03%
Combined
67.82%
1.29%
62.20%
1.24%
Original data
64.72%
1.37%
56.72%
1.31%
Table 6.2 Individual and combined estimates from multiple imputation ( chained equations)

We can also use the post imputation procedures to see how the imputation procedure has affected the InfoButtonstandard errors of the estimates, always assuming that the biases have been corrected adequately by the model fitted. Table 6.2 compares the estimated prevalence at sweep 6 from 4 imputations and gives the combined estimate. The combined estimate has a mean that is the mean of the 4 imputations, but its standard error has been increased by the variation between the imputations. The combined imputed estimate gives a lower standard error than the original data, indicating that as well as reducing bias, the imputation procedure has recovered information for the missing cases.

6.4.2 Conclusion from using these simple models

This all looks very reassuring. The complete case analysis showed that the data were not missing completely at random. However the pattern of imputed data at each wave did not suggest that it would have been very different from an analysis that used all the data available from each sweep.. Sweep 6 is most strongly affected by the missing data but differences are small. It was very reassuring that all the packages were in reasonable agreement, and that different methodologies gave very close results.

arrowCode to see how to run these analyses in different packages can be accessed as web pages here. ex6spss.htm
ex6sas.htm
ex6Stata.htm
ex6sR.htm

arrowAnd the output they produce can be viewed here as web pages here.

ex6resspss.htm
ex6ressas.htm
ex6resStata.htm
ex6resR.htm

uparrowtop

6.5 Results using a large number of derived variables

6.5.1 Methods

The literature on imputation recommends that as large a model as possible should be used in the imputation and, in particular, that all the variables to be used in any analysis models should be included in the imputation. To follow these rules we attempted imputation using all the variables saved in the data set with the scores (see below).

Two methods were compared, chained equations and InfoButtonNORM. For chained equations the variable types available had to be selected and the options available depended on the choice of package used, see theory section on imputation section 6.7.4. Two packages (SAS and Stata) are illustrated. They offer different choices for modelling. The greatest challenge was how to model the variety and, particularly, the volume of delinquency. In Stata the variety score was modelled as an ordered categorical variable and with the SAS IVEWARE package it was modelled as a Poisson variable. In both cases the volume measure was modelled as a normal distribution, but in SAS this was modified as a normal distribution for part of the data with a spike at zero for the remainder. In both cases the data needed to be sorted after the imputation to ensure that variety and volume were always zero together.

6.5.2 Results

The imputations were checked to make sure that the distributions seemed reasonable and no obvious problems were seen. The results files from each package illustrate some of this.

Table 6.3 compares the mean scores and their InfoButtonstandard errors by gender for sweeps 1 and 6. The same variables are used in each case but three different methods are compared. There is little change from the original data in the sweep 1 scores, but the sweep 6 scores, especially the prevalence, are somewhat higher than the observed data and also a little higher than the imputation from the smaller model (Table 6.2). The means of all methods are very similar, but the standard error s for the sweep 1 scores are lower for the InfoButtonNORM method. This may be the result of the truncation of negative values to zero. It is also seen in the Stata implementation of chained equations where this procedure had to be used, but not in the IVEWARE model where the volume was only imputed when the prevalence was zero.

PREVALENCE
Boys
Girls
Imputation sweep
1
s.e. sweep
6
s.e. sweep
1
s.e. sweep
6
s.e.
NORM
79.2%
0.90%
71.71%
1.3%
67.7%
1.03%
62.7%
1.16%
chained (iveware)
79.20%
0.90%
70.56%
2.2%
67.59%
1.03%
62.91%
1.67%
chained (Stata)
79.00%
0.97%
69.80%
1.45%
67.42%
1.00%
61.72%
1.42%
Original data
79.0%%
0.99%
64.7%
1.37%
67.2%
1.0%
56.7%
1.31%
VOLUME
Boys
Girls
Imputation
sweep
1
s.e.
sweep
6
s.e.
sweep
1
s.e.
sweep
6
s.e.
NORM
11.12
0.33
8.80
0.30
6.19
0.21
5.84
0.25
chained (iveware)
11.35
0.36
9.78
0.60
6.29
0.22
5.86
0.46
chained(Stata)
11.18
0.28
9.01
0.41
6.22
0.28
5.83
0.33
Original data
10.75
0.29
6.60
0.27
5.70
0.29
3.90
0.25
VARIETY
Boys
Girls
Imputation
sweep
1
s.e.
sweep
6
s.e.
sweep
1
s.e.
sweep
6
s.e.
NORM
3.01
0.062
1.98
0.055
2.00
0.040
1.29
0.043
chained (iveware)
3.02
0.068
2.03
0.095
2.00
0.050
1.15
0.067
chained (Stata)
3.02
0.057
2.03
0.080
2.00
0.057
1.29
0.061
Original data
2.98
0.057
1.64
0.048
1.95
0.057
0.97
0.045
Table 6.3. Combined estimates of scores at sweeps 1 and 6 by gender and by imputation method, using scores.

The logistic regression results that are presented in Table 6.5 in section 6.6 were also slightly different for the prediction model of sweep 6 prevalence.

arrowCode to see how to run these analyses in different packages can be accessed as web pages here.

ex6sas.htm
ex6stata.htm

arrowAnd the output they produce can be viewed here as web pages here.

ex6ressas.htm
ex6resStata.htm

uparrowtop

6.6 Results from imputing with individual questions

6.6.1 Methods and problems

To fit the imputation model to the 102 individual questions, each with 8 possible response categories, proved challenging. It required many weeks of work to investigate all the possible approaches and to reject those with obvious problems. See especially the SAS file that documents problems. Some of the practical problems encountered are detailed in the theory page on imputation (Section 6.6) and problems with individual packages are documented in the code for each one. See especially the SAS file that documents problems with chained equations ex6DETprob.htm.

Finally two different acceptable chained equations unit non-response analyses were achieved by two different packages (Stata and SAS). In the Stata analysis each question was modelled as an ordered categorical variable. To include all of the questions in every model together would have produced models that were too big and would fail. One solution could be to write 102 equations giving details of which other variables were to be used to impute each one. An alternative option to divide the questions into subsets of related questions was used since it was easier to program. The SAS IVEWARE model used a COUNT variable to model the responses to each question.

The InfoButtonNORM analyses also gave some problems and had to be run in two halves because of resource problems. Scanning the results for individual questions after NORM, it was clear that a large proportion of the missing data had been imputed as having done an offence just once, compared to the observed data. These results were not extreme, however, and it is not clear they would have been an obvious error had the results from other methods not been there to compare. The NORM procedure had to be run without imposing bounds on the range of values. When this was attempted the program failed because the predictive distribution repeatedly gave a value outside the range.

Looking at the detailed questions makes one realise the problems of imputation. One of the offending questions relates to "skiving school". School leavers, who accounted for many of the non-responders, could hardly have answered this in any way. But the other questions are all relevant. It was clear on inspecting the data that the answers to many individual questions tracked very strongly from one period to another and the imputed data preserved this pattern.

6.6.2 Results

After imputing from the individual questions, the scores for all respondents are then calculated from the imputed data. The means for each score at sweeps 1 and 6 are given by gender in Table 6.4.

PREVALENCE
Boys
Girls
Imputation
sweep
1
s.e.
sweep
6
s.e.
sweep 1
s.e.
sweep
6
s.e.
NORM
80.7%
0.85%
77.55%
1.01%
69.4%
1.00%
69.5%
1.01%
chained (iveware)
79.50%
0.90%
76,53%
1.32%
67.8%
1.04%
68.9%
1.18%
chained (Stata)
80.3%
0.90%
73.6%
1.08%
68.2%
0.97%
66.0%
1.06%
Original data
79.0%
0.99%
64.7%
1.37%
67.2%
1.0%
56.7%
1.3%
VOLUME
Boys
Girls
Imputation
sweep
1
s.e.
sweep
6
s.e.
sweep
1
s.e.
sweep
6
s.e.
NORM
11.09
0.31
9.29
0.27
6.09
0.20
5.85
0.17
chained (iveware)
10.81
0.34
10.83
0.84
5.78
0.21
6.02
0.46
chained (Stata)
11.05
0.27
9.69
0.26
5.90
0.27
5.61
0.26
Original data
10.75
0.29
6.60
0.27
5.70
0.29
3.90
0.25
VARIETY
Boys
Girls
Imputation
sweep
1
s.e.
sweep
6
s.e.
sweep
1
s.e.
sweep
6
s.e.
NORM
3.27
0.064
3.05
0.078
2.18
0.052
1.78
0.051
chained (iveware)
2.98
0.062
2.27
0.089
1.95
0.046
1.34
0.047
chained (Stata)
3.02
0.054
2.08
0.044
1.97
0.05
1.22
0.042
Original data
2.98
0.057
1.64
0.048
1.95
0.057
0.97
0.045
Table 6.4. Combined estimates of scores at sweeps 1 and 6 by gender and by imputation method, using individual question.

The trends in these means are compared with the earlier results in the next section. But a few points stand out in Table 6.3. The NORM procedure gives increased scores for variety, in line with the number of questions imputed with a value of 'having done once'. Apart from this the means of the methods are in fairly good agreement. The differences between the imputation methods are, however, rather larger than the standard errors from the post-imputation procedures, and these standard errors vary considerably between methods. Perhaps more imputations should have been run to get a better estimate of them. But more likely, the differences are an aspect of 'model uncertainty'.

Logistic regressions predicting prevalence at sweep 6 for these data are presented in the next section.

arrowCode to see how to run these analyses in different packages can be accessed as web pages here.

ex6DETsas.htm
ex6DETStata.htm

arrowAnd the output they produce can be viewed here as web pages here. ex6DETressas.htm
ex6DETresStata.htm

uparrowtop

6.7 Comparing methods and conclusions

6.7.1 Trends in mean scores over sweeps.

Figure 6.5 shows the mean prevalence, by gender, for all respondents over all 6 sweeps. We can see that our original suspicion that those who were not followed up at the later sweeps is now confirmed by the results of the imputation, especially those that impute from the detailed questions. Differences between methods are less pronounced than differences between the variables selected for the imputation. Compared to the smaller imputation, the prevalence at sweeps 5 and 6 were higher for all the imputation methods.

boysgirlsleg
Figure 6.5 Prevalence over all 6 sweeps imputed by different methods

Figure 6.6 shows the same data for variety of offending . Differences here are smaller except for imputing with a normal distribution from the detailed questions. This was due to the high proportion of imputed values that imputed to 'one time' rather than zero.

boysgirlsleg
Figure 6.6 Variety over all 6 sweeps imputed by different methods

Figure 6.7 shows the mean prevalence, by gender, for all respondents over all 6 sweeps. Here all methods are in reasonable agreement with the possible exception of imputing the scores from a normal distribution that tends to give higher values, especially for girls. There is a modest increase in volume seen at sweep 6, compared to the original data.

girlsgirlsgirls
Figure 6.7 Volume over all 6 sweeps imputed by different methods
uparrowtop

6.7.2 Modelling prevalence by logistic regression

Logistic regressions to predict prevalence at sweep 6 were carried out for the original data and for each of the imputed data sets.

Observed  
odds ratio
Parameter
estimate
95% CI 0.00
p-value
Gender F vs M
0.71
0.60
0.83
<.0001
Deprivation
1.32
1.10
1.57
0.0019
Mainstream School
1.0 (base)
EBD school
0.95
0.08
11.16
0.9681
Independent school
0.73
0.59
0.91
0.0039
Special school
0.22
0.08
0.59
0.0021
Imputed from totals  
odds ratio
Parameter
estimate
95% CI
p-value
Gender F vs M 0.70 0.59 0.84 0.0012
Deprivation 1.40 1.18 1.66 0.0003
Mainstream School 1.0 (base)
EBD school 1.14 0.15 8.38 0.9022
Independent school 0.62 0.49 0.79 0.0002
Special school 0.24 0.11 0.53 0.0007
Imputed from questions  
odds ratio
Parameter
estimate
95% CI
p-value
Gender F vs M 0.68 0.58 0.80 <.0001
Deprivation 1.73 1.47 2.02 <.0001
Mainstream School 1.0 (base)
EBD school 4.43 0.79 24.74 0.0900
Independent school 0.53 0.43 0.65 <.0001
Special school 0.29 0.14 0.61 0.00
Table 6.5. Logistic regressions predicting prevalence at sweep 6 from gender, deprivation and sector

All t he analyses that imputed from the scores gave similar results for the coefficients, as did all the analyses that imputed from the detailed questions. This included the NORM method in each case. The values in table 6.5 are from the IVEWARE imputations.

All the three analyses agree that girls offend less than boys, that deprivation is associated with more offending and that those attending independent schools offend less.

The analysis of data imputed from the total scores differ only slightly from that from the observed data. But data imputed from the totals gives a much stronger effect for deprivation and a somewhat stronger effect for independent schools.

The numbers in the other two school sectors are smaller and so the estimates are more volatile. All three analyses agree that a non-behavioural special school is associated with less offending. The effect of having been at a school for pupils with behaviour problems shows up only in the analysis that imputes from the individual questions and is only marginally significant. This may well be because of a high percentage of missing data for this group. It would be possible to check this in the data file. I have not had time, but the data are here on the site if you want to.

If time had allowed more modelling analyses would have been added to this site. To allow you to test this yourselves we provide the data sets with the imputed data (10 imputations) in SAS, Stata and SPSS format. The SAS data sets have the imputation number identified by the variable _IMPUTATION_.

SAS Stata SPSS
Imputed from scores ivesc.sas7bdat
MICEsc.sas7bdat
NORMsc.sas7bdat
ivesc.dta
MICEsc.dta
NORMsc.dta
ivesc.sav
MICEsc.sav
NORMsc.sav
Imputed from detailed questions
ivedet.sas7bdat
MICEdet.sas7bdat
NORMdet.sas7bdat
ivedet.dta
MICEdet.dta
NORMdet.dta
ivedet.sav
MICEdet.sav
NORMdet.sav
NOTES

Each data set has 10 imputations

The data sets imputed from detailed questions do not have a full range of other variables, but they could easily be added from the other data sets.


RUN ex6formats.sas before using these

The SAS data sets have the imputation number identified by the variable _IMPUTATION_.

The Stata data sets have the imputation number identified by the variable _j and the sequence number in the original file identified by _i.

The SPSSdata sets have the imputation number identified by the variable _j and the sequence number in the original file identified by _i.
uparrowtop

6.7.3 Conclusions.

Theses are tentative conclusions based on an extensive investigation of one data set, but they point to several tentative conclusions that may apply to other cases.

  • Imputation is not an easy option and it needs a considerable investment in time and expertise
  • All imputed data, especially model-based methods, should be thoroughly checked
  • The choice of variables to use in the imputations may be more important than the method used
  • If there is a substantial proportion of missing data and bias is important then using a NORMal distribution method when it is not appropriate may be a problem

Imputation is an art and missing data are a challenge. Any attempt to deal with missing data (even by ignoring it) is full of assumptions. We can never be sure what the missing data would have been had we got their answers.

But decisions have to be made. If I had to decide in this case I would go for the analysis that imputes from the individual questions using one or other implementation of chained equations.

uparrowtop

6.8 Details of the survey

This was a simple longitudinal survey. All secondary schools in Edinburgh were approached. Although it was clustered by school, this was a natural clustering rather than one induced by the survey design. We have not used the information on clustering in the analyses presented here, although this might be possible with some of the software discussed.

The longitudinal data set has been produced by linking the data from all sweeps of this survey (S1 to S6). The data from sweeps 1 to 4 of the survey are available from the UK data archive. Most of the data used in this exemplar comes from the young people's questionnaires which can be viewed at the survey web site. The data archive documentation explains the codes for the variables used from the questionnaires at the first 4 sweeps, and the conventions continue in an obvious way at the final two sweeps. Details as to how data for use in this exemplar have been extracted from the larger data files is available here.

The questions that contribute to the prevalence, volume and variety scores for delinquency differed slightly for each of the sweeps. They are listed in the codes for the data sets with individual questions (below).

The data sets with total scores (named ex6.***)
uparrowtop
Variable Label
CASEID Unique case identifier
ETHGP White or non-white
GENDER Gender (1=m 2=f)
HZEVREFI HZ. Whether ever referred on offence
HZRESPRE HZ. Whether had a hearing record at any sweep
RAALCY80 RA Frequency of drinking alcohol sweep 1
5 = 'Few times/wk' 4 = ' Weekly' 3 = 'Monthly'
2 = 'Special occasions' 1 ='hardly ever' 0= 'Non-drinker
RBALCY80 RB. Frequency of drinking alcohol sweep 2
RCALCY80 RC. Frequency of drinking alcohol sweep 3
RDALCY80 RD. Frequency of drinking alcohol sweep 4
REALCY80 RE Frequency of drinking alcohol sweep 5
RFALCY80 RF Frequency of drinking alcohol sweep 6
SASMKY80 SA Frequency of smoking sweep 1 4=daily 3=weekly
2=less than weekly 1=hardly ever 0=non smoker
SBSMKY80 SB Frequency of smoking sweep 2
SCSMKY80 SC Frequency of smoking sweep 3
SDSMKY80 SD Frequency of smoking sweep 4
SESMKY80 SE Frequency of smoking sweep 5
SFSMKY80 SF Frequency of smoking sweep 6
SZABS04 SZ. Whether school record for truancy at any time
SZINDEP SZ. Individual deprivation/socioeconomic 1=y 2=n
YZLEAVE YZ. Whether left school at earliest opportunity 1=y 2=n
dprev1-6 prevalence of offending sweeps 1-6 (0/1)
drgcode1 Drug use summary code at sweeps 1-6
0='none' 1='only cannabis' 2='only glue' 3='other drug or combination'
dvol1-6 volume of offending sweeps 1-6
sector

Independent, special or behavioural education at any time
1=mainstream 2=private 3=special physical or learning disabilities
4=special/behavioural

The data sets with individual questions(named ex6det.***)
uparrowtop

This data set includes 102 questions each giving the number of times each participant reported having done each of the items 0= not done 1 to 5 represent the number of times, 6= 5 to 10 times 7=10 or more times.

  Variable Label
39 YAYARS01 YA Ever set fire to something
21 YAYBOP01 YA Ever rowdy or rude in public
17 YAYBUS01 YA Ever dodged paying correct fare
43 YAYCBK01 YA Ever broken into a vehicle to steal
33 YAYGRF01 YA Ever written or sprayed graffiti
31 YAYHBK01 YA Ever broken into a house or building
41 YAYHIT01 YA Ever hit, kicked or punched someone
37 YAYHOM01 YA Ever stolen something from home
23 YAYJRD01 YA Ever stolen/ridden in a stolen vehicle
35 YAYROB01 YA Ever used force/threats/weapon
25 YAYSCL01 YA Ever stolen something from school
19 YAYSHP01 YA Ever stolen from a shop
45 YAYSKV01 YA Ever skived school
29 YAYVND01 YA Ever vandalised property
27 YAYWEP01 YA Ever carried a knife or weapon
94 YBYARS01 YB Set fire to something in last year
76 YBYBOP01 YB Noisy or cheeky in public in last ye
72 YBYBUS01 YB Dodged paying correct fare in last y
98 YBYCBK01 YB Broken into a vehicle to steal in la
88 YBYGRF01 YB Wrote or sprayed graffiti in last ye
86 YBYHBK01 YB Broken into a house or building to s
96 YBYHIT01 YB Hit, kicked or punched someone in la
92 YBYHOM01 YB Stolen something from home in last y
78 YBYJRD01 YB Ridden in a stolen vehicle in last y
90 YBYROB01 YB used force/threats/weapon to rob so
80 YBYSCL01 YB Stolen something from school in last
74 YBYSHP01 YB Stolen something from a shop in last
100 YBYSKV01 YB Skived school in last year
84 YBYVND01 YB Vandalised property in last year
82 YBYWEP01 YB Carried a knife or weapon in last ye
152 YCYARS01 YC: Set fire to something in last year
134 YCYBOP01 YC: Noisy or cheeky in public in last ye
130 YCYBUS01 YC: Dodged paying correct fare in last y
156 YCYCBK01 YC: Broken into a vehicle to steal in la
162 YCYDRG01 YC: Sold an illegal drug in last year
146 YCYGRF01 YC: Wrote or sprayed graffiti in last ye
144 YCYHBK01 YC: Broken into a house or building to s
154 YCYHIT01 YC: Hit, kicked or punched someone in la
150 YCYHOM01 YC: Stolen something from home in last y
136 YCYJRD01 YC: Ridden in a stolen vehicle in last y
160 YCYPET01 YC: Cruel to animals/birds in last year
164 YCYRAB01 YC: Racially abused someone in last year
148 YCYROB01 YC: Used force/threats/weapon to rob some
138 YCYSCL01 YC: Stolen something from school in last
132 YCYSHP01 YC: Stolen something from a shop in last
158 YCYSKV01 YC: Skived school in last year
142 YCYVND01 YC: Vandalised property in last year
140 YCYWEP01 YC: Carried a knife or weapon in last ye
219 YDYARS01 YD: Set fire to something in last year
201 YDYBOP01 YD: Noisy or cheeky in public in the last year
197 YDYBUS01 YD: Dodged paying correct fare in last y
223 YDYCBK01 YD: Broken into vehicle to steal in last
229 YDYDRG01 YD: Sold an illegal drug in last year
213 YDYGRF01 YD: Wrote or sprayed graffiti in last year
211 YDYHBK01 YD: Broken into house\building to steal
221 YDYHIT01 YD: Hit, kicked, punched someone in last year
217 YDYHOM01 YD: Stolen something from home in last year
203 YDYJRD01 YD: Ridden in stolen vehicle in the last year
227 YDYPET01 YD: Cruel to animals\birds in last year
231 YDYRAB01 YD: Racially abused someone in last year
215 YDYROB01 YD: Used force\threats\weapon in last ye
205 YDYSCL01 YD: Stolen something from school in last
199 YDYSHP01 YD: Stolen something from shop in last y
225 YDYSKV01 YD: Skived school in last year
209 YDYVND01 YD: Vandalised property in last year
207 YDYWEP01 YD: Carry knife or weapon in last year
267 YEYWEP01 YE: Carry knife or weapon in last year
332 YFYARS01 YF: Set fire to something in last year
350 YFYBFT01 YF: Claimed benefits not entitled to in
320 YFYBOP01 YF: Noisy or cheeky in public in the las
336 YFYCBK01 YF: Broken into vehicle to steal in last
342 YFYDRG01 YF: Sold an illegal drug in last year
352 YFYFRD01 YF: Cheque, credit fraud etc, in last ye
328 YFYHBK01 YF: Broken into house\building to steal
334 YFYHIT01 YF: Hit, kicked,punched someone in last y
322 YFYJRD01 YF: Ridden in stolen vehicle, ever done
340 YFYPET01 YF: Cruel to animals\birds in last year
344 YFYRAB01 YF: Racial abuse, ever done it
330 YFYROB01 YF: Robbed person of property in last ye
346 YFYRST01 YF: Sold on stolen property in last year
348 YFYRST21 YF: Bought stolen property in last year
349 YFYRST22 YF: Bought stolen property, times done i
318 YFYSHP01 YF: Stolen something from shop in last y
338 YFYSKV01 YF: Skived school in last year
326 YFYVND01 YF: Vandalised property in last year
324 YFYWEP01 YF: Carry knife or weapon in last year
P|E|A|S project 2004/2005/2006