Exemplar 6:Young people and delinquency

P|E|A|S

text/print version

�

6: Young people and delinquency

> 6.1 Background > 6.2 Types of imputation > 6.3 Getting started

> 6.4 Results using a small subset of variables > 6.5 Results using a large number of derived variables
> 6.6 Results using individual questions

> 6.7 Comparisons and conclusions
> 6.7.1 Trends over time > 6.7.2 Logistic regression > 6.7.3 Conclusions

> 6.8 Details of this survey and the data

top

6.1 Background

This exemplar is based on data from the Edinburgh Study of Youth Transitions and Crime. This is a longitudinal survey that has collected data on young people through their six years of secondary education (ages 12 to 17). A cohort of pupils beginning their secondary school education in Edinburgh in autumn 1998 was the target population. Only a few schools and a small proportion of eligible pupils were withdrawn by their parents. Further information about the project can be found at http://www.law.ed.ac.uk/cls/esytc. The data for the earlier sweeps of the study can be accessed at the ESRC data archive, and the later ones will be there in time.

To illustrate InfoButton multiple imputations for longitudinal data we are using data collected at all six sweeps. Each year the participants completed a detailed questionnaire. For the first 4 years almost all the participants were in school and the response rate was excellent. At years 5 and 6 it was slightly worse as a proportion of young people left school and moved away from home.

The population who were eligible for the study changed over time as families moved in and out of the city. For the longitudinal analysis we considered the 4328 participants who were eligible to participate in year 4. The response rate at each year is shown in Figure 6.1. The non-responders in years 1 and 2 were largely those who moved into study schools after this. From year 4 onwards the decline is largely due to the worse response rate from those who had left school and had to be followed up elsewhere. 74% of participants responded at all 6 sweeps.

Figure 6.1 Response rates by yearly sweep

In this exemplar we will be looking at how the rates of self-reported delinquency change over the 6 years. The questionnaire contained a series of questions like the following example:-

In the last year did you steal something from a shop or a store?
If ‘yes’ how many times?

There were between 15 and 18 such questions, depending on the sweep of the questionnaire. Three measures of delinquency were derived from each set of questions

Prevalence 1 if a yes response to any question, else 0.
Variety Number of questions answered out of the total.
Volume – Sum of the number of times for all questions answered yes, with the answer 'more than 10 times' coded as 11.

Being missing on any question lead to a missing score. This was most pronounced at sweep 6 where some of the questions were no longer relevant to those who had left school (e.g. truancy). Figure 6.2 shows how the mean values of these measures changed over time for the observed subjects.

Figure 6.2 Mean delinquency scores by gender and school year (variety and volume divided by numbers of questions asked).

We can see that the delinquency measures peak at around secondary years 3 and 4 (S3 and S4) and decline after this. But perhaps some of this decline relates to losses to follow up? If the more delinquent failed to respond in later sweeps the mean score for responders would be increased.

For every longitudinal survey we need to consider what we mean by our ‘eligible sample’ since some respondents may not be asked to participate in every wave of the survey. This eligible sample may sometimes vary depending on how many sweeps of the survey we wish to look at together. For this survey it was decided to consider the eligible sample as those who were pupils in Edinburgh schools at S4, and to exclude any pupils who had moved away in earlier years. Table 1 shows the response patterns for all 4328 pupils in the eligible sample. An R represents a response at each wave. The pupils in the first row of Table 6.1 responded at all sweeps, the second row missed the last sweep, and so on. Although a large proportion of response patterns (the first 6 groups) are nested there are many others, so a purely nested analysis would not work.

Table 6.1 Response patterns for all respondents
pattern	number of pupils	comment
RRRRRR�	3124	responded all 6 sweeps
RRRRR-	399	missed 6
RRRR--	192	missed 5 and 6
RRR---	55	missed 4,5,6
RR----	16	missed 3,4,5,6
R-----	5	first sweep only
RRRR-R	122	missed sweep 5 only
-RRRRR	100	miss wed sweep 1 only
--RRRR	69	missed 1 and 2
others	246	37 other distinct response patterns

In addition to those who did not fill in a questionnaire at some sweeps (unit non-response) we will have additional missing data from individual questions that were not answered ( InfoButton item non response ). For the questions on delinquency that we investigate the unit non-response averaged around 5%, with a smaller proportion missing in the older age groups for most questions.

We will illustrate how different ways of imputing the non-respondents' values affect the patterns of delinquency over time.

top

6.2 Types of imputation

6.2.1 Theoretical differences

Different types of imputation are discussed in the theory part of this site: Imputation section 6.2. We will be illustrating only model- based imputation here. Two main approaches have been developed. Both of them involve making some quite strong assumptions about the data.

Assume a multivariate normal distribution. This is the implemented by Schafer and available in some packages and discussed in his text book. See theory section 6.
Predict each variable in your data set sequentially from all the others. Methods using this approach are known as chained methods.

Both of these methods are InfoButton proper imputations and both can be run as multiple imputations. Methods to combine the imputations can then be used that allow the standard errors of estimates to be increased to allow for the missing data.

6.2.2 Different practical approaches

The scores mentioned above were derived from a large number of individual questions about offending behaviour. We have a choice of either

Imputing the total scores at each sweep
Imputing the original questions and recalculating scores (102 variables)

The second of these will have the advantage of obtaining scores from people who have answered some, but not all, of the questions that make up the score.

As well as the choices between methods and approaches, we need to decide which other variables in the data set to use to predict what values the missing data should take. We also need to decide how the imputation of each variable is to be handled, what model to use and which other variables will be used to impute it.

�

top

6.3 Getting Started

From links in this section you can:-

Downlaod or open the data files
Analyze them with any of the 4 packages you have available
View the code (with comments) and the ouput, even if you don't have the software.

To start, click the mini guide for the statistical package you want to use to analyse Exemplar 6.

For additional help click on the appropriate novice guide.

For details of the data sets see below.

Mini Guides

R
Stata
SPSS
SAS

�

Guides for Novices

mini-book

R
Stata
SPSS
SAS

This exemplar has two different data files. One with the scores (ex6) and some other variables and one with all 102 detailed questions (ex6det). See below ex6 and ex6det for a list of variables in each one.

# Indicates a web page is to view program code and results outside packages.

* Indicates that file should be saved to a local computer before running.

Package	Data Sets	Program Code	Output
SAS ex6prob files give details of methods that appear not to give good results checkimp_MACRO.SAS is a SAS MACRO to check imputations	ex6.sas7bdat* ex6det.sas7bdat*	ex6.sas* ex6det.sas* ex6formats.sas* checkimp_MACRO.SAS* ex6prob.sas* ex6sas.htm# ex6DETsas.htm# ex6DETprob.htm#	ex6ressas.htm# ex6DETressas.htm#
Stata	ex6.dta ex6det.dta	ex6.do* ex6det.do* ex6Stata.htm# ex6DETStata.htm#	ex6resStata.htm# results in detail to follow
SPSS	ex6.sav	ex6.sps* ex6spss.htm#	ex6resspss.htm#
R	ex6.RData ex6det.RData	ex6.R exdet6.R ex6R.htm#	ex6resR.htm#

Note: the current R implantations of chained equations and NORM are not working correctly. This has been reported by us and by others and we will let you know when it is fixed.

The MICE R library is not currently on the R web site, but it can be obtained directly from the authors' web site. Versions that worked with previous versions of R ( and that were used for the code here) are available.

top

6.4 Results using a small subset of variables

6.4.1 Imputing prevalence

The data for this exemplar consist of three types of scores (prevalence, volume and variety) that are calculated as from a set of questions. One way to approach this is to use the total scores and impute the missing values of the totals. An alternative would be to impute the answers to the individual questions. The first approach makes the problem smaller but might be seen as less satisfactory. It means that the totals have to be imputed whenever there is InfoButton unit non response to any of the questions. As we will see, it can also lead to inconsistencies between variables.

We start by simplifying the problem and trying to impute the 1/0 variable of any reported delinquency at each sweep. These are binary variables, so obviously not normally distributed. But the literature suggests that procedures using the multivariate normal distribution can work even when the assumptions are quite wrong.

If we just use the 6 values of the delinquency prevalence we can use a repeated measures analysis of variance that will adjust correctly for the missing data without any complicated imputation being required, but it makes the assumption of multivariate normality. Only some repeated measures programs will do this correctly (see below section on software).

The programs for this exemplar carry out procedures to deal with the missing data in different ways. These include:-

1. Using all available data at each sweep
2. Using only those pupils who responded at all 6 sweeps (complete cases)
3. Carrying out a repeated measures analysis
4. Imputing prevalence at other sweeps using chained methods with logistic regressions
5. Imputing prevalence at other sweeps assuming a normal distribution

Figure 6.3 illustrates the results from these analyses for the 1/0 variable measuring delinquency prevalence over time. Note the expanded scale here compared to Figure 6.1.

graph of results of diferrent imputation strategies

Figure 6.3 Results of imputations from a simple model

All the imputation methods give results that are fairly close to one another and are only differ from those for all the available data at the sixth sweep, and then by only a modest amount. Using only complete cases (listwise deletion) gives much lower rates. This might have been expected since those who have completed the survey on every occasion might be expected to be a more law-abiding group. The complete data from each sweep and the imputed data agree well. The drop in delinquency at S5 and S6 seems, from this analysis, to be mainly genuine and not just the result of sample attrition.

Sweep 6	Boys		Girls
Imputation	Prevalence	Standard error	Prevalence	Standard error
1	67.79%	1.02%	61.48%	1.03%
2	66.83%	1.02%	62.32%	1.03%
3	68.16%	1.01%	62.97%	1.03%
4	68.48%	1.02%	62.04%	1.03%

Combined	67.82%	1.29%	62.20%	1.24%
Original data	64.72%	1.37%	56.72%	1.31%

Table 6.2 Individual and combined estimates from multiple imputation ( chained equations)

We can also use the post imputation procedures to see how the imputation procedure has affected the InfoButton standard errors of the estimates, always assuming that the biases have been corrected adequately by the model fitted. Table 6.2 compares the estimated prevalence at sweep 6 from 4 imputations and gives the combined estimate. The combined estimate has a mean that is the mean of the 4 imputations, but its standard error has been increased by the variation between the imputations. The combined imputed estimate gives a lower standard error than the original data, indicating that as well as reducing bias, the imputation procedure has recovered information for the missing cases.

6.4.2 Conclusion from using these simple models

This all looks very reassuring. The complete case analysis showed that the data were not missing completely at random. However the pattern of imputed data at each wave did not suggest that it would have been very different from an analysis that used all the data available from each sweep.. Sweep 6 is most strongly affected by the missing data but differences are small. It was very reassuring that all the packages were in reasonable agreement, and that different methodologies gave very close results.

Code to see how to run these analyses in different packages can be accessed as web pages here. ex6spss.htm
ex6sas.htm
ex6Stata.htm
ex6sR.htm

And the output they produce can be viewed here as web pages here.

ex6resspss.htm
ex6ressas.htm
ex6resStata.htm
ex6resR.htm

top

6.5 Results using a large number of derived variables

6.5.1 Methods

The literature on imputation recommends that as large a model as possible should be used in the imputation and, in particular, that all the variables to be used in any analysis models should be included in the imputation. To follow these rules we attempted imputation using all the variables saved in the data set with the scores (see below).

Two methods were compared, chained equations and InfoButton NORM. For chained equations the variable types available had to be selected and the options available depended on the choice of package used, see theory section on imputation section 6.7.4. Two packages (SAS and Stata) are illustrated. They offer different choices for modelling. The greatest challenge was how to model the variety and, particularly, the volume of delinquency. In Stata the variety score was modelled as an ordered categorical variable and with the SAS IVEWARE package it was modelled as a Poisson variable. In both cases the volume measure was modelled as a normal distribution, but in SAS this was modified as a normal distribution for part of the data with a spike at zero for the remainder. In both cases the data needed to be sorted after the imputation to ensure that variety and volume were always zero together.

6.5.2 Results

The imputations were checked to make sure that the distributions seemed reasonable and no obvious problems were seen. The results files from each package illustrate some of this.

Table 6.3 compares the mean scores and their InfoButton standard errors by gender for sweeps 1 and 6. The same variables are used in each case but three different methods are compared. There is little change from the original data in the sweep 1 scores, but the sweep 6 scores, especially the prevalence, are somewhat higher than the observed data and also a little higher than the imputation from the smaller model (Table 6.2). The means of all methods are very similar, but the standard error s for the sweep 1 scores are lower for the InfoButton NORM method. This may be the result of the truncation of negative values to zero. It is also seen in the Stata implementation of chained equations where this procedure had to be used, but not in the IVEWARE model where the volume was only imputed when the prevalence was zero.

PREVALENCE	Boys				Girls
Imputation	sweep 1	s.e.	sweep 6	s.e.	sweep 1	s.e.	sweep 6	s.e.
NORM	79.2%	0.90%	71.71%	1.3%	67.7%	1.03%	62.7%	1.16%
chained (iveware)	79.20%	0.90%	70.56%	2.2%	67.59%	1.03%	62.91%	1.67%
chained (Stata)	79.00%	0.97%	69.80%	1.45%	67.42%	1.00%	61.72%	1.42%
Original data	79.0%%	0.99%	64.7%	1.37%	67.2%	1.0%	56.7%	1.31%
VOLUME	Boys				Girls
Imputation	sweep 1	s.e.	sweep 6	s.e.	sweep 1	s.e.	sweep 6	s.e.
NORM	11.12	0.33	8.80	0.30	6.19	0.21	5.84	0.25
chained (iveware)	11.35	0.36	9.78	0.60	6.29	0.22	5.86	0.46
chained(Stata)	11.18	0.28	9.01	0.41	6.22	0.28	5.83	0.33
Original data	10.75	0.29	6.60	0.27	5.70	0.29	3.90	0.25
VARIETY	Boys				Girls
Imputation	sweep 1	s.e.	sweep 6	s.e.	sweep 1	s.e.	sweep 6	s.e.
NORM	3.01	0.062	1.98	0.055	2.00	0.040	1.29	0.043
chained (iveware)	3.02	0.068	2.03	0.095	2.00	0.050	1.15	0.067
chained (Stata)	3.02	0.057	2.03	0.080	2.00	0.057	1.29	0.061
Original data	2.98	0.057	1.64	0.048	1.95	0.057	0.97	0.045

Table 6.3. Combined estimates of scores at sweeps 1 and 6 by gender and by imputation method, using scores.

The logistic regression results that are presented in Table 6.5 in section 6.6 were also slightly different for the prediction model of sweep 6 prevalence.

Code to see how to run these analyses in different packages can be accessed as web pages here.

ex6sas.htm
ex6stata.htm

And the output they produce can be viewed here as web pages here.

ex6ressas.htm
ex6resStata.htm

top

6.6 Results from imputing with individual questions

6.6.1 Methods and problems

To fit the imputation model to the 102 individual questions, each with 8 possible response categories, proved challenging. It required many weeks of work to investigate all the possible approaches and to reject those with obvious problems. See especially the SAS file that documents problems. Some of the practical problems encountered are detailed in the theory page on imputation (Section 6.6) and problems with individual packages are documented in the code for each one. See especially the SAS file that documents problems with chained equations ex6DETprob.htm.

Finally two different acceptable chained equations unit non-response analyses were achieved by two different packages (Stata and SAS). In the Stata analysis each question was modelled as an ordered categorical variable. To include all of the questions in every model together would have produced models that were too big and would fail. One solution could be to write 102 equations giving details of which other variables were to be used to impute each one. An alternative option to divide the questions into subsets of related questions was used since it was easier to program. The SAS IVEWARE model used a COUNT variable to model the responses to each question.

The InfoButton NORM analyses also gave some problems and had to be run in two halves because of resource problems. Scanning the results for individual questions after NORM, it was clear that a large proportion of the missing data had been imputed as having done an offence just once, compared to the observed data. These results were not extreme, however, and it is not clear they would have been an obvious error had the results from other methods not been there to compare. The NORM procedure had to be run without imposing bounds on the range of values. When this was attempted the program failed because the predictive distribution repeatedly gave a value outside the range.

Looking at the detailed questions makes one realise the problems of imputation. One of the offending questions relates to "skiving school". School leavers, who accounted for many of the non-responders, could hardly have answered this in any way. But the other questions are all relevant. It was clear on inspecting the data that the answers to many individual questions tracked very strongly from one period to another and the imputed data preserved this pattern.

6.6.2 Results

After imputing from the individual questions, the scores for all respondents are then calculated from the imputed data. The means for each score at sweeps 1 and 6 are given by gender in Table 6.4.

PREVALENCE	Boys				Girls
Imputation	sweep 1	s.e.	sweep 6	s.e.	sweep 1	s.e.	sweep 6	s.e.
NORM	80.7%	0.85%	77.55%	1.01%	69.4%	1.00%	69.5%	1.01%
chained (iveware)	79.50%	0.90%	76,53%	1.32%	67.8%	1.04%	68.9%	1.18%
chained (Stata)	80.3%	0.90%	73.6%	1.08%	68.2%	0.97%	66.0%	1.06%
Original data	79.0%	0.99%	64.7%	1.37%	67.2%	1.0%	56.7%	1.3%
VOLUME	Boys				Girls
Imputation	sweep 1	s.e.	sweep 6	s.e.	sweep 1	s.e.	sweep 6	s.e.
NORM	11.09	0.31	9.29	0.27	6.09	0.20	5.85	0.17
chained (iveware)	10.81	0.34	10.83	0.84	5.78	0.21	6.02	0.46
chained (Stata)	11.05	0.27	9.69	0.26	5.90	0.27	5.61	0.26
Original data	10.75	0.29	6.60	0.27	5.70	0.29	3.90	0.25
VARIETY	Boys				Girls
Imputation	sweep 1	s.e.	sweep 6	s.e.	sweep 1	s.e.	sweep 6	s.e.
NORM	3.27	0.064	3.05	0.078	2.18	0.052	1.78	0.051
chained (iveware)	2.98	0.062	2.27	0.089	1.95	0.046	1.34	0.047
chained (Stata)	3.02	0.054	2.08	0.044	1.97	0.05	1.22	0.042
Original data	2.98	0.057	1.64	0.048	1.95	0.057	0.97	0.045

Table 6.4. Combined estimates of scores at sweeps 1 and 6 by gender and by imputation method, using individual question.

The trends in these means are compared with the earlier results in the next section. But a few points stand out in Table 6.3. The NORM procedure gives increased scores for variety, in line with the number of questions imputed with a value of 'having done once'. Apart from this the means of the methods are in fairly good agreement. The differences between the imputation methods are, however, rather larger than the standard errors from the post-imputation procedures, and these standard errors vary considerably between methods. Perhaps more imputations should have been run to get a better estimate of them. But more likely, the differences are an aspect of 'model uncertainty'.

Logistic regressions predicting prevalence at sweep 6 for these data are presented in the next section.

Code to see how to run these analyses in different packages can be accessed as web pages here.

ex6DETsas.htm
ex6DETStata.htm

And the output they produce can be viewed here as web pages here. ex6DETressas.htm
ex6DETresStata.htm

top

6.7 Comparing methods and conclusions

6.7.1 Trends in mean scores over sweeps.

Figure 6.5 shows the mean prevalence, by gender, for all respondents over all 6 sweeps. We can see that our original suspicion that those who were not followed up at the later sweeps is now confirmed by the results of the imputation, especially those that impute from the detailed questions. Differences between methods are less pronounced than differences between the variables selected for the imputation. Compared to the smaller imputation, the prevalence at sweeps 5 and 6 were higher for all the imputation methods.

Figure 6.5 Prevalence over all 6 sweeps imputed by different methods

Figure 6.6 shows the same data for variety of offending . Differences here are smaller except for imputing with a normal distribution from the detailed questions. This was due to the high proportion of imputed values that imputed to 'one time' rather than zero.

Figure 6.6 Variety over all 6 sweeps imputed by different methods

Figure 6.7 shows the mean prevalence, by gender, for all respondents over all 6 sweeps. Here all methods are in reasonable agreement with the possible exception of imputing the scores from a normal distribution that tends to give higher values, especially for girls. There is a modest increase in volume seen at sweep 6, compared to the original data.

Figure 6.7 Volume over all 6 sweeps imputed by different methods

top

6.7.2 Modelling prevalence by logistic regression

Logistic regressions to predict prevalence at sweep 6 were carried out for the original data and for each of the imputed data sets.

Observed		odds ratio
Parameter	estimate	95% CI 0.00		p-value
Gender F vs M	0.71	0.60	0.83	<.0001
Deprivation	1.32	1.10	1.57	0.0019
Mainstream School	1.0 (base)
EBD school	0.95	0.08	11.16	0.9681
Independent school	0.73	0.59	0.91	0.0039
Special school	0.22	0.08	0.59	0.0021

Imputed from totals		odds ratio
Parameter	estimate	95% CI		p-value
Gender F vs M	0.70	0.59	0.84	0.0012
Deprivation	1.40	1.18	1.66	0.0003
Mainstream School	1.0 (base)
EBD school	1.14	0.15	8.38	0.9022
Independent school	0.62	0.49	0.79	0.0002
Special school	0.24	0.11	0.53	0.0007

Imputed from questions		odds ratio
Parameter	estimate	95% CI		p-value
Gender F vs M	0.68	0.58	0.80	<.0001
Deprivation	1.73	1.47	2.02	<.0001
Mainstream School	1.0 (base)
EBD school	4.43	0.79	24.74	0.0900
Independent school	0.53	0.43	0.65	<.0001
Special school	0.29	0.14	0.61	0.00

Table 6.5. Logistic regressions predicting prevalence at sweep 6 from gender, deprivation and sector

All t he analyses that imputed from the scores gave similar results for the coefficients, as did all the analyses that imputed from the detailed questions. This included the NORM method in each case. The values in table 6.5 are from the IVEWARE imputations.

All the three analyses agree that girls offend less than boys, that deprivation is associated with more offending and that those attending independent schools offend less.

The analysis of data imputed from the total scores differ only slightly from that from the observed data. But data imputed from the totals gives a much stronger effect for deprivation and a somewhat stronger effect for independent schools.

The numbers in the other two school sectors are smaller and so the estimates are more volatile. All three analyses agree that a non-behavioural special school is associated with less offending. The effect of having been at a school for pupils with behaviour problems shows up only in the analysis that imputes from the individual questions and is only marginally significant. This may well be because of a high percentage of missing data for this group. It would be possible to check this in the data file. I have not had time, but the data are here on the site if you want to.

If time had allowed more modelling analyses would have been added to this site. To allow you to test this yourselves we provide the data sets with the imputed data (10 imputations) in SAS, Stata and SPSS format. The SAS data sets have the imputation number identified by the variable _IMPUTATION_.

	SAS	Stata	SPSS
Imputed from scores	ivesc.sas7bdat MICEsc.sas7bdat NORMsc.sas7bdat	ivesc.dta MICEsc.dta NORMsc.dta	ivesc.sav MICEsc.sav NORMsc.sav
Imputed from detailed questions	ivedet.sas7bdat MICEdet.sas7bdat NORMdet.sas7bdat	ivedet.dta MICEdet.dta NORMdet.dta	ivedet.sav MICEdet.sav NORMdet.sav
NOTES	� Each data set has 10 imputations The data sets imputed from detailed questions do not have a full range of other variables, but they could easily be added from the other data sets.
	RUN ex6formats.sas before using these The SAS data sets have the imputation number identified by the variable _IMPUTATION_.	The Stata data sets have the imputation number identified by the variable _j and the sequence number in the original file identified by _i.	The SPSSdata sets have the imputation number identified by the variable _j and the sequence number in the original file identified by _i.

top

6.7.3 Conclusions.

Theses are tentative conclusions based on an extensive investigation of one data set, but they point to several tentative conclusions that may apply to other cases.

Imputation is not an easy option and it needs a considerable investment in time and expertise
All imputed data, especially model-based methods, should be thoroughly checked
The choice of variables to use in the imputations may be more important than the method used
If there is a substantial proportion of missing data and bias is important then using a NORMal distribution method when it is not appropriate may be a problem

Imputation is an art and missing data are a challenge. Any attempt to deal with missing data (even by ignoring it) is full of assumptions. We can never be sure what the missing data would have been had we got their answers.

But decisions have to be made. If I had to decide in this case I would go for the analysis that imputes from the individual questions using one or other implementation of chained equations.

�

top

6.8 Details of the survey

This was a simple longitudinal survey. All secondary schools in Edinburgh were approached. Although it was clustered by school, this was a natural clustering rather than one induced by the survey design. We have not used the information on clustering in the analyses presented here, although this might be possible with some of the software discussed.

The longitudinal data set has been produced by linking the data from all sweeps of this survey (S1 to S6). The data from sweeps 1 to 4 of the survey are available from the UK data archive. Most of the data used in this exemplar comes from the young people's questionnaires which can be viewed at the survey web site. The data archive documentation explains the codes for the variables used from the questionnaires at the first 4 sweeps, and the conventions continue in an obvious way at the final two sweeps. Details as to how data for use in this exemplar have been extracted from the larger data files is available here.

The questions that contribute to the prevalence, volume and variety scores for delinquency differed slightly for each of the sweeps. They are listed in the codes for the data sets with individual questions (below).

The data sets with total scores (named ex6.***)

top

Variable	Label
CASEID	Unique case identifier
ETHGP	White or non-white
GENDER	Gender (1=m 2=f)
HZEVREFI	HZ. Whether ever referred on offence
HZRESPRE	HZ. Whether had a hearing record at any sweep
RAALCY80	RA Frequency of drinking alcohol sweep 1 5 = 'Few times/wk' 4 = ' Weekly' 3 = 'Monthly' 2 = 'Special occasions' 1 ='hardly ever' 0= 'Non-drinker
RBALCY80	RB. Frequency of drinking alcohol sweep 2
RCALCY80	RC. Frequency of drinking alcohol sweep 3
RDALCY80	RD. Frequency of drinking alcohol sweep 4
REALCY80	RE Frequency of drinking alcohol sweep 5
RFALCY80	RF Frequency of drinking alcohol sweep 6
SASMKY80	SA Frequency of smoking sweep 1 4=daily 3=weekly 2=less than weekly 1=hardly ever 0=non smoker
SBSMKY80	SB Frequency of smoking sweep 2
SCSMKY80	SC Frequency of smoking sweep 3
SDSMKY80	SD Frequency of smoking sweep 4
SESMKY80	SE Frequency of smoking sweep 5
SFSMKY80	SF Frequency of smoking sweep 6
SZABS04	SZ. Whether school record for truancy at any time
SZINDEP	SZ. Individual deprivation/socioeconomic 1=y 2=n
YZLEAVE	YZ. Whether left school at earliest opportunity 1=y 2=n
dprev1-6	prevalence of offending sweeps 1-6 (0/1)
drgcode1	Drug use summary code at sweeps 1-6 0='none' 1='only cannabis' 2='only glue' 3='other drug or combination'
dvol1-6	volume of offending sweeps 1-6
sector	Independent, special or behavioural education at any time 1=mainstream 2=private 3=special physical or learning disabilities 4=special/behavioural

The data sets with individual questions(named ex6det.***)

top

This data set includes 102 questions each giving the number of times each participant reported having done each of the items 0= not done 1 to 5 represent the number of times, 6= 5 to 10 times 7=10 or more times.

	Variable	Label
39	YAYARS01	YA Ever set fire to something
21	YAYBOP01	YA Ever rowdy or rude in public
17	YAYBUS01	YA Ever dodged paying correct fare
43	YAYCBK01	YA Ever broken into a vehicle to steal
33	YAYGRF01	YA Ever written or sprayed graffiti
31	YAYHBK01	YA Ever broken into a house or building
41	YAYHIT01	YA Ever hit, kicked or punched someone
37	YAYHOM01	YA Ever stolen something from home
23	YAYJRD01	YA Ever stolen/ridden in a stolen vehicle
35	YAYROB01	YA Ever used force/threats/weapon
25	YAYSCL01	YA Ever stolen something from school
19	YAYSHP01	YA Ever stolen from a shop
45	YAYSKV01	YA Ever skived school
29	YAYVND01	YA Ever vandalised property
27	YAYWEP01	YA Ever carried a knife or weapon
94	YBYARS01	YB Set fire to something in last year
76	YBYBOP01	YB Noisy or cheeky in public in last ye
72	YBYBUS01	YB Dodged paying correct fare in last y
98	YBYCBK01	YB Broken into a vehicle to steal in la
88	YBYGRF01	YB Wrote or sprayed graffiti in last ye
86	YBYHBK01	YB Broken into a house or building to s
96	YBYHIT01	YB Hit, kicked or punched someone in la
92	YBYHOM01	YB Stolen something from home in last y
78	YBYJRD01	YB Ridden in a stolen vehicle in last y
90	YBYROB01	YB used force/threats/weapon to rob so
80	YBYSCL01	YB Stolen something from school in last
74	YBYSHP01	YB Stolen something from a shop in last
100	YBYSKV01	YB Skived school in last year
84	YBYVND01	YB Vandalised property in last year
82	YBYWEP01	YB Carried a knife or weapon in last ye
152	YCYARS01	YC: Set fire to something in last year
134	YCYBOP01	YC: Noisy or cheeky in public in last ye
130	YCYBUS01	YC: Dodged paying correct fare in last y
156	YCYCBK01	YC: Broken into a vehicle to steal in la
162	YCYDRG01	YC: Sold an illegal drug in last year
146	YCYGRF01	YC: Wrote or sprayed graffiti in last ye
144	YCYHBK01	YC: Broken into a house or building to s
154	YCYHIT01	YC: Hit, kicked or punched someone in la
150	YCYHOM01	YC: Stolen something from home in last y
136	YCYJRD01	YC: Ridden in a stolen vehicle in last y
160	YCYPET01	YC: Cruel to animals/birds in last year
164	YCYRAB01	YC: Racially abused someone in last year
148	YCYROB01	YC: Used force/threats/weapon to rob some
138	YCYSCL01	YC: Stolen something from school in last
132	YCYSHP01	YC: Stolen something from a shop in last
158	YCYSKV01	YC: Skived school in last year
142	YCYVND01	YC: Vandalised property in last year
140	YCYWEP01	YC: Carried a knife or weapon in last ye
219	YDYARS01	YD: Set fire to something in last year
201	YDYBOP01	YD: Noisy or cheeky in public in the last year
197	YDYBUS01	YD: Dodged paying correct fare in last y
223	YDYCBK01	YD: Broken into vehicle to steal in last
229	YDYDRG01	YD: Sold an illegal drug in last year
213	YDYGRF01	YD: Wrote or sprayed graffiti in last year
211	YDYHBK01	YD: Broken into house\building to steal
221	YDYHIT01	YD: Hit, kicked, punched someone in last year
217	YDYHOM01	YD: Stolen something from home in last year
203	YDYJRD01	YD: Ridden in stolen vehicle in the last year
227	YDYPET01	YD: Cruel to animals\birds in last year
231	YDYRAB01	YD: Racially abused someone in last year
215	YDYROB01	YD: Used force\threats\weapon in last ye
205	YDYSCL01	YD: Stolen something from school in last
199	YDYSHP01	YD: Stolen something from shop in last y
225	YDYSKV01	YD: Skived school in last year
209	YDYVND01	YD: Vandalised property in last year
207	YDYWEP01	YD: Carry knife or weapon in last year
267	YEYWEP01	YE: Carry knife or weapon in last year
332	YFYARS01	YF: Set fire to something in last year
350	YFYBFT01	YF: Claimed benefits not entitled to in
320	YFYBOP01	YF: Noisy or cheeky in public in the las
336	YFYCBK01	YF: Broken into vehicle to steal in last
342	YFYDRG01	YF: Sold an illegal drug in last year
352	YFYFRD01	YF: Cheque, credit fraud etc, in last ye
328	YFYHBK01	YF: Broken into house\building to steal
334	YFYHIT01	YF: Hit, kicked,punched someone in last y
322	YFYJRD01	YF: Ridden in stolen vehicle, ever done
340	YFYPET01	YF: Cruel to animals\birds in last year
344	YFYRAB01	YF: Racial abuse, ever done it
330	YFYROB01	YF: Robbed person of property in last ye
346	YFYRST01	YF: Sold on stolen property in last year
348	YFYRST21	YF: Bought stolen property in last year
349	YFYRST22	YF: Bought stolen property, times done i
318	YFYSHP01	YF: Stolen something from shop in last y
338	YFYSKV01	YF: Skived school in last year
326	YFYVND01	YF: Vandalised property in last year
324	YFYWEP01	YF: Carry knife or weapon in last year

P|E|A|S project 2004/2005/2006