About this Resource
P|E|A|S FAQ   | home   | contact us   | index   | links and further resources   |

text/print version

Exemplar 5: Young women and drugs in a local health survey

5.1 Background > 5.2 Getting started > 5.3 Carrying out the reweighting > 5.4 Young women’s drug use > 5.5 Details of survey

This exemplar is based on data from alocal health survey conducted in 2002 by Ayrshire and Arran health board.

The main purpose of this exemplar is to illustrate how InfoButtonpost-stratification is carried out to calculate non-response weights. The original survey was unweighted. We also present some results from a weighted analysis that turn out to be little affected by the weighting.

If you need more background on weighting or non-response for surveys we suggest you visit the appropriate sections of this site, 3:weighting or 5:non-response.

Table 5.1 Features of the Local H ealth Survey and the analysis.

This was a postal survey and the questionnaire can be viewed here. The sample was based on the Community Health Index (CHI) and was intended to represent 2.5% of all residents of the Ayrshire and Arran health Board.

The sample was randomly selected from the CHI but full details of how this was done are not available. The response rate for this survey was around 50%. Response rates can be calculated as a % of all addresses issued, or as a % of those remaining after excluding those returned by the post office or by the current occupiers to say that the person is no longer at this address. Exact information on all of this was not available for this survey, but we did have a list of number of patients in the sampling frame by age and sex.

Non-response and problems with the sampling frame

logo of scottish neighbourhood statistics

The relatively low response rate makes it important to take steps to make the sample more representative of the population.

As we had full post-codes for all respondents it was possible to make use of the data available from the Scottish Neighbourhood Statistics project to carry out the reweighting of the survey. Part of this project has been the commissioning of a geographical areas designed to be socially homogeneous (details.) These zones are relatively small units with populations of between 500 and 1000 residents.

The advantage of using postcodes is that we can link each respondent to a data zone and the following information is available for each data zone.

  • Population data by age and sex in 5 year age groups from the 2001 census is available from the registrar general for Scotland.
  • The Scottish Index of Multiple deprivation that gives each data zone a score on several dimensions including income, education, health and rurality. These data can be obtained from the Scottish Executive where a map showing each data zone in its local area can also be accessed.

These data have been used to provide weights for this survey as explained in detail in section 5.3 below. As part of this process we compared the age and sex composition of the census population for the whole area with that for the list of addresses contacted. The ratio of these two is illustrated in figure 1. The age groups range from 16-19 up to 75-79 in five year age groups.

This points to what may well have been problems with the sampling frame. There appear to be too many addresses for older people, especially older men, as well as rather too many for young people. The excess of older people in the list of addresses probably relates to the fact that the CHI is known to contain a records for people who have died or have moved away. The same is probably true of young people some of whom may have moved out of the area.

graph of comparison of issued addresses with 2001 census populations by age and sex

Figure 5.1 Comparison of issued addresses with 2001 census populations by age and sex.

5.2 Getting Started

From links in this section you can:-

  • Downlaod or open the data files
  • Analyze them with any of the 4 packages you have available
  • View the code (with comments) and the ouput, even if you don't have the software.

To start, click the mini guide for the statistical package you want to use to analyse Exemplar 5..

For additional help click on the appropriate novice guide.

For details of the data sets see below.

Mini Guides

Guides for Novices


This exemplar has two different data files. One used to carry out the reweighting (ex5wt) and one to analyse the data with the weights.

# Means that it is not just a simple text file of the program, but a web page you can use to view program code and results outside packages.

* Means that the program or data set should be saved to your computer before use as it may not open correctly by just clicking on it.

5.3 Carrying out the reweighting

5.3.1 Data preparation

In order to calculate weights the started step was to create a data set that compares the number of respondents with the number in the population. This was done here by using the sources described above and gave a set of records for each data zone, one for each combination of age-group and sex where there were population members.

Each of these records contained quite a small number in the population (ranging from 1 to 137) and numbers in the sample are much smaller, with many data zones having no respondents in particular age-sex groups. . Had the numbers been larger then the probability of selection could have been simply taken as the ratio of respondents to population and its inverse used as a weight. Clearly this would not work here. For one thing it would give a large number of infinite weights.

Modelling non-response

To overcome this we need some other way of modelling non-reponse. For a discussion of issues in non-response re-weighting click here. This is most commonly done by fitting a binary regression model using logistic regression. The general form of this regression model is:-

Logit( response rate) = code5

Where the Xs are variables that can be measured for members of the population and for those of in the sample. In our case, the Xs consist of the agegroup, sex and the characteristics of the data zones (the six domains of the deprivation indicator). All the possible interactions of these factors can also be included as further X variables. The data used to fit this model is the observed proportions of responses as a proportion of the population for each sub-group.

There are probably as many ways of fitting a model to the data as there are analysts who might do so. The trick is to come up with something to predict well and yet give reasonably stable weights that are not just noise. Such noise may have entered from uncertainties in population numbers.or from chance events, unrelated to the characteristics of respondents. The weights are intended to compensate for differences in the type of person who may respond, not for such chance fluctuations.

The modelling process used here had the following steps:

  • It started by taking age-group as a categorical variable with 13 levels and fitted a model with the age-group and gender interaction and linear functions of the 6 domains of the deprivation index.
  • This model was then reduced by removing 3 of the six deprivation scores, leaving income (much the strongest), access and housing
  • The interactions of income, age-group and gender were tested and that between age group and income found to be important.
  • The linearity of the income score was checked by including a grouped variable, but this was not needed
  • The housing deprivation score was dropped, as it no longer contributed to the model.

The fitted rates from this model are shown below (Figure 5.2).

graph of Initial fitted model for survey responses

Fig 5.2 Initial fitted model for survey responses as a proportion of the population by age, sex and deprivati

The pattern by age group (forced to be the same for income group) still seems a bit noisy. A further model was fitted that used a separate quadratic model for age group for men and women and allowed its parameters to have linear interactions with income deprivation. The fit of this final model is shown in Figure 5.3.

graph of Initial fitted model for survey responses

Figure 5.3: Final fitted model for survey responses as a proportion of the population by age, sex and deprivation score of neighbourhood, fitted at the average value of access deprivation.

Calculating the weights

Once the model to compare the population with the population has been fitted it is easy to calculate the weights. The predicted ratio of respondents to population is calculated for every individual in the data. Different computer packages allow this to be done in slightly different ways. This is illustrated in the code for each package.

Once the predicted ratio is calculated, its inverse becomes the weight for that observation. These weights for the whole sample will add to a number that is approximately the population totals. So they will be approximately grossing-up weights. If desired they can be adjusted to make them exactly grossing up weights. If it is desired further weights that add to the sample size can be calculated. In this exemplar grossing up weights and (GWEIGHT) and weights that add to the sample size (WEIGHT) are provided.

All the packages we feature can do these analyses as they all provide data manipulation features and programs for logistic regression.

arrowThese links will take you to the code that carried out this modelling in some of the packages SAS code
SPSS code
and to the output from each SAS code
SPSS code


5.4 Young women and drug use

To illustrate some analyses on the weighted data, we have selected a sub-group of the population that is of particular interest to the Health Board. This is the group of young women aged 16-29, and their drug use is of particular interest. Because this survey is neither clustered not stratified we can analyse this subgroup without reference to the rest of the survey. Also, we saw above that the largest factor affecting weighting was age group. So we might expect the weighting here to have only a minor effect.

These are non-response weights rather than design weights (seesection 3.9 in the theory section for a discussion of this). They are probability weights, so the design-based procedures can be used for inference, although they will not incorporate the uncertainty due to modelling the non response.

There were only 361 young women in the survey and the range of their weights was considerably less than for the whole survey (see Figure 5.4 below).

graph weights for young women

Figure 5.4 Weights for young women

We can see in Table 5.4 that the weighted and unweighted proportions for categorical variables differ by very little.

General Health
Unweighted Weighted Cannabis Use
Unweighted Weighted
Excellent (5)
Daily (1)
Very good (4)
Weekly (1)
Good (3)
Last month (1)
Fair (2)
More than a
month ago (0.5)
Poor (1)
Onceor twice (0)
Never (0)

Table 5.4 Weighted and unweighted proportions for General Health groups and Cannabis Use groups for young women

By looking at the means of some of these variables, scored as shown above, we can also see that the InfoButtondesign effects are quite modest for this sub-group of the survey although of course they vary by what is being measured (Table 5.5). . If there had been no relationship between a variable and the weight the design effect could be calculated from the weights (see weighting section 3.7) here because there is no stratification or clustering. For this example it would give a Design Effect of 1.10. we can see that the income score, that we know to be related to the weights, has a larger design effect although its size is quite modest.

mean s.e. lower limit upper limit design effect
General health
SIMD income score
SIMD access score

Table 5.5 Design effect for estimated mean of three variables, for young women

A finite population correction was used in the analysis here. It made almost no difference, changing the InfoButtonstandard error for general health from 0.0527406 to 0.052416. The same was true when regression models were fitted to the relationship between variables. Details can be found in the programs and output for this exemplar. We show below some results for survey weighted regressions predicting the health score from various factors. the Health score and drug use were scored according to values given on table 5.4 above.

predictor model 1 model 2 model 3 model 4 model 5 model 6 model 7
coefficent 0.455     0.44   0.509 0.51
t statistic (3.03)**     (2.92)**   (3.38)** (3.47)**
coefficient   0.302     0.285 0.247 0.23
t statistic   -1.86     (2.03)* (2.08)* (2.37)*
coefficient     0.015 0.013 0.013   0.013
t statistic   (2.85)** (2.52)* (2.63)**   (2.66)**
Observations   358 338 361 358 338 338 338
*significant at 5% level; **significant at 1% level

5.5 Prediction of geneal health score weighted regression

We can see that poor health is associated with income deprivation and independently with both cannabis use and amphetamine use. Of course we cannot tell from this analysis whether drug use causes poor health rating or vice-versa, or whether it is some other factor that is responsible for the association.

The unweighted regression gives broadly similar results, but the association between amphetamine use and poor reported health is not so clear-cut in the unweighted analysis. This suggests that failing to weight can sometimes obscure associations that should have been seen in the data.

Predictor   model 1 model 2 model 3 model 4 model 5 model 6 model 7
coeffecient 0.455     0.44   0.509 0.51
t statistic (3.03)**     (2.92)**   (3.38)** (3.47)**
coefficient   0.302     0.285 0.247 0.23
t statistic   -1.86     (2.03)* (2.08)* (2.37)*
coefficient     0.015 0.013 0.013   0.013
t statistic     (2.85)** (2.52)* (2.63)**   (2.66)**
Observations   358 338 361 358 338 338 338
* significant at 5% level; **significant at 1% level

5.5 Prediction of geneal health score unweighted regression

The exemplars also give analyses of the data above using chi-squared tests, adjusted for weighting. They do not show the same associations as did the regression because they do not focus on the linear-by-linear interaction. Their power is reduced by the fact that they are seeking any association between a large number of cells. The regression is more appropriate for this type of analysis with such limited numbers. But the comparisons of chi squared statistics in the exemplars does show how the design affects these.

Effectively, all the packages features on this site can do these analyses. SAS version 8 does not include chi-squared tests for surveys, but version 9 will. Similarly SPSS does not do regression for surveys, but version 13 will.

Do we need to use a weighted analysis for this survey?

The case to go to all this trouble here must be at least marginal. The InfoButtondesign effects are all pretty small and none of the analyses we have looked at are substantially different between the weighted and unweighted analysis. Some other variables may be more affected by this than the ones we have investigated. But we have not looked at the whole survey, only a subgroup, so it would be safer to use analyses that allow appropriately for the weighting.

The largest factor affecting the weights was age group. It is important to get the weighting right to avoid bias in summaries for the whole population. Regression analyses that adjust for age could probably be safely carried out without weighting.

arrowThese links will take you to the code that was used to carry out these analyses in various packages SAS code
SPSS code
and to the output from each SAS code
SPSS code


5.5 Details of the survey

This was a simple postal survey. Thus there was no clustering of addresses. We believe that there may have been some stratification to make the sample match the population in terms of agegroups sex and local area. The details of this are not available so we have not been able to incorporate this into the analysis. This should have no effect on our estimates, but if we had details it might have given somewhat better estimates of precision. The following web link (to be added when ready) gives further details of the design of the survey.

Primary sampling units<

Since this survey was not clustered the PSUs are simply the individual respondent


The sample was InfoButtonpost-stratified to match the age and sex populations from the 2001 census. The weights also produced a match between the sample numbers and the deprivation categories at the data zone level. Details of this are in sections 5.1 and 5.3

Data files

Two data sets are provided. The started one is used to calculate the weights. It has one record for each age group, sex and data zone in Ayr and Arran Health board. Records with no population (which fortunately also had no respondents) have been omitted.

Data sets for weighting procedure (ex5wt.***)

Variable Label
AyrA_num number of respondents
agegrp age group 1=16-19 2=20-24 3=25-29 ....... 13=74-79
dz data zone number
gender M for male F for female
npop 2001 census population estimate
sacc score from SIMD2004 access domain
sed score from SIMD2004 education domain
semp score from SIMD2004 employment domain
shlth score from SIMD2004 hlth domain
shse score from SIMD2004 housing domain
sinc score from SIMD2004 income domain

The second type of data set have been extracted from the data file for the whole questionnaire data. It includes only young women from 16 to 29. We have constructed a data set for this analysis with just those variables we will use. The process by which the data set has been constructed and the programmes used to make it from data are explained in detail here.

Data for weighted analyses (ex5.***)

Variable Label and Codes
age how old are you
weight weight
empl employment status
1 in paid work or self employed - full time
2 in paid work or self employed - part time
3 unemployed
4 intending look for work but prevented by temp sickness
5 permanently sick or disabled and not able to work
6 retired
7 looking after the home or family full time
8 in full time education
9 doing something else
10 caring
11 part time study
12 voluntary work ;

highest qualification
1 none
2 higher degree
3 started degree
4 A levels
5 O level
7 Other

q85a Cannabis Use
1 never used
2 tried once or twice
3 use daily
4 use weekly
5 used in last month
6 used more than a month ago
q85b amphetamines use
codes as above
sacc data zone access deprivation score
sinc data zone income deprivation score
gweight weight adding to population totals

composition of household
1="living alone"
2="single parent"
4="couple with kids"

genhelf general health
2="Very good"
peas project 2004/05