About this Resource
P|E|A|S FAQ   | home   | contact us   | index   | links and further resources   |

3.1 Background

This exemplar is about using data from the Scottish Health Survey, 1998, to investigate Smoking Rates in Scotland. We will also look at rates by Health Board and by other factors.

The Scottish Health Survey (SHlthS) data for this exemplar were obtained from the ESRC Essex Data Archive. The data differ from the archive data because of steps taken to prevent any disclosure of individuals. The principles we have followed in anonymising the data are set out in precautions.htm and the details of what have been done in this case are explained in ex3_prep.htm

One of the purposes of the Scottish Health Survey is contribute to the monitoring ofhealth targets. For example the targets for adult smoking is to:

"Reduce rate of smoking from an average of 35% to 33% between 1995 and 2005 and to an average of 29% by 2010." (A breath of fresh air for Scotland )

The SHlthS is not the major preformance indicator for this outcome. The Scottish Household survey (featured in exemplar 2) is the main source of information for this target (see NHS Scotland Performance Assessment Framework). But we are using the SHlthS data to illustrate some of the problems in the use of survey data for this type of work.

The rate of 35% quoted for 1995 was the weighted percentage from adults aged 16-64 in the 1995 Health Survey. In the 1998 survey it was found that there was little change in the overall rate (weighted of course), but that the rate for women had decreased significantly while that for men had increased, though not significantly.

Weighting makes a difference!

These results are illustrated below where we also show the simple unweighted rates. We can see that the differences between the weighted and unweighted rates are at least as large as the changes over time that the survey is designed to monitor. This emphasises the importance of doing weighting correctly. Because the interest in comparing with the 1995 survey we use the age group 16-64, as older people were excluded from the earlier survey. weighted and unweighted estimates

Figure 3.1 Weighted and unweighted estimates of current cigarette smoking (ages 16-64)
from the 1998 health survey, compared to the 1995 rates.

The fact that the unweighted analysis gave higher smoking rates was a bit puzzling at first, since younger ages and urban areas tended to have high weights. This would have been expected to increase the smoking rates for a weighted analysis. The answer to the puzzle
is below.

��Detailed Explanation Details for this survey
��Weighting click here
��Clustering click here
��Implicit stratification click here
��Post -Stratification click here
��Subgroup analyses click here

Table 3.1 Features of the health survey and analysis


3.2 Getting Started

From links in this section you can:-

  • Downlaod or open the data files
  • Analyze them with any of the 4 packages you have available
  • View the code (with comments) and the ouput, even if you don't have the software.

To start, click the mini guide for the statistical package you want to use to analyse Exemplar 3.

For additional help click on the appropriate novice guide.

For details of the data set see below.

Mini Guides


Novice Guides

Do not just click on the items in this table go to the mini guides first.
Data sets
Program Code
SAS ex3.sas7bdat*


Stata ex3.dta ex3.do*
SPSS ex3.sav ex3.SPS
R ex3.RData ex3.R*

Table 3.2 Data sets and code.

* SAVE these files to your computer They do not open from outside the software packages.

You may have to save some of the other files to disc if your set-up does not allow you to open files directly.

The html files allow you to view program code and results outside packages.


3.3 Smoking rates ages 16-74 - how design features affect precision

Table 3.3 gives weighted and unweighted smoking rates for both sexes combined, and now all adults in the survey, and also looks at how the design features have affected the precision of estimates.

Allowing for features of the design Percentage Standard Error Design Effect


1. Weighting only

2. Clustering

3. Clustering and Stratification

4. Clustering and stratification and post-stratification


33.2 %











Table 3.3 Features of the health survey and analysis

1. The uneven weighting in this survey reduces precision (relative to a simple random sampling). The biggest weighting factors come from the household size.
2. The clusters are quite large, so again this has quite a large effect.
3. Implicit stratification into small strata of just two or three PSUs makes a real improvement.
4. The InfoButtonpost-stratification was carried out mainly to adjust for possible biases, not to improve precision. Since it seems to have no impact on precision here this step can safely be ignored in assessing precision in relation to smoking. It is possible that it might change the precision in other analyses.

arrowYou can follow links from here to some of the output that was used to create this table in packages R ex3Rres.htm and in Stata


3.4 Regions and health boards

Results for local areas based on small numbers lose precision. Concern with this, lead the
Health Survey to present results only for seven regional groups of Health Boards. We investigate this by looking at one region and its constituent Health Boards.

Percentage Smokers
Standard Error
��Lothian and Fife
��Lothian HB
��Fife HB

Table 3.4: Smoking rates by health board Lothian and Fife region

We can see that, as expected, the Health Board estimates are less precise. This is partly because numbers are small and also because the stratification was not done at Health Board level. But the two areas differ and models (part of the program code) show that this is just significant at 5% level.

arrowYou will get more details for other local authorities by looking at the results from some of the packages. Links to regional analyses are in R ex3Rres.htm in Stata
ex3Statares.htm and in SAS ex3sasres.htm


3.5 Factors that affect smoking

We look here at factors affecting smoking, with a view to understanding why the weighted rates are higher than the non-weighted. Some packages have a logistic regression programs adapted for survey design. This allows easy exploration of the most important factors influencing smoking and sample programs are with the program code for these exemplars.

As in the survey report, we find social class at the individual and area level are important predictors of smoking cigarettes. The other important feature was that people who live alone have higher smoking rates. There were also variations in smoking rates by age and sex and by region, but these were less important than the social ones. You can see the details if you run the exemplar programs for Stata, R or SAS. We illustrate this here:

rates of smoking

Figure 3.2 Rates of smoking by various factors, weighted percentages.

How would these factors affect the comparison of weighted and unweighted results

The health survey was weighted first to account for design features and then to allow for non-response. The most important design feature for smoking was probably the selection of one adult per household, thus giving higher weights to those in households with more adults. We can see that this would reduce the smoking rate in the weighted analysis, since people who live with other adults have lower smoking rates.

Non-response weighting was by region, age and sex. From the survey volume we can see that the weighting for non-response gave higher weights to Glasgow and to the 20-35 year olds. These are groups with higher smoking rates, so the effect would be to increase smoking in the weighted analysis. So it appears to be the weighting for number of adults that is bringing down the smoking rate in the weighted analysis

Has the weighting adjusted for all possible important features?

Given that socioeconomic factors are a major influence on smoking it would seem to be desirable to adjust for any non-response that could be identified in relation to this. One possibility would be to look at the response rates by the area deprivation (Carstairs score). It is not possible to investigate this fully without access to the details on non-responses. But the number of responders per PSU seems to be lower in the more deprived areas. However, this might be due to the fact that the PAF includes more non-contactable addresses in deprived areas.

This suggests that weighting for non-response within the sample, by deprivation score, before InfoButtonpost-stratification might be advisable to get more accurate results. But if this affected smoking rates it would make them difficult to compare with previous years.


3.6 Which packages can do these analyses?

R survey package
SPSS complex surveys
��calculation of smoking
��rates with standard errors
yes but need to recode as 0/1
��design effects
��as above, but for subsets
�� of the survey
�� InfoButtonpost-stratification for ��nonresponse

yes but with contributed command

��logistic regression for ��survey data
no but can use ordinary regression
not in version 12
$ There are different ways of defining design effects for subgroups

Table 3.5: Features of different packages


3.7 Details of the survey and data set

Survey design

The Scottish Health Survey is a survey of the household population of Scotland. Fieldwork for the 1998 survey took place over the year in a manner that balanced the sample for seasonal features. The survey includes an interview and a visit from a nurse. We will only be using the interview data. One adult per household, randomly chosen, is interviewed. The sampling frame used was the postcode address file ( InfoButton PAF). This survey also includes data on children, but we do not include them in analyses here.

Primary sampling units and sample selection

The primary sampling units were post code sectors (around 5000 households). The PSUs were selected with probability proportional to size, with some exceptions in the Island areas, that are discussed in the technical report of the survey. A sample of 46 addresses was then selected from each selected PSU. In the last quarter of the year this number was increased to 58.


The selection of postcode sectors was carried out separately within each region. Regions of Scotland were either individual Health Boards or pairs of Health Boards. Within each Region the sectors were ordered by their Carstairs deprivation index and a systematic sample was selected. This means that the sample is implicitly stratified by the Carstairs index, since the balance in the sample will reflect the pattern in the sampling frame. To represent this in the design, strata have been formed by grouping adjacent sectors (with similar Carstairs values) together in pairs to form strata. A few strata contained 3 PSUs because of odd numbers.

Weighting from the design

The probability of selection for a InfoButtonPSU varied by region. Also a weight equal to the number of adults in the household was applied to adjust for the over-representation of people in households with few adults.


After the survey data were completed and the design weights calculated the achieved weighted sample was compared to the mid year population estimates in terms of its regional distribution and age/sex categories. A further weight was calculated to make the sample match the population data. This effectively adjusts for differential non-response by region and by age and sex. Response was lowest in Glasgow and among the age groups 20-35.

Final weights

The final weights are a combination of the design weights and the InfoButtonpost-stratification weights. The histogram below shows the distribution of the variable WEIGHTA, it is scaled to add (approximately) to the sample size.

histogram of health

Fig 3.3: Histogram of final weights scaled to add to sample size.

A further weight is provided on our data files. This is called GROSSWT and is identical to WEIGHTA except that it has been scaled up to make its total match the mid year population estimate for 16-74 year olds in Scotland. It is required by SPSS in order to get estimates of Design Effects.

The data sets

We have constructed a data set for this analysis with just those variables we will use. The process by which the data set has been constructed and the programmes used to make it from data at Essex data archive) are explained in detail here for anyone who has an interest in this . The identification of the PSUs from the data on the archive required some detective work. It is explained in the comments of the SPSS program used to prepare the data set.

After the PSUs had been identified they were grouped into strata in pairs according to the ranking of their Carstairs index, within regions. Where there were an odd number of PSUs in a region the last stratum was given 3 PSUs.

The information used to derive this information, including the exact Carstairs index values of the PSUs, has been removed from the file to prevent identification.

The variables in the data files are :-

Variable Label
1 age age of respondent in questionnaire.
2 sex sex of respondent from household grid.
3 hboard Health Board
4 carstg5 carstairs index
5 cigst1 cigarette smoking status-never/ex-reg/ex-occ/current
6 cigst2 cigarette smoking status - banded current smokers
7 nofad number of adults
8 sc social class (adult respondent)
9 weighta weight variable (with some random noise added)
10 grosswt weight grossed to population totals
11 idno sequential ID
12 psu PSU no
13 regstrat unique stratum no

arrowThe analyses presented here mostly do not allow for this final postratification in the calculation of the standard errors and design effects.. We tested this out for smoking rates using replication methods. Results can be viewed in the corresponding results files ex3Rres.htm and ex3Statares.htm

peas project 2004/2005/2006