About this Resource
P|E|A|S FAQ   | home   | contact us   | index   | links and further resources   |

Exemplar 1: Income Distributions in Scotland

1.1 Introduction > 1.2 Getting started > 1.3 Different answers for the mean income and its standard error. 1.4 Pitfalls > 1.5 Percentiles > 1.6 Income for single parents > 1.7 Data set and survey design details

uparrowtop

1.1 Introduction

This exemplar is about estimating the income distribution of households in Scotland and was motivated by work commissioned by the Communities Scotland. It is based on interviews carried out in Scotland as part of theFamily Resources Survey..

The Family Resources Survey (FRS) data for this exemplar were obtained from the ESRC Essex Data Archive and originally consisted of the FRS data for the financial year 2002/2003. The data differ from that from the data archive because of steps taken to prevent any disclosure of individuals. The principles we have followed in anonymising the data can be viewed here and the details of what have been done in this case are explained here . The analyses of the anonymised data provides similar results to what would have been obtained from the original data.

Our analyses also depart, to some extent, from that carried out by the FRS team. We have post-stratified the data to match Scottish totals. The FRS team carried out a similar InfoButtonpost-stratification , but this was at a UK level.

We have prepared small data sets (seeTable 1.2 and the data codebook) with only the variables you need for each package. To find out how the variables relate to those on the archive clickhere.

In this exemplar we illustrate how to rake a survey so that it matches more than one set of population totals. The packages Stata and R can do this and the SAS macro CALMAR can be used.

We illustrate each of these. Once the survey has been post-stratified the calculation of standard errors should allow for the advantage in precision that the post-stratification should have given. This requires extra methods of analysis. Both Stata and R can do this by replication methods. The R survey package can do it by a calibration method.

Detailed Explanation Details for this survey Packages that
can handle this
Weighting click here all
Clustering click here all
Post Stratification click here R, and Stata have programs to do this. There is a SAS macro also.
Grossing Up click here all

Subgroup Analyses

see below all

Table 1.1 Features of this exemplar

This table lists the features of the survey design that needed to be allowed for in the analyses presented here.

uparrow

1.2 Getting Started

From links in this section you can:-

  • Downlaod or open the data files
  • Analyze them with any of the 4 packages you have available
  • View the code (with comments) and the ouput, even if you don't have the software.

To start, click the mini guide for the statistical package you want to use to analyse Exemplar 1.

For additional help click on the appropriate novice guide.

For details of the data set see below.

Mini Guides

mini-book


Novice Guides

mini-book
Do not just click on the items in this table go to the mini guides first.
Package
Data sets
Program Code
Output
SAS ex1.sas7bdat* ex1.sas* pctilegrps.sas* ex1Sas.htm
ex1sasres.htm
Stata ex1.dta ex1.do*
ex1_v8.do*
ex1Stata.htm
ex1statares.htm
SPSS ex1.sav ex1.SPS
exemp1spss.htm
ex1spssres.htm
R ex1.RData ex1.R*
ex1R.htm
ex1rres.htm

Table 1.2 Data sets and code.

* SAVE these files to your computer They do not open from outside the software packages.

You may have to save some of the other files to disc if your set-up does not allow you to open files directly.

The html files allow you to view program code and results outside packages.

uparrowtop

1.3 Different answers for the mean income and its standard error.

Mean Standard
Error
Design
Effect
Comments

A

₤470.86 ₤6.02 - Unweighted mean
B ₤483.09 ₤7.88 1.48 Weighted mean - no other design features allowed
C1 ₤483.09 29p -

Frequency weights using the grossing-up weight GROSS2 (the default non-survey weight in SPSS)

C2 ₤483.09 ₤6.25 - Defining analytical weights (the default in SAS) or equivalently using frequency weights and rescaling so they add to the sample size
D ₤483.09 ₤10.64 2.90 Weighted mean allowing for clustering *** CORRECT****
D after ₤479.58

₤10.58

2.74 Using the weights calculated to match Scotland totals
E ₤479.58 ₤7.46 1.43 Using the weights calculated to match Scotland totals and using either a calibration method (R) or a jackknife procedure to see the benefits of post-stratification . *** CORRECT****
Table 1.3 Estimates from different methods: mean and standard error of all household incomes in Scotland.

Comments

Weighting corrects the bias in the income estimates and increases the mean weekly income by about ₤13, and somewhat increases its standard error (B).

The theory section explains different kinds of weighting (design, frequency and analytical). Wrongly defining frequency weights (C1) (the default in SPSS) gives much too small a standard error here because the weights are grossing up weights and this makes the program think you have an enormous sample. Analytical weights or euqivalently rescaling the weights so they add to 1 gets an answer in right ball-park, but not what it should be for probability weights.

The correct standard error for this design is D, which also allows for clustering into 320 clusters. This increases the standard error of the estimates even more.

Post-stratification to match two margins (raking) changes the weights so that they match the Scottish population totals. It changes the mean income just a little, decreasing the mean income by about ₤3.50. This adjustment seems mainly to be due to the population having a higher proportion of privately renting households than the original weighted data. When it is analysed correctly it also improved the precision of estimation. In this example, the advantage of post-stratification regains the precision lost by the clustering.

In order to recognise this improvement in precision in reducing standard errors you need to analyse the data using a method that takes the post-stratification into account. Just adjusting the weights by itself will not do this. In fact, in this case, it makes it look as though the change in weighting has had no effect on the standard errors (D after). But analysing it with allowance for the post-stratification using either a calibration method (implemented in R) or a jackknife method shows that it has improved things substantially (E),

Is it worth the bother of post-stratifying?

It depends if you believe the totals are correct. If you can accept this, then the effective sample size for the whole sample would be increased by a factor of almost 2 for this analysis. This has to represent a BIG saving. But if you want to analyse it with correct methods it brings some overheads.. At present that means using either a method that allows for it by calibration (only R of the packages we feature) or you need will need to use replication methods (R or Stata or SAS, with more effort). Replication methods are more work, but Stata and especially R have very easy to use routines for them. See the very concise code that you need for R for this exemplar. But the advantage will be less for subgroups (see belowsection 1.6).

uparrowtop

1.4 Pitfalls:

When a survey is post -stratified members of a cluster can fall into different strata. This means that you cannot simply define your post-strata as though they were strata. Splitting InfoButtonPSUs would be wrong as it would imply that the clusters consisted of lots of small clusters, for the bits that fall inside each PSU.Some method that makes specific allowance for the post-stratification needs to be used. Two such methods are available in the packages we are featuring on this site. The easiest to use is a calibration method available in R, and the alternative is to use replication methods. The figures in this table are calculated by replication, but calibration gave almost identical answers with much less effort.

When you define the post-stratification as though it were just ordinary disproportionate stratification (as carried out at the design stage) the packages do different things. The problem is that this divides InfoButtonPSUs between strata. This results in a split PSU problem.

  • The R survey package prints a warning. If you check the help you will see
    that the option to, split the PSUs is there and it is suggested in the warning.JOINT BEST BUY *****
  • SAS prints a warning in the log file and suggests that the option of
    splitting the PSUs between strata be considered. JOINT BEST BUY *****
  • Stata splits the PSUs with no warnings unless you notice that you now
    have lots of PSU's instead of just 320.
  • SPSS splits the PSUs but does not appear to give any output that shows how many PSUs you have, so no warnings.

uparrowtop

1.5 Percentiles of the distribution

For an explanation of percentiles click here. All the packages can calculate estimates
of weighted percentiles and all get the same answers. Only one package (the R survey package) offers a method to calculate to standard errors for this. This uses the method described here by default. The current version for replicate means seems to have a bug (August 05) so you may still have to make use of the supplied alternative functions that can be found on this site (myRfunctions.R) A SAS macro pctilegrps.sas is also available to implement this method. This method could also be programmed in Stata but it would require programming expertise that we don't have.
Before post-stratification
After post-stratification
95% CI
Percentage point
Estimated quantile
s.e.
Estimated quantile
s.e.
Lower limit Upper limit
5%
102.00
3.50
100.00
3.07
92.00 106.92
10%
131.00
3.50
129.00
3.50
121.00
135.00
25%
202.00
4.50
200.00
4.00
192.78
208.00
50%
355.00
9.76
350.00
7.00
338.00
365.00
75%
625.00
15.00
618.00
11.79
596.00
636.00
90%
986.33
27.60
981.24
23.53
930.00
1019.77
95%
1277.79
37.50
1275.00
32.90
1204.00
1337.84

Table 1.4 Percentiles of the distribution and their S.E.S

The results below show how the standard error of the upper percentiles increases as the distribution of incomes becomes more sparse.

Note also that the post-stratification has given slightly more precise estimates of the percentiles and reduced the estimates a little.

uparrowtop

Among the 4695 households surveyed, just 334 were lone parent households. We want to investigate the income in these families.

In order to carry out analyses by sub groups for survey data it is NOT CORRECT to make a
new data set consisting of just the subset of the data and analysing it as if it were a complete survey. Instead a special routine to look at subgroups needs to be used. The reason for this is that, when there are strata or post-strata, the subgroup of the survey will no longer balance to the stratum totals.

All the packages considered provide this facility in one form or another. The table below gives the results for an appropriate method (with and without post-stratification . Notice that the advantage of getting a smaller standard error due to the post-stratification in A has now disappeared.

Mean Standard Error Design effect Comments
Before post-stratification ₤276.56 ₤8.49 1.01 Notice that the design effect has disappeared. Clustering no longer hurts for sub-samples that are spread across PSUs.
After post-stratification ₤274.48 ₤8.19 0.97 As before mean income comes down a little on raking to Scottish totals. But there is now very little advantage in improved precision.

Table 1.5 Mean and standard error of income for single parent households in Scotland analysed by different methods.

When you check the output from some of the programs you will see that you get a different design effect. What we have quoted is a design effect compared to having taken a simple random sample of 334 single parent households over all Scotland. Another way to define a design effect is compared to the number of single parents we would have obtained had we taken a simple random sample of all households in Scotland. This would have given us a different number of households in our sample, so it gives a different answer (0.84 for the design effect). This method of calculation is the default in Stata and the only option in SPSS.

uparrowtop

1.7 Details of the data set and survey design

We have constructed a data set for this analysis with just those variables we will use. The process by which the data set has been constructed and the programmes used to make it from data at UK Data Archive are explained in detail here The variables in the data file are:

NAME DESCRIPTION
HHINC Gross household weekly income (₤s)
DEPCHLDH Dependent children in household
ADULTH Number of adults in the household
PSU (InfoButtonPSU) primary sampling unit here the post code sector- renumbered to prevent identification
CT BAND

council tax band used for InfoButtonpost-stratification to Scottish totals.
(1 to 8 for A to H and 9=not separately assessed)

TENTYPE type of household tenure used for poststratification to Scottish totals.
1= owned 2 = LA rented 3 = other public rented 4= private rented or rent free
GROSS2 weight provided by the survey contractors calculated from all aspects of the design including poststratification to UK totals, but not to the Scottish totals.

Table 1.6 Survey variables selected for analysis

Further details of the survey can be found here and in particular the methodology section. Questionnaires and further details are available from the ESRC supported Question Bank.

Sample Selection

The sampling frame is the post code address file (PAF) a list from the post office of all addresses in the UK. We are using data from the financial year 2002-/03
for Scotland only.

PSU's / Clustering

The FRS sample is a clustered sample where the primary sampling units (PSU's) are postcode sectors with up to 25 respondents from one sector. The data used has 4695 households in 320 PSU's.

Weighting

EX1WTS
Figure 1.1: Box plots of weights by number of adults in the household

The weights in the file are grossing-up weights that add to the population size. There is fairly large range of weights with some very large ones, relative to others. The range of weights by adults in the household is shown in Figure 1.1. The median weight is 466.

As this is a survey of households, not people, single person households are not down-weighted.

The major factors in weighting probably relate to the InfoButtonpost-stratification carried out at the UK level.

Stratification

Stratification was used in selecting the sample based on eight area based socioeconomic groups for the InfoButtonPSU's within regions where Scotland is one region. These design variables were not available with the data and have been ignored in these analyses.

Post-stratification

The FRS data were post-stratified at the UK level, using the method of raking, to make the total households match national totals by tenure and council tax band. In this exemplar we illustrate how to carry out InfoButtonpost-stratification for the Scottish data alone. Total numbers of Scottish Households by Council tax band are available froma Scottish Economic Statistics publication. The 2001 census was used to get data on housing tenure in four broad bands.

Table 1.7 compares these outside data with the weighted sample totals. We can see that differences between the sample and the population percentages are small, probably due to the raking carried out a t a UK level.

NOTE: This table has French headings because it is based on the output from the SAS CALMAR macro, featured in the SAS analyses, which is written in French.

Variable Modalit�
ou variable
Marge (margin)
�chantillon
Marge
population

Pourcentage �chantillon
(sample)

Pourcentage population
CTBAND A 515672 556691.58 23.05 24.83
B 547548 551983.35 24.48 24.62
C 351599 346390.85 15.72 15.45
D 291425 267023.63 13.03 11.91
E 266257 268144.64 11.90 11.96
F 147851 133399.71 6.61 5.95
G 87767 88335.27 3.92 3.94
H 9190 10089.05 0.41 0.45
not separately valued 19670 19953.91 0.88 0.89
tenure owned 1459205 1404172.12 65.23 62.63
LA rented 493237 484050.39 22.05 21.59
other public rented 128189 125104.27 5.73 5.58
private rented or free 156348 228685.22 6.99 10.20

Table 1.7: Comparison of population numbers of households with data from the weighted sample before InfoButtonpost-stratification

peas project 2004/2005/2006