Clustering and PSU's (Primary Sampling Units)

P|E|A|S

text/print version

2. Clustering and PSU's (Primary Sampling Units)

2.1 Introduction > 2.2 Impact clustering has on standard errors> 2.3 Selecting a proportionate clustered sample

top

2.1 Introduction

Many surveys use clustered or ‘multi-stage’ designs. A ‘clustered’ sample
is defined as a sample that is selected in two or more hierarchical stages, different ‘units’ being selected at each stage, and with multiple sub-units being selected within higher order units. A few examples will help to clarify this:

Example 1
A sample of children is selected by:
(a) sampling schools and then
(b) selecting children within schools.

This is a two-stage clustered sample, the clustering being of children within schools.

Example 2
In the design of a general population survey the InfoButton PAF , is used to generate a sample of households. Within each household up to two adults are selected at random. This is a two-stage clustered sample, the clustering being of adults within households. Note that, had the instruction been to select just one adult per household, this would not be described as a clustered sample, because there would no clustering of the adult sample within a smaller number of households.

Example 3
The most common design for PAF samples is, at the first stage, to select a random sample of postcode sectors. Then, at the second stage, households are selected within these postcode sectors. And then, at a third stage, individuals might be selected within households. Under this design adults are clustered within households assuming more than one adult is selected per household) and households are clustered within postcode sectors.

Ultimate clusters
For clustered designs it is usual to describe the first stage sampling unit as the ‘primary sampling unit’ or InfoButton PSU . When using statistics packages that compute complex standard errors from multi-stage clustered samples it is only necessary to have a PSU variable in the dataset. Any clustering after the first stage does not have to be identified - the variance between PSUs automatically incorporates later stages of clustering. For example, in a general population survey with postcode sectors as the PSUs and with clustering within households, a PSU identifier is needed but a household identifier is not.

This method of calculating standard errors is sometimes known as the 'ultimate cluster method'. Other more complicated methods can also be used that require details of each stage of sampling, but they are much more complicated to specify. Although some software does allow you to use them we do not illustrate them on this site. Ultimate cluster methods are well-tested and there should be no need to use anything more complicated. The only exception to this would be where you have a design with several clustered stages and you want to apply a finite population correction (theory section 9) at each stage separately, including the first stage. See section 9.3 for a discussion of this.

top

2.2 What impact does clustering have on standard errors

Clustering the sample almost always leads to an increase in the InfoButton

standard error of survey estimates (relative to the standard error for a simple random sample). This means that InfoButton

design effects and InfoButton

design factors are increased as a result of the clustered design.

The degree of increase depends upon two things:
(i) the sample size per cluster
(ii) the homogeneity of the clusters.

For a given sample size, as the cluster sample size increases, the standard error tends also to increase.

The homogeneity of the clusters is measured by the intra-cluster correlation coefficient (ICC or r 'roh'). If the individuals within a cluster have more in common than individuals have in general, then r will be greater than zero. If, at the extreme, all individuals within a cluster are identical yet there is some between-cluster variation, then r will be equal to 1.

As r increases so does the standard error. This makes some intuitive sense - if the individuals within a cluster are all alike, but are different to individuals from other clusters, then with a clustered sample there is an increased risk of drawing a sample that happens to be very different to the population, and this risk is reflected in the standard error. The risk will increase as the number of clusters selected decreases (or, equivalently, as the sample size per cluster increases).

In a household-based survey where the InfoButton PSUs are geographical areas (such as postcode sectors) the types of variables that have relatively high r values are those with relatively little within-area variation. Tenure and dwelling type are examples.

The results from exemplar 1 (check link) and exemplar 2 illustrate this for real data.

top

2.3 Selecting a proportionate clustered sample

Some software packages (SAS and SPSS) include procedures that can be used to select a sample from a sampling frame. But these may not cover all the cases you wish and you may prefer to do it in other ways.

The selection of a clustered sample with probability proportional to size is very easily managed by a simple set of steps

order your sampling frame of clusters by a random number
for each cluster generate as many records as there are individual units in your cluster
you will now have as many records as there are individuals units in the population
calculate the sampling interval (SINT) you want by taking (total records) divided by number you want.
Select a systematic sample by counting down your records and selecting the cluster from which every SINTth record falls.
In the unlikely event that you hit the same PSU twice (this will only happen if you have some very large PSUs) then just ignore it and if necessary replace with another one by counting some reasonable ad hoc procedure.

If you also want the sample to be implicitly stratified by one or more other factors, then you would sort the file by these factors rather than by a random number. You would use a random sort within groups if your criteria for implicit stratification had repeated values for more than one cluster.

�

peas project 2004/2005/2006