precautions

Precautions taken to anonymise data from the surveys

>> Why are we concerned?
>> Where does the data on the P|E|A|S site originate and where have we obtained them from?
>> Who has access to the survey data and what aspects may be restricted ?
>> What special conditions apply to the P|E|A|S data ?
>> How have we taken steps to make the data on this site anonymous?

Why do we need to be concerned with anonymity?

Survey subjects, in almost every case, are informed that the data they provide will only be used for 'statistical purposes' and that nothing that is published from the survey that will make it possible to identify individuals. Clearly, we must ensure that we never breach this promise in making real survey data available over the world wide web. This principle is enshrined in the Statement of Principles for the Office of National Statistics Code of Practice

" The National Statistician will set standards for protecting confidentiality, including a guarantee that no statistics will be produced that are likely to identify an individual unless specifically agreed with them."

Data may disclose information about individuals or organisations, even when it is not indexed by a name or another identifier that could be used directly to trace the respondent. Disclosure can happen because an individual has a unique combination of characteristics. An individual responding to a survey might be the only doctor in a village. If the survey identifies the village as a geographic area and has details of occupations, then anyone with access to the anonymised survey data could identify this person's responses to other questions.

Where does the survey data on the P|E|A|S site originate and where have we obtained it from?

The survey data on this site has come from three sources. Most commonly it is from large surveys that have been commissioned by government departments and carried out either by the Office of National Statistics or by Survey organisations. We also include a survey carried out by an academic group and sponsored by the ESRC and a survey carried out by a health board and funded by the NHS. In all but the last case we have obtained the data from the ESRC UK Data Archive . This service gives bona-fide researchers access to a wide range of data, including government surveys that are made available via the Economic and Social Data Service.

Who has access to the survey data and what aspects may be restricted?

The analysts working for the organisations who have commissioned the survey generally have access to all the data collected. This would exclude individual names and addresses which it would be good practice to hold separately from the survey data.

Secondary analysts need to register with the Data Archive (see above) to obtain copies of the data and sign a guarantee of confidentiality. They must register the use they plan to make of the data in order to get access to specific resources. In some cases the more confidential aspects of data (e.g. geographic identifiers) may only be obtained with special permission from the organisation which has deposited the resource with the Archive. In other cases such confidential information may not be made available via the Archive at all.

Finally tabular data from the surveys appears in published reports and in internal documents that are available to the general public or to policy makers. In all of these cases the principle above needs to be adhered to by one or more of the following measures

ensuring the data, either by itself or in conjunction with other sources, could never disclose an individual

by making minor changes to the data (e.g. counts in tables) to prevent individuals being identified

obtaining assurances from those with access to the data that no disclosure of individual information will take place

What special conditions apply to the P|E|A|S data ?

We want to make it easy for people learning about surveys to use real data, with all the problems this involves. It would have detracted from the usefulness of the P|E|A|S web site if it was necessary to apply to the Data Archive for permissions before trying out the methods. What is more, analysts require information on the survey deign to use methods for complex surveys. This may include variables that identify the PSU of a respondent for a clustered sample and/or the stratum for a stratified sample. These data are not always available for surveys deposited in the Data Archive, although they may sometimes be derived from the serial number of the case. Sometimes they are only available in restricted data sets.

We have made data sets of individual records available on the P|E|A|S with a subset of survey variables. These data can be accessed by anyone who holds the relevant software and finds the web site. Therefore the data (taken by itself) must comply with the same standards of confidentiality that would apply to tables that might be released in published reports. But other considerations apply because of the possibility of linking with data on the Data Archive. We can identify three different types of people who might access the data on the web page:-

People with no access to data from the Archive and thus have given no assurance of confidentiality
People with access to the data sets from the survey Archive who have obtained copies of the full data sets for those surveys from which we have extracted the data, but without access to any special resources for which depositor's permission is required.
People with access to the data sets as above and who have obtained permission from the depositors to access special resources.

For individuals of all three types we need to ensure that the data by itself is not disclosive of individual information. For those of types 2 and 3 we need to ensure that the combination of the data on the web site along with the data from the Archive is not more disclosive than the data from the Archive by itself. This might happen, for example, if the data on the web site provided data on the primary sampling units and this, along with the data on the Archive might enable the identification of individuals and access to other data about them.

How have we taken steps to make the data on this site anonymous?

There are various ways in which the data taken by itself can be disclosive:

If a small cell arises which contains someone who is unique in the population (e.g. the village doctor as described above).
If an individual is unique in the sample and could be identified by their individual characteristics
If a survey contains a small unit (cluster or stratum) with only a few respondents can be identified and this unit has a substantial sampling fraction.
As condition 3 but where the unit or stratum has a small sampling fraction.
Weights and sampling fractions can reveal the identity of clusters, as has been pointed out by de Waal and Willenborg, if the numbers of units in the population is known.

Items 1 and 3 above refer to uniqueness in the population, while 2 and 4 refer to uniqueness in the sample. Sample uniqueness will only be disclosive if it is known that a respondent took part in the survey.

We have used the following methods where they were required to prevent this type of disclosure:

adding random noise to the data as a Poisson variable for counts or a normal random variable for continuous data
providing some continuous variables as ranges only
ensuring that categorical variables have no rare classes
limiting the number of variables provided in the data sets to those required for our analyses (usually very few)
removing any identifying information, such as the original numbering, that might identify small clusters or strata
adding random noise to weights and sampling fractions where this might lead to their identification

For surveys where we are providing additional information on the web site that is either held as restricted files in the Data Archive, or is not available from the Data Archive at all, we have carried out the following steps to prevent the web data being merged with the files from the Archive:-

changed the serial numbers that identify individual cases
added random noise to continuous variables and to the weights, where they take a large number of unique values

For each exemplar we have checked to make sure that these procedures have worked, paying special attention to any population unique cases. We have also made sure that none of these procedures distorts the conclusions from the analysis of the exemplars compared to what would have been obtained from the original data.