About this Resource
[Skip navigation][Access key info]SEARCH | SITE MAP
SELF-STUDY INDEX
Exploring online research methods - Incorporating TRI-ORM

Web analytics

[Skip instructions]

[i] Click on the headings to open them. They will open on this page. Open the following link for further information about these headings if required.

Your browser does not support these headings. To ensure that the contents remain accessible, they have been automatically opened so that all the information on the page is displayed.

However, to take advantage of the headings and to ensure that the layout and design of this site are displayed correctly, you are recommended to upgrade to a current version of one of the following standards compliant browsers:

Image of statistics from web analytics showing a graph of the usage of a website with figures on number of visits and pages viewed
Image of web analytics from Google analytics
showing website usage figures.

 

 

Open/close headingCase study details

Title: A Case Study into Web Analytics

Author: Andy Phippen

Affiliation: School of Computing, Communications and Electronics, University of Plymouth

 

Close heading CLOSE

Open/close headingAims of the project

Understanding user behaviour online is problematic to the social researcher – traditional techniques such as interview, survey or focus group rely upon the user expressing their perceived behaviour, rather than actual behaviour. However, techniques already used throughout the business world to understand customer online behaviour may offer alternative, complimentary techniques that provide the social researcher with a different dataset to examine.

 

Close heading CLOSE

[Open/close heading]Methodological Innovation Used

The infrastructure behind the Internet provides the researcher with opportunities to non-intrusively observe user behaviour, as long as the researcher can appreciate the manner of the data presented to them. The business world invests considerable resources into the practice of web analytics – understanding this data to interpret how users behave on their sites in order to better optimise the promotion of products and site layout and therefore maximise profit. However, little has been done within social research to embrace these techniques; I have commented upon this elsewhere ([External Link - opens in a new window]http://erdt.plymouth.ac.uk/mionline/public_html/viewarticle.php?id=43&layout=html).

This case study presents a brief introduction to web analytics and considers their use applied to an oral history website – Mediterranean Voices (See [External Link - opens in a new window]http://www.iictd.org/medvoices/londonevent/about_the_project.cfm for project information). It suggests that with some appropriate technical techniques, understanding specific user behaviour related to website aims and even, one might hypothesise, cultural behaviour, can be achieved.

[Open/close heading]What Happens When You Click a Link?

The user perception of clicking a link within a webpage is that the request resource is retrieved and displayed in their browser. What is less well known is that the web server - the computer upon which the resources reside - make a note of the request it has just processed. This happens for every single request to the server and each is recorded in a logfile - an archive of all of the requests that the server has had to process. Each entry in the logfile looks something like the following (Logfile formats will differ between servers, but all will have some standard fields within them):

81.215.192.241, -, 30/04/2004, 09:49:03, W3SVC9, WEBSERVER, 192.168.2.53, 1002, 537, 9774, 200, 0, GET, /pages/search/showresource.aspx, id=1177&lang=0

Figure 1: An example logfile entry from the Med-Voices server

Obviously, on first observation, this information looks somewhat technical and of little benefit in observing behaviour! However, within this string of technical information lie some very useful units:

  • IP Address (81.215.192.241) – the first element within the line above (the four numbers separated by periods) provides the address of the computer requesting the information (the computer at which the user sits). The IP address is the core locating mechanism within the protocols that enable computers to talk to each other across the Internet. However, more importantly to our purposes, it identifies an individual request and its location. IP address ranges are assigned at a country and corporate level (for example, any computer within the University of Plymouth network will start 141.163….) and therefore can be used as a crude locating mechanism that works very well at a country level.
  • Time and date of request (30/04/2004, 09:49:03)
  • Name of the resource request (/pages/search/showresource.aspx)– the page or resource requested by the remote computer
  • Request arguments (id=1177&lang=0) – a technical term meaning any parameters passed with the request. Generally these are values used by the server to further refine the request. For example, if we look at a website such as Google, all searches will request the http://www.google.com/search resource. However, the server needs further information to know what the specific request to process. So, once we add a search term to Google (in this case “analytics”), something like: http://www.google.co.uk/search?hl=en&q=analytics&meta= will appear in the address bar of the browser.

At a very basic level, the most fundamental metrics are those such as hits and page views. A hit is a single request entry in a log file, a page view is a request asking for a web page. Arguably, one can also identify session, or clickstream, information – the pages a single user views in a single web site visit, through an aggregation of all requests from a specific IP address over a single date within a finite timeframe. However, while these basic metrics have some use, it is when such information is combined with more site specific data that the information one can glean becomes more useful. For example, if we resolve the location of the IP address, we now have an approximately location of the user, and therefore their clickstream. And if we know the nature of the resource requested, and how the arguments passed modify the nature of the resource (for example, looking at a particular item within an oral history database), we start to build a far richer picture of online interests and behaviour.

 

Close heading CLOSE

Open/close headingUnderstanding User Behaviour on Mediterranean Voices

The Mediterranean Voices project is an ongoing project, initially funded by the European Union, to collect an oral history of cultures in the southern Mediterranean. The histories were recorded using a number of different media (image, audio, video) and were coded according to location and “themes” – cross cutting definitions of common practices within cultures (work, play, worship, etc.). Each resource had additional textual commentary in both the local language and English added to it, and researchers also had the option to tag resources they considered related to the one they were currently coding. Once this coding was complete, these resources were stored in a web based archive. The project aimed to demonstrate cross cultural similarities within the region and encourage the exploration of resources outside of a specific location. For further information on locations, themes and aims of the project, visit [External Link - opens in a new window]http://www.med-voices.org.

The application of analytics to the study of user behaviour within the Med-Voices site combined data from web logs and the resource database to enable an examination of individual’s explorations of the archive. IP resolution software (i.e. a piece of code that takes an IP address and provides an approximate location) was also used to examine the regions that were interested in the project, and what they examined.

The logfile data used for the experiment examine 1231 unique visits to the site, comprising in total 4672 resources viewed. The first experiment combined the data sources to consider the proportion of resources viewed by users in each of the locations represented in the project to give an indication of the regions of interest from specific counties. This would hint at interests of individuals within these locations. It grouped clickstream information based upon IP address groups (to group IP addresses by country), and then decomposed the sessions through the identification of requests to the “showresource.aspx” page (the page that displayed a given archive resource) and the arguments passed to that request (which would allow us to discover the origin of the archive resource being viewed). Table 1 provides a summary of this experiment, showing that some locations were more insular than others. While the volume of sessions is such that this cannot be taken as authoritative without further investigation (for example, only a single session from Italy cannot be taken as a general view of the interests of visitors from Italy!) it does provide an interesting an example of what we can learn from analytics techniques.

User location

Resource location

Valletta (Malta)

Alexandria (Egypt)

Mallorca (Spain)

London (UK)

Chania (Greece)

Nicosia (Cyprus)

Ancona (Italy)

Marseilles (France)

Granada (Spain)

Bethlehem (Palestine)

Istanbul (Turkey)

Cyprus

17.28

2.47

9.88

12.35

9.88

30.86

2.47

14.81

0

0

0

Spain

1.4


3.15

3.5

1.4


0

5.59

71

4.2

9.79

France

0.78

0.23

1.01

1.4

3.26

0.62

0.39

84.26

2.87

3.8

1.4

UK

4.77

2.27

0.68

9.09

18.64

0.91

1.35

23.64

22.27

9.55

8.18

Greece

0

0

0

0

59.46

0

1.35

31.08

8.11

0

0

Italy

0

0

0

0

0

0

0

40

0

0

60

Malta

100

0

0

0

0

0

0

0

0

0

0

Palestine

0

0

0

0

0

0

0

4

0

96

0

Turkey

1.71

0

0

6.84

4.27

0.85

0.21

25.64

4.91

1.71

53.85

Table 1: % Resources viewed vs. location

A second experiment examined the preference for viewing resource descriptions either in the local language or English, compared to the location of the user. This was possible through the isolation of the argument in the request that stated the preferred language for the resource. The results displayed in Table 2 again show indicative results rather than definitive statements regarding regional interests in language.


Local %

English %

Cyprus

46.51

53.49

Spain

39.44

60.56

France

42.69

57.31

UK

23.63

76.37

Greece

26.67

73.33

Lebanon

15.63

84.38

Malta

33.33

66.67

Turkey

13.51

86.49

Palestine

0.00

100.00

Table 2: Language in which resource was viewed, compared to location

Finally, we examined the use of themes as a technique for encouraging exploration across locations. For this examination, specific clickstreams were examined and the resource path exploration was followed. A number of sample clickstreams are illustrated in Figure 2 below, showing the number of resources viewed within a specific location and theme, and the route traveled from each location:

Clickstream 1 (location: Spain)
Bethlehem (The Person) 2 resources - Granada (Worship) 3 resources - Bethlehem (Worship) 1 resource - Bethlehem (Spaces)

Clickstream 2 (location: Spain)
Marseilles (Spaces) 2 resources - Bethlehem (The Person) 2 resources - Granada (The Person) 3 resource - Marseilles (The Person) 1 resource - Marseilles (Work) 4 resource - Granada (Work) 1 resource

Clickstream 3 (location: UK)
Granada (Worship) 1 resource - Bethlehem (The Person) 2 resource - Granada (The Person) 1 resource - Granada (Play) 3 resources - Chania (Play) 1 resource - Chania (Spaces) 6 resources

Clickstream 4 (location: France)
Marseilles (Spaces) 4 resources - Marseilles (Worship) 5 resources - Marseilles (Objects) 3 resources - Marseilles (Spaces) 3 resources

Figure 2: Sample Med-Voices clickstreams

 

Close heading CLOSE

 

Close heading CLOSE

Open/close headingFinal Discussion

The use of analytics is still very much in its infancy within social research, but does have the potential to complement existing techniques in understanding social behaviour online. It is acknowledged that there are some drawbacks – for example, the technical knowledge required to implement such analysis, and the imprecise nature of the IP address as a user identifier (although further techniques, outside of the scope of this paper, can focus the clickstream onto an individual, rather than a machine). Certainly I would not propose that the techniques are immediately accessible to every researcher, but I would hope that such results would encourage further dialogue between technologists and sociologists in realising that their interests can, sometimes, converge. And the techniques do offer non-intrusive approaches to understanding user behaviour, so are not subject to the Hawthorne Effect (refers to the often observed phenomena of individuals modifying their behaviour when they are aware they are the subject of research. It has been observed in a variety of different social research methods, but was first described in a series of experiments in the Hawthorne plant of Western Electric, USA, in the 1920s). As such they offer the opportunity to add validity to traditional techniques.

One final acknowledgement should be made to the ethical use of such techniques within research. I have had many discussions regarding whether the use of such data is a privacy infringement. Indeed, the recent news articles about Google’s use of user’s data for prolonged periods of time ([External Link - opens in a new window]http://news.bbc.co.uk/1/hi/technology/6692063.stm) suggest that there is growing public mistrust about what websites do with 'their' data. In the case of base log file analytics, my own view is that the IP address represents the computer, not the user, and the details stored within a web server are not related to personal information. However, it is unquestionable that some techniques within the analytic world, which I have not discussed in detail in this article, certainly do have implications for privacy and lie in some grey areas ethically.

The best approach to the researcher is to be open about the data collection. Website users are becoming familiar with viewing privacy policies within a website and it is entirely appropriate to include a privacy policy within a social website, clearly stating how IP addresses, cookies, etc. will be stored and analysed. Another useful indicator of openness is to including contact details so should a user wish to examine the use of the data in more detail, they can. However, exposing an email address on a website does then open the researcher up to mail harvesting and spamming, so I would suggest a postal address rather than an electronic one. While such a policy may dissuade some people from using such a site, the people who will take the time to inspect such a document (which, in my experience – through examining logfiles to see who have retrieved the page – will only be a minority) will have clear reassurance about the use of their data.

 

Close heading CLOSE

 

Author of this page: Andy Phippen - Year of publication: 2007 - Affiliation: University of Plymouth
  © 2004-2010  All rights reserved    |    Maintained by ReStore    |    About this website    |    Disclaimer    |    Copyright    |    Citation policy    |    Contact us