ECONOMIC & SOCIAL RESEARCH COUNCIL

Data Preparation

Splitting & Merging Text Files

Splitting Files

Many databases for media research, such as LEXIS/NEXIS or Sociological Abstracts, create files that contain large numbers of records. To organize these records in qualitative software packages each record should be stored in its own, separate file. The result files from these databases therefore need to be split at the start of each record. In UNIX, this can be done with the "csplit" command. This command can also be installed into Windows.

Apple Program Apple users can recur to Dejal TextSplitter.

csplit (for Windows and UNIX)

Installation of csplit (Windows only)

A variety of simple utilities for text manipulation is standard with most UNIX releases. These useful utilities are also available for Windows, but, alas, they have retained the non-intuitive installation routines and command line interface from UNIX. Here is a step-by-step instruction for installation in Windows 9x/ME/2000/XP:

  1. Make sure that Internet Explorer 4.0 or later is installed.
  2. Download and run libiconv-1.8-1-bin.exe.
  3. Download and run libintl-0.11.5-2-bin.exe.
  4. Download the files highlighted in pink on the sourceforge page.
  5. Unzip (e.g., by using Winzip) the latter files into the directory that contains the GNUWin files on your harddisk. Typical locations of this directory are:
    • C:\Program Files\GnuWin32\
    • D:\Program Files\GnuWin32\
    • C:\Programme\GnuWin32\
    • D:\Programme\GnuWin32\

You are done! You have installed GNUWin32 Text Utilities. Check out their manual to make full use of these utilities, which you can unfortunately only access by using the "Start -> Run" interface in Windows.

Usage of csplit

Here is an example on how to use the csplit utility. Suppose, you want to split a LEXIS/NEXIS result file into set of subfiles, each of which contain a single article record. This is, what you need to do:

  1. Download the result file as "ASCII file". If you downloaded the default RTF-File, use your word processor (e.g., Word, WordPerfect, etc.) to convert the file into ASCII by saving it as "text only (.txt)".
  2. Check how many records are in the file. Let us call that number n.
  3. Open the file in a word processor (or a text editor) and replace the number denoting the total number of documents in the record delimiter (e.g., "1 of 864 Documents") by a string that will not exist in any of the articles, for instance, "qq", and the number itself. That is, in the current example you would replace "864 Documents" by "qq864 Documents". (You need to perform this replacement, because otherwise you run the risk of chopping up records that by chance contain the total number of records in the contents of the record.)
  4. Run

    "C:\Program Files\GnuWin32\bin\csplit.exe" C:\split\search.txt /qqn/ {*}

    in the "Start -> Run" box, where C:\split\results.txt is the full path of the LEXIS/NEXIS file you want to split and n is the number of articles in the result file (or any other string that might serve as a delimiter for records).

    NB: Make sure that neither the file path or the file name of the file you want to split itself contains any spaces! Also, use spaces and quotation marks exactly as they are used in the command line above.

    UNIX uses the following command line:

    csplit.exe C:\split\search.txt /qqn/ {*}

  5. You will find the separate files in the folder, into which csplit.exe was installed (e.g., "C:\Program Files\GnuWin32\bin\"). Their names will consist of xx and a number. Use RenameFiles to give them more sensible names.

Merging Text Files

At times, it is also convenient to merge ceratin text files into a data file. For this purposes exists besides the command line Uumerge utility for UNIX derivatives (a free Windows version is available), a very comfortable GUI Windows™ program exists:

media methods > resources > data preparation > splitting files contact