In the past, the various language-engineering research resources created by our team have been scattered across different sites, not all under our control, and there have been problems when some of them have been shifted without my being notified of the fact. I apologize to users for any frustration and inconvenience.
I now have my own domain name, which I intend to maintain indefinitely; in future the most up-to-date versions of the research resources should always be stored in locations under my control and should be accessible by following the downloadable research resources link from my www.grsampson.net home page.
Note that this address should never change, but the URLs of the resources themselves may well change for various practical reasons. Anyone referring people to the location of these materials should quote this address (www.grsampson.net/Resources.html) – not the current addresses of the respective resources.
The following resources are currently available. The links here are to web pages describing the resources; lower on this page are links allowing you to download the resources themselves.
- SUSANNE Corpus, Release 5
Note that Release 5 of SUSANNE, completed in August 2000, is substantially revised from the previous release, which was circulated by the Oxford Text Archive.
Also given below is a link to download the SEMiSUSANNE Corpus, developed by Christopher Powell of the Ashmolean Museum, Oxford University, who has kindly permitted me to distribute it from my site. SEMiSUSANNE, completed in 2006, supplements the grammatical annotations of SUSANNE with semantic annotations identifying the WordNet senses in which vocabulary items are used. It covers 33 of the 64 SUSANNE texts.
- CHRISTINE Corpus, Release 2
The second release of CHRISTINE, which became available in August 2000, incorporates a minor change in the distribution of analytic information between the fields, to make it more compatible with SUSANNE and easier to read.
- LUCY Corpus, Release 2
Release 2 of the LUCY Corpus, circulated in December 2005, corrects a number of errors in the initial release of 2003.
— Note that both CHRISTINE and LUCY Corpora have a feature relating to filename conventions which, with hindsight, I regret. They were developed in a Unix environment, where proprietary file formats have no great significance, and consequently I felt free to devise my own system of classifying corpus files using “dot-suffixes”. Many researchers nowadays, though, work in Windows and other computing environments where adherence to standards for file-format identification is important. When I find the time, I intend to produce new versions of these corpora using different, unobjectionable filename conventions. Meanwhile, I suggest that those who use these resources in non-Unix environments begin by changing the filenames, say by replacing full stops with hyphens.
- XML versions of the above
The SUSANNE, CHRISTINE, and LUCY treebank resources are encoded in a traditional record-and-field format. However, Olga Pustylnikov of the University of Bielefeld has transposed the data into XML-based formats. Since 2006–07 these alternative versions of the treebanks have been freely available for downloading from her project website (not from my site).
Simple Good–Turing frequency estimation softwareThis software is available in C, in C++, and in Perl. The current release of the C version, which I completed in 2000, eliminates a minor bug present in the initial release; it is available via the link lower on this page. The original Perl implementation was produced by Tibor Kiss and André Halama of the Ruhr-Universität Bochum and released in 2004; they have kindly permitted me to distribute it from this site – again, use the link below. In 2007 Kiss and Halama’s Bochum colleagues Florian Doemges and Björn Wilmsmann produced a modified Perl version which can be downloaded from CPAN. The C++ version was produced by David Elworthy of Google, Inc., and is available from his site.
- leaf-ancestor assessment software
A C program implementing the leaf-ancestor metric for parse accuracy, as described by Sampson & Babarczy in J. of Natural Language Engineering vol. 9 pp. 365–80, 2003. My original program has been available online since 2005. In 2006, Derrick Higgins of the Educational Testing Service, Princeton, New Jersey, produced an improved version which he has kindly permitted me to distribute. Either version can be downloaded via the link below; I recommend the Higgins version.
So far as I am concerned, anyone is welcome to take copies of these resources and to use them for any purpose; and as far as I am able to check, I am legally entitled to make that offer. (If this is not legally watertight enough for you, you will have to go into the legalities yourself.) Naturally, if you do anything public with some of these materials, Sussex University and I would appreciate an acknowledgement (and, in the case of SUSANNE, CHRISTINE, and LUCY, so would the Economic and Social Research Council (UK), which sponsored their creation).
If you wish to read or print out the documentation files for SUSANNE, CHRISTINE, or LUCY Corpora, rather than download the data files themselves, these are available as large Web pages (respectively 12,000, 36,000, and 13,000 words) from the following links. The SEMiSUSANNE documentation file is quite short.
SUSANNE documentation file
SEMiSUSANNE documentation file
CHRISTINE documentation file
LUCY documentation file
The SUSANNE, CHRISTINE, and LUCY data files themselves, together with one version of the Good-Turing software, are published as compressed tar files; and all these files are distributed by anonymous ftp. Consequently, downloading any of these resources requires the ability to use ftp software. In the case of the larger files, you will also need to use tar and uncompression software; and to use the the Simple Good-Turing or leaf-ancestor assessment programs you will need to use a C compiler (or alternatively, in the former case, a Perl interpreter). If you don’t know what these terms mean, we respectfully suggest that you proceed no further unaided; perhaps you could consult someone in your local environment who has the appropriate technical expertise, or read up a guide to those aspects of the internet which go beyond the World Wide Web. Those who are confident with the relevant technology are very welcome to initiate the downloading process via the appropriate one of the following links:
There are no special problems about using the relevant software to obtain these resources, but you will need to understand what you are doing.
If any user finds errors in the data files or bugs in the code, I should
be very grateful to be notified with details; if your information leads
to improved versions, you will be acknowledged by name in the
documentation of those versions, in the same way as I have already done
with people who kindly pointed out flaws in earlier releases.
Likewise, if someone has problems in downloading or uncompressing the
material and is familiar enough with the technology to be confident
that the problem lies at the Sussex end, I would be very glad to
be told so that I can sort the problem out. (I hope that situation
will not arise, but one can never be sure.) On the other hand, with
queries along the lines “Help, I’ve never used gunzip before, what’s all
this stuff in my computer?”, in principle I wish we could assist but in
practice I’m sorry,
our research team really has not got the time.
It is envisaged that additional resources, and enlarged or upgraded versions of those already listed, will be added from time to time. The access route will always be from www.grsampson.net via a link labelled downloadable research resources.
last changed 9 Mar 2007