Geoffrey Sampson


[LOGO]

Downloadable Research Resources

In the past, the various language-engineering research resources created by our team have been scattered across different sites, not all under our control, and there have been problems when some of them have been shifted without my being notified of the fact. I apologize to users for any frustration and inconvenience.

I now have my own domain name, which I intend to maintain indefinitely; in future the most up-to-date versions of the research resources should always be stored in locations under my control and should be accessible by following the downloadable research resources link from my www.grsampson.net home page.

Note that this address should never change, but the URLs of the resources themselves may well change for various practical reasons at any time. Anyone referring people to the location of these materials should quote this address (www.grsampson.net/Resources.html)not the current addresses of the respective resources.

The following resources are currently available. In this list, the resource names link to background descriptions addressed to a general readership; links to technical documentation and to the resources themselves are included in the indented material below the names.

Note that the SUSANNE, CHRISTINE, and LUCY data files are published as compressed tar files; and all the corpus files are distributed by anonymous ftp. Consequently, downloading any of these resources requires the ability to use ftp software. In the case of the larger files, you will also need to use tar and uncompression software; and to use the the Simple Good-Turing or leaf-ancestor assessment programs you will need to use a C compiler. If you don’t know what these terms mean, I respectfully suggest that you proceed no further unaided; perhaps you could consult someone in your local environment who has the appropriate technical expertise, or read up a guide to those aspects of the internet which go beyond the World Wide Web.

SUSANNE Corpus, Release 5

Release 5 of SUSANNE, completed in August 2000, is substantially revised from the previous release, which was circulated by the Oxford Text Archive.

Documentation file here; link to data files here.

SEMiSUSANNE Corpus

The SEMiSUSANNE Corpus was developed in 2006 by Christopher Powell of the Ashmolean Museum, Oxford University, who has kindly permitted me to distribute it from my site. SEMiSUSANNE supplements the grammatical annotations of SUSANNE with semantic annotations identifying the WordNet senses in which vocabulary items are used. It covers 33 of the 64 SUSANNE texts.

Documentation file here; link to data files here.

CHRISTINE Corpus, Release 2

The second release of CHRISTINE, which became available in August 2000, incorporates a minor change in the distribution of analytic information between the fields, to make it more compatible with SUSANNE and easier to read.

Documentation file here; link to data files here.

LUCY Corpus, Release 2

Release 2 of the LUCY Corpus, circulated in December 2005, corrects a number of errors in the initial release of 2003.

Documentation file here; link to data files here.

— Note that both CHRISTINE and LUCY Corpora have a feature relating to filename conventions which, with hindsight, I regret. They were developed in a Unix environment, where proprietary file formats have no great significance, and consequently I felt free to devise my own system of classifying corpus files using “dot-suffixes”. Many researchers nowadays, though, work in Windows and other computing environments where adherence to standards for file-format identification is important. When I find the time, I intend to produce new versions of these corpora using different, unobjectionable filename conventions. Meanwhile, I suggest that those who use these resources in non-Unix environments begin by changing the filenames, say by replacing full stops with hyphens.


XML versions of the above

The SUSANNE, CHRISTINE, and LUCY treebank resources are encoded in a traditional record-and-field format. However, Olga Pustylnikov of the University of Bielefeld has transposed the data into XML-based formats. Since 2006–07 these alternative versions of the treebanks have been freely available for downloading from her project website.


Simple Good–Turing frequency estimation software

Software implementing the algorithm defined in Gale & Sampson, “Good–Turing frequency estimation without tears”, 1996. This software is available in ANSI standard C, in C++, and in Perl. The C version was produced by myself in 2000. The C++ version was produced in 2004 by David Elworthy of Google, Inc., and is available from his site. The Perl version was produced in 2007 by Florian Doemges and Björn Wilmsmann of the Ruhr-Universität Bochum, and can be downloaded from CPAN.

(Please note that for some time an alternative Perl implementation was distributed from my site though not produced by me. I removed the link to that version after Fan Yang of Next IT Inc., Spokane (Wash.), pointed out that it was buggy. I am grateful to Fan Yang for this information, and I apologize to anyone who downloaded that software in good faith.)

leaf-ancestor assessment software

A C program implementing the leaf-ancestor metric for parse accuracy, as described by Sampson & Babarczy in J. of Natural Language Engineering vol. 9 pp. 365–80, 2003. My original program has been available online since 2005; it works (at least with datasets that are not too large) but is slow, and can fail outright with large datasets. In 2006, Derrick Higgins of the Educational Testing Service, Princeton, New Jersey, produced an improved version which he has kindly permitted me to distribute, and which is more efficient and robust. I recommend that users should work with Derrick Higgins’s version, which he has kindly allowed me to distribute from this site; I leave my original version available because it is the only one for which I personally can vouch.


So far as I am concerned, anyone is welcome to take copies of these resources and to use them for any purpose; and as far as I am able to check, I am legally entitled to make that offer. (If this is not legally watertight enough for you, you will have to go into the legalities yourself.) Naturally, if you do anything public with some of these materials, Sussex University and I would appreciate an acknowledgement (and, in the case of SUSANNE, CHRISTINE, and LUCY, so would the Economic and Social Research Council (UK), which sponsored their creation).

If any user finds errors in the data files or bugs in the software code, I should be very grateful to be notified with details (sampson followed by at-sign followed by cantab.net). If your information leads to improved versions, you will be acknowledged by name in the documentation of those versions, in the same way as I have already done with people who kindly pointed out flaws in earlier releases. Likewise, if someone has problems in downloading or uncompressing the corpus data and is familiar enough with the technology to be confident that the problem lies at my end, I would be very glad to be told so that I can sort the problem out. (I hope that situation will not arise, but one can never be sure.) On the other hand, with queries along the lines “Help, I’ve never used gunzip before, what’s all this stuff in my computer?”, in principle I wish I could assist but in practice I’m sorry, life is too short.

It is envisaged that additional resources, and enlarged or upgraded versions of those already listed, will be added from time to time. The access route will always be from www.grsampson.net via a link labelled downloadable research resources.




Geoffrey Sampson

last changed 5 Sep 2010