Geoffrey Sampson

Downloadable Research Resources

In the past, the various language-engineering research resources created by our team have been scattered across different sites, not all under our control, and there have been problems when some of them have been shifted without my being notified of the fact. I now have my own domain name, which I intend to maintain indefinitely; in future the most up-to-date versions of the research resources should always be stored in locations under my control and should be accessible by following the downloadable research resources link from my www.grsampson.net home page.

Note that this address should never change, but the URLs of the resources themselves may well change for various practical reasons at any time. Anyone referring people to the location of these materials should quote this address (www.grsampson.net/Resources.html) – not the current addresses of the respective resources.

Apart from resources created by researchers under my direction during the years before I retired, I also include links here to resources developed by others which build on our work. I am very glad to see our material re-used in these ways, and (with the respective researchers’ encouragement) to display links here. But I am in no position to vouch for the solidity of research products for which I was not responsible — would-be users must take their own views on that; and any problems or requests for advice would need to be taken up with those responsible, not with me.

(Of course, if anyone discovers problems serious enough to mean that I shouldn’t be advertising one of these resources, please tell me — that has already happened once!)

The following resources are currently available. In this list, the resource names link to background descriptions addressed to a general readership; links to technical documentation and to the resources themselves are included in the indented material below the names.

Note that the SUSANNE, CHRISTINE, and LUCY data files are published as compressed tar files; and all the corpus files are distributed by anonymous ftp. Consequently, downloading any of these resources requires the ability to use ftp software. In the case of the larger files, you will also need to use tar and uncompression software; and to use the the Simple Good-Turing or leaf-ancestor assessment programs you will need to use a C compiler. If you don’t know what these terms mean, I respectfully suggest that you proceed no further unaided; perhaps you could consult someone in your local environment who has the appropriate technical expertise, or read up a guide to those aspects of the internet which go beyond the World Wide Web.

SUSANNE Corpus, Release 5

Release 5 of SUSANNE, completed in August 2000, is substantially revised from the previous release, which was circulated by the Oxford Text Archive.

Documentation file here; link to data files here.

SEMiSUSANNE Corpus

The SEMiSUSANNE Corpus was developed in 2006 by Christopher Powell of the Ashmolean Museum, Oxford University, who has kindly permitted me to distribute it from my site. SEMiSUSANNE supplements the grammatical annotations of SUSANNE with semantic annotations identifying the WordNet senses in which vocabulary items are used. It covers 33 of the 64 SUSANNE texts.

Documentation file here; link to data files here.

SUSANNETS

SUSANNETS was developed in 2017 by Alastair Butler of the (Japanese) National Institute for Japanese Language and Linguistics, Tokyo. It is a web-accessible automatic conversion of SUSANNE into the CorpusSearch format, and with tag labels changed to the Penn Historical Corpora scheme. Its primary purpose is to serve as a companion to the Treebank Semantic Parsed Corpus, increasing the availability of high-quality syntactic parsed analyses for testing the generation of predicate-logic-based meaning representations with Treebank Semantics.

CHRISTINE Corpus, Release 2

The second release of CHRISTINE, which became available in August 2000, incorporates a minor change in the distribution of analytic information between the fields, to make it more compatible with SUSANNE and easier to read.

Documentation file here; link to data files here.

LUCY Corpus, Release 2

Release 2 of the LUCY Corpus, circulated in December 2005, corrects a number of errors in the initial release of 2003.

Documentation file here; link to data files here.

— Note that both CHRISTINE and LUCY Corpora have a feature relating to filename conventions which, with hindsight, I regret. They were developed in a Unix environment, where proprietary file formats have no great significance, and consequently I felt free to devise my own system of classifying corpus files using “dot-suffixes”. Many researchers nowadays, though, work in Windows and other computing environments where adherence to standards for file-format identification is important. When I find the time, I intend to produce new versions of these corpora using different, unobjectionable filename conventions. (I ought to have done this years ago by now, but one gets lazy in retirement. — Not quite terminally lazy; see my various publication pages …)

Meanwhile, I suggest that those who use these resources in non-Unix environments begin by changing the filenames, say by replacing full stops with hyphens.

XML versions of the above

The SUSANNE, CHRISTINE, and LUCY treebank resources are encoded in a traditional record-and-field format. However, Olga Pustylnikov (now Olga Abramov) of the University of Bielefeld has transposed the data into an XML-based format.

Simple Good–Turing frequency estimation software

Software implementing the algorithm defined in Gale & Sampson, “Good–Turing frequency estimation without tears”, 1996. Such software is available in at least the following languages:

ANSI standard C, coded by myself in 2000, available here
C++, produced in 2004 by David Elworthy of Google, Inc., and available from his site
Perl, produced in 2007 by Florian Doemges and Björn Wilmsmann of the Ruhr-Universität Bochum, available from CPAN
Python 3, by Zachary McCord of Roberson & Associates (Illinois) in 2019, available from PyPi

leaf-ancestor assessment software

A C program implementing the leaf-ancestor metric for parse accuracy, as described by Sampson & Babarczy in J. of Natural Language Engineering vol. 9 pp. 365–80, 2003. My original program has been available online since 2005; it works (at least with datasets that are not too large) but is slow, and can fail outright with large datasets. In 2006, Derrick Higgins of the Educational Testing Service, Princeton, New Jersey, produced an improved version which he has kindly permitted me to distribute, and which is more efficient and robust. I recommend that users should work with Derrick Higgins’s version, which he has kindly allowed me to distribute from this site; I leave my original version available because it is the only one for which I personally can vouch.

So far as I am concerned, anyone is welcome to take copies of these resources and to use them for any purpose; and as far as I am able to check, I am legally entitled to make that offer. (If this is not legally watertight enough for you, you will have to go into the legalities yourself.) Naturally, if you do anything public with some of the materials which I was responsible for producing, Sussex University and I would appreciate an acknowledgement (and, in the case of SUSANNE, CHRISTINE, and LUCY, so would the Economic and Social Research Council (UK), which sponsored their creation). No doubt similar remarks would apply to the materials produced by others.

If any user finds errors in the data files or bugs in the software code, I should be very grateful to be notified with details (sampson followed by at-sign followed by cantab.net). If your information leads to improved versions, you will be acknowledged by name in the documentation of those versions, in the same way as I have already done with people who kindly pointed out flaws in earlier releases. Likewise, if someone has problems in downloading or uncompressing the corpus data and is familiar enough with the technology to be confident that the problem lies at my end, I would be very glad to be told so that I can sort the problem out. (I hope that situation will not arise, but one can never be sure.) On the other hand, with queries along the lines “Help, I’ve never used gunzip before, what’s all this stuff in my computer?”, in principle I wish I could assist but in practice I’m sorry, life is too short.

It is envisaged that additional resources, and enlarged or upgraded versions of those already listed, will be added from time to time. The access route will always be from www.grsampson.net via a link labelled downloadable research resources.

Geoffrey Sampson

last changed 14 Apr 2023