Since the 1990s, the exciting growth-area in linguistics has been corpus linguistics: studying how English and other languages are used in real life, through analysis of large electronic samples – “corpora” – of spoken or written usage. In 2004, together with my colleague Diana McCarthy I edited an anthology of papers illustrating the diverse strengths of modern corpus linguistics.
Many findings of corpus linguistics shed new light on the nature of language as a human ability. But corpus analysis is crucial also for enabling computers to process human language. For that purpose, we need corpora annotated to show their structural features, as a source of information and statistics to guide the development of language-processing algorithms. This in turn requires some set of categories to be explicitly defined, so that researchers exchanging language data can be confident that they are using the annotations in the same way. Computational linguistics needs something like the Linnaean taxonomy created for botany in the 18th century, which for the first time enabled naturalists everywhere to exchange information about plants secure in the knowledge that when they used the same names they were talking about the same things.
(To get a sense of the massive variety of annotation practices which have emerged from the lack, in the past, of any explicit public taxonomy that researchers could choose to standardize on, see the catalogue compiled by the Linguistic Data Consortium.)
Beginning in 1983 I led an effort, which came to fruition with my 1995 book English for the Computer (Oxford University Press), to produce this sort of Linnaean taxonomy for English: the SUSANNE scheme. The SUSANNE scheme is so far as I am aware the first serious attempt anywhere to produce a comprehensive, fully explicit annotation scheme for English grammatical structure. It has won praise internationally, e.g.:
- “a unique achievement”
- — a report of the European Union EAGLES standards initiative
- “the best there is”
- — an anonymous referee for our later CHRISTINE project
- “the detail ... is unrivalled”
- — D.T. Langendoen (President, Linguistic Society of America), in Language vol. 73, 1997
- “impressive ... very detailed and thorough”
- — Oliver Mason, in International Journal of Corpus Linguistics vol. 2, 1997
- “meticulous treatment of detail”
- — Geoffrey Leech & Elizabeth Eyes, in R.G. Garside et al., eds., Corpus Annotation, Longman, 1997
The name “SUSANNE” stands for “Surface and underlying structural analysis of natural English”.
The genesis of the SUSANNE scheme lay in work on statistics-based parsing techniques led by Geoffrey Leech and Roger Garside in the early 1980s at Lancaster University, where I then worked. The automatic parser needed a manually-analysed database as a source of statistical information, and I undertook to produce this in consultation with Lancaster colleagues. At that period, it was usual in computational linguistics to work with invented, artificially well-behaved language examples. Our group at Lancaster was unusual at that period in working with real-life language (we were using the LOB Corpus of written British English).
Although there has long been a broad consensus among linguists about the core grammatical categories of English — parts of speech (e.g. noun, preposition), types of phrase and clause (e.g. adjectival phrase, relative clause) — the experience of applying an agreed listing of such categories to real-life examples immediately threw up two huge problems:
We had expected that there would be difficult cases — Should this sequence be classified as an X phrase or as a Y phrase? Should the phrase be marked as ending at this word or the next word? We hadn’t quite realized that such difficulties would crop up in virtually every sentence, even in published (hence, edited) prose. Likewise, we knew that the consensus categories focused mainly on the “core” logical constructions of English, but we hadn’t realized how much there is that occurs in real-life English and needs to be dealt with by practical Language Engineering systems, but which the consensus categories ignore. Dates, money sums, weights and measures, multi-word personal names all have characteristic internal structure, but standard linguistics offered no guidance about how to represent it. Linguistics was so heavily oriented towards spoken language that even punctuation marks had no accepted place in parse-trees.
The task I took on was to make explicit all these gaps in the consensus, and to agree with my colleagues consistent, acceptable ways of filling them.
The SUSANNE scheme is an attempt to lay down a set of annotation standards which resolve all doubts of the kinds just discussed, by specifying an explicit rule to decide each grey area. The SUSANNE scheme sets out to be:
The scheme does not aim to be correct: that is, no claim is made that where a construction might be analysed in one way or another way, the SUSANNE analysis corresponds to how speakers process the construction psychologically, or anything of that sort. The SUSANNE philosophy is that it is more important that every linguistic form should have a predictable, explicitly defined analysis than that analyses should always be theoretically ideal. Collecting and registering data in terms of an explicit taxonomic scheme is a precondition for successful theorizing, not the other way round.
However, the SUSANNE scheme has been developed and debugged through consultation between numerous researchers and through application to sizeable samples of British and American English; so, if not “correct”, the scheme can at least be described as consistent. Analyses which looked attractive for one example were often changed during the development of the scheme because they proved unworkable when extended to related examples. Great effort was put into ensuring that the rules eventually published in English for the Computer do not contain hidden drawbacks, that there are no hidden inconsistencies between statements in different places among its 500 pages, and that the rules really can be used to represent any linguistic phenomenon that crops up.
One important question – for understanding how language works for humans, as much as for computational linguistics – is where the ultimate ceiling to precision of structural annotation lies. How fine are the distinctions it is possible to make consistently in specifying the structural properties of English as used in real life?
This is really two questions – one about the precision of definitions, and another about the ability of human analysts to apply precise definitions.
I like to draw an analogy with measuring clouds. Suppose we wanted to be able to say how large particular clouds are – what volume of space they occupy. Clouds are fuzzy things, so one problem would be what we mean by the volume of a cloud – what exactly should we count as its edge? But even if we adopted some precise definition of cloud boundaries, so that it became meaningful to say that this cloud is exactly N cubic yards in size, not N + 1 or N – 1, it might still be beyond mankind’s abilities actually to measure clouds so exactly. With language, we have a rather vague and ambiguous traditional terminology for describing grammatical elements, which the SUSANNE scheme aims to replace with a precise scheme. So the two questions are: how much precision in definitions is possible, and how far are human analysts capable of applying an extremely precise scheme of structural analysis?
The SUSANNE scheme is the obvious vehicle for examining these questions, because (unlike other analytic schemes used by computational linguists) its details are specifically motivated by the goal of maximizing precision, consistency, and comprehensiveness. (To quote a third-party commentator, Lin Dekang of the University of Alberta, writing in Anne Abeillé, ed., Treebanks, 2003, “Compared with other possible alternatives such as the Penn Treebank ... [t]he SUSANNE corpus puts more emphasis on precision and consistency”.) In a collaboration with Anna Babarczy and John Carroll I have examined the above questions experimentally. Our detailed numerical findings are written up in two papers: one on the special issue of word classification, which appeared in Natural Language Engineering in 2006, and another about higher-level phrase and clause analysis and classification, which appeared in the same journal in 2008.
We drew a number of fairly clearcut and not entirely predictable conclusions:
Problems of analytic consistency stem almost entirely from human limitations rather than definitional shortcomings. The most refined analytic scheme available contains surprisingly few areas of vagueness or hidden contradiction, but even the most highly-trained, experienced analysts are often unable in practice to conform their annotation to its refinements. We can define what count as cloud boundaries very precisely, but we cannot measure clouds so accurately in practice.
One particular aspect of English language structure, an aspect which is crucial for grasping the human import of utterances – namely, functional classification of clause arguments and adjuncts – turns out to be strikingly more resistant to precise definition than any other area of structure.
Structural distinctions that are clearly “real” – in some contexts they make a large difference to what is being said – in many other cases turn out to act as “distinctions without a difference”. English grammar seems analogous to a ruler which is marked off in sixteenths of an inch, but which, much of the time, is used merely to check whether one has picked up a seven-inch or eight-inch bolt.
A by-product of the work of creating the SUSANNE annotation scheme was the production of a corpus of English annotated in accordance with the scheme. The SUSANNE Corpus contains annotations of a 130,000-word cross-section of written American English (it is based on a subset of the million-word Brown Corpus). The SUSANNE Corpus is freely available without formalities for use by researchers anywhere (and has been heavily used since the first release was published in 1992). Many gratifying comments have been received from users about the detail and reliability of the annotated Corpus.
The published SUSANNE scheme includes extensions to the annotation to handle spoken English, with phenomena such as speech repairs, ums and ers, and discourse items that have no analogue in edited, written language. The SUSANNE Corpus contains written English only; but a later project, the CHRISTINE Project, has produced a counterpart of the SUSANNE Corpus based on samples of the spoken language, drawn from spontaneous speech by speakers chosen to represent a cross-section of the present-day British population. The manual for the CHRISTINE Corpus includes explicit extensions to the SUSANNE annotation to handle speech phenomena in a similarly predictable, well-defined way. (For discussion of some of the problems that arise, see a paper co-authored with my previous researcher Anna Rahman.) And the manual of the even more recent LUCY Corpus contains some further extensions to allow consistent structural annotation of the written English of unskilled writers, such as children. The LUCY Corpus represents written English in modern Britain, ranging from published prose to the less-skilled writing of young adults, and spontaneous writing by nine-to-twelve-year-old children.
My resources page gives details on how to get up to date copies of each of these annotated corpora or “treebanks”. (Please note that the address given for the SUSANNE Corpus in English for the Computer is now out of date.) The manuals for the corpora are also available as web pages — again my resources page has links.
All of these resources were produced with the sponsorship of the Economic & Social Research Council (UK).
More recently, a related resource, SEMiSUSANNE, has been developed by Christopher Powell of the Ashmolean Museum, Oxford University. SEMiSUSANNE, completed in 2006, consists of 33 of the SUSANNE texts in which the grammatical annotations have been supplemented with annotations identifying the senses in which the vocabulary items are used, in terms of the coding scheme of Princeton WordNet 1.6. (SEMiSUSANNE was derived by merging information from SUSANNE and from SemCor for the files common to both.) SEMiSUSANNE is also available from my resources page, by kind permission of its creator Christopher Powell.
Another extension relates to dependency structure. Our own version of the SUSANNE Corpus is based on a phrase-structure approach to English grammar, and does not include explicit information about head/modifier relationships within grammatical constructions. Lin Dekang has attempted to use automatic techniques to add this information and create a dependency-structure version of SUSANNE. This is available from ftp://ftp.cs.umanitoba.ca/pub/lindek/. (Please note that I have no personal familiarity with this work; any queries about it should be directed to Lin Dekang.)
A word on the relationship of the SUSANNE Corpus to other analysed corpora.
When the work described in this Web page began in the early 1980s, no grammatically analysed corpora yet existed. The 45,000-word Lancaster-Leeds Treebank which I developed for Geoffrey Leech and Roger Garside’s parsing project, though small, was apparently the first in the field (and we believe that Leech first coined the term treebank, which has now come into general use in Computational Linguistics).
As the virtues of corpus-based methods began to be more widely appreciated in the 1990s, larger analysed corpora came into existence, the largest of which (Mitchell Marcus’s Penn Treebank) makes the SUSANNE Corpus insignificant in size terms.
These two resources have different aims. The SUSANNE Corpus was produced as an adjunct to the development of detailed analytic standards; consequently it could only be as big as was compatible with individual attention (often, attention by several individuals) to almost every difficult analytic decision posed by its language. There would have been no point in including more samples than could be scrutinized thoroughly: the extra material would have done nothing to raise the quality of the published analytic standards. The motive of the Penn project, as I understand it, is to produce the largest possible quantity of analysed language material, using an analytic scheme which is as subtle as is compatible with that aim. The Penn team have published their own analytic guidelines online, at ftp://ftp.cis.upenn.edu/pub/treebank/doc/manual/; but their approach does not seem to involve the same ambition of covering detail comprehensively. (Cf. the Lin Dekang quotation above.)
Both of these alternative research strategies are valid in their own terms. It is a case of horses for courses.
In support of the SUSANNE strategy, I would comment that what the research community ultimately needs is very large databases of language, analysed in very great detail. The precedent of software engineering suggests that this goal might best be achieved by thoroughly debugging the annotation scheme as the first stage.
last changed 13 Sep 2008