Extending Grammar Annotation Standards to Spontaneous Speech

Extending Grammar Annotation Standards to Spontaneous Speech

Anna Rahman and Geoffrey Sampson

University of Sussex

Published in J.M. Kirk, ed., Corpora Galore: Analyses and Techniques in Describing English, Rodopi (Amsterdam), 1999.

Abstract

We examine the problems that arise in extending an explicit, rigorous scheme of grammatical annotation standards for written English into the domain of spontaneous speech. Problems of principle occur in connexion with part-of-speech tagging; the annotation of speech repairs and structurally incoherent speech; logical distinctions dependent on the orthography of written language (the direct/indirect speech distinction); differentiating between nonstandard usage and performance errors; and integrating inaudible wording into analyses of otherwise-clear passages. Perhaps because speech has contributed little in the past to the tradition of philological analysis, it proves difficult in this domain to devise annotation guidelines which permit the analyst to express what is true without forcing him to go beyond the evidence.

Background

To quote Jane Edwards (1992: 139), “The single most important property of any data base for purposes of computer-assisted research is that similar instances be encoded in predictably similar ways”. This principle has not often been observed in the domain of grammatical annotation. Although many alternative lists of grammatical categories have been proposed for English and for other languages, in most cases these are not backed up by detailed, rigorous specifications of boundaries between the categories. A scheme may define how to draw a parse tree for a clear, “textbook” example sentence, with node labels drawn from a large, informative label-alphabet, but may leave it entirely to analysts’ discretion how to apply the annotation to the messy constructions that are typical of real-life data.

The SUSANNE scheme, developed over the period 1983–93 (Sampson 1995; www.grsampson.net/RSue.html) is a first published attempt to fill this gap for English; the 500 pages of the scheme aim to define an explicit analysis for everything that occurs in the language in practice. Figure 1 shows a brief extract from the scheme (the first two of rather over four pages defining the logical boundaries of the category P, “prepositional phrase”, one of the simplest categories recognized by the scheme). Figure 1 gives a flavour of the kinds of issue that have to be explicitly settled, one way or another, if a category is to be applied in a consistent fashion. No claim is made that the numerous annotation rules comprised in the SUSANNE scheme are “correct” with respect to some psychological or other reality; undoubtedly there are cases where the opposite choice of rule could have yielded an equally well-defined and internally consistent annotation scheme. But, without some explicit choice of rules on a long list of issues comparable to those discussed in Figure 1, one has only a list of category-names and symbols, not a well-defined scheme for applying them.

The SUSANNE scheme has been achieving a degree of international recognition: “the detail … is unrivalled” (Langendoen 1997: 600); “impressive … very detailed and thorough” (Mason 1997: 169, 170); “meticulous treatment of detail” (Leech & Eyes 1997: 38). We are not aware of any alternative annotation scheme (for English, or for another language) which covers the ground at a comparable level of detail. (The other schemes that we know about seem to have been initiated substantially more recently than SUSANNE, as well as being less detailed. We do not survey these schemes here; but it is worth mentioning, as particularly closely related to the work described below, the scheme for annotating dysfluencies in the Switchboard corpus of American telephone conversations, http://www.ldc.upenn.edu/myl/DFL-book.pdf.)

Various research groups may prefer to use different lists of grammatical symbols; but it is not clear what value will attach to statistics derived from annotated corpora, unless the boundaries between their categories are defined with respect to the same issues that the SUSANNE scheme treats explicitly.

Currently, the CHRISTINE project (www.grsampson.net/RChristine.html) is extending the SUSANNE scheme, which was based mainly on edited written English, to the domain of spontaneous spoken English. CHRISTINE is developing the outline extensions of the SUSANNE scheme for speech which were contained in Sampson (1995: ch. 6) into a set of annotation guidelines comparable in degree of detail to the rest of the scheme, “debugging” them by applying them manually to samples of British English representing a wide variety of regional, social class, age, and social setting variables. Figure 2 displays an extract from the corpus of annotated speech currently being produced through this process. The sources of language samples used by the CHRISTINE project are the speech section of the British National Corpus (http://info.ox.ac.uk/bnc/), the Reading Emotional Speech Corpus (http://midwich.reading.ac.uk/research/ speechlab/emotion/), and the London-Lund Corpus (Svartvik 1990). Figure 2 is extracted from file KSS of the British National Corpus. (Except where otherwise stated, examples quoted in later sections of this paper will also come from the BNC, with the location specified as three-character filename followed after full stop by five-digit “s-unit number”. BNC transcriptions include punctuation and capitalization, which are of questionable status in representations of spoken wording; in Figure 2 these matters are normalized away, but they have been allowed to stand in examples quoted in the text below.)

In Figure 2, the words uttered by the speakers are in the next-to-rightmost field. The field to their left classifies the words, using the SUSANNE tagset supplemented by some additional refinements to handle special problems of speech: the part-word i at byte 0692161 is tagged “VV0v/ice” to show that it is a broken-off attempt to utter the verb ice. The rightmost field gives the grammatical analysis of the constructions, in the form of a labelled tree structure which again uses the SUSANNE conventions. All tagmas are classified formally, with a capital letter followed in some cases by lower-case subcategory letters: S stands for “main clause”, Nea represents “noun phrase marked as first-person-singular and subject”, Ve labels don’t know as a verb group marked as negative. Additionally, immediate constituents of clauses are classified functionally, by a letter after a colon: Nea:s in the first line shows that I is the subject of say, Fn:o in the third line shows that the nominal clause (Fn) I don’t know where … is direct object of say. Three-digit index numbers relate surface to logical structures. Thus where in the seventh line is marked as an interrogative adverb phrase (Rq) having no logical role (:G) in its own clause (that is, the clause headed by +’s gon, i.e. is going), but corresponding logically to an unspoken Place adjunct (“p101”) within the infinitival clause +na (= to) get cake done. (The character “y” in column 3 identifies a line which contains an element of the structural analysis rather than a spoken word.)

Legal constraints permitting, the CHRISTINE Corpus will be made freely available electronically, after completion in December 1999, in the same way as the SUSANNE Corpus already is.

Defining a rigorous, predictable structural annotation scheme for spontaneous speech involves a number of difficulties which are not only additional to, but often different in kind from, those involved in defining such a scheme for written language. This paper examines various of these difficulties. In some cases, our project has already identified tentative annotation rules for addressing these difficulties, and in these cases we shall mention the decision adopted; but in other cases we have not yet been able to formulate any satisfactory solution. Even in cases where our project has chosen a provisional solution, discussing this is not central to our aims in the present paper. Our goal, rather, is to identify the types of issue needing to be resolved, and to show how devising an annotation scheme for speech involves problems of principle, of a kind that would have been difficult to anticipate before undertaking the task.

The Software Engineering Precedent

The following pages will examine a number of conceptual problems that arise in defining rigorous annotation standards for spontaneous speech. Nothing will be said about computational technicalities, for instance the possibilities of designing an automatic parser that could apply such annotation, or the nature of the software tools used in our project to support manual annotation. (The project has developed a range of such tools, but we regard them as being of interest only to ourselves.)

In our experience, some computational linguists see a paper of this type as insubstantial and of limited value in advancing the discipline. While it is not for us to decide the value of our particular contribution, as a judgement on a genre we see this attitude as profoundly wrong-headed. To explain why, let us draw an analogy with developments in industrial and commercial computing.

Writing programs and watching them running is fun. Coding and typing at keyboards are the programmer activities which are most easy for IT managers to perceive as productive. For both these reasons, in the early decades of computing it was common for software developers to move fairly quickly from taking on a new assignment to drafting code – though, unless the assignment was trivially simple, the first software drafts did not work. Sometimes they could be rescued through debugging – usually a great deal of debugging. Sometimes they could not: the history of IT is full of cases of many-man-year industrial projects which eventually had to be abandoned as irredeemably flawed without ever delivering useful results.

There is nowadays a computer science subdiscipline, software engineering (e.g. Sommerville 1992), which has as one of its main aims the training of computing personnel to resist their instincts and to treat coding as a low priority. Case studies have shown (Boehm 1981: 39-41) that the cost of curing programming mistakes rises massively, depending how late they are caught in the process that begins with analysing a new programming task and ends with maintenance of completed software. In a well-run modern software house, tasks and their component subtasks are rigorously documented at progressively more refined levels of detail, so that unanticipated problems can be detected and resolved before a line of code is written; programming can almost be described as the easy bit at the end of a project.

The subject-matter of computational linguistics, namely human language, is one of the most complex phenomena dealt with by any branch of IT. To someone versed in modern industrial software engineering, which mainly deals with structures and processes much simpler than any natural language, it would seem very strange that our area of academic computing research could devote substantially more effort to developing language-processing software than to analysing in detail the precise specifications which software such as natural-language parsers should be asked to deliver, and to uncovering hidden indeterminacies in those specifications. Accordingly, we make no bones about the data-oriented rather than technique-oriented nature of the present paper. At the current juncture in computational linguistics, consciousness-raising about problematic aspects of the subject-matter is a high priority.

Wordtagging

One fundamental aspect of grammatical annotation is classifying the grammatical roles of words in context – wordtagging. The SUSANNE scheme defined an alphabet of over 350 distinct wordtags for written English, most of which are equally applicable to the spoken language though a few have no relevance to speech (for instance, tags for roman numerals, or mathematical operators). Spoken language also, however, makes heavy use of “discourse items” (Stenström 1990) having pragmatic functions with little real parallel in writing: e.g. well as an utterance initiator. Discourse items fall into classes which in most cases are about as clearly distinct as the classifications applicable to written words, and the CHRISTINE scheme provides a set of discourse-item wordtags developed from Stenström’s classification. However, where words are ambiguous as between alternative discourse-item classes, the fact that discourse items are not normally syntactically integrated into wider structures means that there is little possibility of finding evidence to resolve the tagging ambiguity.

Thus, three discourse-item classes are Expletive (e.g. gosh), Response (e.g. ah), and Imitated Noise (e.g. glug glug). Consider the following extracts from a sample in which children are “playing horses”, one riding on the other’s back:

KPC.00999–1002 speaker PS1DV: … all you can do is <pause> put your belly up and I’ll go flying! … Go on then, put your belly up! speaker PS1DR: Gung!

KPC.10977 Chuck a chuck a chuck chuck! Ee ee! Go on then.

In the former case, gung is neither a standard English expletive, nor an obviously appropriate vocal imitation of anything happening in the horse game. Conversely, in the latter case ee could equally well be the standard Northern regional expletive expressing mildly shocked surprise, or a vocal imitation of a “riding” noise. In many such cases, the analyst is forced by the current scheme to make arbitrary guesses, yet clear cases of the discourse-item classes are too distinct from one another to justify eliminating guesswork by collapsing the classes into one.

Not all spoken words posing tagging problems are discourse items. In:

KSU.00396–8 Ah ah! Diddums! Yeah.

any English speaker will recognize the word diddums as implying that the speaker regards the hearer as childish, but intuition does not settle how the word should be tagged (noun? if so, proper or common?); and published dictionaries do not help. To date we have formulated no principled rule for choosing an analysis in cases like these.

Speech Repairs

Probably the most crucial single area where grammatical standards developed for written language need to be extended to represent the structure of spontaneous spoken utterances is that of speech repairs. The CHRISTINE repair annotation system draws on Levelt (1983) and Howell & Young (1990, 1991), to our knowledge the most fully-worked-out and empirically-based previously existing approach. This approach identified a set of up to nine repair milestones within a repaired utterance, for instance the point at which the speaker’s first grammatical plan is abandoned (the “moment of interruption”), and the earlier point marking the beginning of the stretch of wording which will be replaced by new wording after the moment of interruption. However, this approach is not fully workable for many real-life speech repairs. In one respect it is insufficiently informative: the Levelt/Howell & Young notation provides no means of showing how a local sequence containing a repair fits into the larger grammatical architecture of the utterance containing it. In other respects, the notation proves to be excessively rich: it requires speech repairs to conform to a canonical pattern from which, in practice, many repairs deviate.

Accordingly, CHRISTINE embodies a simplified version of this notation, in which the “moment of interruption” in a speech repair is marked (by a “#” sign within the stream of words), but no attempt is made to identify other milestones, and the role of the repaired sequence is identified by making the “#” node a daughter of the lowest labelled node in a parse tree such that both the material preceding and the material following the # are (partial) attempts to realize that category, and the mother node fits normally into the surrounding structure. This approach works well for the majority of speech repairs, e.g.:

KBJ.00943 That’s why I said [Ti:o to get ma ba # , get you back then] …

KCA.02828 I’ll have to [VV0v# cha # change ] it

In the KBJ case, to get ma ba (in which ma and ba are truncated words, the former identified by the wordtagging as too distorted to reconstruct and the latter as an attempt at back as an adverb), and get you back then, are successive attempts to produce an infinitival clause (Ti) functioning as object (:o) of said. In the KCA case, cha and change are successive attempts to produce a single word whose wordtag is VV0v (base form of verb having transitive and intransitive uses). In Figure 2, the “#” symbol is used at two levels in the same speaker turn: speaker PS6RC makes two attempts to realize a main clause (S), and the second attempt begins with two attempts to pronounce the verb ice.

However, although the CHRISTINE speech-repair notation is less informative than the full Levelt/Howell & Young scheme, and seems as simple as is consistent with offering an adequate description of repair structure, applying it consistently is not always straightforward. In the first place, as soon as the annotation scheme includes any system for marking speech repairs, analysts are obliged to decide whether particular stretches of wording are in fact repairs or well-formed constructions, and this is often unclear. Sampson (1998) examined a number of indeterminacies that arise in this area; one of these is between repairs and appositional structures, as in:

KSS.05002 she can’t be much cop if she’d open her legs to a first date to a Dutch s- sailor

– where to a Dutch s- sailor might be intended to replace to a first date as the true reason for objecting to the girl, but alternatively to a Dutch s- sailor could be an appositional phrase giving fuller and better particulars of the nature of her offence. Annotation ought not systematically to require guesswork, but it is hard to see how a neutral notation could be devised that would allow the analyst to suspend judgment on such a fundamental issue as whether a stretch of wording is a repair or a well-formed construction.

Even greater problems are posed by a not uncommon type of ill-formed utterance that might be called “syntactically Markovian”, in which each element coheres logically with what immediately precedes but the utterance as a whole is not coherent. The following examples come from the London-Lund Corpus, with text numbers followed by first and last tone-unit numbers for the respective extracts:

S.1.3 0901–3 … of course I would be willing to um <pause => come into the common-room <pause => and uh <pause – – –> in fact I would like nothing I would like better [speaker is undergraduate, age ca 36, describing interview for Oxbridge fellowship]

S.5.5 0539–45 and what is happening <pause=> in Britain today <pause –> is ay- demand for an entirely new foreign policy quite different from the cold war policy <pause => is emerging from the Left [speaker is Anthony Wedgwood Benn MP on radio discussion programme]

In the former example, nothing functions simultaneously as the last uttered word of an intended sequence I would like nothing better and the first uttered word of an implied sequence something like there is nothing I would like better. In the latter, the long NP an entirely new foreign policy quite different from the cold war policy appears to function both as the complement of the preposition for, and as subject of is emerging. In such cases one cannot meaningfully identify a single point where one grammatical plan is abandoned in favour of another. Because these structures involve phrases which simultaneously play one grammatical role in the preceding construction and a different role in the following construction, they resist analysis in terms of tree-shaped constituency diagrams (or, equivalently, labelled bracketing of the word-string). Yet constituency analysis is so solidly established as the appropriate formalism for representing natural-language structure in general that it seems unthinkable to abandon it merely in order to deal with one special type of speech repair.

Logical Distinctions Dependent on the Written Medium

There are cases where grammatical category distinctions that are highly salient in written English seem much less significant in the spoken language, so that maintaining them in the annotation scheme arguably misrepresents the structure of speech. Probably the most important of these is the direct/indirect speech distinction. Written English takes great pains to distinguish clearly between direct speech, involving a commitment to transmit accurately the quoted speaker’s exact wording, and indirect speech which preserves only the general sense of the quotation. The SUSANNE annotation scheme uses categories which reflect this distinction (Q v. Fn). However, the most crucial cues to the distinction are orthographic matters such as inverted commas, which lack spoken counterparts. Sometimes the distinction can be drawn in spoken English by reference to pronouns, verb forms, vocatives, etc.:

KD6.03060 … he says he hates drama because the teacher takes no notice, he said one week Stuart was hitting me with a stick and the teacher just said calm down you boys …

– the underlined he (rather than I) implies that the complement of says is indirect speech; me implies that the passage beginning one week is a direct quotation, and the imperative form calm and vocative you boys imply that the teacher is quoted directly. But in practice these cues frequently conflict rather than reinforcing one another:

KCT.10673 [reporting speaker’s own response to a directly-quoted objection]: I said well that’s his hard luck!

KCJ.01053–5 well Billy, Billy says well take that and then he’ll come back and then he er gone and pay that

In the KCT example, the discourse item well and the present tense of [i]s after past-tense said suggest direct speech, but his (which from the context denotes the objector) suggests indirect speech. Likewise in the KCJ example, well and the imperative take imply direct speech, he’ll rather than I’ll implies indirect speech. Arguably, imposing a sharp two-way direct v. indirect distinction on speech is a distortion; one might instead feel that speech uses a single construction for reporting others’ utterances, though different instances may contain more or fewer indicators of the relative directness of the report. On the other hand, logically speaking the direct v. indirect speech distinction is so fundamental that an annotation scheme which failed to recognize it could seem unacceptable. (To date, CHRISTINE analyses retain the distinction.)

Nonstandard Usage

Real-life British speech contains many differences from standard usage with respect to both individual words and syntactic patterns.

In the case of wordtagging, the SUSANNE rule (Sampson 1995: ß3.67) was that words used in ways characteristic of nonstandard dialects are tagged in the same way as the words that would replace them in standard English. This rule was reasonable in the context of written English, where nonstandard forms are a peripheral nuisance, but it quickly became apparent within the CHRISTINE project that the rule is quite impractical for analysing spontaneous speech which contains a high incidence of such forms. For CHRISTINE, this particular rule has been reversed; in general, words used in nonstandard grammatical functions are given the same wordtags as their standard uses, but the phrases containing them are tagged in accordance with their grammatical function in context.

This revised rule tends to be unproblematic for pronouns and determiners, thus in:

KP4.03497 it’s a bit of fun, it livens up me day

KCT.10705 she told me to have them plums

the underlined words are wordtagged as object pronouns (rather than as my, those), but the phrases headed by day and plums are tagged as noun phrases. It is more difficult to specify a predictable way to apply such a rule in the case of nonstandard uses of strong verb forms, where the word used nonstandardly is head of a phrase requiring a tag of its own. Standard base forms can be used in past contexts, e.g.:

KCJ.01096–8 a man bought a horse and give it to her, now it’s won the race

and the solution of phrasetagging such an instance as a past-tense verb group (Vd) is put into doubt because frequently nonstandard English omits the auxiliary of the standard perfective construction, suggesting that give might be replacing given rather than gave; cf.:

KCA.02536 What I done, I taped it back like that.

KCA.02572 What it is, when you got snooker on and just snooker you’re quite <pause> content to watch it …

Eisikovits (1987: 134) argues in effect that the tense system exemplified in clauses like What I done is the same as that of standard English, but that a single form done is used for both past tense and past participle in the nonstandard dialect (in the same way that single forms such as said, allowed are used for both functions in the standard language, in the case of many other verbs); I done here would correspond straightforwardly to standard I did. (Eisikovits’s article is based on data from an Australian urban dialect, but, as Trudgill & Chambers (1991: 52) rightly point out, the facts are similar for many UK dialects.) But Eisikovits’s analysis seems to overlook cases like the you got snooker on example (which are quite common in our material) where got clearly corresponds to standard have got, meaning “have”, and not to a past tense.

It is quite impractical for annotation to be based on fully adequate grammatical analyses of each nonstandard dialect in its own terms; but it is not easy to specify consistent rules for annotating such uses as deviations from the known, standard dialect. The CHRISTINE project has attempted to introduce predictability into the analysis of cases such as those just discussed, by recognizing an extra nonstandard-English “tense” realized as past participle not preceded by auxiliary, and by ruling (as an exception to the general rule quoted earlier) that any verb form used in a nonstandard structure with past reference will be classified as a past participle (thus give in the KCJ example above is wordtagged as a nonstandard equivalent of given). This approach does work well for many cases, but it remains to be seen whether it deals satisfactorily with all the usages that arise.

At the syntactic level, an example of a nonstandard construction requiring adaptation of the written-English annotation scheme would be relative clauses containing both relative pronoun and undeleted relativized NP, unknown in standard English but usual in various nonstandard dialects, e.g.:

KD6.03075 … bloody Colin who, he borrowed his computer that time, remember?

Here the CHRISTINE decision is to treat the relativized NP (he) as appositional to the relative pronoun. For the case quoted, this works; but it will not work if a case is ever encountered where the relativized element is not the subject of the relative clause. Examples like this raise the question what it means to specify consistent grammatical annotation standards applicable to a spectrum of different dialects, rather than a single dialect. Written English usually conforms more or less closely to the norms of the national standard language, so that grammatical dialect variation is marginal and annotation standards can afford to ignore it. In the context of speech, it cannot be ignored, but the exercise of specifying annotation standards for unpredictably varying structures seems conceptually confused.

Dialect Difference v. Performance Error

Special problems arise in deciding whether a turn of phrase should be annotated as well-formed with respect to the speaker’s nonstandard dialect, or as representing standard usage but with words elided as a performance error. Speakers often do omit necessary words, e.g.:

KD2.03102–3 There’s one thing I don’t like <pause> and that’s having my photo taken. And it will be hard when we have to photos.

– it seems safe to assume that the speaker intended something like have to show photos. One might take it that a similar process explains the underlined words in:

KD6.03154 oh she was shouting at him at dinner time <shift shouting> Steven <shift> oh god dinner time she was shouting him.

where at is missing; but this is cast in doubt when other speakers, in separate samples, are found to have produced:

KPC.00332 go in the sitting room until I shout you for tea

KD2.02798 The spelling mistakes only occurred when <pause> I was shouted.

– this may add up to sufficient evidence for taking shout to have a regular transitive use in nonstandard English.

This problem is particularly common at the ends of utterances, where the utterance might be interpreted as broken off before it was grammatically complete (indicated in the SUSANNE scheme by a “#” terminal node as last daughter of the root node), but might alternatively be an intentional nonstandard elision. In:

KE2.08744 That’s right, she said Margaret never goes, I said well we never go for lunch out, we hardly ever really

the words we hardly ever really would not occur in standard English without some verb (if only a placeholding do), so the sequence would most plausibly be taken as a broken-off utterance of some clause such as we hardly ever really go out to eat at all; but it is not difficult to imagine that the speaker’s dialect might allow we hardly ever really for standard we hardly ever do really, in which case it would be misleading to include the “#” sign.

It seems inconceivable that a detailed annotation scheme could fail to distinguish difference of dialect from performance error; indeed, a scheme which ignored this distinction might seem offensive. But analysts will often in practice have no basis for applying the distinction to particular examples.

Transcription Inadequacies

One cannot expect every word of a sample of spontaneous speech recorded in field conditions to be accurately transcribable from the recordings. Our project relies on transcriptions produced by other researchers, which contain many passages marked as “unclear”; the same would undoubtedly be true if we had chosen to gather our own material. A structural annotation system needs to be capable of assigning an analysis to a passage containing unclear segments; to discard any utterance or sentence containing a single unclear word would require throwing away too many data, and would undesirably bias the retained collection of samples towards utterances that were spoken carefully and may therefore share some special structural properties.

The SUSANNE scheme uses the symbol Y to label nodes dominating stretches of wholly unclear speech, or tagmas which cannot be assigned a grammatical category because they contain unclear subsegments that make the categorization doubtful. This system is unproblematic, so long as the unclear material in fact consists of one or more complete grammatical constituents. Often, however, this is not so; e.g.:

KCT.10833 Oh we didn’t <unclear> to drink yourselves.

Here it seems sure that the unclear stretch contained multiple words, beginning with one or more words that complete the verb group (V) initiated by didn’t; and the relationship of the words to drink yourselves to the main clause could be quite different, depending what the unclear words were. For instance, if the unclear words were give you anything, then to drink would be a modifying tagma within an NP headed by anything; on the other hand, if the unclear stretch were expect you, then to drink would be the head of an object complement clause. Ideally, a grammatical annotation scheme would permit all the clear grammar to be indicated, but allow the analyst to avoid implying any decision about unresolvable issues such as these.

Given that clear grammar is represented in terms of labelled bracketing, however, it is very difficult to find usable notational conventions that avoid commitment about the structures to which unclear wording contributes. Our best attempt so far at defining notational conventions for this area is a set of rules which prescribe, among other things, that the Y node dominating an inaudible stretch is attached to the lowest node that clearly dominates at least the first inaudible word, and that clear wording following an inaudible stretch is attached to the Y node above that stretch if the clear wording could be part of some unknown grammatical constituent that is initiated within the inaudible stretch (even if it could equally well not be).

These conventions are reasonably successful at enabling analysts to produce annotations in a predictably consistent way; but they have the disadvantage that many structures produced are undoubtedly different from the grammatical structures of the wording actually uttered. For instance, in the example above, the Y above the unclear stretch is made a daughter of the V dominating didn’t, because that word will have been followed by an unclear main verb; and to drink yourselves is placed under the Y node, because any plausible interpretation of the unclarity would make the latter words part of a tagma initiated within the unclear stretch. Yet there is no way that to drink yourselves could really be part of a verb group tagma beginning with didn’t.

Provided that users of the Corpus bear in mind that a tree structure which includes a Y node makes only limited claims about the actual structure produced by the speaker, these conventions are not misleading. But at the same time they are not very satisfying.

Conclusion

In annotating written English, where one is drawing on an analytic tradition evolved over centuries, it seems on the whole to be true that most annotation decisions have definite answers; where some particular example is vague between two categories, these tend to be subcategories of a single higher-level category, so a neutral fallback annotation is available. (Most English noun phrases are either marked as singular or marked as plural, and the odd exceptional case such as the fish can at least be classified as a noun phrase, unmarked for number.) One way of summarizing many of the problems outlined in the preceding sections is to say that, in annotating speech, whose special structural features have had little influence on the analytic tradition, ambiguities of classification constantly arise that cut across traditional category schemes. In consequence, not only is it often difficult to choose a notation which attributes specfic properties to an example; unlike with written language, it is also often very difficult to define fallback notations which enable the annotator to avoid attributing properties for which there is no evidence, while allowing what can safely be said to be expressed.

Some members of the research community may be tempted to feel that a paper focusing on these problems ranks as self-indulgent hand-wringing in place of serious effort to move the discipline forward. We hope that our earlier discussion of software engineering will have shown why that feeling would be misguided. Nothing is easier and more appealing than to plunge into the work of getting computers to deliver some desired behaviour, leaving conceptual unclarities to be sorted out as and when they arise. Huge quantities of industrial resources have been wasted over the decades through allowing IT workers to adopt that approach. Natural language processing was one of the first application areas ever proposed for computers (by Alan Turing in 1948 – Hodges 1983: 382); fifty years later, the level of success of NLP software (while not insignificant) does not suggest that computational linguistics can afford to go on ignoring lessons that have already been painfully learned by more central sectors of the IT industry.

Effort put into automatic analysis of natural language implies a prior requirement for serious effort devoted to defining and debugging detailed standard schemes of linguistic analysis. Our SUSANNE and CHRISTINE projects have been and are contributing to this goal, but they are no more than a beginning. We urge other computational linguists to recognize this area as a priority.

Acknowledgment

The research reported here was supported by grant R000 23 6443, “Analytic Standards for Spoken Grammatical Performance”, awarded by the Economic and Social Research Council (UK).

REFERENCES

Boehm, B.W. 1981 Software Engineering Economics. Prentice-Hall.

Edwards, Jane A. 1992 “Design principles in the transcription of spoken discourse”. In J. Svartvik, ed., Directions in Corpus Linguistics, Mouton de Gruyter.

Eisikovits, Edina 1987 “Variation in the lexical verb in Inner-Sydney English”. Australian Journal of English 7.1-24; our page reference is to the reprint in Trudgill & Chambers (1991).

Hodges, A. 1983 Alan Turing: The Enigma of Intelligence. Burnett Books.

Howell, P. & K. Young 1990 “Speech repairs: report of work conducted October 1st 1989–March 31st 1990”. Department of Psychology, University College London.

Howell, P. & K. Young 1991 “The use of prosody in highlighting alterations in repairs from unrestricted speech”. Quarterly Journal of Experimental Psychology 43A.733–58.

Langendoen, D.T. 1997 Review of Sampson (1995). Language 73.600–3.

Leech, G.N. & Elizabeth Eyes 1997 “Syntactic annotation: treebanks”. Ch. 3 of R.G. Garside et al., eds., Corpus Annotation, Longman.

Levelt, W.J.M. 1983 “Monitoring and self-repair in speech”. Cognition 14.41–104.

Mason, O. 1997 Review of Sampson (1995). International Journal of Corpus Linguistics 2.169–72.

Sampson, G.R. 1995 English for the Computer. Clarendon Press (Oxford).

Sampson, G.R. 1998 “Consistent annotation of speech-repair structures”. In A. Rubio et al., eds., Proceedings of the First International Conference on Language Resources and Evaluation, Granada, Spain, 28-30 May 1998, vol. 2.

Sommerville, I. 1992 Software Engineering (4th ed.). Addison-Wesley (Wokingham, Berks.).

Stenström, Anna-Brita 1990 “Lexical items peculiar to spoken discourse”. In Svartvik (1990).

Svartvik, J., ed. 1990 The London-Lund Corpus of Spoken English. Lund University Press.

Trudgill, P. & J.K. Chambers, eds. 1991 Dialects of English. Longman.