Geoffrey Sampson


[LOGO]

The CHRISTINE Project

SUSANNE meets spoken English

Sponsored by the Economic & Social Research Council (UK), the CHRISTINE project set out to extend my SUSANNE analytic scheme and Corpus to cover spoken English, and particularly spontaneous, informal spoken English.

The resulting CHRISTINE Corpus is now ready and available for use. It offers structural analyses of a cross-section of 1990s spontaneous speech from all British regions, social classes, etc. For details on its current location and how to download it, see my resources page. The CHRISTINE documentation file is also available as a Web page (250 kb).

(The CHRISTINE project terminated formally in December 1999. While it was in being, the project annotated considerably more material than the sample now published, but the remainder was not brought into a suitable state for publication by the end of the project. The current CHRISTINE Corpus was originally referred to as “CHRISTINE Stage I”, in the expectation that it would soon be replaced by a larger corpus. It is still hoped to do this eventually, but the work remaining to be done has turned out to be considerably more than was envisaged in 1999; hence the short name “CHRISTINE Corpus” is now used for the corpus currently available.)

The remainder of this page describes the overall nature and aims of the CHRISTINE project. The page has evolved out of one I wrote before the project began, which was expressly designed to gather advice on how the work could best serve the international research community. I am very grateful to the many people who responded with thoughtful and useful comments.

The CHRISTINE enterprise is most easily explained by reference to the earlier SUSANNE materials. The SUSANNE analytic scheme is a notation for indicating the structural (grammatical) properties of samples of real-life English. The SUSANNE Corpus is a machine-readable sample of written English to which the notation scheme has been applied. The scheme aims to be:

The scheme is defined in a 500-page book, English for the Computer. Although other analytic schemes are available, I believe the SUSANNE scheme has no real rival internationally in terms of comprehensiveness and precision, and this has been confirmed by independent commentators (e.g. “the detail ... is unrivalled”, Terry Langendoen in Language vol. 73, 1997, p. 600; “Compared with other possible alternatives such as the Penn Treebank ... [t]he SUSANNE corpus puts more emphasis on precision and consistency”, Lin Dekang in Abeillé, ed., Treebanks, 2003, p. 321). The Corpus is also available on the internet (see my resources page), and is in use by numerous researchers all over the world.

The SUSANNE scheme was developed through the experience of applying it to corpora (the published SUSANNE Corpus, and other unpublished material). All this material was written English. After the scheme was essentially complete, we explored the problem of applying it to spoken language. This led to definition of extensions to the notation, documented in ch. 6 of English for the Computer; but these represent only the beginnings of what would be needed for a fully-defined spoken-English annotation scheme. The annotated speech samples we produced were too messy and limited to be used as more than a trial run.

The CHRISTINE project produced an annotated machine-readable corpus of representative samples of spoken English, comparable in size and “polish” to the SUSANNE Corpus; and in so doing it developed detailed definitions of the speech notations, so as to produce an analytic scheme for spoken English comparable in exactness and comprehensiveness to the scheme for written English — and compatible with this scheme. (The work also uncovered some issues that remained vague in the published SUSANNE scheme, so that it led to improvements in the scheme as it applies to the written language too.)

The samples annotated in the CHRISTINE project are selected from speech material that was already available in transcribed form. We normally work with machine-readable orthographic representations of spoken English (including some special markings for phonetic issues such as hesitation phenomena and unclear words), though in a very few cases we checked the annotations against the original recordings.

The Research Team

The following researchers worked on the CHRISTINE project:

Alan MORRIS

Anna RAHMAN

Coverage

The new project set out to do for spoken English what SUSANNE did for written English. This includes the detailed annotation of grammar in the ordinary sense. It is clear that there are (at least) statistical differences between the ways in which speech and writing exploit the range of grammatical constructions provided by the language, and at present we have little hard evidence on the precise nature of these differences. But spoken language has additional types of structural phenomenon which are not usually found in writing, for which new annotation standards are needed.

Probably most significant are speech management phenomena, whereby wording is edited “on the fly”: computer speech processing needs ways of distinguishing between the wording made obsolete by later edits and the wording which replaces it. Other structurally significant issues more or less peculiar to the speech mode are discourse items used to mark pragmatic force, and hesitation phenomena, whose incidence relative to surrounding structure is potentially an important cue for automatic analysis. Roger Moore of DRA Malvern has written of the “overwhelming need for agreed standards of ... annotation ... [for] normal, everyday, non-prepared speech [which] is replete with repetition, false-starts, repairs, partial utterances, ‘uhms’ and ‘errs’ etc.”

Let me make things more concrete with an example, taken from the London-Lund Corpus. (Hyphens and equals signs represent long and short pauses.)

  right well let's er --= let's look at the applications -- erm -
  let me just ask initially this --- I discussed it with er 
  Reith er but we'll = have to go into it a bit further -- is it
  is it within our erm er = are we free er to er draw up a
  rather = exiguous list -- of people to interview

An adequate structural annotation of this utterance must show, for instance, that the first let’s in line 1 is a false start which is replaced by the clause beginning with the second let’s, and that (a more complex situation) the interrogative clause from are we in line 4 to the end of the passage replaces a differently-phrased interrogative clause which itself contains a false start (the is it at the end of line 3) that is replaced by the wording is it within our, and which breaks off in the middle of a noun phrase. Only with a notation that expresses these editing relationships between complete and incomplete grammatical units does it become useful or meaningful to indicate the “ordinary” grammatical properties within the separate units.

The notation also needs to indicate that the initial right and well are “discourse items” having special roles in speech which (particularly in the case of well) have no close analogue in writing, so they need wordtags drawn from a set that distinguishes the various roles of discourse items. And the notation must specify some way of integrating hesitation phenomena, such as the silent pauses indicated by hyphens and stops and the filled pauses indicated by er, erm, into the structure of editing relationships. Thus, if hesitations frequently occur where constituents are broken off and followed by replacements, a consistent decision is needed on whether the hesitation items are treated grammatically as part of the broken-off constituent, part of the replacement constituent, or as separate from both.

It may not matter very much which notational alternative is chosen — so long as one of them is explicitly adopted, and all similar instances in the annotated material are represented in the same way. Only then can meaningful statistical data be extracted from an analysed corpus that will allow researchers to establish what patterns occur in real-life speech. As Jane Edwards of the University of California has put it:

The single most important property of any data base for purposes of computer-assisted research is that similar instances be encoded in predictably similar ways.
It is the role of an explicit annotation scheme to enable this to be achieved, by prescribing decisions on a multitude of issues where different notations would each be inherently defensible — so that separate analysts using the scheme to annotate the same sample must produce identical analyses.

All properties covered relate to grammatical structure in some broad sense. It would be over-ambitious to try to cover rhetorical issues, e.g. speaker-plan hierarchies as discussed by writers such as van Dijk & Kintsch.

Strategic Decisions

This heading refers to the issues on which I used a previous version of this Web page to consult the research community, in order to maximize the value of the project output. I proposed tentative decisions, at a stage when they could still easily be changed, and invited people to argue for changes.

However, on the whole the respondents tended to confirm that they saw my tentative decisions as appropriate.

The two main issues were:

Choice of Speech Material

Ideally, the speech samples to be analysed should meet six conditions:

  1. they should be a representative cross-section of spoken English, in something like the way that the LOB and Brown Corpora are representative for written English;
  2. they should already have been accurately transcribed;
  3. they should be available in the form of digitized acoustic signals;
  4. they should represent British English rather than (or, conceivably, as well as) American English;
  5. they should be as free as possible of copyright restrictions that would limit the possibility of freely distributing the eventual analysed corpus to the research community;
  6. they should be samples which other researchers are converging on for their own work — each kind of research tends to add value for others using the same materials in different ways.

In practice, I decided to ignore the copyright issue, (5), since the rules governing how various corpus resources may be used change unpredictably year by year. In the event, there were no copyright restrictions on the material included in the published version of the CHRISTINE Corpus, and this is available free and without strings to anyone who wants to use it.

Under point (4) I saw it as essential for our project to use British samples, not a transatlantic mixture. This is not just because Americans, after a late start, are now heavily engaged in corpus research themselves, and the British taxpayer who funded this project was entitled to expect his money to be used on keeping our version of the language in the computational-linguistics game — though that is obviously a relevant consideration. But also, the national language-varieties differ far more in the spoken mode than they do in writing. British researchers on the SUSANNE project had no difficulty in working on written American English. But I seriously doubt whether a British research team would be able to cope with spoken American English, even if we wanted to.

Of the other points, (6) is desirable but, I believe, a lower priority than points (1) to (4). If speech samples were available meeting each of conditions (1) to (4), I would gladly have used them for this project even if no-one else were working with them.

Unfortunately, conditions (1) to (4) could not all be met. There were no samples available which meet all four conditions: some compromise was unavoidable.

I took (1) to be the overriding desideratum; in consequence (3) had, regrettably, to be given a low priority. The project used representative samples of British English which had been accurately transcribed, by transcribers sensitive to phonetic considerations, but the CHRISTINE Corpus is not linked to published digitized acoustic signals.

We decided to draw our speech material from a variety of sources, each of which has its own strengths and weaknesses. However, since the material actually published comes exclusively from the British National Corpus, no purpose would be served in discussing our other sources here.

The spoken part of the British National Corpus is unrivalled as a representative cross-section of real-life 1990s speech from all geographical regions, social classes, etc. At present it is not available as digitized speech signals, but moves are afoot which may lead to this in the future. On the other hand, the material has been transcribed in a way that is phonetically not very sophisticated (understandably, in view of the large size of this resource). Its transcriptions represent speech as ordinary written prose, with capitalization and punctuation marks; this is misleading as a way of representing spoken realities. The transcriptions do include some indicators of hesitation phenomena (ums and ers), but transcribing these accurately is believed not to have been a high priority for the compilers of BNC. These weaknesses consequently are inherited by the CHRISTINE Corpus.

Information to be Included in the Annotation

Here I am aware of three issues:

Phonetic detail in the base transcription

If we were going to produce our own transcriptions from recordings, there would be large issues about transcribing suprasegmentals. British and American speech researchers have contrasting approaches to registering intonation: Britons use an old-established (pre-computer) analytic tradition associated with J.D. O’Connor and G.F. Arnold, Americans use a very different approach (“ToBI”) associated with Janet Pierrehumbert. The difference may correspond partly to real differences between the respective dialects (an issue studied by Francis Nolan of Cambridge University); and arguably the British approach is more empirical, the American more theory-oriented. For both reasons, the O’Connor-Arnold style would probably be better suited to this project. However, in practice the only control we had was limited to choice between the transcription practices in existing bodies of transcribed speech. Since the material in the published CHRISTINE Corpus is all taken from the BNC, issues about how to deal with the detailed suprasegmental annotations included in the London-Lund Corpus are not relevant for CHRISTINE Corpus users.

Structural/grammatical categories covered

In principle, the CHRISTINE structural annotation is intended to include all categories of information comprised in the existing SUSANNE scheme for written language, plus three further categories of information (as outlined in ch. 6 of English for the Computer).

The existing SUSANNE annotation includes:

The CHRISTINE scheme covers the following additional kinds of phenomenon:

pauses:
Both silent pauses and “filled pauses” (er, mm) are recorded and integrated into parse-trees for the surrounding wording, so that researchers can study patterns in the incidence of hesitation relative to grammatical structure.
discourse items:
The wordtag scheme for written English has been supplemented by extra tags for classes of discourse item, in the sense recognized by the Lund group. English for the Computer proposes a set of 14 discourse-item wordtags, for categories such as Engager (I_mean, mind_you, you_know, ... ) and Greeting (hi, hello, good_morning, ... ). Practical experience with the CHRISTINE material has led to some additions to this list.
speech repairs:
English for the Computer surveys previous proposals for analysing the structure of spontaneous speech including repetitions, mid-construction changes of tack, etc. It proposes a system which aims to indicate those aspects of speech repairs that analysts can determine reasonably objectively (without attempting to specify issues that would regularly have to be guessed at); and it lays down a consistent method for integrating annotations of repaired wording into parse-trees for utterances which, in other parts, may be structurally well-formed.

Method of Encoding the Annotations

Here the main issue seemed to be: to SGML or not to SGML?

When the SUSANNE Corpus was published in 1992, it did not use SGML, for reasons that seemed good at least at that period. It has a very regular structure: each text is encoded as a long sequence of records, one per word, and each record has an identical number of fields with consistent functions from record to record. This means that it would be a relatively trivial task to reformat the files as SGML documents, but for most purposes there would be little gained by doing so. SGML is valuable for documents with complex internal structuring, requiring a substantial DTD to state the allowable arrangements of elements of different categories. But it has the drawback that an SGML-encoded file is hard for a human to read directly; it is intended to be read by specialist software.

One of the factors that has helped the SUSANNE Corpus to find a wide clientele of users is that, although it incorporates a great deal of information of different kinds, anyone can easily scan a SUSANNE file, read the wording on which it is based, and grasp quite a lot of the content of the annotations. Moving to SGML would destroy this transparency, for no obvious gain — if anyone is working with an application that needs to use SUSANNE in SGML format, he could easily write a short program to insert appropriate tags.

When my previous Web page put these arguments forward as reasons not to use SGML for CHRISTINE, one respondent commented “I couldn’t agree more”! The initial version of the CHRISTINE Corpus does not use SGML; material from the BNC, which is SGML-based, strips out the SGML tags (but the CHRISTINE version includes reference numbers linking words back to their locations in the original BNC files).

However, after the project got under way, the arguments changed somewhat, with the development by Nancy Ide and Jean Véronis of the Corpus Encoding Standard. The Corpus Encoding Standard is a specialized application of SGML, compliant with the Guidelines of the Text Encoding Initiative, that aims to offer a way of formatting language corpora as “plug-and-play” resources for language engineering applications. Merely to encode CHRISTINE into SGML would not obviously “buy” very much in practice; but to encode it into a specific application of SGML, using a fixed DTD also used for many other corpora, could buy a fair amount.

If the Corpus Encoding Standard does succeed in “taking off” as an accepted standard, it would be good in due course to distribute the CHRISTINE Corpus (and indeed our other annotated corpora) in alternative, CES-conformant versions for users who prefer that. (At the time of writing, it is not clear whether or not the CES is destined to achieve general acceptance.)

The Name CHRISTINE

Before this project began, I referred to it as the “Spoken SUSANNE project”. But it is useful to have a short, distinctive name for a separate research undertaking. Apart from anything else, we needed a name in order to create structure in our mass of electronic files at Sussex.

SUSANNE stood for “Surface and underlying structural analyses of natural English”. (One of the N’s was taken from “analyses”.) But the name was also appropriate for reasons that I shan’t go into here, having to do with the life of St Susanna.

Our new project was “daughter of SUSANNE”. But Susanna, as a holy virgin, had no daughter. So I chose a “successor” name in terms of the calendar. St Susanna’s day is 23rd July. July 24th is the day dedicated to SS Christina of Tyre and Christina the Astonishing. (It is also the day of our local Sussex saint, Lewina of Seaford — but “Lewina” seemed too strange a name to make a satisfactory project title.)

St Christina of Tyre makes a good patroness for a project on speech. We are told that, after being condemned to have her tongue cut out, she carried on speaking just as clearly as ever. Picking up her excised tongue, she threw it at the judge, blinding him in one eye. (A neat trick, which we shall have to bear in mind in case we have any trouble with Research Council assessors.)

If you insist on an acronym, CHRISTINE can just about be twisted into that too: “Chrestomathized speech trees in natural English”. (Ouch!) At any rate, it makes a distinctive and attractive name.



Geoffrey Sampson

last changed 12 Jun 2004