CHRISTINE Corpus: Documentation

Geoffrey Sampson
Department of Informatics
University of Sussex

Release 2, 18 August 2000

“Releases” refer to modifications to the Corpus as a tar file distributed by ftp. Inevitably, there will be occasions when modifications to this Documentation file as a Web page run ahead of the version of the same file that is included within the Corpus tar file. (When the Documentation file is changed, it takes time to create a new compressed tar file of the entire corpus and mount it on our ftp server.)

The Documentation file you are reading was last modified on 1 Jan 2004.


 
 

Change of Address

In the past, the language-engineering resources published by my research team have been scattered at different internet locations not all under my control, and they have more than once been shifted to new addresses without notification to me. I apologize to users for the frustrations this has sometimes caused. To avoid such problems in future, I have acquired my own internet domain, which I intend to maintain indefinitely. My home page has now moved to:

http://www.grsampson.net/

From now on this will always include a pointer to a list of the current locations of corpora and other downloadable research resources produced under my direction. In due course, those resources may themselves be shifted into the grsampson.net domain.


 
 

List of Contents

Version Information

Release 2 differs from Release 1 in that “ghost” elements in the structural analysis, representing the logical placing of elements which have been deleted or moved into a different clause in surface structure, are shown as labelled brackets in the parse field rather than as items in the word field. (Lines for ghosts in Release 2 have a hyphen in the word field.) This brings the analytic formalism of the CHRISTINE Corpus into closer conformity with that of the SUSANNE Corpus, and makes the files less confusing for human readers.

Release 1 was completed on 29 July 1999.

1. Introduction

The CHRISTINE Corpus is a structurally-annotated sample of spoken English. The sample is based on extracts from the “demographically-sampled” speech section of the British National Corpus. It therefore forms a suitable resource for studying grammatical and other structural features in the spontaneous, informal usage of a cross-section of speakers drawn from all social classes and regions of the United Kingdom in the 1990s.

The CHRISTINE Corpus conforms to relevant recommendations of the EAGLES (Expert Advisory Group on Language Engineering Standards) Spoken Language Working Group (Gibbon et al. 1997), as well as to preferences expressed by an international group of more than thirty experts consulted via the Internet at the beginning of the project which created it.

The CHRISTINE project was sponsored from 1996 to 1999 by the Economic and Social Research Council (UK), under award no. R000 23 6443, as a successor to the project which produced the SUSANNE analytic scheme and Corpus.[1]   The main aim of both SUSANNE and CHRISTINE projects has been to develop detailed, comprehensive, and explicit standards for annotating the structural properties of samples of English language as used in real life. Such standards can be developed only by applying an annotation scheme to language samples and refining it in response to problematic cases; so the work yields, as a valuable by-product, corpora, or “treebanks”, annotated in accordance with the scheme.

(The term “treebank” is now accepted internationally to describe a natural-language sample equipped with annotations representing grammatical structure.  I believe the term was first coined by my colleague Geoffrey Leech of the University of Lancaster, in connexion with the treebank for whose creation at Lancaster I took responsibility in 1983 and which is described in Garside et al. (1987: ch. 7).  The SUSANNE and CHRISTINE analytic scheme, though considerably more sophisticated, is the lineal descendant of the scheme developed by Leech and myself in the early 1980s.)[2]

The SUSANNE project focused chiefly on written language. It produced the structurally-annotated SUSANNE Corpus of written (American) English, published in 1992, together with a 500-page book, Sampson (1995) (referred to below as EFC), which defined the annotation scheme. The scheme has been winning a measure of international recognition; for instance, D. Terence Langendoen, President of the Linguistic Society of America, comments that “the detail ... is unrivalled” (Langendoen 1997: 600).

The CHRISTINE project extended this work to the domain of spoken English. Much of the notational apparatus defined in English for the Computer applies equally to spoken or to written English. Chapter 6 of that book proposed additional notations to deal with the special structural features of spoken language, such as the speech-repair structures produced when a speaker edits his wording “on the fly”. The CHRISTINE project tested and refined the scheme which includes these extensions, by applying it to a range of samples of recorded speech from a variety of sources. The material in the published CHRISTINE Corpus represents that part of the project’s annotation work which has been brought to a state suitable for public distribution.

(The project also annotated further passages, drawn from the London-Lund and the Reading Emotional Speech Corpora as well as additional excerpts from BNC, and it was originally intended to include these passages, too, in the published CHRISTINE Corpus. Unfortunately, various practical and staffing problems meant that this has not to date been possible, though it is still the intention to publish at least some of the additional material eventually. Because of the slippage in this aspect of our plans, we no longer use the name “CHRISTINE Corpus Stage I” for the currently-available Corpus; that was appropriate when publication of a larger resource was expected to occur within months, but it is now more realistic to use the short name for the material published to date.)

By now, the SUSANNE Corpus is in use in research institutions in many parts of the world, and numerous research publications have been based on it. It was not the first and is by far not the largest annotated corpus to have been published, but for some research purposes it has proved specially useful. Users have commented, for instance, on the unusual richness and precision of its annotations. The CHRISTINE Corpus is not the first structurally-analysed corpus of spoken English.  For American English there is, for instance, the (specialized) Switchboard Corpus (Meteer et al. 1995); for British English the ICE Corpus (www.ucl.ac.uk/english-usage/ice-gb/) pipped us to the post by several months.  But CHRISTINE has several virtues:

which may give it a special value in some research contexts.

CHRISTINE undoubtedly contains errors. I should be very grateful to be notified of any errors discovered by users (via e-mail to an address that I shall express cryptically to avoid the attention of spammers: grs2, followed by at-sign, followed by sussex.ac.uk), so that these can be eliminated when the full Corpus is released. Any such help will be publicly acknowledged.

More information on CHRISTINE and SUSANNE projects and Corpora is available on the World Wide Web: visit my home page at www.grsampson.net and follow the respective links.

The CHRISTINE web page includes details of the research team, but it is proper to acknowledge them by name here. Any value the CHRISTINE Corpus may have is largely due to the dedicated hard work of Alan Morris and Anna Rahman. The first-person pronoun is used in the documentation files, because these are largely concerned with debatable points on which the Principal Investigator necessarily took the final decision; but this should not detract from the credit due to other members of the team.

The electronic files comprising the CHRISTINE Corpus may be copied freely by anyone and used for any purpose. The Economic and Social Research Council, as sponsoring agency, and the University of Sussex as the contracting institution would undoubtedly appreciate acknowledgments in any publications which emerge from research using the CHRISTINE Corpus.

2. The British National Corpus

The British National Corpus is an electronic resource intended to supply empirical data on the English language as “produced” (that is, spoken and written) and “received” (heard and read) in Britain in the 1990s.[3] The BNC was created by a consortium comprising the publishers Oxford University Press, Longman, and Chambers Harrap, Oxford and Lancaster Universities, and the British Library; the chief sponsors were the Department of Trade and Industry and the Science and Engineering Research Council (now Engineering and Physical Science Research Council). Release 1.0 of the BNC was circulated in 1995 and is documented in Burnard (1995) — this book is referred to below as the BNC Manual.

The BNC contains 4124 language samples comprising 100 million words in all. Of this, about 10% — ten million words — is transcribed speech (the remainder being published and unpublished written material). The spoken part of the Corpus is divided into two parts:

I shall use the abbreviations “BNC/speech”, “BNC/demographic”, and “BNC/context-governed” to refer to the spoken part of the BNC Corpus and its subparts. In essence, BNC/demographic is a sampling of the spoken interactions engaged in by a cross-section of the British population over a given period; the overwhelming majority of these are informal conversation, so BNC/context-governed samples speech-events on the basis of genre rather than on the basis of speakers’ social characteristics, in order to achieve coverage of other speech genres.

The material in the published CHRISTINE Corpus is drawn wholly from BNC/demographic. For BNC/demographic, 153 individuals were recruited in such a way as to give, so far as possible:

Inevitably, practical difficulties prevented this intended distribution from being realized perfectly, but a reasonable approximation was achieved. (Detailed figures are given in the BNC Manual, p. 20. The Manual acknowledges that 153 respondents are fewer than ideal, but resource constraints forbade a substantially larger sampling.)

The recruits — in BNC terminology, “respondents” — were provided with tape recorders and asked to record all speech events in which they took part over a period comprising at least two different days of the week, thus achieving a mixture of weekdays and weekends. As well as returning the recordings, respondents also supplied logs which were intended to include demographic descriptions of other participants in the conversations (though, as we shall see below, this proved to be an area of severe weakness in the system).

BNC/demographic comprises 153 files, one for each respondent’s recordings; the average wordage in a single respondent’s file is about 27,500 words, though there is considerable variation round this mean.

The recordings were transcribed using conventional orthography, with ordinary punctuation, sentence-initial capitalization, etc. (My understanding is that this work was done by clerical employees of the Longman Group, based at Harlow, Essex, though this is not stated in the Manual and may be incorrect.  If it is correct, then the variety of English familiar to the transcribers is likely to have been fairly close in pronunciation to RP — “Received Pronunciation”, the national standard; a number of oddities of transcription are understandable if distant regional dialects were being filtered through ears attuned to RP.) The Corpus as released comprises these transcriptions encoded into an SGML-based file structure, including various analytic annotations (e.g. wordtags) produced semi-automatically by the consortium researchers. (The CHRISTINE Corpus ignores the BNC annotations; we applied our own much more detailed annotation scheme manually to a subset of BNC/demographic, so that only the actual words uttered are common to the two corpora.)

It should be said that BNC/speech, though unrivalled as a cross-sectional sampling of contemporary British speech, is not an ideal research resource in every respect.  The sound recordings are so far not available to researchers (though this may change); in any case, having been made in “field conditions”, the recordings were clearly often of poor quality by the standards of lab-based speech research, which was quite inevitable.  Furthermore, the standards of transcription often leave something to be desired (many transcriber errors are discussed in the notes files for the individual texts).  The other sources of transcribed speech used by the CHRISTINE project have their own virtues (the Reading Emotional Speech Corpus is available as digitized sound signals, the London-Lund Corpus is transcribed to a very high standard of accuracy); conversely, neither of them can claim to be representative of the national population in the way that BNC/demographic is.  At present, there is simply no resource available which combines all desirable properties.

3. The Contents of the CHRISTINE Corpus

3.1  The Nature of the Resource

CHRISTINE comprises structural annotations of forty passages excerpted from the BNC/demographic files. Altogether 147 identified speakers are represented in CHRISTINE (there is also a good deal of speech by unidentified speakers).

Rather than the SGML format used in the original BNC files, CHRISTINE uses a one-word-per-line fixed-field format, similar to that of the SUSANNE Corpus. This is in accordance with preferences expressed by the experts consulted at the outset of the project. Because the field structure of CHRISTINE is very simple, it would be a trivial matter for anyone whose application requires an SGML-structured data resource to convert CHRISTINE into such, given a suitable DTD (Document Type Definition). For the many users who have no such requirement, the existing format is both more transparent and far more computationally tractable.

Use of SGML for a data resource with such a simple structure as the CHRISTINE Corpus is arguably a negative factor, because it creates many possibilities of inadvertently introducing meaningless coding distinctions.  We have encountered several cases of this in Release 1.0 of BNC:

In each of these cases, the coding distinction seems to represent no real difference in what is being said about the structure of the relevant utterance.[5]

(Here and below, examples from the CHRISTINE Corpus are given a location reference in the form “T12.34567”, meaning “text T12, source-unit 34567” — for “source-units”, see §6.2.  Examples are quoted with the punctuation and capitalization provided by the BNC transcribers, where this is helpful for understanding the structure of the utterance.  Some examples quoted in the present document are taken from material annotated by the CHRISTINE project but not included in the published Corpus.)

The 40 CHRISTINE passages or “texts” are similar to one another in length, and the length was selected so as to be broadly comparable with the texts in the Brown, LOB, and SUSANNE Corpora of written English. The latter corpora were designed so that each text contains 2000 words, plus a few more as needed to make each text-end coincide with a sentence boundary. This rule is not directly applicable to a corpus of spontaneous speech, for one thing because the concept “sentence” does not apply straightforwardly to the spoken language, but also because transcribed speech contains many items — ums and ers, failed partial attempts at uttering words, markers showing that different speakers’ utterances were simultaneous, headers identifying speaker turns, records of “noises off”, pauses, etc. — which are not comparable to written words. For some sample passages from the demographically-sampled BNC speech corpus, having converted them from the original SGML format into a fixed-field format I determined the average ratio of lines to ordinary spoken words to be about 1.46:1.[6] This ratio would imply excerpts of about 2930 lines to get 2000 “real words”. However, the items other than words are themselves scientifically-interesting data items, though they seem individually less “weighty” than real spoken words. Consequently I chose 2800 lines as a target text length, as a compromise between 2930 and 2000.

Because the boundaries of excerpts from BNC were chosen to coincide with natural breaks in the speech stream, as discussed below, most excerpts in practice are longer than 2800 lines. The CHRISTINE texts as published contain about 112,000 lines in total, corresponding to about 80,500 “full words”, ignoring hesitation phenomena, etc.

My research team worked on the principle that the task of those who compile natural-language corpora is to represent the properties of language samples in a clear, explicit fashion that creates the fewest possible hurdles for researchers who wish to extract data from a corpus. We did not see it as part of our task to produce software for data extraction. We could not do that, since we have no way of knowing what sorts of questions future researchers will want to pose to our data. (SUSANNE has been used for various kinds of research that I had no thought of when I put it into circulation.) This point seems worth making, because since the publication of SUSANNE I have more than once encountered comments suggesting that, in failing to supply accompanying utility software, we left a job half done. In response, let me quote remarks I made in a recent book review (Sampson 1998: 365) about the approach which sees utility software as an essential accompaniment to corpus data:

It is hard to see this as a wise policy for allocating scarce research resources. In practice there are usually two possibilities when one wants to exploit corpus data. Often, one wants to put very obvious and simple questions to the corpus; in that case, it is usually possible to get answers via general-purpose Unix commands like grep and wc, avoiding the overhead of learning special-purpose software. Sometimes, the questions one wants to put are original and un-obvious; in those cases, the developer of a corpus utility is unlikely to have anticipated that anyone might want to ask them, so one has to write one’s own program to extract the information. No doubt there are intermediate cases where a corpus utility will do the job and grep will not. I am not convinced that these cases are common enough to justify learning to use such software, let alone writing it.

3.2  Choice of Extracts

Forty of the BNC/demographic files were chosen at random to serve as sources of excerpts for CHRISTINE. In order to explain how 2800-line excerpts were selected from these files, it is necessary to explain something of the internal structuring imposed by the BNC compilers on their demographically-sampled speech files.

These files are hierarchically structured into units delimited by SGML tags <div>, <u>, and <s>. The <div> (division) unit corresponds, at least nominally, to a recording of an individual conversation. (In practice <div> breaks sometimes interrupt what appear to be single conversations; so far as I have seen, the BNC Manual does not explain how <div> boundaries were decided.) The <u> and <s> (“utterance” and “segment”) units are intended to correspond to speaker turns, and to individual sentences. Again, in practice these units are often of questionable scientific significance. A speaker’s output is frequently split in BNC into separate <u> units merely because another participant interjects a brief remark (perhaps no more than a reassuring mm) in the middle of what is from all other points of view a single continuous speech-turn. And, although the BNC transcribers set out their transcriptions in the form of sentences, beginning with capital letters and ending with full stops or equivalent punctuation, the grammatical concept “sentence” is often inapplicable to the wording of spontaneous speech, which contains many sequences of wording that do not fit into conventional ideas of sentence structure.

Within each of the 40 randomly-selected BNC/demographic files, I used a random-number generator to select a line in the reformatted version of the file between the first line and the line 2800 short of the last line. I then began the excerpt at a <div> boundary close to this randomly-chosen line, if there was one, and continued to the first <u> boundary at least 2800 lines later. If no <div> boundary occurred near the randomly-chosen line, I began at a <u> boundary (and, if the 2800th line was close to a <div> boundary, I adjusted the excerpt to end there); furthermore, if a BNC <u> boundary did not appear to represent a natural break in the dialogue structure, I continued to a “better” <u> boundary. In general, I allowed myself considerable latitude in ranging forward or back from the randomly-chosen line to find a natural break which led to another natural break roughly 2800 lines later. There was no element of planning in terms of selecting “interesting” or “representative” excerpts from the BNC files; but I treated the aim of finding excerpts with reasonably natural boundaries as a higher priority than making the excerpt boundaries mechanically random, in the sense of being wholly determined by randomizing techniques with no excercise of discretion.

As it turned out, the extracts selected in this way included a minority of cases where the BNC header file gave little or no descriptive information about the speakers, or where a high proportion of speaker turns were not attributed to any identified speaker. This is unfortunate, for purposes of studying who says what in modern Britain, and one possibility would have been to discard those extracts and find other extracts for which information was more complete. But this would probably have skewed the sample. It is surely to be expected, for instance, that less detailed identification of speakers will happen for a recording of teenagers “hanging out” in a city street than for a recording made in a middle-aged couple’s living-room. The chief aim of the project was to produce a representative sample of modern British usage, so we refrained from “improving” on the outcome of the random selection process, and we accepted some gaps in the speaker information as a price to be paid for representativeness.

In 1999, the BNC Consortium released a “BNC Sampler” corpus, containing a selection of material from all parts of the full BNC Corpus, including BNC/speech, for use by researchers whose circumstances made it unnecessary and difficult to deal with the hundreds of megabytes of the full BNC Corpus. Natural-language corpora gain value when the same language samples are studied and processed by many different researchers in different ways, so ideally it would have been desirable to make the CHRISTINE selections overlap with those of the BNC Sampler, which are probably destined to be worked over much more intensively than other parts of the BNC. However, the Sampler was produced too late to allow this. (The BNC selections included in CHRISTINE were made in late 1996; as it happened, my copy of the Sampler disc arrived the day after I had extracted and applied initial processing to the last set of BNC extracts used in our project.)

We saw, above, that the contents of BNC/demographic consist overwhelmingly of informal conversation, but nothing in the sampling methodology ruled out the possibility of including speech of other genres. (The BNC Manual, p. 20, states that respondents were asked to record all of their “conversations”, but this is probably just intended as a nontechnical way of saying “all speech-events”; at any rate, there is a small amount of non-conversational material in CHRISTINE, for instance a sermon-like monologue.) In selecting extracts, we made no attempt to exclude non-conversational material. The aim was to provide a sample of the language that people actually hear in real life; the majority of that is spontaneous conversation, but some is not.

In view of the nature of many of the conversations excerpted, it is perhaps also worth stressing that there was no deliberate intention to choose salacious material. The tone of CHRISTINE, so far as we can tell, simply reflects a fair cross-section of British conversation in the 1990s.

3.3  List of Corpus Files

The forty text extracts in CHRISTINE are named T01, T02, ..., T40. (The prefix letter “T” would become significant if the other files annotated by our project are eventually published; different prefix letters are used for material taken from different source corpora.) CHRISTINE consists of a set of 84 files, as follows:

As stated above, the version of the Documentation file included in the Corpus will sometimes be out of date relative to the version available as a Web page.

3.4  The Lexicon File

The Lexicon file contains an alphabetized list of all pairs of wordform and wordtag that occur at least once in the Corpus. Inclusion of a list of wordforms is a recommendation of the EAGLES Spoken Language Working Group, Gibbon et al. (1997: 170, Recommendation 6).  In the CHRISTINE case, separate listing of grammatically-distinct uses of single wordforms is an obvious way of increasing the value of such a list.

Each line of the file contains a wordform followed by a wordtag, separated by a tab character, and terminated by a newline.

The Lexicon file covers only actual uttered words (whether complete or distorted/truncated); it does not contain entries for non-linguistic or analytic items.  (That is, it contains entries only for word lines of the third category listed under §6.10.)

3.5  The Speakers File

The main goal of the CHRISTINE project has been to annotate a cross-section of British speech, and to develop guidelines for executing such annotation in a predictable manner — not to study differences of usage among different types of speaker.  However, the BNC source files do include background information about many of the speakers; this information is neither as complete nor as reliable as one might ideally hope, but it is a good deal better than no information.

The file Speakers summarizes in machine-usable format the information available on individual CHRISTINE speakers’ demographic characteristics; §4 discusses the assumptions on which this summary is based.

For each speaker represented in the Corpus other than “unidentified” speakers, the file includes one line, terminated by a newline character, and containing eight fields separated by tabs, e.g.:

        003    T01    1992    F    63    NO    DE    Jean

Field contents are:

  1. (e.g. 003): speaker’s CHRISTINE identification number
  2. (e.g. T01): CHRISTINE text file containing the speaker’s contribution(s)
  3. (e.g. 1992): year of recording. In the current CHRISTINE Corpus, based on material from BNC, this year is always in the early 1990s; but the material annotated by the CHRISTINE project includes speech samples recorded in years spanning several decades, so year of recording as well as speaker’s age at date of recording would be needed in order to place a speaker within the history of changing English usage, if some of this additional material is eventually added into the published Corpus.
  4. speaker’s sex (M = male, F = female, X = unknown)
  5. (e.g. 63): speaker’s age at date of recording, with leading zero added if necessary to make two digits (XX = unknown)
  6. (e.g. NO): speaker’s region, classified as discussed in §4.3. Valid codes (not all used in CHRISTINE) are:
  7. (e.g. C1): speaker’s social class, determined as discussed in §4.4; valid codes are:
  8. (e.g. Jean): speaker’s nom de corpus (see §4.1) — a name similar to his or her real name, used in CHRISTINE to strengthen speakers’ anonymity. A speaker with CHRISTINE identification number 003 and nom de corpus Jean is shown as Jean003 in turn headers within a text file.

3.6  The Notes Files

The forty notes files, one for each text, are in HTML format and are intended to be read by users rather than machines.  They include information on the following issues:

4. Identification and Classification of Speakers

4.1  Speaker Identification and Anonymity

The BNC compilers promised anonymity to the speakers represented in BNC/speech. The CHRISTINE Corpus extends this BNC policy in certain respects.

The anonymity policy was implemented in BNC by removing surnames of speakers, and a few other proper names, replacing them with an SGML entity which in CHRISTINE appears as <name>. However, this procedure is arguably not adequate.

The headers to the BNC/speech files do specify speakers’ Christian names (forenames); and of course they also specify the dates and places of the recordings. The places specified are sometimes small villages. The date and place specifications represent significant scientific data, and must be preserved. But, particularly when the speakers’ Christian names are moderately or very unusual, it seems likely that someone familiar with the locale in question would often be able to identify groups of friends from their Christian names.

True, an outsider would hardly be able to identify individuals without their surnames. But anonymity vis-à-vis outsiders is not the only kind of anonymity that matters. Surely it is equally important to protect, say, a group of youngsters who have been recorded chatting freely among themselves from embarrassment through being recognized by their own teachers or relatives. One may feel that the likelihood of such an “insider” encountering the CHRISTINE Corpus is fairly low. But the decisive point is that some of the speakers themselves understood that the corpus compilers were offering them this level of anonymity. For instance, T06.00524 shows the speaker explaining the system to her companion by saying they don’t give them a name, they just say ... sixteen-year-old girl, fifteen-year-old girl with a friend. It is not for us to breach this expectation of literal anonymity.

Furthermore, it is not only the speakers themselves who should be protected. For instance, the two speakers just mentioned comment that one of their schoolmates, identified by Christian name, behaves like a whore. This person is entitled to anonymity as much as the speakers, and arguably more so: she signed no release form for the corpus compilers. When well-known public figures or institutions are mentioned, the BNC compilers seem to have felt that there was no need to anonymize the references at all. Clearly, if someone announces that he has just bought the latest album by a named pop singer, there is no point in concealing the singer’s name. But it depends what is said. One of the CHRISTINE texts contains a series of quite damaging remarks about the management of a secondary school, named in the BNC file. In another case, speakers comment adversely on the sexual morality of a named American actress. Even American actresses, surely, are entitled to have their honour guarded by corpus linguists.

Consequently, the CHRISTINE Corpus has taken the BNC anonymization policy further, in the following ways.

Where a BNC file gives the name of an institution, or the surname of a third-party individual (it never gives surnames for participants in the dialogues), in a context where it seems possible that the identification could cause embarrassment, CHRISTINE replaces the name with the <name> entity.

Christian names of speakers are in all cases replaced by other Christian names, both in identifying the utterers of speech-turns, and in the transcription of words uttered. Each speaker represented in the CHRISTINE Corpus is assigned a name and a three-digit code, e.g. “Scott125”. Each of the speaker’s turns is headed by this name/number code; and other participants in the dialogue are shown addressing him as “Scott” — but “Scott” is not the individual’s real name. The three-digit codes are unique across the CHRISTINE Corpus. The names are sometimes shared by different speakers, as their real names are.

(An alternative would have been to attribute the speaker turns to the five-byte codes used by BNC to identify speakers, e.g. PS546. But this gives the corpus user no easy way to link the individuals who contribute particular turns to their names used vocatively by other dialogue participants. It is far easier to grasp what is going on in a dialogue, if one has naturalistic names to hook the spoken interactions onto; the fact that they are not the actual names of the speakers is scientifically irrelevant.)

Some Christian names of individuals not participating in a dialogue, but who are talked about in it, are also changed, if the comments made about them seem potentially embarrassing, or if the name might involve a special risk of rendering the speakers identifiable.

The noms de corpus are chosen to be metrically equivalent to the real names, and also as far as possible to be socially equivalent. Obviously, male names are replaced by male names and female by female. But, in addition, when a name seems to be associated with a particular age-group, social class, and/or region, it is replaced by a name which feels similar in those respects. When (say) a two-syllable formal name alternates with a one-syllable abbreviation, the replacement name is chosen to preserve the same pattern, and formal name and abbreviation of the replacement name are inserted wherever formal and abbreviated versions of the real name occur, respectively, in the original file. If two participants in a dialogue share the same Christian name, their noms de corpus are also the same (occasionally, the logic of the dialogue depends on this kind of ambiguity of names).

Two kinds of turn in the original BNC files are not attributed to speakers with identified Christian names. In many cases, the transcriber could not decide which speaker produced a particular utterance, and assigned the turn to an “empty” speaker code, usually PS000. (Sometimes, where it is clear that different speakers are involved but neither is identifiable, PS000 and PS001 are used; however, a series of turns all attributed to “PS000” sometimes appear in fact to have been uttered by more than one speaker.[8])  These turns are attributed in CHRISTINE to speakers unid0, unid1 (for PS000, PS001 respectively).

In other cases, the BNC file assigns a “normal” speaker code which is identified by the header as referring to a particular individual with specified characteristics, but no name is included. In those cases, CHRISTINE invents a nom de corpus which seems appropriate in terms of the speaker’s sex, age, etc. (Occasionally, if sex as well as real name are not given, CHRISTINE uses the cover name Anon.)

It must be admitted that these procedures cannot offer a watertight guarantee against speaker identification. Someone who was determined to penetrate behind the veil of anonymity provided by CHRISTINE would only have to link its files to the corresponding passages in the original BNC files to discover the names we have concealed. There is nothing we can do about that. But our policy greatly reduces the chance of an accidental betrayal of informants’ confidence. If any of their identities should ever be revealed, it will not be the fault of the CHRISTINE Corpus.

4.2  Categorization of Speakers

Relevant information about the 147 identified speakers in the CHRISTINE Corpus, adapted from the file-headers of the respective BNC files, is given discursively in the notes files for the separate CHRISTINE texts, and is summarized in computer-tractable form for all the speakers in the Speakers file. Categories such as sex and age in years are self-explanatory, but the dialect and social class categories require some discussion.

One special problem about BNC speaker categorization data relates to the fact that some of the BNC files were created not by the BNC project itself but by a separate project based in Norway, the “Bergen Corpus of London Teenager Language” (“COLT”) project (Stenström & Breivik 1993; cf. BNC Manual, p. 20).  COLT material appears to have been used where the BNC/demographic sampling system called for samples fitting its description; but, because COLT was an independent project, it did not collect the same types of information about speakers as the BNC project itself.  Users of CHRISTINE will notice that relatively little information about individual speakers is included for those texts which represent young Londoners.

4.3  Regional Classification

The BNC file headers normally identify speakers’ mother tongue, almost always as British English, and in many cases give rather detailed (though not always very clear) information about speakers’ regional dialects.

The only cases where file-header information seems to mean that speakers have a language other than English as mother tongue are two speakers in text T08 who are identified as having European accents. (Also, the header to the BNC file from which CHRISTINE text T24 is extracted codes all speakers in that file as native speakers of Irish Gaelic, but this is not credible; I take it to be a symptom of the BNC respondent’s nationalist political fervour rather than a serious linguistic description. These speakers include members of more than one family, living in Belfast, and all including a child of three years are shown as speaking fluent English.) Apart from the above, the only cases in CHRISTINE where no information is given about mother tongue are:

I have assumed that all of these speakers should in fact be classified as having British English as mother tongue.

Regional dialects are classified by BNC on a system which, with respect to England, seems to be adapted from the classification in Trudgill (1990: 3-5). (The BNC Manual does not explicitly quote Trudgill’s book, so far as I have seen.) Trudgill’s book does not deal with the British nations other than England, and BNC treats Wales, Scotland, and Ireland as three unitary dialect regions coinciding with the respective political units. (BNC makes no distinction between Northern and Southern Irish speakers; and, although all the speech samples were recorded within the UK, the CHRISTINE Corpus includes at least one Irish speaker living in England, who may well have come from the Republic rather than from Northern Ireland — so CHRISTINE likewise uses a single “Irish” category.)

Within England, Trudgill recognizes sixteen dialect areas. BNC describes speakers’ dialects via three-letter codes whose definitions (BNC Manual, pp. 86-7) are too similar to Trudgill’s areas to have been chosen independently, though they are not quite identical. (For instance, BNC uses a code XLO for “London”, which in Trudgill’s system is part of the much larger “Home Counties” area, and it uses a code XLC for “Lancashire”, whereas various parts of Lancashire fall into different areas in Trudgill’s scheme — I infer by elimination that XLC may in reality stand for Trudgill’s “Central Lancashire” area.) The complications in the relationships between the BNC and Trudgill’s dialect classification systems seem to stem partly from the fact that BNC aims wherever possible to use internationally-recognized ISO classifications for geographical regions, and partly from the fact that laymen such as the BNC respondents commonly classify speech-varieties by reference to traditional county names; both of these classification methods relate to political boundaries which are often irrelevant to linguistic realities.

Be that as it may, Trudgill’s classification in any case seems unnecessarily fine-grained for a project like CHRISTINE, which is concerned with grammar rather than with details of pronunciation; and a sixteen-way classification of English dialects is particularly inappropriate when one considers that the recording sites often happened to fall rather close to one or other of Trudgill’s isoglosses, and that BNC respondents had no expertise in classifying speakers who hailed from areas distant from the recording site.

On the other hand, linguistic differences between, say, Northern and Southern England are sufficiently large, in grammar as well as pronunciation, that it would be a pity to ignore the dialect indications in BNC altogether.

CHRISTINE has adopted a compromise strategy, which uses the data in BNC to assign as many English speakers as possible to one of four broad regions corresponding to the second level from the root in Trudgill’s hierarchical classification of modern dialects (Trudgill 1990: Fig. 3.1, p. 65). CHRISTINE uses the terms:

The boundaries of these dialect areas are defined by the heaviest lines in Trudgill’s Map 18 (1990, p. 63). In terms of traditional counties, Northern England extends from the Scottish border south to central Lancashire (excluding Merseyside), all of Yorkshire, and the northernmost part of Lincolnshire. The Midlands includes Merseyside and the counties south of Yorkshire, southwards to Shrewsbury, the Birmingham area, Leicestershire, and the southern boundary of Lincolnshire. The South West includes southern Shropshire, southern Worcestershire, Gloucestershire, western parts of Oxfordshire, Berkshire, and Hampshire, and the counties further south and west. The South East includes Northamptonshire and all of England south and east of that county, including East Anglia to the east, the bulk of Hampshire to the south, and notably the London area.

Clearly, this classification can be no more than a broad and vague indication; habits of speech do not change sharply either side of lines drawn through the map of England.

CHRISTINE contains no cases of native speakers of varieties of English from outside the British Isles. However, the complete CHRISTINE Corpus will include some speakers to whom this applies, and consequently the coding system given below includes further classifications:

4.4  Social Classification

BNC file headers include three kinds of information about speakers which could broadly be described as social classifications (though in many cases one or more items is missing for a given speaker):

Of these, the education information is so patchy (it is virtually always missing for all speakers other than the one BNC respondent for each file) that it is ignored in CHRISTINE.

The social-class information is expressed as a code drawn from a four-way classification derived from the Standard Occupational Classification (“SOC”) scheme defined in Office of Population Censuses and Surveys (1990-1).[9]

The SOC scheme assigns occupations to six social classes:

I    professional, etc.
II    managerial and technical
II    skilled occupations, divided into:
IIIN    non-manual
IIIM    manual
IV    partly skilled
V    unskilled
The BNC coding (in common with much social research) collapses this into a four-way scheme:
AB      I+II, professional, managerial, and technical
C1      IIIN, skilled non-manual
C2      IIIM, skilled manual
DE      IV+V, partly skilled and unskilled
In principle, this four-way scheme is at a very suitable level of granularity for use in CHRISTINE. But there are severe problems in practice, which presumably stem from the fact that the data in BNC file headers are only as good as the logs supplied by the non-expert respondents who filled in details about their friends and relatives.

In the first place, many speakers are assigned an “unclassified” code under the social-class heading. But, more worryingly, it not infrequently happens that the social code assigned to a speaker contradicts the statement in the same file header about that speaker’s occupation, despite the fact that the social classification is supposed to be based on occupation. An extreme case is Gillian091 in text T23, who is socially classified DE (partly skilled or unskilled) in BNC, and is described as a doctor by occupation. Any doctor is SOC class I, i.e. AB in terms of the four-way scheme.

Between them, these two problems are sufficiently severe that one might think it best to abandon any attempt to include social-class data in CHRISTINE. That would be very unfortunate: the issue of correlations (or lack of them) between speech patterns and social class are a topic of great interest from many points of view. And the data in BNC, while certainly quite “dirty” in this area, are not so irredeemably flawed as to prevent anything being said.

For CHRISTINE, therefore, I proceeded as follows. Where the BNC file header for a speaker states that speaker’s occupation, I assumed that this statement, being relatively specific and objective, was more likely to be correct than any social-class code shown: so I used the SOC mapping of occupations onto classes in order to assign a social code. (Thus Gillian091’s code was altered from DE to AB.) In the case of married couples, knowing that wives often treat earning as a subsidiary aspect of their role and take lower-level jobs than their background would qualify them for, I assigned the social-class code for the husband also to the wife; and vice versa in occasional cases where a husband was disabled or unemployed, so that the wife was likely to be the main breadwinner. (Note that these procedures were chosen in the light of experience of 1990s British society as it actually is, rather than of politically-motivated theories of how it possibly ought to be.) In the case of schoolchildren or preschool children, I assigned the father’s (or, failing that, the mother’s) code, irrespective of any code shown in BNC for the child. Only where none of these guidelines yielded a class code for an individual did I accept the code given in the BNC file header, if there was one. (Some speakers remained unclassified.)

The notes files for the various texts explain how these guidelines were applied in each specific case in order to derive the code included in the social-class field of the Speakers file. I believe the resulting classification is significantly more informative than omitting any attempt at social classification would have been. At the same time, it should be clearly understood that this aspect of the data is quite imperfect. Social classification is certainly one of the least satisfactory aspects of the information available about BNC speakers.

5. Analytic Principles

5.1  The Need for Explicit Annotation Guidelines

The general approach to structural analysis of real-life language samples exemplified in the CHRISTINE Corpus was described in early chapters of EFC, in connexion with the SUSANNE Corpus of written English. Our primary aim has been to refine the analytic scheme (the set of annotation symbols and detailed guidelines for applying them) through conscious consideration of every or almost every awkward case in our samples, so as to uncover hidden ambiguities or gaps in the guidelines and replace them with new explicit decisions. The work is analogous to the way in which the stream of cases arising in a nation’s lawcourts uncovers hidden uncertainties in the legal framework and causes them to be settled through judicial decisions which stand as precedents for the future.

Elsewhere (e.g. Rahman & Sampson forthcoming) we have used the analogy with the discipline of software engineering, developed in response to the “software crisis” of the 1960s-70s caused by premature coding of solutions to inadequately analysed tasks, in order to argue that this kind of detailed logging and classification of linguistic phenomena should be seen as a high priority at the current juncture in natural-language engineering. The carefully hand-crafted nature of treebanks produced in this spirit inevitably means that they are small, relative to some other treebanks now available; but small size is arguably a cost worth paying in exchange for comprehensiveness of the analytic guidelines. As Jane Edwards (1992: 139) has written, “The single most important property of any data base for purposes of computer-assisted research is that similar instances be encoded in predictably similar ways.”  Only detailed, explicit analytic guidelines can enable this kind of internal consistency to be maintained within future larger treebanks.

5.2  Indeterminacy in Spoken Structure

The exercise of adding structural annotations to samples of spontaneous speech involves problems of a higher order than arise in the equivalent work with edited written language. There are of course specific structural features found only in spoken language, which require their own annotation mechanisms; a number of such mechanisms were defined in EFC, chapter 6, and are used in CHRISTINE, and further notational devices developed in the course of the CHRISTINE project are described in §13 below. But, beyond these things, there is a pervasive incidence of indeterminacy in transcriptions of spontaneous speech which is rarely found in written prose. When people’s hasty, unstudied utterances are pinned down in black and white, again and again it is just not clear what they are saying. This is frequently true even when the transcriber has succeeded in distinguishing each of the individual words on the tape (often, of course, transcribers do not succeed in that).

In some cases the reason will be that the speech refers to features of the situation which were not visible to the transcriber (BNC transcribers worked from tape recordings but had only general indications of the locale in which a dialogue was recorded), or that participants in the dialogue were tacitly drawing on their shared knowledge of some matter that is never explicitly mentioned. In other cases the main problem seems to be that people sometimes just speak inconsequentially. Their skill at rapidly assembling word-sequences suitable to express their meaning is imperfect, so they come out with utterances which are uninterpretable (or perhaps they have not formulated a clear meaning in the first place — they are “engaging mouth before putting brain in gear”).

In the CHRISTINE project our task is limited to annotating for linguistic structure, rather than somehow indicating the meanings of the language samples; so this sort of inconsequentiality is not too problematic while it does not affect grammatical coherence. But sometimes it does. What should one make, for instance, of a speaker turn such as the following, uttered by Harold001:

When, when he rang us up it, it was looking at us like for Brian he kept, called in as I was going enjoy  T01.02733
Where the transcriber placed commas, evidently the speaker has repeated the previous word or substituted an alternative word; and it is easy to see that the word like is functioning in the colloquial “hedging” usage. But there is no clue in the context about what the it was that was looking at us (or why it was looking at us, whatever it was), or how to interpret for (“looking on Brian’s behalf”, or “because Brian called in”?); and the relationship of the closing word enjoy to what precedes is completely mysterious.

No set of explicit guidelines, however comprehensive, can define a clear, predictable structure for every case like this. The CHRISTINE principle is that, where word-sequences are open to alternative grammatical interpretations, the analyst chooses one, at random if there are no apparent clues (so the CHRISTINE analysis of the above example annotates for Brian on the “looking at us on Brian’s behalf” interpretation); and, while words and phrases are fitted into larger structures whenever they can be, there is no objection to leaving individual words or short phrases as independent elements in the stream of structures, if there is no apparent way to fit them into their verbal environment — so, in the example, the CHRISTINE analysis leaves enjoy as a verb which is not part of any higher structure, following a main clause he kept, called in as I was going.

Each word is at a minimum assigned a wordtag.  But, if the word cannot be fitted into a larger structure and alternative wordtags are available for the form, a tag may have to be chosen at random (for instance, off at T11.02554 is tagged RP simply because this is the commonest of the three tags applicable to this wordform).

Contrary to the rule specified in EFC, §6.13, the CHRISTINE Corpus does not group the succession of structures within a single speaker turn under a single root node O. To do that would imply some degree of coherence among the successive daughters of the speaker-turn node: it would suggest that those daughter nodes constituted a “construction” of some type. This is more than our data warrant. In CHRISTINE, a speaker turn consists of a disconnected sequence of one or more grammatical units which, if the speaker is articulate, may all be main clauses with recognizable internal structuring, but which in some cases may be disjointed phrases that would never be allowed to stand in writing as “complete sentences”, or may be individual words as in the case of enjoy, above.

In the example quoted, there was no stretch of speech where the transcriber found the wording indistinguishable. But stretches like that, shown in CHRISTINE as {unclear}, are very frequent, and obviously they create even greater problems for choice of structural annotation. We have evolved explicit guidelines for annotating passages that include {unclear} stretches, and these are discussed in §9.

5.3  Emendation of Transcribers’ Wording

Members of the CHRISTINE project had no access to the recordings from which the BNC transcribers worked; at present, for confidentiality reasons these are not available to researchers.  This means that those of us who provided the structural annotation had no evidence about what was actually said, other than what the BNC transcribers wrote down.

This evidence went slightly beyond the words themselves; BNC was transcribed using ordinary English orthography, including sentence-initial capitals and a variety of punctuation marks, and although these features were in due course stripped off the wording and “hidden” in a subsidiary field of the CHRISTINE text files (because they have no direct spoken significance), at the stage when our grammatical annotations were being created the orthographic features were still present and visible to the analysts — often, the BNC transcriber’s choice of punctuation was helpful in deciding between alternative structural interpretations which would have been equally plausible with respect to the words alone.

However, the possibility obviously arises that the BNC transcribers may sometimes have misheard or misunderstood, and have written down words different from those which the speakers thought they were saying. It is at least logically possible that the kind of inconsequentiality described in §5.2 might be entirely an artefact of the transcription process, and not present at all in the speakers’ actual wording. Perhaps Harold001 did not use the word enjoy but some phonetically-similar form which made good sense in context. I very much doubt that cases of apparent speaker incoherence can all be explained away in this manner, but there is no way that we can be sure. This is very regrettable; it would be extremely worthwhile from several points of view to be able to use this sort of treebank to check how structurally coherent speakers are on average in real life. CHRISTINE does not give us reliable quantitative information about the question. Given the primary goal of making the Corpus representative socially, regionally, etc., no speech samples were available to us that would have enabled us to answer it.

On the other hand, it is unquestionably true that some cases of incoherence in the transcriptions were created by transcribers rather than speakers. At the outset of the project, I had envisaged adopting a principle that the transcriber’s wording should never be “second-guessed”; it might be wrong, but the transcriber had heard the tape, we had not. As I grew familiar with the nature of the BNC transcriptions, it became clear that such a principle would be unreasonable.

Clear evidence of transcriber error occurs in a few cases where the speaker is reading aloud from a published document. For instance, T08.00933 contains a passage read out from St Matthew’s Gospel, 8:11, containing the phrase ... will come and decline at the table ..., where it is easy to check that the transcriber’s decline must have been a mistake for recline. But there are other cases where we have no independent check, yet some phonetically-small adjustment to the transcriber’s wording makes such a large improvement to the sense that it seems morally certain that the transcription is erroneous. For instance, at T29.09621, in a discussion of unsatisfactory child-minders, one speaker is made by the BNC transcriber to say unless you’ve low and detest children or unless you’re out purely and simply for the money ... The first clause seems meaningless, but emending it to unless you loathe and detest ... gives a passage which uses a standard cliché to express a straightforward piece of sense. The phonetic differences involve two voiced fricatives, which are among the least auditorily salient classes of English phoneme; and the BNC dialogues were recorded in non-ideal field conditions. It is overwhelmingly likely that the speaker said you loathe, and the BNC transcriber misheard. (As I understand it, the transcriptions were done under heavy time pressure.)

Therefore, we have allowed ourselves to make cautious emendations to the BNC transcriptions (logging any such emendation in the relevant notes file, with a summary of the justification for it). In general, where a change to the written wording implies no phonetic difference (e.g. where as changed to whereas), we have freely adopted the emendation if it improves the sense. The larger the phonetic difference between the transcribed wording and a proposed emendation, the greater we required the gain in semantic coherence to be before the emendation was adopted; and we tried to err on the side of conservatism. The notes files sometimes mention possible emendations which we regarded as too adventurous to incorporate into the text files.

5.4  Reallocation of Utterances Among Speakers

Emendations to the BNC source material are not limited to alterations of wording. Another type of apparent error in BNC is that, in some dialogues, the assignment of turns to identified participants seems to have become muddled. (The BNC system of distinguishing speakers via opaque sigla such as PS0M4, PS0M5 may have encouraged such confusions.) Sometimes the error becomes apparent because a speaker seems to be addressing himself or herself by name when it is clear that he or she is really talking to someone else. In other cases the contents of the wording, taken with information in the BNC file headers about what kind of people the speakers are, makes the existence of a confusion inescapable. For instance, when text T21.01517-8 includes the exchange:

PS0M5: Did you want to have a shower with daddy?
PS13T: Umm yes.
it is startling to learn from the file header that PS13T is a 34-year-old engineer and PS0M5 his 3-year-old son; but a simple permutation of identity codes between the three participants in this dialogue makes good sense of this and many other seemingly bizarre aspects of the conversation.

In this case the evidence is so strong that CHRISTINE has reallocated the speaker identity codes (with an explanation in the file T21.nb); a few other reallocations have been made elsewhere for comparable reasons. There are other cases where the BNC attribution of individual speaker turns looks suspicious, but either the evidence is not strong enough to justify a reallocation, or it is not clear what particular reallocation would be correct, and so the BNC attributions have been left to stand. Allocation of speaker turns is an area where I suspect that CHRISTINE contains a significant incidence of errors (though not, I believe, so many as to make the information about speakers valueless).

A special case of speaker-code reallocation relates to a phenomenon found in a sizeable minority of the files, namely T06, T09, T19, T22, T28, T29, and probably T14.  In these files, there are speaker turns which as a whole are transcribed as {unclear} (the transcriber could not distinguish the wording) and are allocated by BNC to unidentified speakers, but in context it often seems clear that the inaudible passage is in fact a segment of an adjacent turn whose speaker is identified.  In the texts where this approach seems to have been adopted by the BNC transcriber, we have felt free to make guesses about how best to link these unclear turns with adjacent clear turns.  Most of them have been treated as part of the preceding (or, sometimes, the following) turn, and only a minority have been left as separate turns, where it is quite unclear who is speaking and what the turn structure can be (e.g. the unclear turn between T06.00439 and 00440, or the sequence at T06.00457ff.).  When an unclear turn has been linked with a clear turn, the adjacent s-unit in the latter has been made to extend over the {unclear} entity, contrary to the structure in the BNC original.  No log is kept of such modifications in the respective notes files.

5.5  Reordering of Utterances

One respect in which we gave ourselves complete freedom to modify the source transcriptions related to the sequencing and structuring of different speakers’ contributions as BNC <s> and <u> units. Much of the BNC material consists of lively conversations in which two or more participants often interrupt one another, talk simultaneously, make brief supportive responses while someone else holds the floor, and so forth. The BNC compilers treated it as a high priority to record the relative timing of various speakers’ contributions, and because they did not themselves equip the material with annotations for grammatical structure they seem not always to have noticed that sequencing different speakers’ wording in accordance with the physical timing does violence to the integrity of individual speakers’ contributions. Although the BNC material was transcribed in the form of orthographic sentences, a yes or mm interjected by a hearer has sometimes led to a speaker’s wording being divided into separate “sentences” in the middle of quite a low-level grammatical tagma.

The aims of the CHRISTINE project have to do with grammar rather than with social turn-taking phenomena. Consequently, whenever a construction produced continuously by one speaker has been split in BNC into separate <u> (utterance) units interrupted by another speaker’s utterance, CHRISTINE reorders the first speaker’s <u> units into one continuous speaker turn. Where one speaker holds the floor for a long time, producing a series of clauses which are interrupted in the middle by hearer responses, CHRISTINE will show the first speaker’s contribution as continuous, followed by the responses which in some cases may physically have occurred much earlier than the point where they appear in CHRISTINE. (If the hearer responses occurred between independent clauses in the first speaker’s contribution, no reordering is done, since the BNC sequencing does not distort any of the grammatical constructions.) The source transcription field (§6.8) includes markers indicating that such reordering has occurred, though users who need full information on relative timing will need to consult the original BNC files.

(It is perhaps fair to point out in this connexion that manual transcriptions like those of the BNC are unlikely to be very accurate with respect to precise relative timing of speech events. There is plenty of psycholinguistic research showing that hearing of relative timing of speech sounds is heavily influenced by the hearer’s understanding of grammatical structure, which interferes with perception of the physical facts. This is one of many reasons why it is unfortunate that the audio data from which the BNC transcriptions were made have not so far been released.)

In BNC, <s> (segment or sentence) units are wholly contained within <u> (utterance) units, implying that when a speaker’s contribution is divided into separate <u> units because of an interruption, there is necessarily also an <s>-unit boundary at the same point. But, in addition, BNC <s>-unit boundaries sometimes seem grammatically arbitrary even in cases where the <s> units are adjacent in BNC. What seems to be a single coherent grammatical construction has sometimes been split by the transcriber into separate “sentences”, or successive disconnected constructions have been grouped into one orthographic sentence. CHRISTINE analysts took complete freedom to group words into tagmas reflecting speakers’ apparent logic, ignoring the BNC <s>-unit boundaries where these clashed.

CHRISTINE does preserve the BNC <s>-unit (“source-unit”) boundaries for the purely practical convenience of having a division of the texts into short numbered chunks which can easily be cross-referred to the corresponding locations in the original BNC files.  But the consequence of the analytic approach described above is that CHRISTINE source-unit boundaries have no significance with respect to the tree structure assigned to surrounding wording. A source-unit boundary may occur between adjacent parse-trees, or anywhere in the middle of a tree.

On the other hand, parse-trees are never divided by the higher-level boundaries recognized in CHRISTINE: turn boundaries, and division boundaries. The point of re-ordering the BNC <u> units was to ensure that every grammatical tagma is complete within a single speaker turn.

5.6  No Shared Structures

One specific rule related to this last point is that CHRISTINE grammatical annotation is never allowed to link separate speakers’ wording into a single tagma, even when speakers “complete one another’s sentences” (as people often do). For instance, T01.02825ff. has this passage (shown here in the original BNC sequence):

Jean003:    yeah Chris is yeah yeah
Harold001:    she is Chris she’s a-coming
Jean003:     just with the kids
Harold001:    with two kids
At the beginning of the CHRISTINE project, I envisaged that such a passage might be analysed in such a way that just with the kids would be treated as adjunct material within the she’s a-coming clause, and with two kids perhaps tacked on as an appositional element subordinate to with the kids. But it quickly became apparent that this approach would not lead to predictable analyses — it creates too many debatable alternatives, particularly when speaker B completes speaker A’s sentence and then speaker A also completes it in the same or different wording (again a frequent scenario). Accordingly, trees crossing speaker-turn boundaries are forbidden. (This rule, though evolved independently, turned out to coincide with the rule adopted for the Switchboard Corpus (Meteer et al. 1995: §1).) In CHRISTINE, the dialogue above is structured as:
Jean003:    yeah [ Chris is yeah yeah | just with the kids ]
Harold001:    [ she is Chris ] [ she’s a-coming | with two kids ]
where square brackets enclose main clauses, and the “|” symbol marks source-unit boundaries. (The fact that the physical timing of the words was different from this is marked in the source transcription field.)

6. The Text Files

6.1  The Structure of Text Files

CHRISTINE text files use a fixed-field file structure which is intended to be transparent to manual inspection (that is, a non-expert newcomer who scans a file should be able to grasp as much as possible of what is going on in the dialogue), while making it easy, through regularity of structure, to write code to extract information automatically. The file structure is also somewhat similar to that of the SUSANNE Corpus, though differences in the nature of the information recorded unfortunately made it impossible to use identical structures. (The SUSANNE Corpus was already published years before I began to plan the CHRISTINE Corpus.)

Each CHRISTINE text file consists of a sequence of lines terminated by newline characters; each line contains a sequence of fields separated by tabs. Tab and newline, codes 9 and 10, are the only nonprinting characters found in the text files (the space character, code 32, never occurs). Among the ASCII printing characters, i.e. codes 33 (!) to 126 (~), no use is made of the characters:

$ ( ) ; \ ^ ` ~
(codes 36, 40, 41, 59, 92, 94, 96, 126).

As a specimen, here is the initial part of file T02.tx.  (As a consequence of Web technology, tabs as field dividers are simulated using HTML entities.)

T02_0003        =====   011303  m
T02_0006        ——-   Gemma006
T02_0009        .....   00325
T02_0012        0050761 *       PPH1    it      [S[Ni:s.Ni:s]
T02_0015        0050770 |       VBZ     +’s     [Vzb.Vzb]
T02_0018        0050780 |       IIp     per     [P:e.
T02_0021        0050791 |       NNU1c   foot    .P:e]
T02_0024        0050803 |       RTn     then    [Rsw:c.Rsw:c]S]
T02_0027        0050815 |       RRz     so      [S[Rs:c.Rs:c]
T02_0030        0050825 |       PPY     you     [Ny:s.Ny:s]
T02_0033        0050835 |       VMd     +’d     [Vdc.
T02_0036        0050846 |       VH0     have    .Vdc]
T02_0039        0050858 |       TO      to      [Ti:z[Vi.Vi]
T02_0042        0000000 y       YR      #       .Ti:z]S]
T02_0045        0050868 c       YP      {pause} .
T02_0048        0050876 |       DDQ     what    [S?[Dq:o.Dq:o]
T02_0051        0050888 |       VD0     do      [Vo.Vo]
T02_0054        0050898 |       PPY     you     [Ny:s.Ny:s]
T02_0057        0050909 ?       VV0v    want    [Vr.Vr]S?]
T02_0060        ——-   Barbara004
T02_0063        .....   00326
T02_0066        0050960 *       MC      eleven  [Nu[M.
T02_0069        0000000 y       YR      #       .
T02_0072        0050974 c       YP      {pause} .
T02_0075        0050982 |       MC      eleven  .
T02_0078        0050996 |       IIb     by      [P.
T02_0081        0051006 |       MC      eleven  [M.
T02_0084        0051020 |       CC      and     [Ns+.
T02_0087        0051031 |       AT1     a       .
T02_0090        0051041 |       NN1c    half    .Ns+]M]P]M]
T02_0093        0051053 .       NNU1c   foot    .
T02_0096        .....   00327
T02_0099        0051085 *       RGQq    how     [S?@[Dq:e.
T02_0102        0051096 |       DA1     much    .Dq:e]
T02_0105        0051108 |       VBZ     is      [Vzb.Vzb]
T02_0108        0051118 ?       DD1a    that    [Ds:s.Ds:s]S?@]Nu]
T02_0111        0051138 c       YP      {pause} .
T02_0114        0051146 ic      YY      {unclear}       [Y.Y]
T02_0117        0051156 c       -       {event16}       .
The first field of each line is an eight-byte CHRISTINE location code of the form Tnn_nnnn, where Tnn is the name of the text, and nnnn is a four-digit number uniquely identifying the line within the text. Successive line-numbers are guaranteed to increase, but are not in general consecutive; they usually increase in threes, but editing and correction of the files sometimes led to insertion of lines with intermediate numbers.

6.2  Word Lines v. Header Lines

Lines are divided into two types: header lines, which identify the structuring of the dialogue into units of various levels above the individual words, and word lines, which contain successive spoken words (and certain non-word items, such as identification of “noises off”). In a header line, the second field is composed of five identical punctuation marks (different marks for different categories of header). In a word line, the second field is a seven-digit source location code.

The types of header line are:

Within the British National Corpus, a file contains one or more “division” (nominally, separate conversations — though see §3.2); a division contains one or more <u> (“utterance”) unit, called in CHRISTINE a speaker turn; and a <u> unit contains one or more <s> (“segment” or “sentence”) unit, called in CHRISTINE a source-unit (sometimes abbreviated as “s-unit”) in order to emphasize that this is a division of the wording which stems from our source material and to which we attribute no consistent significance.

CHRISTINE header lines mark the beginnings of corpus sections of these three levels. In consequence:

Thus, the general overall structure of a text file looks like this:
division header
turn header
source-unit header
word
word
word
source-unit header
word
word
source-unit header
word
word
word
word
turn header
source-unit header
word
word
word
source-unit header
word
word
division header
turn header
source-unit header
word
word
word
word
turn header
source-unit header
word
word
...
word
word

6.3  Division Headers

A division header has four fields; an example is:

T02_0003    =====    011303    m
The four fields are: CHRISTINE text files are always continuous extracts from BNC files; consequently, a division coded e must be the first division in a CHRISTINE file, a division coded b must be the last division, and a division coded m must be the sole division in the CHRISTINE file.

6.4  Turn Headers

A turn header has three fields; an example is:

T02_0060    ——-    Barbara004
The three fields are:

6.5  Source-Unit Headers

A source-unit header has three fields; an example is:

T02_0096    .....    00327
The three fields are: A complication in the BNC file structure is that its hierarchical division into <div>, <u>, and <s> units does not account exhaustively for all the vocal material recorded in the files.  There are certain forms which are treated as <u> units but not included in any <s> unit; and certain other forms which are treated as part of a <div> but are outside all <u> and <s> units.  CHRISTINE treats a form of either of these types as a separate speaker turn consisting of a single source-unit; the third field of the source-unit header contains five zeros, if the form is a BNC <u> unit, and five hyphens, if it is not.  Examples of both occur adjacent to source-unit T40.00141, what are you doing here.  The immediately-following {laugh} entity, attributed to an unidentified speaker, is in the original BNC file a <u> unit not containing an <s> unit; the immediately-preceding {unclear} entity, attributed to Sadie148, is in the BNC file an empty tag outside all <u> and <s> tags.[10]

6.6  Word Lines

A word line has six fields; a typical example is:

T02_0108    0051118    ?    DD1a    that    [Ds:s.Ds:s]S?@]Nu]
The six fields are: The CHRISTINE location code was described above. There follow descriptions of the other five fields in a word line.

6.7  Source Location Code

The purpose of this field is to link each word in a CHRISTINE text to its location in the source file from which the text was extracted — in the case of CHRISTINE, from some file in Release 1.0 of the British National Corpus. BNC contains various categories of information which were judged to have little relevance to the aims of the CHRISTINE project and are not preserved in the CHRISTINE Corpus; one example is detailed information about the relative timing of various speakers’ wording, in cases where speakers interrupt one another or speak simultaneously. The source location field is provided in order to enable users who need to do so to check CHRISTINE wording against the original BNC file.

The BNC filename corresponding to a CHRISTINE text is given in the notes file for that text; for instance, CHRISTINE text T02 is extracted from BNC file KB6. The seven-digit source location code within the CHRISTINE text file locates the individual word (in the example, the word that) within the relevant BNC file. Because BNC files are based on an SGML structure rather than on fixed-field records, the location reference which appears in CHRISTINE ignores the internal structure of the BNC file and uses a simple byte count from the beginning of the file.

In BNC, each word uttered is enclosed within an SGML <w> ... </w> tag. The source location code 0051118 means that the character < at the start of the <w> element to the left of the word that is the 51118th byte in the BNC file (counting its initial byte as 1, not 0). Where the contents of a CHRISTINE word field correspond to an “empty” SGML tag in BNC (e.g. an indication of a silent pause or a non-speech noise), the CHRISTINE source location code identifies the opening < of the empty SGML tag.

An exception occurs in cases where an item treated as a single word in BNC is by the rules of the SUSANNE/CHRISTINE annotation scheme split into two or more words on successive lines in CHRISTINE. For instance, BNC (Manual, p. 97ff.) treats various phrases containing multiple orthographic words as single units within a single <w> element — often these correspond to SUSANNE “idioms”, sometimes they do not, but in either case the separate orthographic words appear on separate lines in CHRISTINE. Also, occasionally BNC erroneously runs together items which ought by its own standards to be separate words — for instance, text T21 contains a passage where the BNC source has a “word”:

five.The
produced by leaving out spaces between adjacent orthographic sentences, and in CHRISTINE this form is split into its separate words. In such cases, where words after the first do not have their own <w> tag in the BNC file, the intention was that the CHRISTINE source location code should identify the first character in the BNC file belonging to the respective word. Unfortunately, misunderstandings within the CHRISTINE project meant that this plan was not correctly executed, so that the byte count for such a word will fall within the appropriate BNC <w> element but in some cases will not coincide with the initial character of the relevant word.

Where a CHRISTINE word line represents an analytic item supplied as part of the structural annotation and having no equivalent in the BNC source — for instance, a “ghost” node (EFC, p. 353ff.) or “trace” representing the logical position of a constituent which appears elsewhere in surface structure — the source location code is a sequence of seven zeros.

6.8  Source Transcription Field

The source transcription field is used to record various categories of information about the wording as transcribed in the source files (in the case of CHRISTINE, the BNC files) which it is convenient to eliminate from the CHRISTINE word field. The source transcription field always contains a string of one or more characters; in the majority of word lines, to which none of the relevant categories of information applies, the field contains a pipe symbol, “|”, as placeholder.

Source transcription fields not containing the one-byte string “|” contain some combination of one or more of the following elements, in the order given:

& y i I c s t * punctuation-marks
Many of these items are mutually incompatible, but some source transcription fields do contain more than one character.

The meanings of the various items occurring in source transcription fields are as follows:

In a very few cases, a word in the BNC sources is immediately followed by more than one punctuation mark.  For instance the BNC file from which text T19 was extracted has a number of instances of a word followed by question mark and comma, in that order.  In such cases, both marks appear in the CHRISTINE source transcription field in the appropriate order, after any of the characters listed earlier as valid elements in source transcription fields.  (In every case of multiple punctuation marks, the BNC file treats them as separate “words” — there are no cases in CHRISTINE where, say, a sequence of two hyphens has been used to represent a single dash.  There is one case, discussed in the relevant notes file, where BNC represents a punctuation mark as separated from the preceding word by a space; in all other cases punctuation marks in the source transcription field should be understood as attached directly to the end of the word field contents.)

6.9  Wordtag Field

The wordtag field contains a code representing the grammatical classification of the word. Wordtags are normally strings of two or more characters, beginning with two capital letters, drawn from the class of wordtags defined in EFC supplemented by some additional wordtags for spoken English (listed in §13.1). (Note that the relatively precise set of wordtags for spoken “discourse items”, defined on pp. 447-8 of EFC, is used in place of the generic “interjection” wordtag UH defined on p. 118 of that book — UH does not occur in the CHRISTINE Corpus.) The only cases where the wordtag field contains something other than a string beginning with two capitals is in lines representing non-speech “events” (§8.5), where the wordtag field contains a hyphen.

6.10  Word Field

A word field contains a sequence of one or more characters; these sequences fall into three classes, distinguished by the contents of the source transcription field on the relevant line:

On the rules for standardizing the orthography of word field contents, see §7.2.

As in the SUSANNE Corpus, enclitics such as those at the end of the words won’t, she’d, are treated as separate words on lines of their own.  The Germanic genitive suffix as in John’s book is treated as an enclitic for purposes of word division.  In the few cases where a form ending phonetically in a sibilant must in context be seen as a regular genitive plural, as in both girls’ books, the apostrophe alone is split from the stem and treated as a separate word on its own line.  (In the context of speech this is an odd procedure, since this “word” never has any phonetic substance at all, but it is the logical consequence of the preceding rules which in most cases give sensible and convenient results.)

Again as in SUSANNE, whenever the contents of a word field would in ordinary English orthography follow the contents of the preceding word field immediately, without an intervening space or spaces, a plus sign is prefixed to the later word field.  Thus the word won’t is divided between two lines as

wo
+n’t
(note that nothing in the earlier word field marks wo as something other than an independent word); and gotta as a reduced form of got to is represented in CHRISTINE as
got
+ta
The CHRISTINE words are tagged and otherwise analysed like their unreduced equivalents:  wo, +n’t, and +ta are given the same wordtags as will, not, and to respectively.

(The other main use of the plus sign in word fields of the SUSANNE Corpus, in connexion with punctuation marks, is not relevant to the CHRISTINE Corpus.  When BNC transcriptions contain words followed by punctuation marks, the punctuation is moved into a different field in CHRISTINE as not part of the spoken material.  Punctuation marks attached to the beginnings of words, such as left bracket or opening inverted commas, do not occur in the speech transcriptions.)

On the treatment of hyphenated forms, see §7.12.

6.11  Parse Field

Parse fields in successive CHRISTINE word lines define a labelled tree structure over the corresponding sequence of word_wordtag pairs, considered as leaves of the tree, in the same manner as in the SUSANNE Corpus.

A parse tree for a sequence of words is represented as a labelled bracketing, with labels always repeated in full inside each of paired brackets (immediately following an opening square bracket, and immediately preceding a closing square bracket), and with no spacing between adjacent bracket/label strings (the label of the first opening bracket is immediately followed by the second opening bracket, and so on).  The character string for an entire tree (“tree string”) is divided between the parse fields of successive word lines, in a way that is rather cumbersome to define in words but which is natural and easily grasped from an example:  cf. the sample from T02.tx shown earlier in this section.

In every word line, the parse field contains a full stop, representing the word_wordtag pair.  To the left of the stop is shown the maximal subsegment of the tree string which consists entirely of labelled opening brackets such that the last one represents the node whose first daughter is the word_wordtag pair in question.  To the right of the stop is shown the maximal subsegment of the tree string which consists entirely of labelled closing brackets such that the first one represents the node whose last daughter is the word_wordtag pair in question.  It follows that a word which occurs medially within the tagma immediately dominating it will have a word field consisting just of a full stop character.

Referring back to the T02.tx sample:  Gemma006’s turn begins with a tree whose root is labelled S and has four daughter nodes labelled Ni:s, Vzb, P:e, and Rsw:c respectively.  The first and second of these nodes, labelled Ni:s and Vzb, each immediately dominates a single leaf node, it_PPH1 and the enclitic +s_VBZ respectively.  The P:e node has the two leaves per_IIp and foot_NNU1c as daughters.  Gemma006’s second tree again has an S root with four daughter nodes, labelled Rs:c, Ny:s, Vdc, and Ti:z, and the last of these has daughters of which the first is itself nonterminal, labelled Vi (and the second is an analytic element, #, indicating the fact that the Ti:z tagma is incomplete).  This is followed by a degenerate “tree” in which a silent pause, {pause}_YP, is both root and sole terminal node; and the turn finishes with a tree having a root labelled S? and four daughters each of which immediately dominates a leaf.

A parse tree is always complete within a single speaker turn (and therefore, a fortiori, within a single text division); in other words, turn header and division header lines never interrupt a parse tree.  Source-units, on the other hand, are segments of the BNC transcriptions which are preserved in CHRISTINE for reference purposes but do not necessarily correspond to any linguistic realities.  Therefore source-unit header lines may, and often do, occur medially within parse trees.

Definitions of the meanings of the bracket labels, S, Ni:s, etc., are outside the purview of the present document.  The bulk of the labelling scheme is defined in great detail in EFC.  Much of that book deals with notation that is equally applicable to written or spoken English; its Chapter 6 describes features of the scheme applying particularly to speech.  The present documentation file, in §13-14, does list and discuss additional speech annotation symbols and guidelines which have proved necessary in the light of experience with the CHRISTINE project, but those sections are written on the assumption that readers are familiar with the contents of EFC.

7. Orthography

7.1  Phonetic Transcription

I begin this section by identifying the system of phonetic transcription used, because later subsections of the present document include such transcriptions.

Phonetic transcriptions are shown (in this documentation file, and in CHRISTINE text files) as character-sequences enclosed in square brackets, using the SAM-PA broad phonetic notation for English, as defined in Gibbon et al. (1997: 699ff.); except that, for reasons discussed in Sampson (forthcoming), the ampersand rather than opening curly bracket symbol is used for the pat vowel.  (The SAM-PA system assigns the ampersand symbol to a slightly different vowel which does not occur in English.)

Linguists for whom the “emic/etic” distinction is important might prefer to enclose broad phonetic notation such as that of the SAM-PA system within slashes (solidi) rather than square brackets.  For the CHRISTINE Corpus, the distinction has little significance, and we use square brackets with phonetic notation in all cases.

(CHRISTINE contains only one instance of a form represented by phonetic transcription, but the full CHRISTINE Corpus will include many more cases in its text files.)

7.2  Spelling Standards

Different BNC speech files were transcribed by various workers, and it does not appear from the finished BNC that strong copy-editing standards were imposed, either via instructions to the transcribers or via post-editing. There is considerable orthographic variation, including quite a number of straightforward spelling mistakes.

For computer processing it is desirable that orthographic details should be standardized wherever possible, even in cases where the norms of English permit variation between alternative forms. CHRISTINE treats the orthographic usage of the Concise Oxford Dictionary (8th edition, 1990) as standard. Spellings in the original BNC files which deviate from COD usage are changed in CHRISTINE wordfields to agree with COD. Where COD lists alternative spellings (e.g. gaol, jail), CHRISTINE uses the one shown as primary in COD. The -ize form of the -ize/-ise suffix is used.[13]

Wherever orthographic forms in CHRISTINE deviate, for this or other reasons, from the form in the original BNC file, a note of the difference is included in the relevant notes file.

In some particularly common cases, changes to the orthography of the BNC transcriptions are made without logging them in the notes files:

Applying this standardization policy consistently depends on the analyst spotting cases where BNC orthography needed to be checked against COD; we had no means of enforcing the policy automatically. There will be cases which have escaped us. But I believe such cases in CHRISTINE are few.

7.3  Unusual Spellings of Christian Names

In some cases, BNC transcriptions include unusual spellings of Christian names.  In recent years, it has become fashionable for unusual spellings of traditional Christian names (particularly girls’ names) to be used as individuals’ official names.  However, there is no way that a BNC transcriber could have known that a particular individual mentioned in a dialogue, but not a participant in it, spelled his or her name in a special way; so CHRISTINE replaces such spellings with standard spellings.  (Where a person shown in BNC with an unusually-spelled name was a dialogue participant, his or her true name is changed in CHRISTINE anyway, for anonymization purposes.)

7.4  Nonstandard Wording v. Nonstandard Pronunciation

The issue of orthographic standardization relates only to standardizing the written representation of whatever words were uttered by speakers. Speakers’ words are not themselves changed, when their usage deviates from the standard. An idiosyncratic written form which seems to represent a purely phonological dialect variation is replaced by standard spelling: for instance, in text T39 the BNC original has the form wents, apparently representing a Liverpool pronunciation of went, and this is changed to went in CHRISTINE file T39.tx (with a record of the change in T39.nb). It is very unlikely that a BNC transcriber who recorded an occasional nonstandard pronunciation in this manner would do so consistently. But when the speaker’s words themselves, or the structure in which the words are arranged, differ from standard English, CHRISTINE records what was actually said, not what would be said in the standard language. A Northern speaker’s nowt for nothing appears as nowt in CHRISTINE; he done it as a nonstandard equivalent of he did it appears as he done it. This type of usage variation, which is lexical or grammatical rather than phonological, is part of the subject-matter of the CHRISTINE enterprise, to be preserved for study rather than discarded; and non-expert transcribers would normally be well able to reproduce it consistently.

An intermediate case is where a spelling is standardly used to represent some special pronunciation. It is not “idiosyncratic” for a transcriber occasionally to write ’e for an H-less pronunciation of he, or an’ for a reduced form of and. Logically, it might have been appropriate to change ’e and an’ to he and and in CHRISTINE (again there is unlikely to have been any consistency in transcribers’ use of the deviant orthography), but this was not done; where arguments are evenly balanced, it seemed best not to alter the material in our sources.

7.5  The Word Cos

The spelling cos deserves special mention. The standard-English word because is often reduced to a monosyllable which novelists, etc., frequently show as cos, and many examples of this form appear in our source material. There is an argument for saying that colloquial cos and standard because, while undoubtedly sharing a common origin, should be regarded as separate grammatically-distinct words in current spoken usage. Because in standard English is a subordinating conjunction; colloquial cos very often (impressionistically, more often than any other subordinating conjunctions) begins a clause which stands alone, or which displays only a vague logical relationship with what precedes (as if cos were a co-ordinating conjunction like and), rather than being used to express the precise causative or inferential relationship expressed by because in written English.

The CHRISTINE Corpus preserves the orthographic form cos where it occurs in the sources, rather than standardizing it to because, but in other respects the word is treated as identical to because: a clause beginning with cos is analysed as an adverbial subordinate clause (Fa), even if it is used as an independent statement. This arguably is a distortion of the structural realities of the spoken language.[14]

7.6  Of After Modal Verb

Another nonstandard orthographic practice proved controversial. There are many cases in the BNC transcriptions where perfective forms are written with of in place of have, e.g. could of been. This is an orthographic deviation of long standing in English; Caldwell (1998) quotes it as used by the American writer Booth Tarkington in his 1914 novel Penrod as a device to suggest lack of interest in education. My assumption was that this should be classified as a spelling mistake on the transcribers’ part; when unstressed, both have and of are regularly reduced to the pronunciation [@v], and some people choose the wrong spelling for this pronunciation in the relevant context. Consequently, the policy adopted in CHRISTINE is to change sequences such as could of been to could’ve been. But one of my researchers who has an English-teaching background urged strongly that this policy is misguided, because for many speakers the word really is of rather than have, so that a sequence like could of been should be seen not as a spelling mistake but as grammatical deviance. Presumably this would imply that the speakers in question would sometimes produce a full vowel, [Qv], in such a sequence.

Even if that is so, nothing could guarantee that BNC transcribers wrote of in just those cases where the respective speakers thought of the word as of (in most if not all cases the sound on the tape will have had an obscure vowel). But I record the disagreement here, because it seems a matter of some linguistic interest. On the face of it, the pervasiveness of of for have in the writing of perfective forms is surprising, because the logic of English verb groups might seem to make it obvious that [@v] in this context stands for have (no-one, surely, would ask Of you seen this? instead of Have you seen this?). The notes files show which cases of +’ve in CHRISTINE correspond to of in BNC.

7.7  Worth After Genitive

There is variation in the BNC transcriptions between the use of s-apostrophe and plain plural forms in the construction (ten) pounds’ worth/pounds worth. CHRISTINE uses the s-apostrophe form, as in standard orthography, and analyses the wording before worth as a genitive phrase. (Note however that one speaker uses the phrase seventy-two pound worth, T13.00974; no attempt is made to represent this as a genitive construction.)

7.8  Capitals

With respect to initial capitals, the contents of CHRISTINE word fields are intended to display words as citation forms, not as they would appear in a running text. Thus the name London, or the pronoun I, appear with capitals in CHRISTINE, because these words are intrinsically capitalized. On the other hand, if an utterance begins with the article the, this will be shown in the CHRISTINE wordfield as the even though an ordinary transcription would capitalize it as sentence-initial. The aim of CHRISTINE is to identify the individual words used by speakers, in a standard format, and to show via the parsefield annotations what spoken grammatical structures the words enter into — it is not to mimic the special structural norms of the written language.

This policy has consequences which users may find surprising, in the case of proper names derived from universal terms. For reasons which are discussed at length in EFC, pp. 86-90, and will not be rehearsed here, the SUSANNE wordtagging standards define the concept “proper noun” in a very restrictive way: except for names of persons, universal terms used as names are wordtagged as universal terms, and hence in the CHRISTINE Corpus they appear in lower case. The SUSANNE annotation scheme treats “being a name” as a property of the syntactic category noun phrase, rather than of individual words. Hence (to use the EFC example) a reference to the town of Flagstaff in Arizona would appear in the CHRISTINE wordfield as flagstaff, classified in the wordtag field as a countable common noun, though the parsefield would show the wordtag dominated by a tagmatag Nn..., “proper noun phrase”.

Cases where word-initial capitals in the original BNC transcriptions correspond to lower case in CHRISTINE files are reflected by asterisks in the CHRISTINE source transcription field (§6.8). In some cases, notes on orthographic corrections in the notes files show that a lower-case form in the original BNC transcription has been capitalized, because standard English orthographic norms require a capital although the transcriber failed to write one; but what this means in practice is that the form occurs in lower case in the CHRISTINE word field (because the SUSANNE annotation rules require it to appear in lower case) but an asterisk occurs in the source transcription field, showing that it would be capitalized in a standard orthographic transcription. For instance, at T06.00637 the original BNC transcription has the form the incredible hulk.  The speaker was referring to a cartoon character, the Incredible Hulk; so the notes file has a comment to the effect that incredible hulk has been emended to Incredible Hulk, but (because the words incredible and hulk are universal terms) the CHRISTINE text file actually displays the words as incredible and hulk, with asterisks in the corresponding source transcription fields.

(However, correcting the BNC transcribers’ use of capitals with names formed from universal terms is not an issue on which great effort has been expended. For instance, at T26.00756 the original BNC file contains the spelling house of commons, referring to the lower House of the UK Parliament. By the SUSANNE annotation rules the words house and commons must be shown in the CHRISTINE wordfields in lower case. The phrase fits the SUSANNE definition of a proper noun phrase, so the CHRISTINE parsefields give it an Nn... tag. Probably any publisher’s stylebook would require the phrase to be written House of Commons, contrary to the BNC transcriber’s usage, so ideally the notes file ought to include a note “house of commons emended to House of Commons” and the source-transcription fields for house and commons should be given asterisks; but this has not been done.)

There is variation in the BNC files between God and god, in exclamations referring to the Deity. These cases are standardized to the form god in the word field with an asterisk in the source transcription field, i.e. they are treated as if the original transcription had been God in every case; the notes files do not record which examples had lower-case g in the original.

7.9  Acronyms

A special issue arises in the case of acronyms. The original BNC files differentiate between acronyms which are pronounced as ordinary words (e.g. NATO, said [neIt@U]) and acronyms which are spelled out letter by letter (e.g. USA, said [ju:eseI]). Cases like NATO are treated in BNC as single words (written in modern style without abbreviatory stops); cases like USA are transcribed as U S A, i.e. each individual letter is treated as a separate word. For grammatical annotation, the latter treatment is inconvenient; an acronym such as USA functions grammatically as “one word” as much as an acronym like NATO, and treating USA as three words is not an appropriate way of indicating the conventional pronunciation of the word. (We do not transcribe knight in a special way to indicate that it is conventionally pronounced [naIt] rather than [knIxt].) Consequently, USA-type acronyms are assimilated to NATO-type acronyms in CHRISTINE text files; but the notes files log such cases, with remarks in the form “U S A emended to USA”.

7.10  Special Characters

Where the COD standard orthography of a word uses an accented or other non-ASCII character, this is represented as an ISO 8879 Appendix D entity (Goldfarb 1990: 506ff.) within angle brackets, e.g. d<eacute>tente.

In the SUSANNE Corpus, the entity <apos> is used for the apostrophe, to distinguish it from inverted comma. In a speech corpus, inverted commas do not occur, so the ASCII character(code 39) unambiguously means apostrophe and is used as such in CHRISTINE.

7.11  Anonymized Forms

The elements <name>, <address>, and <telNo> are used in CHRISTINE wordfields to represent names, addresses, and telephone numbers which have been removed from the transcriptions for anonymization purposes. (See §4.1.)

7.12  Hyphens

Hyphenation is an area where practice in CHRISTINE deviates from the SUSANNE norms, which often require written hyphens to be analysed as separate “words” with their own wordtag and place in a parsetree. This would be inappropriate in an analysed corpus of speech, since hyphens never have any phonetic reality. The policy applied in CHRISTINE is as follows:

In the case of set phrases, the authority of COD is used to decide whether to treat them as single words written solid, single words written with hyphens, or separate words (e.g. wineglass, wine-glass, or wine glass). A case written by COD like wineglass or wine-glass is treated as one word and appears on a single line in CHRISTINE, with a single wordtag; a case written by COD like wine glass is treated in CHRISTINE as a sequence of words each with its own wordtag on a separate line, linked only by higher-level structuring in the parsefields. If the spelling in the original BNC file deviates from the COD standard, as in other cases of orthographic discrepancy the BNC spelling is changed to match COD, with the emendation logged in the notes file.

CHRISTINE does, however, depart from the COD orthographic standard in cases where COD treats a phrase as hyphenated but the phrase is formed productively and the SUSANNE annotation scheme would treat it as a sequence of words with internal grammatical relationships. For instance, at T34.02863 the BNC file includes the form blonde-haired, and althogh COD does not contain blonde-haired as a headword it is clear from examples in the dictionary that it treats forms of the pattern X-haired as hyphenated words. However, the SUSANNE annotation scheme treats the form as an adjective phrase consisting of blonde as a JJ word followed by haired as a JJh word. Therefore CHRISTINE changes the BNC form blonde-haired to blonde haired (logging the change in the notes file), and analyses the form as a two-word phrase.

Conversely, there are cases where a clear verb group contains a word, written separately by the transcriber, which is not a verb; e.g. you ought to be second staging the whole thing W22.00168.  In cases like this, CHRISTINE introduces a hyphen (so that second-staging as a whole is a present participle, and the infinitival verb group has a normal internal structure) irrespective of the transcriber’s orthography.  Typically, such compounds are productive and will not be listed in dictionaries, but it seems likely that they would be hyphenated in careful written English.

The Corpus includes a number of examples of the nonstandard regional present participle with a- prefix, e.g a-coming (Trudgill 1990:80), which were sometimes treated as two words, a coming, by the BNC transcribers. These forms are normalized in CHRISTINE to single hyphenated words, a-coming (which are wordtagged like standard present participles).

7.13  Distorted Words

Where an uttered form is too truncated or distorted to be transcribed as a normal word, BNC usually represents it using letters with their conventional English values, as would be done in a representation of broken dialogue in a novel. Here the concept of standardizing spellings to consistent norms does not apply. The option of replacing conventional orthography with phonetic transcription was considered but rejected, because it would involve too much guesswork about what sounds were intended. No attempt has been made to eliminate meaningless variation, such as the representation of a truncated utterance of she as sh or as sh- at different points in the Corpus. Where a distorted form seems to have been capitalized purely because it is “sentence”-initial, the capital is changed to lower case in CHRISTINE. But in some cases capitals on distorted words have been left to stand, because they may possibly have been intended to represent a pronunciation difference (e.g. O v. o could conceivably correspond to [@U] v. [Q]).

If it is possible to infer from context what word was intended, the distorted word is given the appropriate wordtag followed by the normal orthography of the word after a slash (e.g. sh- for she is wordtagged PPHS1f/she — see §13.1, “Slash Wordtags”), and a grammatical structure is assigned as if the word had been pronounced normally.  Where the grammatical category of the intended word is not clear, it is wordtagged FD (distorted word) and is fitted into the surrounding tree as if it were a meaningless noise.

8. Non-linguistic Items

8.1  Vocalizations v. Events

Curly brackets are used in the wordfield to enclose non-linguistic items: non-linguistic vocal sounds such as coughs, laughs, etc., vocal shifts marking the beginning and end of stretches of speech which the original transcribers noted as being uttered in a special manner, e.g. singing or laughing, and “events” — non-vocal noises such as telephone rings, music playing, etc., together with occurrences such as mechanical breaks in the recording.

In principle there is room for debate about where the line should be drawn in this area between vocalizations and “events”, for instance children shouting in the background would be vocalization from the point of view of the children but is recorded as an “event” by BNC transcribers who are attending to others’ speech in the foreground; with a few exceptions, CHRISTINE follows the BNC classification.

One case where CHRISTINE deviates from BNC practice relates to the items recorded in BNC as “vocal” elements with the description attribute “tut”. These elements appear to represent the click sound (indicating various emotions including mild disapproval) which novelists commonly show as “tsk” in dialogue. Although not listed in COD, this might be regarded as an item of English vocabulary rather than as a non-linguistic vocalization; it is shown as tsk, wordtagged UX as an Expletive, in CHRISTINE.

8.2  Silent Pauses

Silent pauses are shown in CHRISTINE as {pause}, wordtagged YP.

8.3  Inaudible Wording

The element {unclear}, wordtagged YY, represents a form or sequence of forms which the transcriber could not identify from the recording.

8.4  Non-linguistic Vocalizations

Non-linguistic vocal sounds are assigned the wordtag YV; and CHRISTINE uses vocalization shift elements, tagged YVL and YVR, to show the beginning and end of stretches of wording to which special vocal properties apply. An item between curly brackets in the wordfield identifies the nature of the vocal sound or vocalization shift. The following standard vocal categories are represented in the CHRISTINE files by name:

{belch}
{clearsThroat}
{cough}
{crying}
{giggle}
{humming}
{laugh}
{laughing}
{moan}
{onTelephone}
{panting}
{raspberry}
{scream}
{screaming}
{sigh}
{singing}
{sneeze}
{sniff}
{whistling}
{yawn}
In the case of noun/present-participle doublets such as {laugh} v. {laughing}, the present participle is used for shifts applying over a word or words; a sound interrupting the sequence of words is commonly shown in BNC with the noun, but sometimes the present participle is used in such cases also, presumably to indicate that the sound lasted for an appreciable period. CHRISTINE simply follows BNC usage in these cases.

Other vocal sounds or shifts with more complex descriptions, or which occur less frequently in conjunction with speech, are given numerical codes and appear in the text files in the form {vocal99}. The meanings of these codes are as follows:

01      imitates woman’s voice
02      imitating a sexy woman’s voice
03      imitating Chinese voice
04      imitating drunken voice
05      imitating man’s voice
06      imitating posh voice
07      mimicking police siren
08      mimicking Birmingham accent
09      mimicking Donald Duck
10      mimicking stupid man’s voice
12      mimicking
13      speaking in French
14      spelling
15      whingeing
16      ch ch i.e. face-slapping noise
17      drowning noises
18      imitates sound of something being unscrewed and popped off
19      imitates vomiting
20      makes drunken sounds and a pretend belch
21      makes running noises
22      sharp intake of breath
23      click

8.5  Non-vocal Events

Non-vocal events are shown in the wordfield as elements of the form {event...}, with a two-digit code identifying the event-type, as follows (trivial differences between BNC event-descriptions, e.g “phone rings” v. “telephone rings”, are ignored):

01      loud music and conversation
02      banging noise
03      break in recording
04      car starts up
05      cat noises
06      children shouting
07      crash as child falls over
08      dog barks
09      poor quality recording, unable to hear conversation
10      traffic noise
11      in another room
13      intercom on
14      intercom off
15      children making a lot of noise
16      children playing in background
17      lots of kitchen banging around noises
18      loud music is on
19      man in other room shouting
20      microphone interference
21      microphone too far away
22      mouth full
23      moved in to another room to eat and watching This Is Your Life on television
24      music in background
25      music loud in background
26      music playing
27      music too loud to hear and lots of different conversations going on
28      pen on paper
29      telephone rings
30      telephone conversation
31      playing with baby
32      beep
33      poor quality recording and traffic noise, unable to hear conversation
34      clapping
35      tape recording
36      tapping on computer
37      television on very loud
38      television on
39      loud music with speakers singing along and then poor quality recording; unable to make out any conversation
40      tune to Batman
41      unzipping a bag
42      walking through subway, speech echoes
43      water running making it difficult to hear
44      eating
These “events” are given no wordtag, since they are in no sense linguistic elements. In such lines, the wordfield contains a hyphen as placeholder. Logically speaking, events belong to no speaker turn; they are not given header lines of their own, so, where an event actually occurred between adjacent speaker turns, the CHRISTINE file looks as though the event was the closing item of the immediately-preceding turn and source-unit.

8.6  Duration Markers

Whenever BNC specifies the approximate duration of a pause, nonlinguistic vocalization, or “event”, this is shown as a figure (in seconds) within the curly brackets after an underscore character. Thus {pause_20} means a silent pause lasting about 20 seconds, {humming_7} means humming lasting for about 7 seconds, {vocal19_10} means an imitation of vomiting lasting for about 10 seconds, {event25_600} means loud music lasting about ten minutes, and so forth.[15]
 

 9.  Annotating Inaudible Wording

The transcriptions contain many passages where, because of poor recording quality or for other reasons, the transcriber has failed to make out the wording; in the CHRISTINE text files these passages are shown with the entity {unclear}.  In context it seems likely that {unclear} elements in many cases were momentary inarticulate sounds occurring in the middle of passages which are well-formed and meaningful without them; in other cases, the {unclear} entity certainly stands in place of a fairly lengthy word sequence, so that the surrounding clear wording is uninterpretable without knowledge of the unclear wording.  In the texts derived from BNC (all the texts contained in CHRISTINE), no distinction is normally made between these cases.

When a substantial segment of an utterance is inaudible, it can often be quite impossible to know how the surrounding clear wording fits into a grammatical structure; the structure prescribed by the annotation scheme might be quite different, depending on which of alternative plausible turns of phrase are concealed behind the {unclear} notation.  But it is inevitable that annotated corpora based on recordings of real-life speech will contain inaudible passages.  BNC/demographic perhaps has a higher incidence of such passages than some transcribed speech samples (and the CHRISTINE excerpting process made no attempt to avoid passages with much inaudible wording, for fear of biasing the selection of texts), but some incidence of untranscribable wording will occur in any corpus of spontaneous speech.  An annotation scheme needs strategies for dealing with such passages, which recognize the unavoidable structural ambiguities, but specify a predictable annotation nevertheless.

The CHRISTINE project has attempted to formulate a set of annotation rules which prescribe how to draw a tree structure in cases where inaudible wording means that the correct tree structure cannot be known, and which specify what claims are implied by such structures, and what claims are not implied, about the grammatical constructions actually produced by the speakers.  The rules as they stand are not entirely satisfactory; but I hope that describing our practice as it has developed to date, and the problems that remain, may help the discipline to find better ways forward in this area.

Our rules are as follows:

The reason why it has seemed appropriate for these two rules to be non-symmetric is that tagmas in English commonly have distinctive beginnings but not distinctive endings.

Although these rules depend on analysts’ judgements about what is “clearly true”, or “could well be true”, of the grammar of inaudible wording, this is not intended to require analysts to consider every remote possibility and base the analysis on what is absolutely guaranteed to be true or to be impossible.  Decisions must be made on the basis of common-sense judgements about what alternative structures are reasonably plausible in particular cases.

 To illustrate, consider the sequence I {unclear} Sophie can walk over, T04.02592.  This could represent something like “I might go by car.  Sophie can walk over.” — in other words, two separate parsetrees.  But it could equally be something like “I suppose Sophie can walk over”, where the wording from Sophie on is a clause subordinate to a verb within the inaudible stretch.  Therefore we annotate the sequence as:
[S [Nea:s I ] [Y {unclear} [S [Nns:s Sophie ] [Vc can walk ] [R:q over ] S] Y] S]
The Sophie ... clause might be an argument of an inaudible verb, so it is placed under the Y even though it might alternatively be an independent sentence.  The word I clearly begins some clause which must continue within the inaudible stretch, so the Y is placed below the root S.

The rules above fix the shape of the trees, but there are also issues about the node labels.

If the tagmatag Y is used because wording is inaudible (rather than because it was interrupted), then the Y node will normally immediately dominate at least one {unclear}_YY entity.  Occasionally, however, the rules above mean that a Y node has to be created which dominates a YY leaf only at two or more removes.  Thus, in:
[Y+ [CC# and # and # and ] [Fa if er [Y {unclear} ] Fa] Y+]  T39.04812
the rules require the Y node which immediately dominates the {unclear} entity to be placed below Fa (its first words must surely have gone with if as an adverbial clause).  Probably, in reality, the {unclear} entity also included later words belonging to the construction within which the Fa is embedded; but we do not know what these words were, so we have no basis for labelling the higher construction anything more specific than Y+.

In some cases, the result of the rules is to create productions with odd relationships between mother and daughter labels.  Consider, for instance:

[S ... [Fa:t when I’m in love ] [N:s my [Y {unclear} ] N:s] S]  T06.00615
[S? who do you reckon [Fn:o [Ns:s this cheap [Y {unclear} ] Ns:s] Fn:o] S?]  T20.03027
By the rules stated above, the {unclear} entity in the T06 case is placed below the N:s node, creating the oddity of an S containing a Time adjunct followed by a subject noun phrase but no verb or further constituents.  (By comparison with another remark by the same speaker immediately afterwards, it is likely that the inaudible wording here was something like face lights up.)  In the T20 case, we assume that who has been Wh-Fronted out of a nominal clause whose subject began this cheap ...; the rules for Y result in an Fn tagma consisting of a subject noun phrase and nothing else.

These strange productions could be avoided, if we allowed {unclear} entities to be split in two, representing successive subsequences of the inaudible wording; then, in the T06 case for instance, one {unclear} entity could remain within the N:s tagma but another could be placed after it, so that the S tagma would contain the more plausible sequence of daughter-labels “... N:s Y”.  But to split {unclear} entities up in this way would require us to make assumptions which we have no basis for making, about the wording lying behind the transcriber’s {unclear} notation.  For all we know, the speaker in T06 may have broken off her utterance at or before the end of the subject phrase.  CHRISTINE does not split up single {unclear} entities in the sources into multiple entities; we accept occasional odd productions such as those illustrated here as the cost of fidelity to our data.

10.  Annotating Speech Repairs

10.1  Structure Before an Interruption

Where a speaker edits his output “on the fly”, CHRISTINE uses the system described in EFC, p. 448ff., for annotating the structures of repaired speech, which makes heavy use of the symbol # to mark points where a tagma is prematurely interrupted.

The discussion in EFC of the use of the # symbol is explicit about how a # element is fitted into the surrounding tree; but it is not explicit about how an interrupted construction is annotated.  Immediately before a “moment of interruption” there will often be words which were intended to begin tagmas that were never completed.

CHRISTINE practice is to include the tagmatags for interrupted tagmas, provided it is reasonably clear what these were likely to have been — even if the word(s) actually uttered before the point of interruption would not, in fluent, unrepaired speech, justify the relevant tagma.  For instance, in the (invented) example:

I must have the must get the ticket
an interruption point occurs between the and must, and the # symbol will be a daughter of an S node dominating the entire sequence; before the #, the word the (alone) will be tagmatagged N:o, although the alone would never be counted as a noun phrase in fluent, unrepaired speech:
[S [Nea:s I ] [Vc must have ] [N:o the ] # [Vc must get ] [Ns:o the ticket ] S]
Again, the phrase:
[P in [N [G people’s ] ] # in [Np [G people’s ] minds ] ]  W22.00164
is analysed as shown, with the first G subordinate to an N that would certainly have been required in the structure if the speaker had not interrupted himself.  A single word will be given a phrasetag node below another phrasetag, contrary to the normal SUSANNE rule, if that word would have been part of a multi-word phrase which was interrupted after the first word.

In these cases, no subcategory letter is added to the “N”.  In context it is a fair bet that the interrupted noun phrases, if completed, would have been respectively the ticket and people’s minds, like the complete phrases which were uttered after the interruption points; but the parts of the interrupted phrases that were actually uttered are not marked for number, so we would not write Ns:o or Np.  The general principle is that, so far as it is possible to work it out, the overall shape of the tree dominating interrupted wording should be what it would have been if the wording had been completed, with the same number of nodes labelled with the same main categories and functiontags; but detailed subcategory symbols are added only for features which are actually marked in the forms uttered.

One can scarcely ever be 100% sure how interrupted wording would have been completed.  Consequently, an element of guesswork is unavoidable in this area; we allow ourselves to make reasonable guesses about the balance of probability, in order to produce a meaningful structural annotation.  For instance, we created quite a lot of “hypothetical structure” for the example:

they’re new cases here though are, well no have you seen ...  T30.00633
The BNC transcriber’s comma evidently marks a point where the speaker interrupted himself.  The likeliest explanation occurring to us for the word are is that it was intended to begin a tag question, so the CHRISTINE annotation is:
[S ... they’re new cases here though [Iq [S? [Vab are ] ] ] # well_UW no_UN [Vo have ] ... ]
Often the intended role of words preceding an interruption is quite opaque, and in those cases no tagmatags are inserted (even though the consequence may be to create grammatically-odd relationships between the labels of mother and daughter nodes).  But where it is possible, by postulating interrupted constructions that seem at least highly plausible, to make the material immediately before an interruption point into part of a normal grammatical tree structure, we do this.

10.2  “Markovian” Syntax

A phenomenon akin to speech repairs, though without a “point of interruption”, is what might be called “Markovian” syntax.  I mean by this a sequence of words which are such that, if a window of limited size were moved through the sequence, the words inside the window at any point would appear to cohere as part of a normal grammatical structure, but no such structure can be imposed on the entire sequence from beginning to end.  CHRISTINE contains a number of examples.  The following case was uttered by Anthony Wedgwood Benn, MP, on a radio discussion programme:

and what is happening {pause} in Britain today {pause} is ay- demand for an entirely new foreign policy quite different from the cold war policy {pause} is emerging from the Left  X01.00539-45
With respect to what precedes it, the long noun phrase an entirely new foreign policy quite different from the cold war policy functions as the complement of a prepositional phrase postmodifying demand, which in turn is the complement of a main clause headed by is.  On the other hand, with respect to what follows, that same noun phrase is subject of a main clause headed by is emerging.

In a case like this, it is not really meaningful to identify a single point where one grammatical plan is abandoned in favour of another.  In this particular example, the fact that the transcriber recorded a pause immediately after cold war policy but no pause immediately before an entirely might perhaps tempt one to say that the noun phrase is “really” the complement of for, and that a new clause lacking a subject is initiated after the pause; but there are other cases where the “hinge” element of a Markovian sequence has no pause either immediately before or immediately after it.

However, we have found no way of annotating Markovian sequences other than by imposing an arbitrary division and treating the hinge element as belonging to one of the constructions to which it is adjacent and not to the other.   A sequence in which a single element plays roles simultaneously in two separate constructions resists analysis in terms of tree-shaped constituency diagrams (or, equivalently, in terms of labelled bracketings of word-strings).  Yet constituency analysis is so solidly established as the appropriate formalism for representing natural-language structure in general that it seems impractical to think of abandoning it, merely in order to deal with one special type of speech repair.

11.  Annotating Nonstandard Usage

11.1  Dialect Difference v. Performance Error

The UK contains an intricate diversity of English dialects, and CHRISTINE includes many turns of phrase that are clearly deviant with respect to standard English.  At the phonological level, there are good published descriptions of nonstandard English dialects; but phonology is not our concern on the CHRISTINE project.  At the level of grammatical structure, the literature on dialects other than the national standard is less well-developed.  (Leading publications are Trudgill 1990, Trudgill & Chambers 1991, Milroy & Milroy 1993.)  This means that formulating consistent, sensible guidelines for annotating nonstandard usage is a major problem for a project like ours.

Less than three years from the inception of the CHRISTINE project, we certainly have not achieved a fully adequate solution to this problem; it is too large for that.  But we have made a start in grappling with it.  I discuss our provisional guidelines for annotating nonstandard usage here, in the hope that they may help to inspire others to take the effort further.

A first difficulty lies in drawing the distinction between cases where a speaker’s usage is regular with respect to his own regional or class dialect (though deviant from the point of view of standard English), and cases where a speaker makes an error of performance yielding wording which is ill-formed from his own point of view.  When odd-looking wording is just a hasty slip of the tongue or the like, we may annotate it in ways which explicitly mark it as deviant, for instance using the # symbol to identify a tagma as incomplete; and we do not expect ever to evolve guidelines which provide well-defined, predictable structural annotations for each particular kind of performance error — there are just too many different ways in which people can trip over their tongues, so that many performance errors are likely to be “one-offs” which will inevitably have to be annotated in ways that are somewhat ad hoc and arbitrary.  If, on the other hand, a stretch of wording conforms to structural norms that happen to differ from those of the standard language, then our annotation scheme ought, if possible, to provide norms for representing its structure.  It would be scientifically misleading (and perhaps offensive) to annotate such cases in ways that equate them with performance errors.

Between us, members of the CHRISTINE team have lived for substantial periods in many different British dialect areas; we hope that this experience, together with exposure to further varieties through the media and through a cosmopolitan workplace, will usually have enabled us to recognize when a nonstandard form is regular in its own terms.  “Usually” is certainly not “always”, though.  Consider, for instance, the passage:

oh she was shouting at him at dinner-time {begin shouting} Steven {end shouting} oh god dinner-time she was shouting him  T19.03154
When I encountered this passage I was well aware that use of a time-noun like dinner-time as a Time adjunct without a preposition (as in the second instance of dinner-time in the example) is normal in nonstandard variants of English, where the national standard language would require at dinner-time.  But I took for granted that the closing phrase shouting him, where shout appears to be used transitively with the person shouted at as object, was just a slip of the tongue (or, perhaps, an accidental transcriber error).  There are a number of places in the texts where words are omitted by accident.  For instance, probably no-one would be tempted to invoke dialect difference to explain the lack of a verb before the final word photos in the passage:
There’s one thing I don’t like {pause} and that’s having my photo taken.  And it will be hard when we have to photos.  T17.03102-3
The speaker surely meant to say something like to show photos.  I took the shouting him case to be like this — after all, earlier in the same turn the speaker had produced the normal phrase shouting at him.  But this assumption was undermined when I encountered further cases uttered by speakers in other BNC files:
go in the sitting-room until I shout you for tea  T33.00332
the spelling mistakes only occurred when {pause} I was shouted  T17.02798
This looks like sufficient evidence to establish that shout has a transitive use with personal object in (some) nonstandard dialects, though I had no idea of this previously.  (The regional codes for the speakers involved are Midlands for the T19 case, and Northern England for the T33 and T17 cases.)

In this case, our material happened to contain multiple examples.  There must be other cases where we took nonstandard phrasing for performance error, because we encountered only a single example, or failed to notice similarities between separate examples.

There is no ideal solution to this problem.  One can only be aware of it and strive to draw the distinction between nonstandard dialect norms and performance errors, but this is never likely to be perfectly achieved.

The problem seems to be particularly troublesome at the ends of utterances.  Breaking a construction off prematurely is a common performance deviation; but dialects differ in the ways in which they regularize abbreviation.  In:

That’s right, she said Margaret never goes, I said well we never go for lunch out, we hardly ever really.  T28.08744
the words we hardly ever really would not occur in standard English without some verb (if only a placeholding do), so the sequence would most plausibly be taken as an interrupted utterance of some clause such as we hardly ever really go out to eat at all — in which case the CHRISTINE annotation would supply a # symbol as the last daughter of the clause node.  But on the basis of impressionistic awareness of dialect variation in Britain it seems easy to suppose that the speaker’s dialect might allow we hardly ever really for standard we hardly ever do really.  If so, it would be misleading to insert a # symbol.

Where a passage is ambiguous between an interpretation which attributes standard grammar to the wording and an interpretation which attributes a nonstandard construction to it, other things being equal we prefer the former.  Thus, in They got Mr Bean on Saturday as well T03.00814 (Mr Bean is a television programme), the word on is ambiguous.  In standard English, on Saturday could be a prepositional phrase functioning as a Time adjunct; but, alternatively, on and Saturday might be separate clause constituents, with on interpreted adverbially as in the programme is on, “it is being broadcast at the time”, and Saturday taken as a case of the common though nonstandard construction in which a time noun without preposition functions as an adjunct.  CHRISTINE chooses the former analysis.

11.2  Wordtags for Nonstandard Usage

In the case of wordtagging, we have adopted fairly clear and predictable analytic guidelines for the CHRISTINE project, but these contradict what was said in the EFC book.

The EFC rule (§3.67) was that words used in ways characteristic of nonstandard dialects were to be wordtagged in the same way as the words that would replace them in standard English.  That rule was reasonable in the context of the written language, where nonstandard forms are a peripheral nuisance.  Experience on the CHRISTINE project quickly showed the rule to be impractical for analysing spontaneous speech, which contains a high incidence of such forms.  The rule creates too many questions:  granted that we cannot use word X in exactly this way in standard English, just how is it used in the relevant speaker’s dialect, and precisely which standard word does that equate to?  To expect analysts to succeed in formulating suitable answers to a stream of such questions is not reasonable.  For CHRISTINE, the wordtagging rule is reversed:  in general, words used in nonstandard grammatical functions are given the same wordtags that the relevant wordforms are given in their standard uses.[17]

On the other hand, the phrases containing the words are tagmatagged in accordance with their grammatical function in context.  The general principle here is that English dialects differ less as one moves from the leaves towards the roots of parse-trees.  It does not seem practical to assign wordtags other than by reference to the known, standard language, because word uses in nonstandard varieties are too unpredictable (from the point of view of analysts who speak standard English).  But phrases and clauses should usually be assignable to the same range of categories in any dialect, even if some of the words composing them are used in nonstandard ways.

This revised wordtagging rule has proved to work unproblematically for many cases:  for instance, frequent nonstandard uses of (standard) adjectives in place of adverbs as qualifiers, e.g:

[J awful_JJ quiet_JJ ]  T11.02680
or (standard) personal pronouns in modifying position, as in:
it’s a bit of fun, it livens up me day  T31.03497
she told me to have them plums  T15.10705
— here, the words me, them are wordtagged as object pronouns (they are not given the wordtags which my, those would receive), but the phrases me day, them plums are tagged as noun phrases.

In these cases, since the most characteristic word of an adjective phrase or a noun phrase is its head adjective or noun, respectively, the fact that modifying words represent nonstandard usages does not create a strong feeling of inconsistency between wordtags and tagmatags.  The consequences of our rule are not so happy in the (less frequent) cases where the word used in a nonstandard fashion is itself the head or characteristic word of its tagma.  Cases of this sort include:

[P next_MDt [Nj the high ] ]  T13.01136
is it [R:e indoor_JB ]  T01.02688
wait [Fa:t to_IIt you start next year ] T27.03876
I haven’t had [Ns:o a lend_VV0t of it ]  T11.02608
In next the high (which in context means “next to the high setting [on a cooker]”), next is used as a preposition (as it can be in archaic literary English as well as in current nonstandard dialects), but in modern standard English the word functions only as an ordinal-type modifier.  In Is it indoor? the word indoor is a regional equivalent of standard indoors; in standard English the form without -s can be only an attributive adjective, as in indoor games.  In the T27 case, to is being used as a subordinating conjunction rather than a preposition.[18]  Finally, the last example illustrates the very frequent nonstandard use of lend, in standard English only a verb, for loan.  Although our rule produces odd-looking relationships between tagmatags and wordtags in cases like these, overall it seems preferable to continuing with the rule specified in EFC.

11.3  Abbreviated Idioms

Sometimes, what in standard English are multi-word idioms are reduced by speakers (whether as a regular dialect feature or a performance simplification) to single words, for instance a bit appears as bit in bit awkward that I should think W09.00485.  In such cases we give the word the wordtag it would receive as part of the standard idiom, without creating a node above it with an idiomtag ending in equals sign.  Thus, in the case just quoted, bit is wordtagged DD1b22, but there is no DD1b= node.

A more complex case occurs at W15.00640, shown here with the analysis assigned by CHRISTINE:

I wouldn’t mind [N:o [Dp just a_DD221 # just few_DD222 ] miles ... N:o]
— it seems that two attempts have been made to realize the idiom a few, but neither attempt was complete.

In the exchange Evening Olive. — Evening Alf.,  T03.00967-8, the word evening is wordtagged UGA22, as an abbreviation of good evening — though it is so usual for the words morning, afternoon, and evening to be used alone as UGA discourse items (and for night to be used alone as a UGZ item) that, arguably, it might have been preferable to assign the simple wordtag UGA rather than treat evening as an abbreviation of good evening.

11.4  Nonstandard Verbal Structures

The rule whereby nonstandard uses of words are given the tags which the same wordforms would receive in their standard use has proved unsatisfactory for one area of grammar, namely nonstandard verb uses.  There are two difficulties.  In the first place, many standard verb forms are ambiguous between two inflexions, for instance come in standard English can be the base form or the past participle — so that a rule saying “give a nonstandard use the tag which the wordform would receive in standard English” does not yield a predictable tagging decision.  Secondly, and more importantly, verb usage in nonstandard English dialects often seems too different from the standard pattern to be adequately handled by a simple rule about wordtagging.  We find:

a man bought a horse and give it to her  T13.01096-8
he give me tablets and whatnots  T01.02717
do you want them took off  T08.00848
I thought you’d’ve ate them before now  T16.07040
what I done, I taped it back like that  T11.02536
I heard it when I come in didn’t I  T04.02377
The difficulties in applying the general nonstandard-wordtagging rule to verbs seem to relate in particular to verbs referring to past time.  There are also nonstandard verb uses where the oddity relates to subject-verb agreement, but the rule of §11.2 works satisfactorily in these cases.  Thus, in the examples:
I were flying  T01.02626
I says  T01.02763
the underlined words are tagged VBDR, VVZv respectively (rather than VBDZ, VV0v, like was, say) — and the verb groups were flying, says, are tagmatagged Vwu, Vz.  The nonstandardness of the grammar is manifested in the parse tree through the fact that the subcategories of subject noun phrase (Nea) and verb group, respectively, do not match in the normal way.  But, if the general rule were applied to the earlier examples, it would not be clear whether to tag come as base form or past participle, and it would further be unclear how to assign higher-level structure above the wordtags.

One writer who has discussed this problematic range of verb uses is Eisikovits (1987).  Eisikovits’s article is based on data from an Australian urban dialect, but, as Trudgill & Chambers (1991: 52) rightly point out, the facts are similar for many UK dialects.  Eisikovits (p. 134) in effect argues that the tense system exemplified in a clause like what I done is the same as that of standard English, but that a single form done is used in the nonstandard dialect for both past tense and past participle (in the same way as single forms such as said, allowed, are used for both functions in the standard language, in the case of many other verbs).  Other writers on nonstandard dialects, e.g. Beal (1993: 192), seem to take a similar line.

But this analysis seems to overlook cases (which are very common in the CHRISTINE material) like:

What it is, when you got snooker on and just snooker you’re quite {pause} content to watch it  T11.02572
Here it is clear that got is functioning as a perfective form meaning “have”, the equivalent of standard have got.  It is not just that nonstandard dialects swap verb forms between the different inflexion categories; the syntax of nonstandard verb groups is different, in that a perfective construction can lack a preceding auxiliary.  Presence or absence of auxiliary is the only diagnostic in standard English for the perfective/past-tense contrast with the majority of verbs, whose past tense and past participle forms are identical.  So, if a past form without auxiliary can in nonstandard varieties correspond to a standard perfective construction as well as a standard past tense, it seems questionable whether this distinction can meaningfully be imposed in annotating nonstandard verb constructions.

The solution adopted for CHRISTINE, which might perhaps seem over-simple if one had a comprehensive knowledge of the structures of the various nonstandard dialects, but which seems in practice to work well, is to say:

Thus give, took, ate, done, come, got in the examples above are all given VVN... wordtags and treated as Vn phrases.

11.5  Subcategories for Nonstandard Verb Groups

CHRISTINE contains many sequences of auxiliary and main verbs which would not be acceptable in standard English; an example is you’re not give me a sweet T25.00331.  In such cases, only those V subcategory letters are added which are justified by the verb forms actually used.  Thus, in the case quoted, the verb group begins with are and contains not, but does not meet the criteria for any other V subcategory listed in EFC, p. 186ff.; so the tagma is labelled Vae (a combination of symbols that will never occur in annotations of well-formed standard English).

11.6  ain’t, in’t, innit

A new wordtag VAI is used for the ai- of ain’t and the i- of in’t, innit, which are ambiguous both with respect to person and with respect to the identity of the corresponding standard verb root.  Different cases of ain’t would translate into standard English as isn’t, aren’t (aren’t you or aren’t I, i.e. am I not), hasn’t, haven’t, and possibly sometimes doesn’t, don’t; without a special neutral tag VAI, the wordtagging would have to resolve this ambiguity and represent the forms uttered as much more specific than they actually are.

The form in’t (as in, for instance, they’ve seen them in’t they T07.00382) is, as far as I know, entirely comparable to ain’t and is therefore treated in the same way.[19]  (The literature of linguistics seems to discuss ain’t much more than in’t — though see Cheshire 1982: 54ff., Cheshire et al. 1993: 73; it may be that ain’t is more widely discussed because it is the only one of the two forms found in American English.  Both forms occur in British speech, and in CHRISTINE.)

CHRISTINE divides innit into i-_VAI +nn_XX +it_PPH1.  One might feel that innit is a different case from ain’t or in’t, because the root of innit can safely be categorized as 3rd person singular.  However, I am fairly sure that this form in its longstanding use has been ambiguous with respect to verb root, if not to person:  innit could equate to hasn’t it and possibly to doesn’t it, not exclusively to isn’t it.  So the need for a “neutral” wordtag remains, and the i- of innit is also tagged VAI.[20]

With innit there is a further complication in that this form seems to be expanding its usage range at present.  For some speakers it appears to be functioning as a generalized tag question akin to French n’est-ce pas, which does not vary with the verb or subject of the declarative clause to which it is attached.  This innovative usage is actually discussed by speakers at one point in CHRISTINE text T36, who describe it as characteristic of the English of the South Asian immigrant communities.  One can well imagine that a simplification of the very complex traditional English tag-question formation rule might be initiated by speakers of a different mother tongue, though if so I have the impression that the novel usage has now spread to young members of the indigenous British population.  (According to my wife, what one might call “generalized innit” already occurred in rural Sussex in the 1960s, which suggests that it is unrelated to South Asian immigration; she describes it as having then been an aggressive, “bovver boy” usage.)  In any case, this usage makes it all the more appropriate to use VAI for the root of innit.

At one point (T16.07076) CHRISTINE contains the form dunnit, divided as du +nn +it.  Arguably, du-, although not ambiguous with respect to verb root, should be given a person-neutral wordtag parallel to VAI.  I believe forms such as [dVnaI] dunnI, “don’t I”, also occur in English speech.  But it seemed undesirable to coin an additional wordtag for a single example; CHRISTINE tags du- in dunnit as VDZ.

A VAI word does not lead to the Vz or Vb subcategories being marked on the verb group it initiates:  ain’t as a whole verb group is tagmatagged Ve, not Vzeb.  However, if a VAI word acts as an auxiliary with a present or past participle, the subcategories Vu, Vf are used.

It is perhaps worth underlining the fact that the analytic guidelines set out above are based on assumptions about patterns of non-standard English usage which do not always rest on authoritative published descriptions.  The decisions made by the CHRISTINE team about how to handle forms such as in’t were based mainly on an impressionistic grasp of non-standard speech patterns derived from encountering such forms in our everyday lives.

Such impressions can be misleading.  In particular, for individuals whose usage broadly conforms to the national standard it is easy to fall into the trap of thinking in terms of two dialects, “standard” and “nonstandard”.  Linguists such as Cheshire et al. (1993) suggest that there may even be a measure of truth in this idea, but it is certainly over-simple:  there are many nonstandard dialects, which differ among themselves.  Here and there we found useful statements about particular details in the linguistic literature.  But, if there exists a thorough, reliable linguistic description of the grammars of the various nonstandard dialects, we have not encountered it.

11.7  Other Nonstandard Syntactic Structures

The use of Vn phrases in finite clauses, discussed above, is one area where CHRISTINE has evolved a well-defined approach to annotating a specific nonstandard syntactic construction.  In other cases, we have begun to develop annotation precedents which will be documented when a fully detailed supplement to EFC is circulated, but at this point we do not yet feel confident that our current approach will remain satisfactory.

Consider, for instance, relative clauses containing undeleted relativized items — a structure which occurs regularly in some nonstandard English dialects (and in some standard languages, e.g. Hebrew) but which is unacceptable in standard English.  A CHRISTINE example is:

... bloody Colin who, he borrowed his computer that time, remember?  T19.03075
For this example, the solution initially chosen for CHRISTINE was to make the relativized noun phrase, he, appositional to the relative pronoun.  But that solution only works provided that the relative clause begins with a relative pronoun and that the relativized element is subject of the relative clause.  More recent experience, with material to be included in the full CHRISTINE Corpus, shows that this is not always so; e.g.:
a {beep} {pause} cock-up by [Ns a farrier {pause} [Fr that I would really like to go and hammer those nails into his feet {pause} and make him walk for two weeks Fr] Ns]
The relativized elements are in one case a genitive and in the other case an item raised to surface object of its clause; and in any case the EFC parsing scheme counts that as a conjunction, not as a pronoun which could be postmodified by an appositional element.  Thus the initial approach cannot be made to work for this latter example, which suggests that it ought probably to be revisited and modified in the case of the T19 example also.
 
Cases like this show the limits to the policy of developing a comprehensive analytic scheme which provides a predictable annotation for anything that may be encountered in the data.  There are good reasons for maintaining that policy as an ideal; but when our scheme is required to apply not just to a single, intensively-studied language variety but to a diverse range of poorly-studied dialects, we cannot hope to get very close to that ideal for many years to come.  Arguably, there is a conceptual confusion in the idea of specifying consistent grammatical annotation standards for a spectrum of different, unpredictably varying structures; yet somehow that is what we have to do.

12.  Swearwords

CHRISTINE samples contain a high incidence of what in everyday parlance is called “bad language” or “swearing”. This area of English has structural features of its own, but the previous analytic guidelines have not served it well. Part-of-speech information in standard dictionaries seems patchy and inconsistent for these words, probably because they are felt to be marginal to the respectable core of the language; and the SUSANNE analytic scheme, which was based mainly on the written language, did not adequately get to grips with swearwords. For CHRISTINE, a new approach has been adopted, overriding earlier decisions in EFC and dictionary information about word classification.

The usages which the following discussion is intended to cover are cases where words are used without literal reference, in order to “let off steam” in speech, but (at least in some occurrences) the words used in this way are integrated to a certain extent into the surrounding grammar. These words are assigned wordtags beginning FL... (from “four-letter word”, for want of a better mnemonic — of course not all the words are actually spelled with four letters). Wordforms given FL... tags will always receive those tags when the words are used as swearwords — they will never be tagged UX even when occurring as isolated expletives; but they may sometimes receive ordinary tags as nouns, verbs, etc. when used with substantial reference rather than as swearwords.

An example of a swearword that is not given an FL... tag is blimey; this always seems to occur as a grammatical isolate, reflecting its derivation from the imperative blind me, so it continues to be tagged UX as in the EFC system.

Examples of words which would be classified by some or all speakers as “bad language” but which are given “ordinary” part of speech tags would be:

All of these and many other comparable uses of socially-deprecated words are tagged as ordinary nouns, verbs, or other parts of speech. But this leaves many cases (e.g. bloody except when it either refers to blood or means “obnoxious in character”, e.g. he really is a bloody man, or fuck other than referring to copulation) where words are used purely in order to modify the emotional tone of an utterance. The emotional effect can be very mild: for instance, the class of words given FL... tags includes flipping as well as fucking. (However, the “mild” FL... words probably in most cases arose specifically as polite replacements for particular swearwords, for instance it seems clear that frigging functions as a replacement for fucking which it is permissible to utter in prudish company.)

The FL... tags used are as follows:

The full list of words given FL tags in CHRISTINE is: The analytic scheme allows FL... words to occur in any category of tagma suggested by the surrounding wording, including as head of the tagma: thus I don’t give a fuck is analysed as including [Ns a_AT1 fuck_FL ], fuck me as [Tb! [V fuck_FL ] [Neo:o me_PPIO1 ] ]; the frequent exclamation bloody hell is [Ns! bloody_FLJ hell_FL1 ]. FLG and FLJ words often act as qualifiers, e.g. he’s [J:e bloody_FLJ mad_JJ ] T25.00316. At the same time, like U... words, FL... words are allowed to appear inserted in the middle of other tagmas to which they are grammatically redundant, e.g.:
[S+ but [Nea:s I ] bloody_FLJ [Vdce couldn’t get ] [R:n out ] ... ] T04.02478

[S [Nea:s I ] [Ve don’t get ] [Nop:o them ] bloody_FLJ [R:q back ] ] T16.06978

Only as much higher structure is included above FL... words as is essential to show the relationship to the environment; thus in fuck me the word fuck has to be dominated by a V node to show that the whole exclamation is a clause with me as object, but for instance FLG words are never given Vg or Tg nodes above them, thus fucking slapper (T36.01648, commenting on a show-business personality) is [Ns fucking_FLG slapper_NN1c ], not [Ns [Tg [Vg fucking_FLG ] ] ] ... ].

Utterances of the type does it heck? T16.07077 are analysed with the FL1 inside rather than outside the interrogative clause:

[S? [Vzx does ] [Ni:s it ] heck_FL1 ]
Phrases of the pattern damn all are analysed as [D damn_FL all_DBa ], e.g.
[S there’s [Np:s [D piss_FL all ] jobs ] [Rw:p in there ] ]  T47.00087
The word hell tagged FL1, if modified, will be head of a noun phrase but this is not annotated as a proper noun phrase, Nn... (This contradicts various passages in EFC.)

There is a use of swearwords in which they interrupt a single multi-syllable word — the stock example is abso-bloody-lutely. The sole case in CHRISTINE is I go with-bloody-out T16.06989, which is analysed by treating with- -out as two halves of an idiom:

I go [R:h [RR= with_RR21 bloody_FLJ out_RR22 ] ]
A case where it is difficult to decide whether the wordform should be counted as occurring in a “swearing” or “literal” use is a hell of a storm T38.01011. One might feel that this means that the storm constituted a hell, literally. But a hell of ... is often used in contexts where this interpretation would be hard to maintain; so the CHRISTINE analysis is [Ns a_AT1 hell_FL1 [Po of a storm ] ].

13.  New Annotation Symbols

13.1  Symbols not Defined in EFC

This section is provided as a concise checklist of symbols used in CHRISTINE annotations which are not listed or defined in the EFC book.  Some of the symbols are discussed in greater detail in other sections of this documentation file, and cross-references are provided in these cases.

13.2  New Wordtags

FL, FL1, FLG, FLJ:  swearwords

See §12 for the definitions of these wordtags.

NP1a:  anonymized name

The entities <name> and <address>, representing a name or address removed from the speech transcription for anonymization purposes, are wordtagged NP1a.  In context some names are clearly Christian names, some are clearly surnames, and some are names of things rather than persons, but often it is not possible to be sure what category the name belonged to; and addresses are likely to have been multi-word phrases.  The “neutral” tag NP1a is used in all cases, rather than the more specific NP... tags defined in EFC.

The other anonymization entity, <telNo>, is given the wordtag FOt already defined in EFC.

UGA, UGZ:  hail and farewell

EFC defines the discourse-item tag UG, Greeting, and gives examples such as hello, good morning, all of which are said at the opening of a social interaction.  It offers no tag for words such as goodbye, said at the close.  Instead of UG, CHRISTINE uses UGA for “hail”-type and UGZ for “farewell”-type discourse items.

UO:  sort of, etc.

The wordtag UO is used for the idioms sort of, kind of, sort of thing (i.e. these phrases are tagmatagged UO=).

EFC, pp. 446-7, argued that the Lund group (who pioneered the classification of discourse items on which EFC drew) were misguided in treating (most cases of) sort of as discourse items, and it urged that the phrase should normally be counted as an adverb idiom, RR= (and only occasionally as UE=, Engager).  Experience with CHRISTINE has shown that the Lund group were wiser in this respect than I appreciated; sort of, and the other similar phrases listed, are commonly dissimilar in their usage to either adverbs or Engagers.  CHRISTINE uses UO for all cases of these forms, except for thoroughly literal uses (as in this is an unusual sort of watch, where sort of would not be a constituent at all).

US:  “sound”

The wordtag US is used for elements which resemble words rather than non-linguistic vocalizations, in the sense that they obey the rules of English phonology (or most of them), but which are intended to represent non-linguistic sounds:  e.g. ding ding dee dee (T09.02011, imitating something unknown in a children’s game), or tra-la (T14.04877, imitating music or “generalized singing” divorced from any particular words).

I say that US items obey “most” phonological rules, because such forms frequently are slightly phonologically exceptional.  The late Prof. Eugénie Henderson used to point out the phonological oddity of the word boing, standardly used to represent the sound of a spring rebounding:  the velar nasal does not normally follow diphthongs such as [OI], and the form is mandatorily pronounced on a level tone although English is not a tone language.  Nevertheless, even this form is phonetically much more like an English word than like an attempt to use the vocal organs to create an imitation of the actual physical sound referred to.

A form such as ha ha or ho ho ho, representing laughter, is tagged US — these syllables consisting of [h] followed by a vowel or diphthong are conventional English indications of laughter, quite different from actual laughing, which is shown as with a non-linguistic vocalization entity.[21]

The boundary between US and UX, Expletive, is fuzzy.  Having adopted US for tra-la, CHRISTINE also uses this tag for less conventional nonverbal sounds integrated into pop lyrics, e.g. wo oh ooh ooh T06.00375.  But UX is used e.g. for fairly conventional “coaxing” noises in ordinary speech, wooty coochy coochy coochy bing T34.02876, and for odd noises made in the heat of children’s horseplay, wom um T05.01005.

VAI:  ai(n’t), etc.

See §11.6.

VVNH:  got

The word got as a past participle (not as a past tense, as in I got it at Tesco’s yesterday) is tagged VVNH.  The commonest use of this word is in the forms have got, +’ve got (and similar with other forms of HAVE, e.g. +’s got, had got), or just got, as colloquial equivalents to literary have.  The EFC annotation scheme provides no special annotation for have got, treating it simply as a perfective verb group, Vf, like have eaten; but semantically it is quite different from other perfective forms, and probably occurs with a higher frequency.  It seemed desirable to mark this special status in the annotation somehow, so CHRISTINE does so in the wordtag.  (If HAVE got ever occurs in the literal perfective sense, “have acquired”, it would be annotated in the same way; my impression is that this usage is rare in colloquial speech.  In practice, people seem to say HAVE gone and got ... to avoid the ambiguity.)

YMN, YMV:  nasal and vocalic filled pauses

“Filled pauses”, sounds made as a conventional way of continuing a speech turn while formulating one’s next words, fall into two classes in English:  sounds based on a nasal consonant, normally [m] — e.g. mm, um; and sounds which are purely vocalic, e.g. er, ah, Scottish eh.[22]  EFC defined one wordtag, YM, for all filled pauses.  However, the EAGLES speech group recommend distinguishing nasal from vocalic fillers (Gibbon et al. 1997: 170, Recommendation 5.4.4).  Consequently, CHRISTINE does not use the wordtag YM; it uses YMN for filled pauses containing a nasal consonant, and YMV for wholly vocalic filled pauses.

YV, YVL, YVR:  non-linguistic vocalizations, vocal shifts

The wordtag YV is used for a non-linguistic vocal sound;  YVL and YVR are used as shift elements identifying the beginning and end of stretches of speech having special vocal properties.  For details see §8.4.

YY:  inaudible wording

YY is used to tag the entity {unclear}, representing a stretch of speech whose wording could not be made out by the transcriber.  For details on the annotation of passages including such material, see §9.

Slash Wordtags

When the form in the wordfield is identifiable as an incomplete or distorted attempt to utter some particular word (as opposed to an unidentiable distorted word, tagged FD), CHRISTINE assigns a wordtag made up of the tag that would be assigned to the complete word, followed after a slash (solidus) character by the dictionary form of the complete word:  e.g. thi for this is wordtagged DD1i/this.

Because, typically, an imperfect word-token preserves the beginning but not the end of the intended word, sometimes it is quite clear what the word-stem is and only the inflexion is in doubt.  However, since CHRISTINE is concerned with grammatical structure, for our purposes the inflexions of words are more significant than the stems.  So, if there is real doubt about which inflected form was intended, a word token will be tagged FD even if the stem is unambiguous.

Slash wordtags are not used for word forms which are nonstandard but nevertheless conventional, e.g. ’em for them, +ta for to in gotta, nowt for nothing in Northern dialect, etc.  These are given simple wordtags, usually the same as those assigned to the standard forms.  Slash wordtags are used only for forms which appear from the transcription to be performance deviations from the speaker’s own idiolectal norms.

Since the CHRISTINE Corpus has been compiled from speech transcribed by others, we are obviously in the original transcribers’ hands for purposes of deciding when a word was pronounced so imperfectly that it should be transcribed other than in its standard orthography.  No doubt some transcribers were more skilled than others at “hearing” the intended words behind the blunders of performance; so far as CHRISTINE annotation is concerned, a word was distorted or incomplete if it was intentionally transcribed with abnormal orthography.
 

13.3  New Tagmatag Subcategories

:y, Y:  formally or functionally unanalysable

On the use of the formtag Y and the functiontag :y for material that is unanalysable because inaudible or interrupted, see §9.

Except for these two symbols, which are irrelevant in the case of written language, the CHRISTINE team made strenuous attempts to avoid postulating new categories above the wordtag level (and, in general, to avoid modifying the annotation guidelines laid down in EFC other than where it was quite necessary to do so).  Tagmatag categories relate to the logic of linguistic expression more than to the superficial manner of expression; it seemed reasonable to think of spoken and written English as two modes for expressing the same range of logical relationships.

Nevertheless, at one point it seemed unreasonable not to add two subcategories which were omitted from EFC by oversight:

Mp, Mq:  plural numeral phrase, wh- numeral phrase

With hindsight it seems irrational that EFC did not define the same subcategories “plural” and “containing wh- word” for the phrase category M, numeral phrase, as were defined for N (noun phrase), D (determiner phrase), etc.  It was a principle of the SUSANNE scheme that modification by a number greater than one was not in itself a reason to ascribe the plural subcategory to a phrase:  two sheep is N, not Np (because in practice numbers greater than one often occur in singular phrases).  But that is not a reason to avoid using the plural subcategory where the head word of a numeral phrase is grammatically plural; and the arguments for marking numeral phrases modified by a wh- word, e.g. which one, with a distinctive subcategory symbol are as strong as for other phrase categories.

Accordingly CHRISTINE uses Mp for phrases whose head is a word tagged MC2, and Mq for M phrases containing a modifying wh- word or ...q phrase.

This does leave a lack of parallelism between M and the other phrases using the subcategory symbol p for plural.  Ms is defined in EFC not as a numeral phrase whose head is grammatically singular, but specifically a phrase whose head is the word one.  Accordingly, most M phrases, where the head is a number word other than one but not with a plural inflexion, are tagged just M, not Ms or Mp.

14.  New Precedents for Applying Existing Symbols

14.1  Accumulation of Precedents

It is beyond the scope of this documentation file to state a list of annotation precedents for speech at the level of detail contained in the 500-page EFC book (which was based mainly on the experience of annotating written corpora).  Many new precedents have been set in the course of annotating the CHRISTINE samples — some relating specifically to spoken-language features, others of which might equally well have arisen in connexion with written English.  The project team has been collecting these precedents, and we intend to compile them into a systematic statement and publish them, probably via the Web, as a supplement to EFC in due course.  At this point, we must restrict ourselves to stating a limited number of new precedents which proved to be specially significant because the features to which they relate occur relatively frequently in our spoken samples.  They are listed below in the order in which the respective topics are dealt with in EFC.

14.2  ICSk, like

The word like has two uses as a “hedge” word in colloquial speech:

CHRISTINE wordtags both of these uses with the ordinary SUSANNE tag for like, ICSk, but treats this item structurally like a discourse item (U...), or a punctuation mark in written English:  i.e. it is attached to the tree as high as possible within the structure created for the surrounding wording.

In the expression feel like meaning “feel inclined to (have)”, like is treated as initiating a P:e phrase (the complement of which may be a noun phrase or a present-participle clause, e.g. I feel [P:e like [Tg [Vg dying ] ] ] T12.04069).

14.3  UL, Response Elicitor

Many words and idioms which can occur as UR (Response) items, e.g. all right, really, oh, also occur (with question intonation) in the function of UL, Response Elicitor.  Not all UR items have this double use:  for instance, fine occurs as UR but would not, I think, occur as a Response Elicitor.

Contrary to the usual SUSANNE practice of giving words fixed wordtags independent of context where possible, we decided in this case that any UR word would be tagged UL in cases where it occurs in the UL function.  Also, mm is tagged UL rather than UY when functioning interrogatively (e.g. at T10.00944).  (Other, more explicit UY forms are not tagged UL even if functioning interrogatively — Yes? would be UY.)

The word what is tagged UL when it functions as a request for the previous speaker to repeat what he has said.  On the other hand, in an exchange like You’ve forgotten something — What?, where the word abbreviates a question beginning what (“What have I forgotten?”), or why did you do that David — what, T35.00074-5, where David is echoing the question and asking “why did I do what?”, the word is wordtagged DDQ and analysed as a Dq tagma.

14.4  Additional Idioms

The following multi-word phrases have been treated as “idioms” in the sense of EFC, p. 99ff., though not listed in that book as such.  The list below includes all the new idioms from CHRISTINE, and some of those from the additional texts to be included in the full CHRISTINE Corpus.

CS
DAz
DD1a
DD1b
II
JA
JJR
NN1n
RAc
RG
RL
RR
UA
UE
UGA (see §13.2)[25]
UGZ (see §13.2)
UI
UK
UL
UR
UX

14.5  Conflicting Phrase-Category Cues

More weight is given to closed-class than to open-class words in deciding how to classify a phrase (even though the head word is usually an open-class word), because it is much commoner for an open-class than a closed-class word to acquire a new grammatical function.  An example is the phrase a bit fucking crap ... discussed in §12.  Crap is listed as a noun and not as an adjective in the dictionary, but the idiom a bit introduces an adjective phrase; so the phrase is analysed as a J with a noun rather than an adjective as head.

14.6  Counting

When someone counts, e.g. Right, one two {humming} T27.03765, the number words are not incorporated into a grammatical structure of co-ordination or the like:  right_UR one_MC1 two_MC {humming}_YV with no tagmatags.

14.7  Adverbial Clause Disconnected from Main Clause

In speech, an adverbial clause adjacent to a question often logically modifies not the questioned proposition but the speaker’s reason for asking it:  did I see him in that pub cos I’ve got no memory at all W09.00635.  In these cases the Fa is treated as disconnected from the S?, not subordinate to it.  Likewise, I said [Q:o [S? why don’t you go and see if Martin will let you stay ] [Fa cos you’ve met him ] Q:o] W09.00681.  In both of these examples the Fa begins with cos, discussed above as being only marginally a subordinating conjunction in colloquial speech; but I believe similar patterns can be found with “true” subordinating conjunctions.

14.8  Adverbial Clause without Subordinating Conjunction

A subordinate clause functioning as an Fa is analysed as such even if it lacks a subordination marker:

[S* [Fa:c you think this year’s bad for physics ] wait to you start next year ] T27.03876

14.9  L, Verbless Clause

Normally discourse items, wordtagged U..., are grammatically disjoint from adjacent wording; but there are cases where such an item occurs as a constituent of a larger tagma.  In these cases the higher tagma is labelled L.  Particularly common examples are:

[L [UT= thank you ] [Ds:h very much ] ]
[L yes_UY [S+ but ... ] ]
or cases where an “interpolation” occurs not medially within but adjacent to a single tagma to which it is linked in sense (so that it would be misleading to treat the two tagmas as disconnected):
[L [J:e rather good ] [I [S I think ] ] ] W22.00136

14.10  Ot, Title

The Ot category is used for names of school periods, e.g. Geography.

It is also used for cases where a personal name is used as the name of a regular television programme.  Well-known cases are Wogan, Parkinson.  Examples occurring in CHRISTINE are Carrot, Mr Bean, at T03.00830ff.  It is noticeable in the latter passage that these names are referred back to subsequently as it, not he.

14.11  Q, Quotation

Written English uses inverted commas and other orthographic devices to maintain a very clear distinction between direct and indirect quotation; in the SUSANNE scheme, the category Q was strictly reserved for material orthographically marked as direct quotation.  In speech there is no equivalent of inverted commas, and the Q category is used more freely for any wording that the speaker seems to be “quoting” rather than “using”; if a speaker asked what does wayzgoose mean, the word wayzgoose, wordtagged NN1c, would be tagmatagged Q.

When a speaker quotes someone else at length, it is common for him to insert phrases such as he said periodically as a way of indicating that “I am still quoting — I haven’t switched back to expressing my own views yet”.  The CHRISTINE rule is that he said or similar preceding the beginning of the quoted material (if there is such a phrase) is treated as a superior clause within which the quoted material is a Q or Fn, depending whether the quotation is cast predominantly in direct- or indirect-speech form; any phrase like he said after the start of the quoted material is a clause inserted as an interpolation:

[S he said [Q:o [S ... ] [S ... ] [I [S he said ] ] [S ... ] [S ... ] [I [S he said ] ] [S ... ] Q:o] S]
If quoted material is not preceded by a quoting phrase, but contains a quoting phrase internally, the quoted material is analysed as a root Q tagma containing the quoting phrase as an interpolation:
[Q oh [I [S she said ] ] [S I # you can’t do that S] Q]  T03.00945
(If there is no quoting phrase anywhere, then structurally speaking the utterance is not marked as a quotation and is analysed as if it were the speaker’s own wording, without a Q node.)  CHRISTINE does not use the category Ss, Embedded Quoting Clause (EFC pp. 246-7), which was introduced into the SUSANNE scheme for written English as a (perhaps over-elaborate) way of dealing with the fact that direct quotation in writing is sometimes governed by a quoting phrase placed medially within the quotation.

Rahman & Sampson (forthcoming) discuss the fact that, in speech, cues for classifying quoted material as direct or indirect speech can often conflict.  Consider, for instance:

[reporting the speaker’s own response to a directly-quoted objection]:  I said well that’s his hard luck! T15.10673
well Billy, Billy says well take that and then he’ll come back and then he er gone and pay that  T13.01053-5
In the former example, the discourse item well and the present tense of [i]s after past-tense said suggest direct quotation; but his rather than your suggests indirect speech (in context, his refers to the person who was addressed).  In the latter case, after says the word well and the imperative take imply direct speech, he’ll rather than I’ll implies indirect speech.  Arguably, it is artificial in annotating speech to use two separate categories, Q and Fn; the linguistic reality is perhaps that “directness of quotation” is a cline with no sharp direct/indirect distinction to be drawn.  From a logical point of view, the distinction seems so fundamental that we have retained it in CHRISTINE; individual quotations are classified as Q or as Fn depending whether the majority of indicators point one way or the other.

Quotations, particularly in young people’s speech, are often introduced by the verb go rather than by a transitive verb such as say:  e.g. I go sorry but I won’t do it T06.00365.  Because go is otherwise intransitive, quotations introduced by this verb are functiontagged Q:e rather than Q:o.[26]

There are cases where speakers incorporate non-linguistic vocalizations into the grammar of their surrounding wording, e.g. and the fortune-teller goes {vocal22} T34.02929, where {vocal22} represents “sharp intake of breath”.  In such a case the vocalization entity, wordtagged YV, is tagmatagged as a Q within the clause:

[S+ and the fortune-teller goes [Q:e {vocal22}_YV ] ]

14.12  Co-ordinate Clauses

As discussed in EFC, §6.14, in analysing speech we do not assume that every clause beginning with a co-ordinating conjunction must be treated as part of a larger co-ordinate construction.  It is common for people to utter clauses beginning and, where there is no strong logical connexion between the and clause and what preceded — sometimes and may function as little more than verbal throat-clearing, an audible warning that the speaker is about to take the floor, or to retain it after his preceding utterance.  Cases like this are treated as S+ tagmas in which the node labelled S+ is a root node.  In other cases, clauses beginning with co-ordinating conjunctions are analysed as “subordinate conjuncts” (EFC, p. 311) within larger co-ordinate structures.  Factors taken as reasons for assigning a co-ordinate structure include:

The third of these factors is quite vague, and we cannot pretend that decisions about whether to treat successive main clauses as co-ordinated or as separate tagmas are as predictable as we have tried to make most aspects of the annotation scheme.  Written English uses punctuation to put this issue beyond doubt; speech has no equivalent machinery, and it is difficult to find any satisfactory hard-and-fast rule for discriminating between [S ... [S+ ... ] ] and [S ... ] [S+ ... ] structures.

(Of course, when clauses linked by a co-ordinating conjunction are jointly subordinate within a higher clause, they must be analysed as co-ordinated.)

14.13  :h, don’t bother

The turns of phrase don’t bother Verbing ..., don’t bother to Verb ..., are common in speech.  The material after bother is treated as a Tg:h or Ti:h clause:  they don’t bother [Tg:h having scarecrows ] this time of year T05.01167.

14.14  :r, to do with

This expression, as in e.g. it’s nothing to do with the fact that he stinks T12.03958, is analysed as it’s [Ns:e nothing [Ti to do [P:r with the fact ... ] ] ].

15. Errors and Inconsistencies in English for the Computer

Since the publication of EFC, it has inevitably emerged that that book contained a number of internal inconsistencies and mistakes. This section lists these, so far as they have come to light, and specifies how they have been resolved for the purpose of our further annotation work. Note that this section deals only with actual errors in EFC. The work of applying the annotation scheme to spoken English, under the CHRISTINE project, has thrown up many issues on which the EFC scheme needs to be extended with further detail; those are not treated here. (The most important points were treated in §14.)

p.90, §3.26: The list of proper-name wordtags should include NP1t for names of towns, etc., cf. p.113.

p.105, APPGi1: The words “as possessive” do not eliminate any use of my (§4.457, pp.308-9, explicitly states that my as an exclamation receives the same wordtag). APPGi1 applies to all uses of my.

p.105, CC:  the idiom as well as should have been classified as II, not CC.  (This phrase is annotated wrongly in Release 4 of the SUSANNE Corpus.)

p.106, CSf; p.109, IF; and cf. pp. 269-70, §4.357:  it should be made explicit that the for of a Tf clause is wordtagged IF, not CSf.

p.106, DDo: the word plenty should be added to a_lot.

p.111, MD: The statement that ordinal forms such as third are given this tag even when used as fractions is inconsistent with the statement (§4.268, p.238) that spelled-out fractions are wordtagged as nouns. The latter rule is preferred to the former.

p.116, RGi: The alternative tags for about should include RL, and for over should include JB.

p.120, II: The inclusion of given is inconsistent with note 48, p.133; the approach of the latter is preferred, and II is dropped as a tag for given.

p.121, middle of page, list of II words having the alternative tag RL: this list should not contain by, since its tags are IIb and RL.

p.121: The list of RL words should include about, as in e.g. play about, go about.

p.122, middle of page: midway should not be listed as having the alternative tag RR (see the note under RR, p.117, about incompatibility of the wordtags RL and RR).

p.122, RR= idioms: all_the_same should be added.

p.123, RR= idioms: This list should not include upside_down, which is an RL= idiom (see the note under RR, p.117, about incompatibility of the wordtags RL and RR).

p.137, §3.107: The sentence about animal names is inexplicit but seems to refer exclusively to cases where animals are named after other entities. A proper name applying only to an animal is tagged NP1m or NP1f.

p.175, §4.56: The example a shear field ... should not have been included here, because the word shear (which is in any case not the head of its phrase) is being used as a noun (a technical usage listed in COD). This is an error in Release 4 of SUSANNE, as well as in EFC: repeatedly in SUSANNE text J03 this sense of shear has inappropriately been tagged as a verb.

p.191, §4.102: With respect to examples in which a modal verb, or the quasi-modal had, is followed by better, rather, etc., without a further verb following in the same group (e.g. the D11 example, or the sequence ... would rather this was put ..., CHRISTINE T28.08740), the rule stated here contradicts the rule stated on p.271, §4.363. The latter rule is preferred. (On the other hand, in a sequence like ... would rather go ..., the main verb go completes the verb group initiated by would, and this is analysed as a Vc with rather as an included adverbial.)

p.194, §4.111:  It ought to be more explicit than it is made here that Vx is not intended to include verb groups in which DO replaces a more specific main verb, e.g.:

Suzanne056:  I do # can’t remember that — Zoe055:  oh you [Vc must do ]  T12.03911-13
p.229, §4.233: The word everywhere should be included in the list of words yielding the Rw subcategory.

pp.242-3, §4.281: This rule implies, but ought to state explicitly, that when the verb omitted in informal usage is the sole verb of its clause, as in you busy for “Are you busy?” W30.0133, the category used is L rather than S: he said [Q:o [L? you on the finance committee ] ] T28.08721.

p.256, §4.315: “as discussed in §5.77” should read “as discussed in §5.85”.

p.273, §4.368: In the first bullet point, the pattern the same ... as ... should be added to as ... as ... and so ... as ... (cf. §4.319).

p.298, §4.424: In the first displayed example, Having all the guns ..., the tag Vex should read Vzex.

pp.303-4, §4.40: The sequence not even to Fergus should be tagged L@, not S@: in co-ordination, “subordinate conjuncts” are given the category of the full tagma from which elements have been removed through Co-ordination Reduction, but the latter concept is not applicable to appositional constructions.

p.304, §4.444: The rule beginning “and, if this material begins with a verb, ...”, which is restated in §5.71, p.382, seems with hindsight to be one of the most unfortunate decisions included in EFC. Many existential-there examples (including the a lot of rot talked ... case here) would be most naturally analysed by treating the part of BE following there, and the verb following the subject, as two halves of one divided verb group.  (Nevertheless, CHRISTINE sticks to the EFC rule.)

p.305, §4.447: It should be made explicit that this passage, dealing with annotation of written English, and §6.28, p.446, dealing with annotation of speech, specify contrasting analyses for exactly the same construction.

p.330, §4.507: In the CHRISTINE project, which is concerned with the spoken language where there are no elements such as inverted commas or italics marking quoted wording, a decision was made to treat the sequences which in this section are classified as appositional instead as Q tagmas. This would probably have been a better decision for the written language also.

p.334, §4.514: The reference “§4.517” in the last line of this section should read “§4.510”.

p.379, §5.64, first displayed example: “Nns:S1” should read “Nns:S123”.

p.388, §5.85: In the first displayed example, the tags “Ncs:e” and “Ps:q” should read “Ns:e” and “P:q”. (An earlier version of the SUSANNE scheme used subcategories Nc and Ps which were eliminated from the version of the scheme used in the published corpus and in EFC.)

p.418, §5.154: The inclusion here of the last example, But [A:m as I have said before] ..., contradicts the statement at the end of §5.198, p.433, that A clauses are given the functiontag :x when they act as propositional relatives, as in §4.370, p.275. The as I have said before example is indistinguishable from the examples in §4.370, and should be tagged A:x rather than A:m.

p.422, §5.164: This section is contradicted by §5.188, pp.428-9, which states that W clauses are always functiontagged :b. It will be preferable to give up the rule of §5.164 and to change all its examples of :c to :b.

p.434, §5.200, first displayed example: The tag Vaet should read Vaeut.

p.448, §6.31, UX: Listing the_hell here implies a different analysis of this construction when it occurs in speech from the analysis prescribed on p.307, §4.452, for the same construction when it occurs in written representations of dialogue. The CHRISTINE project has meanwhile evolved a new approach to the annotation of “swearwords”, explicitly changing various decisions in this book.  Of the items listed under UX, Expletives, on p. 448 of EFC:

p.448, §6.31, UR: In fact right? is often used in speech as a tag question of the kind discussed on p.298, §4.425, and this use is appropriately wordtagged UL.
 

References

Beal, Joan (1993)  “The grammar of Tyneside and Northumbrian English”, chapter 6 of Milroy & Milroy (1993).

Burnard, L., ed. (1995) User Reference Guide for the British National Corpus, Version 1.0, Oxford University Computing Services.

Caldwell, K. (1998) posting 9-720 to the electronic LINGUIST List.

Cheshire, Jenny (1982)  Variation in an English Dialect, C.U.P.

Cheshire, Jenny, et al. (1993)  “Non-standard English and dialect levelling”, chapter 3 of Milroy & Milroy (1993).

Edwards, Jane A. (1992) “Design principles in the transcription of spoken discourse”. In J. Svartvik, ed., Directions in Corpus Linguistics: Proceedings of Nobel Symposium 82, pp. 123-44. Mouton de Gruyter (Berlin).

Eisikovits, Edina (1987)  “Variation in the lexical verb in Inner-Sydney English”, Australian Journal of English 7.1-24; the page reference cited  is to the reprint in Trudgill & Chambers (1991).

Garside, R.G., et al. (1987)  The Computational Analysis of English, Longman.

Gibbon, D. et al., eds. (1997) Handbook of Standards and Resources for Spoken Language Systems, Mouton.

Goldfarb, C.F. (1990) The SGML Handbook, Clarendon Press (Oxford).

Langendoen, D.T. (1997) Review of Sampson (1995), Language 73.600-3.

Mahdi, W.  (1998)  posting 9.682 to the electronic LINGUIST List.

Meteer, Marie, et al. (1995)  “Dysfluency annotation style book for the Switchboard Corpus”, http://www.ldc.upenn.edu/myl/DFL-book.pdf.

Milroy, J. & Leslie Milroy, eds. (1993)  Real English: The Grammar of English Dialects in the British Isles, Longman.

Office of Population Censuses and Surveys (1990-1) Standard Occupational Classification, 3 vols.: vol. 2, Coding Index, 1990; vol. 3, Social Classifications and Coding Methodology, 1991. Her Majesty’s Stationery Office.

Rahman, Anna, and G.R. Sampson (forthcoming)  “Extending grammar annotation standards to spontaneous speech”.  To be in J.M. Kirk, ed., Corpora Galore: Analyses and Techniques in Describing English, Rodopi (Amsterdam).

Sampson, G.R. (1995) English for the Computer: The SUSANNE Corpus and Analytic Scheme, Clarendon Press (Oxford).

Sampson, G.R. (1998) Review of S. Greenbaum, ed., Comparing English Worldwide. Natural Language Engineering 4.363-5.

Sampson, G.R. (forthcoming) Review of Gibbon et al. (1997). To be in Natural Language Engineering.

Stenström, Anna-Brita, & L.E. Breivik  (1993)  “The Bergen Corpus of London Teenager Language”.  ICAME Journal 17.128.

Trudgill, P. (1990) The Dialects of England, Blackwell (Oxford).

Trudgill, P. & J.K. Chambers, eds. (1991)  Dialects of English, Longman.

Wells, J. (1982) Accents of English (3 vols.), Cambridge University Press.
 

Notes

[1]  The name CHRISTINE was chosen as a distinctive label for the project under discussion, which is unlikely to coincide with names of other projects formed on the acronym principle, and has an appropriate relationship with the name of the earlier SUSANNE project.  My group has a tradition of using names of female saints, Anglicized for brevity.  CHRISTINE is the successor project to SUSANNE, and St Christina is the successor to St Susanna in the Calendar of Saints.  (The life of St Christina also had features making her a suitable patroness for a project concerned with speech; see the project web page.)

[2]  Other, independently-invented treebank annotation schemes have since been developed for other treebanks.  It has not seemed practical or desirable to change our scheme to make it more like later-emerging schemes, which have not always been defined at the same level of detail.

[3]  Some of the written material in the BNC pre-dates the 1990s, since published writings are often read years after they were written.

[4]  The BNC Manual does not explain the principles behind the regional sampling. It appears that Britain was divided into a limited number of large regions, and recruits were selected from the different regions, perhaps in proportion to population, but there was little or no attempt to achieve geographic spread within each individual region: there are many cases where several recruits all came from the same small village, perhaps because the Corpus compilers happened to have connexions there.

[5]  There is also at least one case of the opposite type of coding confusion in BNC, where the same coding is used with contrasting meanings:  see contributions to the BNC discussion list, bnc-discuss@maillist.ox.ac.uk, by David McKelvie and by Lou Burnard, both dated 6.1.1997.

[6]  The ratio would be slightly different for the published version of CHRISTINE, which has redistributed various categories of information between fields in wordlines and independent lines of their own.

[7]  To avoid confusion stemming from frequent changes of counties and county boundaries over the last quarter-century, place names in CHRISTINE notes files, as commonly in long-term scholarly publication, use the traditional pre-1974 counties.

[8]  The BNC Manual seems confused on this point. For instance, BNC text F8R, the source of CHRISTINE text V01, is described on p. 229 of the Manual as a lecture and as involving “two participants”, PS000 and the lecturer PS1PR; but the section of the file from which the CHRISTINE text is extracted appears to be a tutorial discussion involving several different students, all shown as PS000.

[9]  A revised version of this scheme has now appeared, changing the range of categories in the light of changing social and employment patterns during the 1990s.  The 1990-1 version of the scheme was the one in standard use at the period when the BNC/demographic recordings were made, and is the version discussed here.

[10]  The rationale for this complication in the BNC structure is not entirely clear.  The BNC compilers perhaps felt that a nonverbal vocalization could not be called a “sentence”, though there seems no good reason why an inaudible stretch of wording should not count as an “utterance”.

[11]  Occasionally, BNC shows a “beginning” time-pointer entity occurring within a word rather than between words (e.g. at T05.01149 a pointer interrupts mini-series, which is otherwise treated by BNC as a single word).  CHRISTINE marks such a case with the ampersand symbol as if the time-pointer preceded the word within which it occurs.

[12]  Our original plan was to mark the words following “opening” but not “closing” time pointers — words d and h in the example — which would have made for a more logical system.  However, the BNC time-pointer entities are not themselves marked as opening or closing — this status can only be inferred from their pattern of occurrence; and their distribution is not always as straightforward as in the schematic example above.  For instance, the BNC original of source-units T01.02621-3 displays the following pattern:

Harold001:  ... ptr1 ... ptr2 ...
Jean003:  ptr1 ... ptr3 ... ptr2
Harold001:  ptr3 ... ptr2
On the face of it, this seems to mean that part of Harold’s first turn was simultaneous with the whole of Jean’s, and that Harold produced a further turn which was simultaneous with the latter part of Jean’s turn — and therefore with part of his own first turn.  This is senseless; we have no way of establishing what was actually happening, so we adopted the ampersand system described here as one into which the BNC notation could be mechanically translated.

[13]  From the “localization” aspects of some modern word-processing software, it appears that American information technologists are under the impression that standard British English requires the -ise variant of this suffix. That is incorrect; -ise is an optional variant, and high-prestige publishers and style manuals tend to prefer -ize.

[14]  Disyllabic v. monosyllabic pronunciation (because v. cos) and status as subordinating v. co-ordinating conjunction may be two independent issues; there may be cases where because, pronounced as such, is equally drained of subordinating force. Note that the point made above is not the banal one that spontaneous speech is sometimes logically vague or confused; other subordinating conjunctions, e.g. although, if, when, seem to retain their subordinator status even in spontaneous speech, and the point made here is specific to cos/because.

[15]  Certain duration markers were unfortunately lost by an oversight in the process of reformatting BNC into CHRISTINE files.  This is believed to apply only to markers occurring in the exceptional source-units numbered 00000 or ——-, discussed in §6.5.  For instance, the {unclear} entity produced by Sadie148 immediately before T40.00141 should have been recorded as {unclear_8}.

[16]  However, when a clause is interrupted immediately after its subject, the functiontag :s is used, as overwhelmingly most likely to apply, even though in theory the verb might have turned out to be passive in which case the subject would have been functiontagged :S.  Thus:

this sort of idea {pause} [Fn that [Ny:s you ] # it wasn’t your own room ] W09.00502
[17]  This rule may be not only easier to apply but more appropriate with respect to user needs.  An important application area for annotated speech corpora is likely to lie in the area of improved automatic speech understanding systems; the wordtag hypotheses generated by speech-recognition software are likely to be limited to those listed for particular wordforms in standard dictionaries.

[18]  Here it is possible that the oddity is not grammatical but is the transcriber’s misunderstanding of the speaker’s London accent.  A London [tIw] for till might be heard from an RP perspective as to.  But there is no indication in the transcription that the wordform was anything other than a normal utterance of to, so we have taken the transcription at face value.

[19]  The BNC transcribers typically fail to distinguish in’t (= ain’t) from in t’ (N.E. England in the), writing both as int.  But in context these are easily distinguished.

[20]  In fact the i- of innit is probably ambiguous with respect to person also; for instance Cheshire (1982: 58) records a tag question in the form in I where standard English would have aren’t I.

[21]  It is very common to represent a laugh with exactly two such syllables, ha ha or ho ho, so that these “phrases” might be treated as US= idioms.  However, this treatment was rejected as over-ingenious; in CHRISTINE, each “laughter” syllable, written as a separate word in our source transcriptions, is treated as a whole US “word”, and a sequence of such “words” are not grouped as a tagma.

[22]  For the benefit of any American users of CHRISTINE, it should perhaps be explained that, because standard RP is a non-rhotic variety of English, the letter sequence er is seen in England as a digraph representing the shwa vowel, and is the conventional orthographic device for representing shwa as a filled pause — the sound which American writings typically show as uh.  The BNC transcribers spelled almost all vocalic filled pauses as er.  Scots English is strongly rhotic and, I believe, contains shwa only as a checked vowel; in consequence Scots use a different, front vowel as a pause filler, and this is conventionally written eh.  (I am not sure what speakers of other rhotic regional dialects do about filled pauses.)

[23]  Mahdi (1998) comments that he first encountered this use of like “in Mad magazine around 1959 or 1960, when it was still explicitly characterized as Californian colloquial or youth”.  In Britain I am sure it is a more recent innovation.  A possible example does occur at W09.00511, recorded as early as 1975; but this might alternatively be seen as a case of like in its literal sense.

[24]  This corresponds both to a comparative adjective, JJR, and to an adjective used only predicatively, JA. We do not want to invent a new wordtag such as JAR, so the idiomtag JJR= was used.

[25]  One phrase, how are you, listed in EFC as a discourse-item idiom, is treated instead in CHRISTINE as a question with ordinary internal grammar.  It is noticeable both that most CHRISTINE speakers treat How are you? as a question to be given an answer, and that speakers sometimes vary the grammar of the phrase.

[26]  Incidentally, this construction is probably not as peculiar to “youthspeak” as some commentators suggest.  I think it would be normal in the speech (or writing) of all age-groups to use wording such as The timer went “ping”; the only thing that is distinctive about young people’s use of the construction is that it is generalized to cases where the sounds described are verbal.