Geoffrey Sampson
Department of Informatics
University of Sussex
“Releases” refer to modifications to the Corpus as a tar file distributed by ftp. Inevitably, there will be occasions when modifications to this Documentation file as a Web page run ahead of the version of the same file that is included within the Corpus tar file. (When the Documentation file is changed, it takes time to create a new compressed tar file of the entire corpus and mount it on our ftp server.)
The Documentation file you are reading was last modified on 1 Jan 2004.
Change of AddressIn the past, the language-engineering resources published by my research team have been scattered at different internet locations not all under my control, and they have more than once been shifted to new addresses without notification to me. I apologize to users for the frustrations this has sometimes caused. To avoid such problems in future, I have acquired my own internet domain, which I intend to maintain indefinitely. My home page has now moved to:
From now on this will always include a pointer to a list of the current locations of corpora and other downloadable research resources produced under my direction. In due course, those resources may themselves be shifted into the grsampson.net domain. |
Release 2 differs from Release 1 in that “ghost” elements in the structural analysis, representing the logical placing of elements which have been deleted or moved into a different clause in surface structure, are shown as labelled brackets in the parse field rather than as items in the word field. (Lines for ghosts in Release 2 have a hyphen in the word field.) This brings the analytic formalism of the CHRISTINE Corpus into closer conformity with that of the SUSANNE Corpus, and makes the files less confusing for human readers.
Release 1 was completed on 29 July 1999.
The CHRISTINE Corpus is a structurally-annotated sample of spoken English. The sample is based on extracts from the “demographically-sampled” speech section of the British National Corpus. It therefore forms a suitable resource for studying grammatical and other structural features in the spontaneous, informal usage of a cross-section of speakers drawn from all social classes and regions of the United Kingdom in the 1990s.
The CHRISTINE Corpus conforms to relevant recommendations of the EAGLES (Expert Advisory Group on Language Engineering Standards) Spoken Language Working Group (Gibbon et al. 1997), as well as to preferences expressed by an international group of more than thirty experts consulted via the Internet at the beginning of the project which created it.
The CHRISTINE project was sponsored from 1996 to 1999 by the Economic and Social Research Council (UK), under award no. R000 23 6443, as a successor to the project which produced the SUSANNE analytic scheme and Corpus.[1] The main aim of both SUSANNE and CHRISTINE projects has been to develop detailed, comprehensive, and explicit standards for annotating the structural properties of samples of English language as used in real life. Such standards can be developed only by applying an annotation scheme to language samples and refining it in response to problematic cases; so the work yields, as a valuable by-product, corpora, or “treebanks”, annotated in accordance with the scheme.
(The term “treebank” is now accepted internationally to describe a natural-language sample equipped with annotations representing grammatical structure. I believe the term was first coined by my colleague Geoffrey Leech of the University of Lancaster, in connexion with the treebank for whose creation at Lancaster I took responsibility in 1983 and which is described in Garside et al. (1987: ch. 7). The SUSANNE and CHRISTINE analytic scheme, though considerably more sophisticated, is the lineal descendant of the scheme developed by Leech and myself in the early 1980s.)[2]
The SUSANNE project focused chiefly on written language. It produced the structurally-annotated SUSANNE Corpus of written (American) English, published in 1992, together with a 500-page book, Sampson (1995) (referred to below as EFC), which defined the annotation scheme. The scheme has been winning a measure of international recognition; for instance, D. Terence Langendoen, President of the Linguistic Society of America, comments that “the detail ... is unrivalled” (Langendoen 1997: 600).
The CHRISTINE project extended this work to the domain of spoken English. Much of the notational apparatus defined in English for the Computer applies equally to spoken or to written English. Chapter 6 of that book proposed additional notations to deal with the special structural features of spoken language, such as the speech-repair structures produced when a speaker edits his wording “on the fly”. The CHRISTINE project tested and refined the scheme which includes these extensions, by applying it to a range of samples of recorded speech from a variety of sources. The material in the published CHRISTINE Corpus represents that part of the project’s annotation work which has been brought to a state suitable for public distribution.
(The project also annotated further passages, drawn from the London-Lund and the Reading Emotional Speech Corpora as well as additional excerpts from BNC, and it was originally intended to include these passages, too, in the published CHRISTINE Corpus. Unfortunately, various practical and staffing problems meant that this has not to date been possible, though it is still the intention to publish at least some of the additional material eventually. Because of the slippage in this aspect of our plans, we no longer use the name “CHRISTINE Corpus Stage I” for the currently-available Corpus; that was appropriate when publication of a larger resource was expected to occur within months, but it is now more realistic to use the short name for the material published to date.)
By now, the SUSANNE Corpus is in use in research institutions in many parts of the world, and numerous research publications have been based on it. It was not the first and is by far not the largest annotated corpus to have been published, but for some research purposes it has proved specially useful. Users have commented, for instance, on the unusual richness and precision of its annotations. The CHRISTINE Corpus is not the first structurally-analysed corpus of spoken English. For American English there is, for instance, the (specialized) Switchboard Corpus (Meteer et al. 1995); for British English the ICE Corpus (www.ucl.ac.uk/english-usage/ice-gb/) pipped us to the post by several months. But CHRISTINE has several virtues:
which may give it a special value in some research contexts.
CHRISTINE undoubtedly contains errors. I should be very grateful to be notified of any errors discovered by users (via e-mail to an address that I shall express cryptically to avoid the attention of spammers: grs2, followed by at-sign, followed by sussex.ac.uk), so that these can be eliminated when the full Corpus is released. Any such help will be publicly acknowledged.
More information on CHRISTINE and SUSANNE projects and Corpora is available on the World Wide Web: visit my home page at www.grsampson.net and follow the respective links.
The CHRISTINE web page includes details of the research team, but it is proper to acknowledge them by name here. Any value the CHRISTINE Corpus may have is largely due to the dedicated hard work of Alan Morris and Anna Rahman. The first-person pronoun is used in the documentation files, because these are largely concerned with debatable points on which the Principal Investigator necessarily took the final decision; but this should not detract from the credit due to other members of the team.
The electronic files comprising the CHRISTINE Corpus may be copied freely by anyone and used for any purpose. The Economic and Social Research Council, as sponsoring agency, and the University of Sussex as the contracting institution would undoubtedly appreciate acknowledgments in any publications which emerge from research using the CHRISTINE Corpus.
The British National Corpus is an electronic resource intended to supply empirical data on the English language as “produced” (that is, spoken and written) and “received” (heard and read) in Britain in the 1990s.[3] The BNC was created by a consortium comprising the publishers Oxford University Press, Longman, and Chambers Harrap, Oxford and Lancaster Universities, and the British Library; the chief sponsors were the Department of Trade and Industry and the Science and Engineering Research Council (now Engineering and Physical Science Research Council). Release 1.0 of the BNC was circulated in 1995 and is documented in Burnard (1995) — this book is referred to below as the BNC Manual.
The BNC contains 4124 language samples comprising 100 million words in all. Of this, about 10% — ten million words — is transcribed speech (the remainder being published and unpublished written material). The spoken part of the Corpus is divided into two parts:
I shall use the abbreviations “BNC/speech”, “BNC/demographic”, and “BNC/context-governed” to refer to the spoken part of the BNC Corpus and its subparts. In essence, BNC/demographic is a sampling of the spoken interactions engaged in by a cross-section of the British population over a given period; the overwhelming majority of these are informal conversation, so BNC/context-governed samples speech-events on the basis of genre rather than on the basis of speakers’ social characteristics, in order to achieve coverage of other speech genres.
The material in the published CHRISTINE Corpus is drawn wholly from BNC/demographic. For BNC/demographic, 153 individuals were recruited in such a way as to give, so far as possible:
Inevitably, practical difficulties prevented this intended distribution from being realized perfectly, but a reasonable approximation was achieved. (Detailed figures are given in the BNC Manual, p. 20. The Manual acknowledges that 153 respondents are fewer than ideal, but resource constraints forbade a substantially larger sampling.)
The recruits — in BNC terminology, “respondents” — were provided with tape recorders and asked to record all speech events in which they took part over a period comprising at least two different days of the week, thus achieving a mixture of weekdays and weekends. As well as returning the recordings, respondents also supplied logs which were intended to include demographic descriptions of other participants in the conversations (though, as we shall see below, this proved to be an area of severe weakness in the system).
BNC/demographic comprises 153 files, one for each respondent’s recordings; the average wordage in a single respondent’s file is about 27,500 words, though there is considerable variation round this mean.
The recordings were transcribed using conventional orthography, with ordinary punctuation, sentence-initial capitalization, etc. (My understanding is that this work was done by clerical employees of the Longman Group, based at Harlow, Essex, though this is not stated in the Manual and may be incorrect. If it is correct, then the variety of English familiar to the transcribers is likely to have been fairly close in pronunciation to RP — “Received Pronunciation”, the national standard; a number of oddities of transcription are understandable if distant regional dialects were being filtered through ears attuned to RP.) The Corpus as released comprises these transcriptions encoded into an SGML-based file structure, including various analytic annotations (e.g. wordtags) produced semi-automatically by the consortium researchers. (The CHRISTINE Corpus ignores the BNC annotations; we applied our own much more detailed annotation scheme manually to a subset of BNC/demographic, so that only the actual words uttered are common to the two corpora.)
It should be said that BNC/speech, though unrivalled as a cross-sectional sampling of contemporary British speech, is not an ideal research resource in every respect. The sound recordings are so far not available to researchers (though this may change); in any case, having been made in “field conditions”, the recordings were clearly often of poor quality by the standards of lab-based speech research, which was quite inevitable. Furthermore, the standards of transcription often leave something to be desired (many transcriber errors are discussed in the notes files for the individual texts). The other sources of transcribed speech used by the CHRISTINE project have their own virtues (the Reading Emotional Speech Corpus is available as digitized sound signals, the London-Lund Corpus is transcribed to a very high standard of accuracy); conversely, neither of them can claim to be representative of the national population in the way that BNC/demographic is. At present, there is simply no resource available which combines all desirable properties.
CHRISTINE comprises structural annotations of forty passages excerpted from the BNC/demographic files. Altogether 147 identified speakers are represented in CHRISTINE (there is also a good deal of speech by unidentified speakers).
Rather than the SGML format used in the original BNC files, CHRISTINE uses a one-word-per-line fixed-field format, similar to that of the SUSANNE Corpus. This is in accordance with preferences expressed by the experts consulted at the outset of the project. Because the field structure of CHRISTINE is very simple, it would be a trivial matter for anyone whose application requires an SGML-structured data resource to convert CHRISTINE into such, given a suitable DTD (Document Type Definition). For the many users who have no such requirement, the existing format is both more transparent and far more computationally tractable.
Use of SGML for a data resource with such a simple structure as the CHRISTINE Corpus is arguably a negative factor, because it creates many possibilities of inadvertently introducing meaningless coding distinctions. We have encountered several cases of this in Release 1.0 of BNC:
In each of these cases, the coding distinction seems to represent no real difference in what is being said about the structure of the relevant utterance.[5]
(Here and below, examples from the CHRISTINE Corpus are given a location reference in the form “T12.34567”, meaning “text T12, source-unit 34567” — for “source-units”, see §6.2. Examples are quoted with the punctuation and capitalization provided by the BNC transcribers, where this is helpful for understanding the structure of the utterance. Some examples quoted in the present document are taken from material annotated by the CHRISTINE project but not included in the published Corpus.)
The 40 CHRISTINE passages or “texts” are similar to one another in length, and the length was selected so as to be broadly comparable with the texts in the Brown, LOB, and SUSANNE Corpora of written English. The latter corpora were designed so that each text contains 2000 words, plus a few more as needed to make each text-end coincide with a sentence boundary. This rule is not directly applicable to a corpus of spontaneous speech, for one thing because the concept “sentence” does not apply straightforwardly to the spoken language, but also because transcribed speech contains many items — ums and ers, failed partial attempts at uttering words, markers showing that different speakers’ utterances were simultaneous, headers identifying speaker turns, records of “noises off”, pauses, etc. — which are not comparable to written words. For some sample passages from the demographically-sampled BNC speech corpus, having converted them from the original SGML format into a fixed-field format I determined the average ratio of lines to ordinary spoken words to be about 1.46:1.[6] This ratio would imply excerpts of about 2930 lines to get 2000 “real words”. However, the items other than words are themselves scientifically-interesting data items, though they seem individually less “weighty” than real spoken words. Consequently I chose 2800 lines as a target text length, as a compromise between 2930 and 2000.
Because the boundaries of excerpts from BNC were chosen to coincide with natural breaks in the speech stream, as discussed below, most excerpts in practice are longer than 2800 lines. The CHRISTINE texts as published contain about 112,000 lines in total, corresponding to about 80,500 “full words”, ignoring hesitation phenomena, etc.
My research team worked on the principle that the task of those who compile natural-language corpora is to represent the properties of language samples in a clear, explicit fashion that creates the fewest possible hurdles for researchers who wish to extract data from a corpus. We did not see it as part of our task to produce software for data extraction. We could not do that, since we have no way of knowing what sorts of questions future researchers will want to pose to our data. (SUSANNE has been used for various kinds of research that I had no thought of when I put it into circulation.) This point seems worth making, because since the publication of SUSANNE I have more than once encountered comments suggesting that, in failing to supply accompanying utility software, we left a job half done. In response, let me quote remarks I made in a recent book review (Sampson 1998: 365) about the approach which sees utility software as an essential accompaniment to corpus data:
It is hard to see this as a wise policy for allocating scarce research resources. In practice there are usually two possibilities when one wants to exploit corpus data. Often, one wants to put very obvious and simple questions to the corpus; in that case, it is usually possible to get answers via general-purpose Unix commands like grep and wc, avoiding the overhead of learning special-purpose software. Sometimes, the questions one wants to put are original and un-obvious; in those cases, the developer of a corpus utility is unlikely to have anticipated that anyone might want to ask them, so one has to write one’s own program to extract the information. No doubt there are intermediate cases where a corpus utility will do the job and grep will not. I am not convinced that these cases are common enough to justify learning to use such software, let alone writing it.
Forty of the BNC/demographic files were chosen at random to serve as sources of excerpts for CHRISTINE. In order to explain how 2800-line excerpts were selected from these files, it is necessary to explain something of the internal structuring imposed by the BNC compilers on their demographically-sampled speech files.
These files are hierarchically structured into units delimited by SGML tags <div>, <u>, and <s>. The <div> (division) unit corresponds, at least nominally, to a recording of an individual conversation. (In practice <div> breaks sometimes interrupt what appear to be single conversations; so far as I have seen, the BNC Manual does not explain how <div> boundaries were decided.) The <u> and <s> (“utterance” and “segment”) units are intended to correspond to speaker turns, and to individual sentences. Again, in practice these units are often of questionable scientific significance. A speaker’s output is frequently split in BNC into separate <u> units merely because another participant interjects a brief remark (perhaps no more than a reassuring mm) in the middle of what is from all other points of view a single continuous speech-turn. And, although the BNC transcribers set out their transcriptions in the form of sentences, beginning with capital letters and ending with full stops or equivalent punctuation, the grammatical concept “sentence” is often inapplicable to the wording of spontaneous speech, which contains many sequences of wording that do not fit into conventional ideas of sentence structure.
Within each of the 40 randomly-selected BNC/demographic files, I used a random-number generator to select a line in the reformatted version of the file between the first line and the line 2800 short of the last line. I then began the excerpt at a <div> boundary close to this randomly-chosen line, if there was one, and continued to the first <u> boundary at least 2800 lines later. If no <div> boundary occurred near the randomly-chosen line, I began at a <u> boundary (and, if the 2800th line was close to a <div> boundary, I adjusted the excerpt to end there); furthermore, if a BNC <u> boundary did not appear to represent a natural break in the dialogue structure, I continued to a “better” <u> boundary. In general, I allowed myself considerable latitude in ranging forward or back from the randomly-chosen line to find a natural break which led to another natural break roughly 2800 lines later. There was no element of planning in terms of selecting “interesting” or “representative” excerpts from the BNC files; but I treated the aim of finding excerpts with reasonably natural boundaries as a higher priority than making the excerpt boundaries mechanically random, in the sense of being wholly determined by randomizing techniques with no excercise of discretion.
As it turned out, the extracts selected in this way included a minority of cases where the BNC header file gave little or no descriptive information about the speakers, or where a high proportion of speaker turns were not attributed to any identified speaker. This is unfortunate, for purposes of studying who says what in modern Britain, and one possibility would have been to discard those extracts and find other extracts for which information was more complete. But this would probably have skewed the sample. It is surely to be expected, for instance, that less detailed identification of speakers will happen for a recording of teenagers “hanging out” in a city street than for a recording made in a middle-aged couple’s living-room. The chief aim of the project was to produce a representative sample of modern British usage, so we refrained from “improving” on the outcome of the random selection process, and we accepted some gaps in the speaker information as a price to be paid for representativeness.
In 1999, the BNC Consortium released a “BNC Sampler” corpus, containing a selection of material from all parts of the full BNC Corpus, including BNC/speech, for use by researchers whose circumstances made it unnecessary and difficult to deal with the hundreds of megabytes of the full BNC Corpus. Natural-language corpora gain value when the same language samples are studied and processed by many different researchers in different ways, so ideally it would have been desirable to make the CHRISTINE selections overlap with those of the BNC Sampler, which are probably destined to be worked over much more intensively than other parts of the BNC. However, the Sampler was produced too late to allow this. (The BNC selections included in CHRISTINE were made in late 1996; as it happened, my copy of the Sampler disc arrived the day after I had extracted and applied initial processing to the last set of BNC extracts used in our project.)
We saw, above, that the contents of BNC/demographic consist overwhelmingly of informal conversation, but nothing in the sampling methodology ruled out the possibility of including speech of other genres. (The BNC Manual, p. 20, states that respondents were asked to record all of their “conversations”, but this is probably just intended as a nontechnical way of saying “all speech-events”; at any rate, there is a small amount of non-conversational material in CHRISTINE, for instance a sermon-like monologue.) In selecting extracts, we made no attempt to exclude non-conversational material. The aim was to provide a sample of the language that people actually hear in real life; the majority of that is spontaneous conversation, but some is not.
In view of the nature of many of the conversations excerpted, it is perhaps also worth stressing that there was no deliberate intention to choose salacious material. The tone of CHRISTINE, so far as we can tell, simply reflects a fair cross-section of British conversation in the 1990s.
The forty text extracts in CHRISTINE are named T01, T02, ..., T40. (The prefix letter “T” would become significant if the other files annotated by our project are eventually published; different prefix letters are used for material taken from different source corpora.) CHRISTINE consists of a set of 84 files, as follows:
As stated above, the version of the Documentation file included in the Corpus will sometimes be out of date relative to the version available as a Web page.
The Lexicon file contains an alphabetized list of all pairs of wordform and wordtag that occur at least once in the Corpus. Inclusion of a list of wordforms is a recommendation of the EAGLES Spoken Language Working Group, Gibbon et al. (1997: 170, Recommendation 6). In the CHRISTINE case, separate listing of grammatically-distinct uses of single wordforms is an obvious way of increasing the value of such a list.
Each line of the file contains a wordform followed by a wordtag, separated by a tab character, and terminated by a newline.
The Lexicon file covers only actual uttered words (whether complete or distorted/truncated); it does not contain entries for non-linguistic or analytic items. (That is, it contains entries only for word lines of the third category listed under §6.10.)
The main goal of the CHRISTINE project has been to annotate a cross-section of British speech, and to develop guidelines for executing such annotation in a predictable manner — not to study differences of usage among different types of speaker. However, the BNC source files do include background information about many of the speakers; this information is neither as complete nor as reliable as one might ideally hope, but it is a good deal better than no information.
The file Speakers summarizes in machine-usable format the information available on individual CHRISTINE speakers’ demographic characteristics; §4 discusses the assumptions on which this summary is based.
For each speaker represented in the Corpus other than “unidentified” speakers, the file includes one line, terminated by a newline character, and containing eight fields separated by tabs, e.g.:
003 T01 1992 F 63 NO DE Jean
Field contents are:
- SE South East England
- SW South West England
- MD Midlands
- NO Northern England
- SC Scotland
- WA Wales
- IR Ireland
- NA North America
- WI West Indies
- SH Southern Hemisphere
- ON other native speaker
- FO non-native speaker of English
- XX unknown
- AB professional, managerial, technical
- C1 skilled non-manual
- C2 skilled manual
- DE partly skilled or unskilled
- XX unknown
The forty notes files, one for each text, are in HTML format and are intended to be read by users rather than machines. They include information on the following issues:
The BNC compilers promised anonymity to the speakers represented in BNC/speech. The CHRISTINE Corpus extends this BNC policy in certain respects.
The anonymity policy was implemented in BNC by removing surnames of speakers, and a few other proper names, replacing them with an SGML entity which in CHRISTINE appears as <name>. However, this procedure is arguably not adequate.
The headers to the BNC/speech files do specify speakers’ Christian names (forenames); and of course they also specify the dates and places of the recordings. The places specified are sometimes small villages. The date and place specifications represent significant scientific data, and must be preserved. But, particularly when the speakers’ Christian names are moderately or very unusual, it seems likely that someone familiar with the locale in question would often be able to identify groups of friends from their Christian names.
True, an outsider would hardly be able to identify individuals without their surnames. But anonymity vis-à-vis outsiders is not the only kind of anonymity that matters. Surely it is equally important to protect, say, a group of youngsters who have been recorded chatting freely among themselves from embarrassment through being recognized by their own teachers or relatives. One may feel that the likelihood of such an “insider” encountering the CHRISTINE Corpus is fairly low. But the decisive point is that some of the speakers themselves understood that the corpus compilers were offering them this level of anonymity. For instance, T06.00524 shows the speaker explaining the system to her companion by saying they don’t give them a name, they just say ... sixteen-year-old girl, fifteen-year-old girl with a friend. It is not for us to breach this expectation of literal anonymity.
Furthermore, it is not only the speakers themselves who should be protected. For instance, the two speakers just mentioned comment that one of their schoolmates, identified by Christian name, behaves like a whore. This person is entitled to anonymity as much as the speakers, and arguably more so: she signed no release form for the corpus compilers. When well-known public figures or institutions are mentioned, the BNC compilers seem to have felt that there was no need to anonymize the references at all. Clearly, if someone announces that he has just bought the latest album by a named pop singer, there is no point in concealing the singer’s name. But it depends what is said. One of the CHRISTINE texts contains a series of quite damaging remarks about the management of a secondary school, named in the BNC file. In another case, speakers comment adversely on the sexual morality of a named American actress. Even American actresses, surely, are entitled to have their honour guarded by corpus linguists.
Consequently, the CHRISTINE Corpus has taken the BNC anonymization policy further, in the following ways.
Where a BNC file gives the name of an institution, or the surname of a third-party individual (it never gives surnames for participants in the dialogues), in a context where it seems possible that the identification could cause embarrassment, CHRISTINE replaces the name with the <name> entity.
Christian names of speakers are in all cases replaced by other Christian names, both in identifying the utterers of speech-turns, and in the transcription of words uttered. Each speaker represented in the CHRISTINE Corpus is assigned a name and a three-digit code, e.g. “Scott125”. Each of the speaker’s turns is headed by this name/number code; and other participants in the dialogue are shown addressing him as “Scott” — but “Scott” is not the individual’s real name. The three-digit codes are unique across the CHRISTINE Corpus. The names are sometimes shared by different speakers, as their real names are.
(An alternative would have been to attribute the speaker turns to the five-byte codes used by BNC to identify speakers, e.g. PS546. But this gives the corpus user no easy way to link the individuals who contribute particular turns to their names used vocatively by other dialogue participants. It is far easier to grasp what is going on in a dialogue, if one has naturalistic names to hook the spoken interactions onto; the fact that they are not the actual names of the speakers is scientifically irrelevant.)
Some Christian names of individuals not participating in a dialogue, but who are talked about in it, are also changed, if the comments made about them seem potentially embarrassing, or if the name might involve a special risk of rendering the speakers identifiable.
The noms de corpus are chosen to be metrically equivalent to the real names, and also as far as possible to be socially equivalent. Obviously, male names are replaced by male names and female by female. But, in addition, when a name seems to be associated with a particular age-group, social class, and/or region, it is replaced by a name which feels similar in those respects. When (say) a two-syllable formal name alternates with a one-syllable abbreviation, the replacement name is chosen to preserve the same pattern, and formal name and abbreviation of the replacement name are inserted wherever formal and abbreviated versions of the real name occur, respectively, in the original file. If two participants in a dialogue share the same Christian name, their noms de corpus are also the same (occasionally, the logic of the dialogue depends on this kind of ambiguity of names).
Two kinds of turn in the original BNC files are not attributed to speakers with identified Christian names. In many cases, the transcriber could not decide which speaker produced a particular utterance, and assigned the turn to an “empty” speaker code, usually PS000. (Sometimes, where it is clear that different speakers are involved but neither is identifiable, PS000 and PS001 are used; however, a series of turns all attributed to “PS000” sometimes appear in fact to have been uttered by more than one speaker.[8]) These turns are attributed in CHRISTINE to speakers unid0, unid1 (for PS000, PS001 respectively).
In other cases, the BNC file assigns a “normal” speaker code which is identified by the header as referring to a particular individual with specified characteristics, but no name is included. In those cases, CHRISTINE invents a nom de corpus which seems appropriate in terms of the speaker’s sex, age, etc. (Occasionally, if sex as well as real name are not given, CHRISTINE uses the cover name Anon.)
It must be admitted that these procedures cannot offer a watertight guarantee against speaker identification. Someone who was determined to penetrate behind the veil of anonymity provided by CHRISTINE would only have to link its files to the corresponding passages in the original BNC files to discover the names we have concealed. There is nothing we can do about that. But our policy greatly reduces the chance of an accidental betrayal of informants’ confidence. If any of their identities should ever be revealed, it will not be the fault of the CHRISTINE Corpus.
Relevant information about the 147 identified speakers in the CHRISTINE Corpus, adapted from the file-headers of the respective BNC files, is given discursively in the notes files for the separate CHRISTINE texts, and is summarized in computer-tractable form for all the speakers in the Speakers file. Categories such as sex and age in years are self-explanatory, but the dialect and social class categories require some discussion.
One special problem about BNC speaker categorization data relates to the fact that some of the BNC files were created not by the BNC project itself but by a separate project based in Norway, the “Bergen Corpus of London Teenager Language” (“COLT”) project (Stenström & Breivik 1993; cf. BNC Manual, p. 20). COLT material appears to have been used where the BNC/demographic sampling system called for samples fitting its description; but, because COLT was an independent project, it did not collect the same types of information about speakers as the BNC project itself. Users of CHRISTINE will notice that relatively little information about individual speakers is included for those texts which represent young Londoners.
The BNC file headers normally identify speakers’ mother tongue, almost always as British English, and in many cases give rather detailed (though not always very clear) information about speakers’ regional dialects.
The only cases where file-header information seems to mean that speakers have a language other than English as mother tongue are two speakers in text T08 who are identified as having European accents. (Also, the header to the BNC file from which CHRISTINE text T24 is extracted codes all speakers in that file as native speakers of Irish Gaelic, but this is not credible; I take it to be a symptom of the BNC respondent’s nationalist political fervour rather than a serious linguistic description. These speakers include members of more than one family, living in Belfast, and all including a child of three years are shown as speaking fluent English.) Apart from the above, the only cases in CHRISTINE where no information is given about mother tongue are:
Regional dialects are classified by BNC on a system which, with respect to England, seems to be adapted from the classification in Trudgill (1990: 3-5). (The BNC Manual does not explicitly quote Trudgill’s book, so far as I have seen.) Trudgill’s book does not deal with the British nations other than England, and BNC treats Wales, Scotland, and Ireland as three unitary dialect regions coinciding with the respective political units. (BNC makes no distinction between Northern and Southern Irish speakers; and, although all the speech samples were recorded within the UK, the CHRISTINE Corpus includes at least one Irish speaker living in England, who may well have come from the Republic rather than from Northern Ireland — so CHRISTINE likewise uses a single “Irish” category.)
Within England, Trudgill recognizes sixteen dialect areas. BNC describes speakers’ dialects via three-letter codes whose definitions (BNC Manual, pp. 86-7) are too similar to Trudgill’s areas to have been chosen independently, though they are not quite identical. (For instance, BNC uses a code XLO for “London”, which in Trudgill’s system is part of the much larger “Home Counties” area, and it uses a code XLC for “Lancashire”, whereas various parts of Lancashire fall into different areas in Trudgill’s scheme — I infer by elimination that XLC may in reality stand for Trudgill’s “Central Lancashire” area.) The complications in the relationships between the BNC and Trudgill’s dialect classification systems seem to stem partly from the fact that BNC aims wherever possible to use internationally-recognized ISO classifications for geographical regions, and partly from the fact that laymen such as the BNC respondents commonly classify speech-varieties by reference to traditional county names; both of these classification methods relate to political boundaries which are often irrelevant to linguistic realities.
Be that as it may, Trudgill’s classification in any case seems unnecessarily fine-grained for a project like CHRISTINE, which is concerned with grammar rather than with details of pronunciation; and a sixteen-way classification of English dialects is particularly inappropriate when one considers that the recording sites often happened to fall rather close to one or other of Trudgill’s isoglosses, and that BNC respondents had no expertise in classifying speakers who hailed from areas distant from the recording site.
On the other hand, linguistic differences between, say, Northern and Southern England are sufficiently large, in grammar as well as pronunciation, that it would be a pity to ignore the dialect indications in BNC altogether.
CHRISTINE has adopted a compromise strategy, which uses the data in BNC to assign as many English speakers as possible to one of four broad regions corresponding to the second level from the root in Trudgill’s hierarchical classification of modern dialects (Trudgill 1990: Fig. 3.1, p. 65). CHRISTINE uses the terms:
Clearly, this classification can be no more than a broad and vague indication; habits of speech do not change sharply either side of lines drawn through the map of England.
CHRISTINE contains no cases of native speakers of varieties of English from outside the British Isles. However, the complete CHRISTINE Corpus will include some speakers to whom this applies, and consequently the coding system given below includes further classifications:
BNC file headers include three kinds of information about speakers which could broadly be described as social classifications (though in many cases one or more items is missing for a given speaker):
The social-class information is expressed as a code drawn from a four-way classification derived from the Standard Occupational Classification (“SOC”) scheme defined in Office of Population Censuses and Surveys (1990-1).[9]
The SOC scheme assigns occupations to six social classes:
I professional, etc.The BNC coding (in common with much social research) collapses this into a four-way scheme:
II managerial and technical
II skilled occupations, divided into:IIIN non-manualIV partly skilled
IIIM manual
V unskilled
AB I+II, professional, managerial, and technicalIn principle, this four-way scheme is at a very suitable level of granularity for use in CHRISTINE. But there are severe problems in practice, which presumably stem from the fact that the data in BNC file headers are only as good as the logs supplied by the non-expert respondents who filled in details about their friends and relatives.
C1 IIIN, skilled non-manual
C2 IIIM, skilled manual
DE IV+V, partly skilled and unskilled
In the first place, many speakers are assigned an “unclassified” code under the social-class heading. But, more worryingly, it not infrequently happens that the social code assigned to a speaker contradicts the statement in the same file header about that speaker’s occupation, despite the fact that the social classification is supposed to be based on occupation. An extreme case is Gillian091 in text T23, who is socially classified DE (partly skilled or unskilled) in BNC, and is described as a doctor by occupation. Any doctor is SOC class I, i.e. AB in terms of the four-way scheme.
Between them, these two problems are sufficiently severe that one might think it best to abandon any attempt to include social-class data in CHRISTINE. That would be very unfortunate: the issue of correlations (or lack of them) between speech patterns and social class are a topic of great interest from many points of view. And the data in BNC, while certainly quite “dirty” in this area, are not so irredeemably flawed as to prevent anything being said.
For CHRISTINE, therefore, I proceeded as follows. Where the BNC file header for a speaker states that speaker’s occupation, I assumed that this statement, being relatively specific and objective, was more likely to be correct than any social-class code shown: so I used the SOC mapping of occupations onto classes in order to assign a social code. (Thus Gillian091’s code was altered from DE to AB.) In the case of married couples, knowing that wives often treat earning as a subsidiary aspect of their role and take lower-level jobs than their background would qualify them for, I assigned the social-class code for the husband also to the wife; and vice versa in occasional cases where a husband was disabled or unemployed, so that the wife was likely to be the main breadwinner. (Note that these procedures were chosen in the light of experience of 1990s British society as it actually is, rather than of politically-motivated theories of how it possibly ought to be.) In the case of schoolchildren or preschool children, I assigned the father’s (or, failing that, the mother’s) code, irrespective of any code shown in BNC for the child. Only where none of these guidelines yielded a class code for an individual did I accept the code given in the BNC file header, if there was one. (Some speakers remained unclassified.)
The notes files for the various texts explain how these guidelines were applied in each specific case in order to derive the code included in the social-class field of the Speakers file. I believe the resulting classification is significantly more informative than omitting any attempt at social classification would have been. At the same time, it should be clearly understood that this aspect of the data is quite imperfect. Social classification is certainly one of the least satisfactory aspects of the information available about BNC speakers.
The general approach to structural analysis of real-life language samples exemplified in the CHRISTINE Corpus was described in early chapters of EFC, in connexion with the SUSANNE Corpus of written English. Our primary aim has been to refine the analytic scheme (the set of annotation symbols and detailed guidelines for applying them) through conscious consideration of every or almost every awkward case in our samples, so as to uncover hidden ambiguities or gaps in the guidelines and replace them with new explicit decisions. The work is analogous to the way in which the stream of cases arising in a nation’s lawcourts uncovers hidden uncertainties in the legal framework and causes them to be settled through judicial decisions which stand as precedents for the future.
Elsewhere (e.g. Rahman & Sampson forthcoming) we have used the analogy with the discipline of software engineering, developed in response to the “software crisis” of the 1960s-70s caused by premature coding of solutions to inadequately analysed tasks, in order to argue that this kind of detailed logging and classification of linguistic phenomena should be seen as a high priority at the current juncture in natural-language engineering. The carefully hand-crafted nature of treebanks produced in this spirit inevitably means that they are small, relative to some other treebanks now available; but small size is arguably a cost worth paying in exchange for comprehensiveness of the analytic guidelines. As Jane Edwards (1992: 139) has written, “The single most important property of any data base for purposes of computer-assisted research is that similar instances be encoded in predictably similar ways.” Only detailed, explicit analytic guidelines can enable this kind of internal consistency to be maintained within future larger treebanks.
The exercise of adding structural annotations to samples of spontaneous speech involves problems of a higher order than arise in the equivalent work with edited written language. There are of course specific structural features found only in spoken language, which require their own annotation mechanisms; a number of such mechanisms were defined in EFC, chapter 6, and are used in CHRISTINE, and further notational devices developed in the course of the CHRISTINE project are described in §13 below. But, beyond these things, there is a pervasive incidence of indeterminacy in transcriptions of spontaneous speech which is rarely found in written prose. When people’s hasty, unstudied utterances are pinned down in black and white, again and again it is just not clear what they are saying. This is frequently true even when the transcriber has succeeded in distinguishing each of the individual words on the tape (often, of course, transcribers do not succeed in that).
In some cases the reason will be that the speech refers to features of the situation which were not visible to the transcriber (BNC transcribers worked from tape recordings but had only general indications of the locale in which a dialogue was recorded), or that participants in the dialogue were tacitly drawing on their shared knowledge of some matter that is never explicitly mentioned. In other cases the main problem seems to be that people sometimes just speak inconsequentially. Their skill at rapidly assembling word-sequences suitable to express their meaning is imperfect, so they come out with utterances which are uninterpretable (or perhaps they have not formulated a clear meaning in the first place — they are “engaging mouth before putting brain in gear”).
In the CHRISTINE project our task is limited to annotating for linguistic structure, rather than somehow indicating the meanings of the language samples; so this sort of inconsequentiality is not too problematic while it does not affect grammatical coherence. But sometimes it does. What should one make, for instance, of a speaker turn such as the following, uttered by Harold001:
When, when he rang us up it, it was looking at us like for Brian he kept, called in as I was going enjoy T01.02733Where the transcriber placed commas, evidently the speaker has repeated the previous word or substituted an alternative word; and it is easy to see that the word like is functioning in the colloquial “hedging” usage. But there is no clue in the context about what the it was that was looking at us (or why it was looking at us, whatever it was), or how to interpret for (“looking on Brian’s behalf”, or “because Brian called in”?); and the relationship of the closing word enjoy to what precedes is completely mysterious.
No set of explicit guidelines, however comprehensive, can define a clear, predictable structure for every case like this. The CHRISTINE principle is that, where word-sequences are open to alternative grammatical interpretations, the analyst chooses one, at random if there are no apparent clues (so the CHRISTINE analysis of the above example annotates for Brian on the “looking at us on Brian’s behalf” interpretation); and, while words and phrases are fitted into larger structures whenever they can be, there is no objection to leaving individual words or short phrases as independent elements in the stream of structures, if there is no apparent way to fit them into their verbal environment — so, in the example, the CHRISTINE analysis leaves enjoy as a verb which is not part of any higher structure, following a main clause he kept, called in as I was going.
Each word is at a minimum assigned a wordtag. But, if the word cannot be fitted into a larger structure and alternative wordtags are available for the form, a tag may have to be chosen at random (for instance, off at T11.02554 is tagged RP simply because this is the commonest of the three tags applicable to this wordform).
Contrary to the rule specified in EFC, §6.13, the CHRISTINE Corpus does not group the succession of structures within a single speaker turn under a single root node O. To do that would imply some degree of coherence among the successive daughters of the speaker-turn node: it would suggest that those daughter nodes constituted a “construction” of some type. This is more than our data warrant. In CHRISTINE, a speaker turn consists of a disconnected sequence of one or more grammatical units which, if the speaker is articulate, may all be main clauses with recognizable internal structuring, but which in some cases may be disjointed phrases that would never be allowed to stand in writing as “complete sentences”, or may be individual words as in the case of enjoy, above.
In the example quoted, there was no stretch of speech where the transcriber found the wording indistinguishable. But stretches like that, shown in CHRISTINE as {unclear}, are very frequent, and obviously they create even greater problems for choice of structural annotation. We have evolved explicit guidelines for annotating passages that include {unclear} stretches, and these are discussed in §9.
Members of the CHRISTINE project had no access to the recordings from which the BNC transcribers worked; at present, for confidentiality reasons these are not available to researchers. This means that those of us who provided the structural annotation had no evidence about what was actually said, other than what the BNC transcribers wrote down.
This evidence went slightly beyond the words themselves; BNC was transcribed using ordinary English orthography, including sentence-initial capitals and a variety of punctuation marks, and although these features were in due course stripped off the wording and “hidden” in a subsidiary field of the CHRISTINE text files (because they have no direct spoken significance), at the stage when our grammatical annotations were being created the orthographic features were still present and visible to the analysts — often, the BNC transcriber’s choice of punctuation was helpful in deciding between alternative structural interpretations which would have been equally plausible with respect to the words alone.
However, the possibility obviously arises that the BNC transcribers may sometimes have misheard or misunderstood, and have written down words different from those which the speakers thought they were saying. It is at least logically possible that the kind of inconsequentiality described in §5.2 might be entirely an artefact of the transcription process, and not present at all in the speakers’ actual wording. Perhaps Harold001 did not use the word enjoy but some phonetically-similar form which made good sense in context. I very much doubt that cases of apparent speaker incoherence can all be explained away in this manner, but there is no way that we can be sure. This is very regrettable; it would be extremely worthwhile from several points of view to be able to use this sort of treebank to check how structurally coherent speakers are on average in real life. CHRISTINE does not give us reliable quantitative information about the question. Given the primary goal of making the Corpus representative socially, regionally, etc., no speech samples were available to us that would have enabled us to answer it.
On the other hand, it is unquestionably true that some cases of incoherence in the transcriptions were created by transcribers rather than speakers. At the outset of the project, I had envisaged adopting a principle that the transcriber’s wording should never be “second-guessed”; it might be wrong, but the transcriber had heard the tape, we had not. As I grew familiar with the nature of the BNC transcriptions, it became clear that such a principle would be unreasonable.
Clear evidence of transcriber error occurs in a few cases where the speaker is reading aloud from a published document. For instance, T08.00933 contains a passage read out from St Matthew’s Gospel, 8:11, containing the phrase ... will come and decline at the table ..., where it is easy to check that the transcriber’s decline must have been a mistake for recline. But there are other cases where we have no independent check, yet some phonetically-small adjustment to the transcriber’s wording makes such a large improvement to the sense that it seems morally certain that the transcription is erroneous. For instance, at T29.09621, in a discussion of unsatisfactory child-minders, one speaker is made by the BNC transcriber to say unless you’ve low and detest children or unless you’re out purely and simply for the money ... The first clause seems meaningless, but emending it to unless you loathe and detest ... gives a passage which uses a standard cliché to express a straightforward piece of sense. The phonetic differences involve two voiced fricatives, which are among the least auditorily salient classes of English phoneme; and the BNC dialogues were recorded in non-ideal field conditions. It is overwhelmingly likely that the speaker said you loathe, and the BNC transcriber misheard. (As I understand it, the transcriptions were done under heavy time pressure.)
Therefore, we have allowed ourselves to make cautious emendations to the BNC transcriptions (logging any such emendation in the relevant notes file, with a summary of the justification for it). In general, where a change to the written wording implies no phonetic difference (e.g. where as changed to whereas), we have freely adopted the emendation if it improves the sense. The larger the phonetic difference between the transcribed wording and a proposed emendation, the greater we required the gain in semantic coherence to be before the emendation was adopted; and we tried to err on the side of conservatism. The notes files sometimes mention possible emendations which we regarded as too adventurous to incorporate into the text files.
Emendations to the BNC source material are not limited to alterations of wording. Another type of apparent error in BNC is that, in some dialogues, the assignment of turns to identified participants seems to have become muddled. (The BNC system of distinguishing speakers via opaque sigla such as PS0M4, PS0M5 may have encouraged such confusions.) Sometimes the error becomes apparent because a speaker seems to be addressing himself or herself by name when it is clear that he or she is really talking to someone else. In other cases the contents of the wording, taken with information in the BNC file headers about what kind of people the speakers are, makes the existence of a confusion inescapable. For instance, when text T21.01517-8 includes the exchange:
PS0M5: Did you want to have a shower with daddy?it is startling to learn from the file header that PS13T is a 34-year-old engineer and PS0M5 his 3-year-old son; but a simple permutation of identity codes between the three participants in this dialogue makes good sense of this and many other seemingly bizarre aspects of the conversation.
PS13T: Umm yes.
In this case the evidence is so strong that CHRISTINE has reallocated the speaker identity codes (with an explanation in the file T21.nb); a few other reallocations have been made elsewhere for comparable reasons. There are other cases where the BNC attribution of individual speaker turns looks suspicious, but either the evidence is not strong enough to justify a reallocation, or it is not clear what particular reallocation would be correct, and so the BNC attributions have been left to stand. Allocation of speaker turns is an area where I suspect that CHRISTINE contains a significant incidence of errors (though not, I believe, so many as to make the information about speakers valueless).
A special case of speaker-code reallocation relates to a phenomenon found in a sizeable minority of the files, namely T06, T09, T19, T22, T28, T29, and probably T14. In these files, there are speaker turns which as a whole are transcribed as {unclear} (the transcriber could not distinguish the wording) and are allocated by BNC to unidentified speakers, but in context it often seems clear that the inaudible passage is in fact a segment of an adjacent turn whose speaker is identified. In the texts where this approach seems to have been adopted by the BNC transcriber, we have felt free to make guesses about how best to link these unclear turns with adjacent clear turns. Most of them have been treated as part of the preceding (or, sometimes, the following) turn, and only a minority have been left as separate turns, where it is quite unclear who is speaking and what the turn structure can be (e.g. the unclear turn between T06.00439 and 00440, or the sequence at T06.00457ff.). When an unclear turn has been linked with a clear turn, the adjacent s-unit in the latter has been made to extend over the {unclear} entity, contrary to the structure in the BNC original. No log is kept of such modifications in the respective notes files.
One respect in which we gave ourselves complete freedom to modify the source transcriptions related to the sequencing and structuring of different speakers’ contributions as BNC <s> and <u> units. Much of the BNC material consists of lively conversations in which two or more participants often interrupt one another, talk simultaneously, make brief supportive responses while someone else holds the floor, and so forth. The BNC compilers treated it as a high priority to record the relative timing of various speakers’ contributions, and because they did not themselves equip the material with annotations for grammatical structure they seem not always to have noticed that sequencing different speakers’ wording in accordance with the physical timing does violence to the integrity of individual speakers’ contributions. Although the BNC material was transcribed in the form of orthographic sentences, a yes or mm interjected by a hearer has sometimes led to a speaker’s wording being divided into separate “sentences” in the middle of quite a low-level grammatical tagma.
The aims of the CHRISTINE project have to do with grammar rather than with social turn-taking phenomena. Consequently, whenever a construction produced continuously by one speaker has been split in BNC into separate <u> (utterance) units interrupted by another speaker’s utterance, CHRISTINE reorders the first speaker’s <u> units into one continuous speaker turn. Where one speaker holds the floor for a long time, producing a series of clauses which are interrupted in the middle by hearer responses, CHRISTINE will show the first speaker’s contribution as continuous, followed by the responses which in some cases may physically have occurred much earlier than the point where they appear in CHRISTINE. (If the hearer responses occurred between independent clauses in the first speaker’s contribution, no reordering is done, since the BNC sequencing does not distort any of the grammatical constructions.) The source transcription field (§6.8) includes markers indicating that such reordering has occurred, though users who need full information on relative timing will need to consult the original BNC files.
(It is perhaps fair to point out in this connexion that manual transcriptions like those of the BNC are unlikely to be very accurate with respect to precise relative timing of speech events. There is plenty of psycholinguistic research showing that hearing of relative timing of speech sounds is heavily influenced by the hearer’s understanding of grammatical structure, which interferes with perception of the physical facts. This is one of many reasons why it is unfortunate that the audio data from which the BNC transcriptions were made have not so far been released.)
In BNC, <s> (segment or sentence) units are wholly contained within <u> (utterance) units, implying that when a speaker’s contribution is divided into separate <u> units because of an interruption, there is necessarily also an <s>-unit boundary at the same point. But, in addition, BNC <s>-unit boundaries sometimes seem grammatically arbitrary even in cases where the <s> units are adjacent in BNC. What seems to be a single coherent grammatical construction has sometimes been split by the transcriber into separate “sentences”, or successive disconnected constructions have been grouped into one orthographic sentence. CHRISTINE analysts took complete freedom to group words into tagmas reflecting speakers’ apparent logic, ignoring the BNC <s>-unit boundaries where these clashed.
CHRISTINE does preserve the BNC <s>-unit (“source-unit”) boundaries for the purely practical convenience of having a division of the texts into short numbered chunks which can easily be cross-referred to the corresponding locations in the original BNC files. But the consequence of the analytic approach described above is that CHRISTINE source-unit boundaries have no significance with respect to the tree structure assigned to surrounding wording. A source-unit boundary may occur between adjacent parse-trees, or anywhere in the middle of a tree.
On the other hand, parse-trees are never divided by the higher-level boundaries recognized in CHRISTINE: turn boundaries, and division boundaries. The point of re-ordering the BNC <u> units was to ensure that every grammatical tagma is complete within a single speaker turn.
One specific rule related to this last point is that CHRISTINE grammatical annotation is never allowed to link separate speakers’ wording into a single tagma, even when speakers “complete one another’s sentences” (as people often do). For instance, T01.02825ff. has this passage (shown here in the original BNC sequence):
Jean003: yeah Chris is yeah yeahAt the beginning of the CHRISTINE project, I envisaged that such a passage might be analysed in such a way that just with the kids would be treated as adjunct material within the she’s a-coming clause, and with two kids perhaps tacked on as an appositional element subordinate to with the kids. But it quickly became apparent that this approach would not lead to predictable analyses — it creates too many debatable alternatives, particularly when speaker B completes speaker A’s sentence and then speaker A also completes it in the same or different wording (again a frequent scenario). Accordingly, trees crossing speaker-turn boundaries are forbidden. (This rule, though evolved independently, turned out to coincide with the rule adopted for the Switchboard Corpus (Meteer et al. 1995: §1).) In CHRISTINE, the dialogue above is structured as:
Harold001: she is Chris she’s a-coming
Jean003: just with the kids
Harold001: with two kids
Jean003: yeah [ Chris is yeah yeah | just with the kids ]where square brackets enclose main clauses, and the “|” symbol marks source-unit boundaries. (The fact that the physical timing of the words was different from this is marked in the source transcription field.)
Harold001: [ she is Chris ] [ she’s a-coming | with two kids ]
CHRISTINE text files use a fixed-field file structure which is intended to be transparent to manual inspection (that is, a non-expert newcomer who scans a file should be able to grasp as much as possible of what is going on in the dialogue), while making it easy, through regularity of structure, to write code to extract information automatically. The file structure is also somewhat similar to that of the SUSANNE Corpus, though differences in the nature of the information recorded unfortunately made it impossible to use identical structures. (The SUSANNE Corpus was already published years before I began to plan the CHRISTINE Corpus.)
Each CHRISTINE text file consists of a sequence of lines terminated by newline characters; each line contains a sequence of fields separated by tabs. Tab and newline, codes 9 and 10, are the only nonprinting characters found in the text files (the space character, code 32, never occurs). Among the ASCII printing characters, i.e. codes 33 (!) to 126 (~), no use is made of the characters:
$ ( ) ; \ ^ ` ~(codes 36, 40, 41, 59, 92, 94, 96, 126).
As a specimen, here is the initial part of file T02.tx. (As a consequence of Web technology, tabs as field dividers are simulated using HTML entities.)
T02_0003 ===== 011303 mThe first field of each line is an eight-byte CHRISTINE location code of the form Tnn_nnnn, where Tnn is the name of the text, and nnnn is a four-digit number uniquely identifying the line within the text. Successive line-numbers are guaranteed to increase, but are not in general consecutive; they usually increase in threes, but editing and correction of the files sometimes led to insertion of lines with intermediate numbers.
T02_0006 ——- Gemma006
T02_0009 ..... 00325
T02_0012 0050761 * PPH1 it [S[Ni:s.Ni:s]
T02_0015 0050770 | VBZ +’s [Vzb.Vzb]
T02_0018 0050780 | IIp per [P:e.
T02_0021 0050791 | NNU1c foot .P:e]
T02_0024 0050803 | RTn then [Rsw:c.Rsw:c]S]
T02_0027 0050815 | RRz so [S[Rs:c.Rs:c]
T02_0030 0050825 | PPY you [Ny:s.Ny:s]
T02_0033 0050835 | VMd +’d [Vdc.
T02_0036 0050846 | VH0 have .Vdc]
T02_0039 0050858 | TO to [Ti:z[Vi.Vi]
T02_0042 0000000 y YR # .Ti:z]S]
T02_0045 0050868 c YP {pause} .
T02_0048 0050876 | DDQ what [S?[Dq:o.Dq:o]
T02_0051 0050888 | VD0 do [Vo.Vo]
T02_0054 0050898 | PPY you [Ny:s.Ny:s]
T02_0057 0050909 ? VV0v want [Vr.Vr]S?]
T02_0060 ——- Barbara004
T02_0063 ..... 00326
T02_0066 0050960 * MC eleven [Nu[M.
T02_0069 0000000 y YR # .
T02_0072 0050974 c YP {pause} .
T02_0075 0050982 | MC eleven .
T02_0078 0050996 | IIb by [P.
T02_0081 0051006 | MC eleven [M.
T02_0084 0051020 | CC and [Ns+.
T02_0087 0051031 | AT1 a .
T02_0090 0051041 | NN1c half .Ns+]M]P]M]
T02_0093 0051053 . NNU1c foot .
T02_0096 ..... 00327
T02_0099 0051085 * RGQq how [S?@[Dq:e.
T02_0102 0051096 | DA1 much .Dq:e]
T02_0105 0051108 | VBZ is [Vzb.Vzb]
T02_0108 0051118 ? DD1a that [Ds:s.Ds:s]S?@]Nu]
T02_0111 0051138 c YP {pause} .
T02_0114 0051146 ic YY {unclear} [Y.Y]
T02_0117 0051156 c - {event16} .
Lines are divided into two types: header lines, which identify the structuring of the dialogue into units of various levels above the individual words, and word lines, which contain successive spoken words (and certain non-word items, such as identification of “noises off”). In a header line, the second field is composed of five identical punctuation marks (different marks for different categories of header). In a word line, the second field is a seven-digit source location code.
The types of header line are:
CHRISTINE header lines mark the beginnings of corpus sections of these three levels. In consequence:
division header
turn header
source-unit header
word
word
word
source-unit header
word
word
source-unit header
word
word
word
word
turn header
source-unit header
word
word
word
source-unit header
word
word
division header
turn header
source-unit header
word
word
word
word
turn header
source-unit header
word
word
...
word
word
A division header has four fields; an example is:
T02_0003 ===== 011303 mThe four fields are:
A turn header has three fields; an example is:
T02_0060 ——- Barbara004The three fields are:
A source-unit header has three fields; an example is:
T02_0096 ..... 00327The three fields are:
A word line has six fields; a typical example is:
T02_0108 0051118 ? DD1a that [Ds:s.Ds:s]S?@]Nu]The six fields are:
The purpose of this field is to link each word in a CHRISTINE text to its location in the source file from which the text was extracted — in the case of CHRISTINE, from some file in Release 1.0 of the British National Corpus. BNC contains various categories of information which were judged to have little relevance to the aims of the CHRISTINE project and are not preserved in the CHRISTINE Corpus; one example is detailed information about the relative timing of various speakers’ wording, in cases where speakers interrupt one another or speak simultaneously. The source location field is provided in order to enable users who need to do so to check CHRISTINE wording against the original BNC file.
The BNC filename corresponding to a CHRISTINE text is given in the notes file for that text; for instance, CHRISTINE text T02 is extracted from BNC file KB6. The seven-digit source location code within the CHRISTINE text file locates the individual word (in the example, the word that) within the relevant BNC file. Because BNC files are based on an SGML structure rather than on fixed-field records, the location reference which appears in CHRISTINE ignores the internal structure of the BNC file and uses a simple byte count from the beginning of the file.
In BNC, each word uttered is enclosed within an SGML <w> ... </w> tag. The source location code 0051118 means that the character < at the start of the <w> element to the left of the word that is the 51118th byte in the BNC file (counting its initial byte as 1, not 0). Where the contents of a CHRISTINE word field correspond to an “empty” SGML tag in BNC (e.g. an indication of a silent pause or a non-speech noise), the CHRISTINE source location code identifies the opening < of the empty SGML tag.
An exception occurs in cases where an item treated as a single word in BNC is by the rules of the SUSANNE/CHRISTINE annotation scheme split into two or more words on successive lines in CHRISTINE. For instance, BNC (Manual, p. 97ff.) treats various phrases containing multiple orthographic words as single units within a single <w> element — often these correspond to SUSANNE “idioms”, sometimes they do not, but in either case the separate orthographic words appear on separate lines in CHRISTINE. Also, occasionally BNC erroneously runs together items which ought by its own standards to be separate words — for instance, text T21 contains a passage where the BNC source has a “word”:
five.Theproduced by leaving out spaces between adjacent orthographic sentences, and in CHRISTINE this form is split into its separate words. In such cases, where words after the first do not have their own <w> tag in the BNC file, the intention was that the CHRISTINE source location code should identify the first character in the BNC file belonging to the respective word. Unfortunately, misunderstandings within the CHRISTINE project meant that this plan was not correctly executed, so that the byte count for such a word will fall within the appropriate BNC <w> element but in some cases will not coincide with the initial character of the relevant word.
Where a CHRISTINE word line represents an analytic item supplied as part of the structural annotation and having no equivalent in the BNC source — for instance, a “ghost” node (EFC, p. 353ff.) or “trace” representing the logical position of a constituent which appears elsewhere in surface structure — the source location code is a sequence of seven zeros.
The source transcription field is used to record various categories of information about the wording as transcribed in the source files (in the case of CHRISTINE, the BNC files) which it is convenient to eliminate from the CHRISTINE word field. The source transcription field always contains a string of one or more characters; in the majority of word lines, to which none of the relevant categories of information applies, the field contains a pipe symbol, “|”, as placeholder.
Source transcription fields not containing the one-byte string “|” contain some combination of one or more of the following elements, in the order given:
& y i I c s t * punctuation-marksMany of these items are mutually incompatible, but some source transcription fields do contain more than one character.
The meanings of the various items occurring in source transcription fields are as follows:
Pointers with the same number are intended to mark the same time point, so the net effect is to show that speaker A’s words d e f g were uttered simultaneously with speaker B’s h i j k l. In CHRISTINE, the word line immediately following a time-pointer entity (provided that there is such a line) has an ampersand in its source transcription field. Thus, in the example, words d, h, and m would be marked with ampersands.[11] The time-pointer entities themselves are suppressed. (In early versions of the CHRISTINE files, which kept the time pointers, they proved to interfere fairly severely with the readability of speakers’ utterances; the use of the ampersand symbol described here seems a reasonable compromise between readability and preservation of simultaneity data, but to find out just which stretches of wording are asserted in BNC to be simultaneous users would need to consult the original files. The CHRISTINE ampersand symbols show that BNC makes such a claim, but the boundaries of the simultaneous stretches cannot be reconstructed from CHRISTINE.)[12]speaker A: a b c <ptr 1> d e f g <ptr 2>
speaker B: <ptr 1> h i j k l <ptr 2> m n o p
The wordtag field contains a code representing the grammatical classification of the word. Wordtags are normally strings of two or more characters, beginning with two capital letters, drawn from the class of wordtags defined in EFC supplemented by some additional wordtags for spoken English (listed in §13.1). (Note that the relatively precise set of wordtags for spoken “discourse items”, defined on pp. 447-8 of EFC, is used in place of the generic “interjection” wordtag UH defined on p. 118 of that book — UH does not occur in the CHRISTINE Corpus.) The only cases where the wordtag field contains something other than a string beginning with two capitals is in lines representing non-speech “events” (§8.5), where the wordtag field contains a hyphen.
A word field contains a sequence of one or more characters; these sequences fall into three classes, distinguished by the contents of the source transcription field on the relevant line:
As in the SUSANNE Corpus, enclitics such as those at the end of the words won’t, she’d, are treated as separate words on lines of their own. The Germanic genitive suffix as in John’s book is treated as an enclitic for purposes of word division. In the few cases where a form ending phonetically in a sibilant must in context be seen as a regular genitive plural, as in both girls’ books, the apostrophe alone is split from the stem and treated as a separate word on its own line. (In the context of speech this is an odd procedure, since this “word” never has any phonetic substance at all, but it is the logical consequence of the preceding rules which in most cases give sensible and convenient results.)
Again as in SUSANNE, whenever the contents of a word field would in ordinary English orthography follow the contents of the preceding word field immediately, without an intervening space or spaces, a plus sign is prefixed to the later word field. Thus the word won’t is divided between two lines as
wo(note that nothing in the earlier word field marks wo as something other than an independent word); and gotta as a reduced form of got to is represented in CHRISTINE as
+n’t
gotThe CHRISTINE words are tagged and otherwise analysed like their unreduced equivalents: wo, +n’t, and +ta are given the same wordtags as will, not, and to respectively.
+ta
(The other main use of the plus sign in word fields of the SUSANNE Corpus, in connexion with punctuation marks, is not relevant to the CHRISTINE Corpus. When BNC transcriptions contain words followed by punctuation marks, the punctuation is moved into a different field in CHRISTINE as not part of the spoken material. Punctuation marks attached to the beginnings of words, such as left bracket or opening inverted commas, do not occur in the speech transcriptions.)
On the treatment of hyphenated forms, see §7.12.
Parse fields in successive CHRISTINE word lines define a labelled tree structure over the corresponding sequence of word_wordtag pairs, considered as leaves of the tree, in the same manner as in the SUSANNE Corpus.
A parse tree for a sequence of words is represented as a labelled bracketing, with labels always repeated in full inside each of paired brackets (immediately following an opening square bracket, and immediately preceding a closing square bracket), and with no spacing between adjacent bracket/label strings (the label of the first opening bracket is immediately followed by the second opening bracket, and so on). The character string for an entire tree (“tree string”) is divided between the parse fields of successive word lines, in a way that is rather cumbersome to define in words but which is natural and easily grasped from an example: cf. the sample from T02.tx shown earlier in this section.
In every word line, the parse field contains a full stop, representing the word_wordtag pair. To the left of the stop is shown the maximal subsegment of the tree string which consists entirely of labelled opening brackets such that the last one represents the node whose first daughter is the word_wordtag pair in question. To the right of the stop is shown the maximal subsegment of the tree string which consists entirely of labelled closing brackets such that the first one represents the node whose last daughter is the word_wordtag pair in question. It follows that a word which occurs medially within the tagma immediately dominating it will have a word field consisting just of a full stop character.
Referring back to the T02.tx sample: Gemma006’s turn begins with a tree whose root is labelled S and has four daughter nodes labelled Ni:s, Vzb, P:e, and Rsw:c respectively. The first and second of these nodes, labelled Ni:s and Vzb, each immediately dominates a single leaf node, it_PPH1 and the enclitic +’s_VBZ respectively. The P:e node has the two leaves per_IIp and foot_NNU1c as daughters. Gemma006’s second tree again has an S root with four daughter nodes, labelled Rs:c, Ny:s, Vdc, and Ti:z, and the last of these has daughters of which the first is itself nonterminal, labelled Vi (and the second is an analytic element, #, indicating the fact that the Ti:z tagma is incomplete). This is followed by a degenerate “tree” in which a silent pause, {pause}_YP, is both root and sole terminal node; and the turn finishes with a tree having a root labelled S? and four daughters each of which immediately dominates a leaf.
A parse tree is always complete within a single speaker turn (and therefore, a fortiori, within a single text division); in other words, turn header and division header lines never interrupt a parse tree. Source-units, on the other hand, are segments of the BNC transcriptions which are preserved in CHRISTINE for reference purposes but do not necessarily correspond to any linguistic realities. Therefore source-unit header lines may, and often do, occur medially within parse trees.
Definitions of the meanings of the bracket labels, S, Ni:s, etc., are outside the purview of the present document. The bulk of the labelling scheme is defined in great detail in EFC. Much of that book deals with notation that is equally applicable to written or spoken English; its Chapter 6 describes features of the scheme applying particularly to speech. The present documentation file, in §13-14, does list and discuss additional speech annotation symbols and guidelines which have proved necessary in the light of experience with the CHRISTINE project, but those sections are written on the assumption that readers are familiar with the contents of EFC.
I begin this section by identifying the system of phonetic transcription used, because later subsections of the present document include such transcriptions.
Phonetic transcriptions are shown (in this documentation file, and in CHRISTINE text files) as character-sequences enclosed in square brackets, using the SAM-PA broad phonetic notation for English, as defined in Gibbon et al. (1997: 699ff.); except that, for reasons discussed in Sampson (forthcoming), the ampersand rather than opening curly bracket symbol is used for the pat vowel. (The SAM-PA system assigns the ampersand symbol to a slightly different vowel which does not occur in English.)
Linguists for whom the “emic/etic” distinction is important might prefer to enclose broad phonetic notation such as that of the SAM-PA system within slashes (solidi) rather than square brackets. For the CHRISTINE Corpus, the distinction has little significance, and we use square brackets with phonetic notation in all cases.
(CHRISTINE contains only one instance of a form represented by phonetic transcription, but the full CHRISTINE Corpus will include many more cases in its text files.)
Different BNC speech files were transcribed by various workers, and it does not appear from the finished BNC that strong copy-editing standards were imposed, either via instructions to the transcribers or via post-editing. There is considerable orthographic variation, including quite a number of straightforward spelling mistakes.
For computer processing it is desirable that orthographic details should be standardized wherever possible, even in cases where the norms of English permit variation between alternative forms. CHRISTINE treats the orthographic usage of the Concise Oxford Dictionary (8th edition, 1990) as standard. Spellings in the original BNC files which deviate from COD usage are changed in CHRISTINE wordfields to agree with COD. Where COD lists alternative spellings (e.g. gaol, jail), CHRISTINE uses the one shown as primary in COD. The -ize form of the -ize/-ise suffix is used.[13]
Wherever orthographic forms in CHRISTINE deviate, for this or other reasons, from the form in the original BNC file, a note of the difference is included in the relevant notes file.
In some particularly common cases, changes to the orthography of the BNC transcriptions are made without logging them in the notes files:
In some cases, BNC transcriptions include unusual spellings of Christian names. In recent years, it has become fashionable for unusual spellings of traditional Christian names (particularly girls’ names) to be used as individuals’ official names. However, there is no way that a BNC transcriber could have known that a particular individual mentioned in a dialogue, but not a participant in it, spelled his or her name in a special way; so CHRISTINE replaces such spellings with standard spellings. (Where a person shown in BNC with an unusually-spelled name was a dialogue participant, his or her true name is changed in CHRISTINE anyway, for anonymization purposes.)
The issue of orthographic standardization relates only to standardizing the written representation of whatever words were uttered by speakers. Speakers’ words are not themselves changed, when their usage deviates from the standard. An idiosyncratic written form which seems to represent a purely phonological dialect variation is replaced by standard spelling: for instance, in text T39 the BNC original has the form wents, apparently representing a Liverpool pronunciation of went, and this is changed to went in CHRISTINE file T39.tx (with a record of the change in T39.nb). It is very unlikely that a BNC transcriber who recorded an occasional nonstandard pronunciation in this manner would do so consistently. But when the speaker’s words themselves, or the structure in which the words are arranged, differ from standard English, CHRISTINE records what was actually said, not what would be said in the standard language. A Northern speaker’s nowt for nothing appears as nowt in CHRISTINE; he done it as a nonstandard equivalent of he did it appears as he done it. This type of usage variation, which is lexical or grammatical rather than phonological, is part of the subject-matter of the CHRISTINE enterprise, to be preserved for study rather than discarded; and non-expert transcribers would normally be well able to reproduce it consistently.
An intermediate case is where a spelling is standardly used to represent some special pronunciation. It is not “idiosyncratic” for a transcriber occasionally to write ’e for an H-less pronunciation of he, or an’ for a reduced form of and. Logically, it might have been appropriate to change ’e and an’ to he and and in CHRISTINE (again there is unlikely to have been any consistency in transcribers’ use of the deviant orthography), but this was not done; where arguments are evenly balanced, it seemed best not to alter the material in our sources.
The spelling cos deserves special mention. The standard-English word because is often reduced to a monosyllable which novelists, etc., frequently show as cos, and many examples of this form appear in our source material. There is an argument for saying that colloquial cos and standard because, while undoubtedly sharing a common origin, should be regarded as separate grammatically-distinct words in current spoken usage. Because in standard English is a subordinating conjunction; colloquial cos very often (impressionistically, more often than any other subordinating conjunctions) begins a clause which stands alone, or which displays only a vague logical relationship with what precedes (as if cos were a co-ordinating conjunction like and), rather than being used to express the precise causative or inferential relationship expressed by because in written English.
The CHRISTINE Corpus preserves the orthographic form cos where it occurs in the sources, rather than standardizing it to because, but in other respects the word is treated as identical to because: a clause beginning with cos is analysed as an adverbial subordinate clause (Fa), even if it is used as an independent statement. This arguably is a distortion of the structural realities of the spoken language.[14]
Another nonstandard orthographic practice proved controversial. There are many cases in the BNC transcriptions where perfective forms are written with of in place of have, e.g. could of been. This is an orthographic deviation of long standing in English; Caldwell (1998) quotes it as used by the American writer Booth Tarkington in his 1914 novel Penrod as a device to suggest lack of interest in education. My assumption was that this should be classified as a spelling mistake on the transcribers’ part; when unstressed, both have and of are regularly reduced to the pronunciation [@v], and some people choose the wrong spelling for this pronunciation in the relevant context. Consequently, the policy adopted in CHRISTINE is to change sequences such as could of been to could’ve been. But one of my researchers who has an English-teaching background urged strongly that this policy is misguided, because for many speakers the word really is of rather than have, so that a sequence like could of been should be seen not as a spelling mistake but as grammatical deviance. Presumably this would imply that the speakers in question would sometimes produce a full vowel, [Qv], in such a sequence.
Even if that is so, nothing could guarantee that BNC transcribers wrote of in just those cases where the respective speakers thought of the word as of (in most if not all cases the sound on the tape will have had an obscure vowel). But I record the disagreement here, because it seems a matter of some linguistic interest. On the face of it, the pervasiveness of of for have in the writing of perfective forms is surprising, because the logic of English verb groups might seem to make it obvious that [@v] in this context stands for have (no-one, surely, would ask Of you seen this? instead of Have you seen this?). The notes files show which cases of +’ve in CHRISTINE correspond to of in BNC.
There is variation in the BNC transcriptions between the use of s-apostrophe and plain plural forms in the construction (ten) pounds’ worth/pounds worth. CHRISTINE uses the s-apostrophe form, as in standard orthography, and analyses the wording before worth as a genitive phrase. (Note however that one speaker uses the phrase seventy-two pound worth, T13.00974; no attempt is made to represent this as a genitive construction.)
With respect to initial capitals, the contents of CHRISTINE word fields are intended to display words as citation forms, not as they would appear in a running text. Thus the name London, or the pronoun I, appear with capitals in CHRISTINE, because these words are intrinsically capitalized. On the other hand, if an utterance begins with the article the, this will be shown in the CHRISTINE wordfield as the even though an ordinary tra