The LUCY Corpus: Documentation

Geoffrey Sampson

Department of Informatics
University of Sussex
Falmer
Brighton BN1 9QH
England

Release 2, 9 Dec 2005

Contents

  • Version information
  • 1. Introduction
  • 2. Rights-free versus unreduced editions
  • 3. Corpus contents
  • 3.1 Polished texts
  • 3.1.1 Choice of individual samples
  • 3.1.2 Choice of LUCY extracts within BNC files
  • 3.2 Young Adult writing
  • 3.2.1 A-Level General Studies scripts
  • 3.2.2 Access-course coursework
  • 3.2.3 First-year undergraduate essays
  • 3.3 Child writing
  • 4. File structure
  • 4.1 Location codes
  • 4.2 Status Field
  • 4.3 Wordtag Field
  • 4.4 Word Field
  • 4.5 Parse Field
  • 4.6 The new EQ wordtag
  • 5. Transcription conventions
  • 5.1 Special character symbols
  • 5.2 Orthographic shift and text division symbols
  • 6. Treatment of linguistic errors
  • 6.1 Spelling mistakes
  • 6.2 Apostrophe errors
  • 6.3 Misprints
  • 6.4 Stylistic infelicities
  • 6.5 Incoherent but grammatical wording
  • 6.6 Ungrammatical structures
  • 6.6.1 Omitted words
  • 6.6.2 Repeated or near-repeated words
  • 6.6.3 Words taking the place of grammatically-distinct words
  • 6.6.4 Other structural errors
  • 6.7 Limiting the section of tree affected by error
  • 6.8 Multiple error annotations
  • 7. References
  • Version information

    Release 1 of the LUCY Corpus was circulated on 27 Oct 2003.

    Release 2 differs from Release 1 in two main respects: it corrects a number of errors in the earlier release, and it changes the filename conventions.

    Release 2 corrects about sixty errors found in the initial release, almost all of which related to omission of closing labelled brackets balancing opening brackets in the parse fields of various files – in most cases, the missing bracket was a top-level bracket identifying the end of a paragraph, O]. In locating and correcting these errors, one or two other parse-field errors were noticed and corrected.

    Since correctly-balanced bracketing can be checked automatically, those errors should not have appeared in a published version of the Corpus, and I apologize to any researchers who may have been inconvenienced by them. With the previous annotated language corpora for which I have been responsible, many months after the formal end of the respective project were spent on systematically searching for and eliminating errors in project output, before any version was released to the public. On this occasion however the sponsoring organization insisted on rapid publication, so the Corpus had to be released without adequate checking; by the time I was next able to return to this research, I had unfortunately lost track of what checks had and had not been applied.

    In Release 1, text files and notes files were distinguished by filename suffixes after stops: .tx and .nb respectively. The LUCY Corpus has emerged from a Unix computing environment in which filename "extensions" have no technical signficance, so it was reasonable to use ad hoc conventions for this purpose. However, increasing numbers of users work in very different computing environments where nonstandard filename extensions create real difficulties; so it seems better to change the conventions. In Release 2, text files have names consisting just of a letter followed by two digits, and corresponding notes files have the same name followed by a lower-case n, without an intervening stop.

    I have also taken the opportunity to make two small editorial modifications to sec. 3.3 of this documentation file, including a revised identification of school-type for two of the source schools for the child writing.

    As with my other published electronic resources, anyone finding further errors is warmly invited to notify me of these (my e-mail address is grs2 followed by at-sign followed by sussex.ac.uk); any such help will be acknowledged by name in updated later releases.

    1. Introduction

    The LUCY Corpus is an electronic sample of modern written English produced in the UK by a spectrum of writers ranging from skilled published authors to young children, equipped with detailed annotation identifying grammatical and other linguistic structure. Compilation of the LUCY Corpus was sponsored by the Economic and Social Research Council (UK), under grant R000 238146, 2000-03, and was carried out at the University of Sussex.

    The "rights-free" edition of the LUCY Corpus (see sec. 2 below) is available for downloading by anonymous ftp. Anyone anywhere in the world is free to take a copy and use it for any purpose. In the case of uses which lead to publications or commercial products, it would be a friendly gesture to ensure that any documentation acknowledges the contributions of ESRC as research sponsor and the University of Sussex as research organization.

    Instructions for accessing LUCY will always be available via a link labelled "downloadable research resources" from the present writer's home page at www.grsampson.net - this domain will be maintained indefinitely, short of circumstances outside the writer's control.

    The LUCY structural annotations use the annotation scheme of Sampson (1995), as modified by Sampson (2000: secs. 13-15), which is widely regarded as the most detailed and rigorously-defined scheme extant. (Comments include: "the detail ... is unrivalled" (Langendoen 1997: 600); "Compared with other possible alternatives such as the Penn Treebank ... [t]he SUSANNE corpus puts more emphasis on precision and consistency" (Lin 2003: 321).) LUCY is thus a further "sister" in the family which also comprises the SUSANNE Corpus (of published American English, first circulated 1992), and the CHRISTINE Corpus (of British speech, first circulated 1999), available by the same route just quoted for LUCY. Because LUCY includes substantial sections of the written output of young and unskilled writers, the annotation scheme inherited from those earlier corpus-annotation projects has been supplemented by additional notations to indicate what has happened in passages where written forms are linguistically mangled. This supplementary notation scheme has no precedent, to the LUCY team's knowledge, and we hope that the definition of the scheme, in sec. 6 below, will itself be a useful research contribution, in addition to the annotated texts that make use of it.

    Like SUSANNE and CHRISTINE, the LUCY Corpus is potentially a resource for studying how the English language is used in real life by the generations now alive. But, also, LUCY has the potential to support study of the process through which English-speaking children acquire writing skills. Early research drawing on LUCY (Sampson 2003) has already yielded findings, including some that seem surprising, about the trajectory which children typically follow in moving beyond spoken fluency to written literacy. We hope that through this type of research LUCY will provide helpful inputs for the teaching profession.

    This initial LUCY release is likely to contain errors. (Less pre-release polishing has been done than I would have preferred, owing to pressure from the sponsoring agency to publish without delay.) Users are cordially invited to inform me of any errors they find, at an e-mail address that I shall state in a roundabout way to discourage spammer address harvesting: grs2 followed by at-sign followed by sussex.ac.uk - any such help will be publicly acknowledged when corrections are made to a later release.

    The team who worked on the LUCY project under the present writer's direction included Anna Babarczy, Alan Morris, and (briefly) Anna Rahman. I should like to record my gratitude for their contributions.

    The Corpus is named after St Lucy, who is among other things a patron saint of authors, and so a suitable figure to be commemorated by a research resource concerned with writing and learning to write.

    2. Rights-free versus unreduced editions

    More than half of the LUCY Corpus (files with names beginning B... or C..., based on texts extracted from the British National Corpus) comprises samples of prose whose copyright is outside our control. It is not legally possible to make this material freely available for downloading by all comers. The following solution has been adopted.

    The LUCY Corpus has been prepared in two editions, "rights-free" and "unreduced". The only difference is that, for those files where copyright is an issue, the words of the original texts are replaced in the rights-free edition by abbreviations. Any characters from the fifth to the next-to-last alphanumeric character, inclusive, in a word are replaced by a single underline symbol. (The latter character occurs nowhere else in LUCY text files.) Thus "copyright" becomes "copy_t", "replaced" becomes "repl_d", "mother's" becomes "moth_'s", but "those" or "and" are unaffected. It is the rights-free edition which is available by anonymous ftp. The characters which are retained are enough to allow a human reader to grasp what is going in in terms of the linguistic structure of a text, but it is not possible to restore the missing characters algorithmically, hence there is no breach of copyright.

    Researchers who wish to use the unreduced edition will be provided with one, if they show that their organization has a copy of the British National Corpus (so that no copyright is violated by providing an extra copy of a subset of the same material). Anyone wishing to take advantage of this offer should send to me, by post to the address at the head of this document, suitable documentary evidence (such as a photocopy of the legal agreement signed when the BNC was obtained), together with an address to which I can upload the unreduced BNC by anonymous ftp.

    3. Corpus contents

    Apart from this documentation file, the LUCY Corpus comprises 239 text files, each representing a sample of written English arranged in a consistent format in which successive lines of the file display successive words of the sample, together with annotation data, in a fixed set of fields; and 239 notes files giving information about the corresponding text files, such as document source, or misprints corrected by the compilers. Pairs of text and notes files have names which are identical except that the notes file has a suffixed n, e.g. B01 and B01n.

    The field format of the text files is described in detail in sec. 4 below.

    Because punctuation marks are treated as separate "words", and (for instance) by the rules of Sampson (1995) hyphenated words are in many cases split across successive lines, counts of file lines do not give figures that precisely match the numbers which would be produced by manually counting words in a LUCY file. Nevertheless, the following statistics use counts of lines, rounded to the nearest thousand, as proxies for word counts, since counting lines automatically is so much easier than counting words manually.

    The LUCY Corpus as a whole comprises 165,000 "words" (file lines). The size of individual files is, necessarily, very different as between different written genres. The samples of published prose are excerpts each chosen to be at least 2000 words long, comparable to the files of SUSANNE and CHRISTINE Corpora. In the case of primary-school children's writing, texts as long as that are obviously unobtainable. The parts of the LUCY Corpus which represent young writers contain relatively large numbers of short files.

    Each of the 239 LUCY text files has a filename consisting of one capital letter representing a written genre, followed by two digits. (The genre letters were chosen in such a way as to avoid clashes with filenames in the SUSANNE and CHRISTINE Corpora. Numbering of texts is not continuous, for instance between texts B06 and B08 there is no B07 - the texts were numbered before the final contents of the corpus were fixed.) The breakdown is as follows (genre letters in bold):

  • "Polished" writing: 41 files, 102,000 words
  • B, informative: 34 files, 84,000 words
  • C, imaginative: 7 files, 17,000 words
  • Young Adult writing, E: 48 files, 33,000 words
  • Child writing: 150 files, 30,000 words
  • F, 12-year-olds: 37 files, 8000 words
  • H, 11-year-olds: 36 files, 7000 words
  • K, 10-year-olds: 29 files, 6000 words
  • M, 9-year-olds: 48 files, 9000 words
  • In more detail:

    3.1 Polished texts

    The Polished texts are taken from the Written section of the British National Corpus ("BNC/written"), with publication dates in the 1980s and early 1990s. The various files in BNC/written range from chapters taken from books of high literary or academic merit, to ephemeral items such as private letters to friends, e-mails, or small-circulation local newsletters or magazines. Documents at different points on this spectrum differ considerably in the extent to which effort has been expended (by authors and/or copy-editors) to make them conform to the conventions of correct written English usage. We imposed a rough two-way distinction between "Polished" and "Informal" BNC/written texts. The default classification was "polished", but an item was classified as "informal" if some of the following properties applied:

    Unavoidably, we had to exercise discretion in applying these criteria and balancing them against one another, so the classification is not totally objective. For instance, an oil industry company's annual report might be seen as addressed to a specialized audience and relevant only for a limited time, but for PR reasons such a document would normally be produced to a high standard (and I noticed no substandard usages), so I classified it as "polished". On the other hand an advertising flyer for a roofing system was classified as "informal", having an even shorter expected life than a company report, though the flyer in question also contained no noticeable substandard usage. A document about educational theory was published by the Arkleton Trust, Dumfriesshire, apparently a fairly small-scale organization concerned with rural development issues (at the time of writing it appears to have no website), and two spelling mistakes were spotted in the BNC extract; but it seemed to be a serious contribution to a weighty topic and was classified as "polished".

    LUCY category B and C samples are all drawn from BNC/written files that were classified as "polished" in these terms. Initially, the plan was to balance this material with another group of informal BNC extracts. As corpus compilation proceeded, though, increasing the representation of child writing seemed a more valuable way to use limited research resources. The aim of the writing-skills training offered to children and young people is to help them approximate the norms of conventionally correct English prose. The BNC/written texts which we categorized as Informal deviate from that target in ways that are too diverse to represent an interesting, coherent genre: some adult writers may violate various conventions intentionally, in order to achieve an informal mood, while other writers make mistakes because they lack the skill to avoid them. Hence the eventual LUCY Corpus contains no Informal BNC material.

    The Polished/Informal distinction does not coincide with the standard distinction between published and unpublished writing. Some individuals are careful with their prose even when not writing for publication; and some ephemeral documents which certainly were printed and published, legally speaking, may be worded as sloppily as if they were handwritten notes. (For that matter, the published/unpublished contrast is not easy to apply in a consistent way to all of the diverse texts in BNC/written.) Furthermore, among the BNC/written texts that we counted as Polished, some are more so than others. They certainly do contain various unconventional usages, which are logged in the notes files - but the density of these is lower than in the texts we categorized as Informal.

    Note also that Polished in this context does not imply anything like high-flown or literary. Most of the Polished texts in LUCY, like most of the contents of BNC as a whole, are relatively workaday, practical documents of one sort or another. Even extracts from fiction are often from popular rather than "literary" fiction. For the purposes of studying the modern English language as used in real life, this is as it should be.

    3.1.1 Choice of individual samples

    It would probably be desirable, other things being equal, for the LUCY extracts from BNC to be ones that many other researchers are studying and annotating from different points of view. Because the Sampler subset of BNC includes plenty of texts fitting under the "polished" heading, a first thought was to use extracts from a 50-item random subset of those texts. But that would make sense only if the written parts of the BNC Sampler Corpus themselves comprised a fair cross-section of the whole written BNC.

    It turns out that this is far from true. For instance, out of 69 Sampler texts categorized by the BNC descriptors as "book or periodical", the first twelve (more than one in six) all represent Foreign material from the Guardian newspaper, even though the Guardian is not the only national newspaper represented in the full BNC, and Guardian material in the BNC is drawn from eight different categories (e.g. Arts, City, Home, etc.). I do not know why the Sampler texts were chosen in such an unrepresentative way, but, since they were, it seemed best to draw selections at random from the full BNC in order to get a subcorpus approximately as representative as the written BNC as a whole.

    Consequently we used a pseudo-random number generator to generate a sequence of BNC filenames; from this sequence we eliminated filenames corresponding to spoken texts, together with a few texts which from the Reference Guide description sounded unduly similar to texts that had already appeared in the sequence. (Thus, having accepted BNC file FBU, extracted from The Weekly Law Reports 1992, vol. 3, we rejected BNC file FDH, another extract from the same volume; and, at a stage when we were still considering "informal" material, after accepting BNC file CCH, from the Shrewsbury Diocesan Catholic Voice, we rejected BNC file C8G, from the Leeds Diocesan Catholic Voice, published by the same Church publisher. We accepted multiple texts from the same newspaper, but only so long as these represented different genres, e.g. City material v. Leisure material.) This process continued until we had 50 "polished" files. In one case where the text was less than 2000 words long we added the most similar other text we could find to take available wordage over 2000.

    3.1.2 Choice of LUCY extracts within BNC files

    BNC files are of very different lengths, in some cases many tens of thousands of words. For LUCY, as for SUSANNE and CHRISTINE, the aim was to choose extracts each of the same length, nominally 2000 words (taking a few words more rather than a few less as needed to make the extract boundaries coincide with natural breaks). At least, this was the aim with respect to the Polished adult prose; with the Young Adults' and Child writing, the original documents were shorter, and with those genres we either aimed for 500-word extracts or took complete documents, as discussed below.

    From a few sample BNC/written extracts it appeared that a rough average proportion of lines to words in the LUCY format (in which punctuation marks and various other non-word items have lines of their own) is 1.175:1, implying that 2350 lines should correspond to about 2000 words, so we aimed to select extracts from random points within reformated BNC files so as to begin and end at natural breaks and include 2350 or slightly more lines.

    Written BNC files are divided into sections by a hierarchy of division tags: <div1> for most-important down to <div4> for least-important divisions. The BNC tags <p> for paragraph and <s> for sentence might perhaps be seen as, logically, further elements in this hierarchy. We used a pseudo-random number generator to locate appropriate extract boundaries as follows. For a text N lines long, we generated a random number between 2350 and N. We then looked for the <div1> boundary (which could be the beginning or end of the entire BNC text) that was least distant from either line N-2350 or line N - that is, if <div1> boundaries occurred respectively 500 lines before line N-2350, 50 lines after that line, 100 lines before line N, and 200 lines after line N, we chose the second of the four. The line selected was made the beginning of the LUCY extract, if it was close to line N-2350, or the end if it was close to line N. We then moved 2350 lines in the appropriate direction (forwards or backwards), and continued for as many more lines as necessary to reach at least an <s> boundary; however, if there were a "better" boundary not many lines further, we continued to that (for instance, we accepted another shortish sentence if it avoided breaking within a paragraph, though without any prior decision about how short "shortish" should be, and we accepted a heading if the extract already included the beginning of the text section introduced by it).

    If the nearest <div1> boundary to either line N-2350 or line N was more than 2350 lines from it, so that the above algorithm would result in an extract including no part of the material between lines N-2350 and N, I carried out the process looking instead for <div2> boundaries, and so on to lower categories of boundary as necessary. Thus the resulting set of extracts should contain a representative variety of material from beginnings, ends, and middle parts of text divisions. (Whether there are interesting differences between structures found early and late in text divisions is open to question; it may be that the procedure described was unnecessarily heavy-handed. But it seemed inadvisable to assume that there are no such differences.)

    3.2 Young Adult writing

    The files with names beginning E... fall into three groups:

    3.2.1 A-Level General Studies scripts

    "A-Level" (the Advanced Level of the General Certificate of Education) is the exam taken in Britain typically by youngsters in the final year of secondary education, and which serves as the main test for university entrance. Secondary schooling is relatively specialized in Britain, with most pupils studying just three or four subjects (say, English, French, and History, or Physics, Maths, and Chemistry) in the last two years of the system, at ages about 16 to 18. A pupil will take A-Level exams in each of his or her subjects, and for each subject (assuming it is passed) will receive a grade ranging from A down to E, which is a distillation of more detailed percentage marks assigned internally in the examining system to the individual answers produced by a candidate in the various papers sat in a subject. A University department will respond to applications for admission by specifying required grades in relevant subjects at A-Level. Apart from special subjects, some pupils additionally take a "General Studies" paper, which invites candidates to write essays on current affairs, social issues, and the like, and which is not prepared for by studying a set syllabus. Files E01 to E19 represent scripts graded A or B (i.e. high-graded) from the General Studies paper set by the University of Cambridge Local Exams Syndicate in 1994. We are very grateful to John Milton of the Hong Kong University of Science and Technology for supplying copies of this material, which he has used in his research comparing the written English of native speakers and Hong Kong Chinese undergoing English-medium education. (A-Level exams are taken in Hong Kong as well as in Britain, but the LUCY material is drawn exclusively from scripts written in Britain, where candidates are normally native speakers.)

    The General Studies paper contained questions grouped into seven thematic sections. The LUCY files represent one answer each by 19 candidates, chosen to give a reasonably even spread across the seven sections and across alternative questions within each section, and to give a spread of numerical marks reasonably representative of the spread over all answers by all candidates in the sample supplied by Milton. Within these constraints, choice of individual answers for analysis was random.

    3.2.2 Access-course coursework

    Although the typical undergraduate entering a British university will be aged 18 or 19 and will have left secondary education after taking A-Levels either the same year or with only one so-called "gap year" (of non-study life experience) intervening, many universities also encourage applications from students without standard A-Level qualifications, and who may be a few years older than the majority of entrants. Because of their less adequate educational background, these students often take a one-year "access course" designed by the university to help foster study skills and provide some general academic background before the student embarks on the first year of a standard undergraduate degree programme. Files E40 to E44 are extracted from coursework produced in the academic year 1999-2000 on an access course attached to computing degree schemes at Sussex University, and taught under a franchise arrangement by Crawley College, Sussex. Each writer was aged in the range 18 to 22, male, and U.K. born and educated. We are very grateful to Gabrielle Litten of Crawley College for supplying this material.

    The original documents were diverse in length. We aimed to extract passages of 500+ words (i.e. at least 500 words, beginning and ending at natural breaks such as sentence boundaries or, better, paragraph boundaries). Where an original was well over 1000 words long, a random number generator was used to choose a starting place; otherwise, material was taken from the beginning and from the end in alternate documents, and one document of less than 500 words was used in full.

    3.2.3 First-year undergraduate essays

    When the LUCY project began, the plan was to represent young adult writing entirely through coursework essays writtten by first-year Sussex University undergraduates. The early months of most degree schemes involve students producing assessed writing (the usual term is "essays") of a fairly general kind, not closely tied to obscure specialist knowledge. We hoped that by sampling material of this kind from across the range of arts, social studies, and science disciplines, we would get a good representation of undergraduates' general ability to use written English, without having to go far from base to gather our data.

    Unfortunately it soon emerged that this strategy was less suitable than we had expected. Impressionistically, it was very hard to believe that the material we collected did represent the writers' natural writing style. Much of the material read rather tortuously, and the effect seemed to result from attempts to pastiche the surface features of the learned academic books and journals which will have appeared on tutors' reading lists.

    The issue here was not "plagiarism", in the sense of students trying to gain marks dishonestly by passing others' writing off as their own - we did not suppose that the material was copied word for word. The impression, rather, was that these new undergraduates had arrived at University with a strong instinct that success in their University career would require them to conform to alien norms, so that the sooner they practised the heavy-handed written style of academic publication the better, even if they were writing about ideas which could have been expressed more straightforwardly. Perhaps it was naive of us not to have anticipated this problem.

    Not all the undergraduate essays share this style to the same extent, and our interpretation of it may be mistaken. Nevertheless, with hindsight we now see the A-Level scripts as a better representative sample of late-teens' natural writing style than the undergraduate essays. The A-Level scripts were written under exam conditions, and the time constraints perhaps discouraged candidates from embellishing their prose with unnatural convolutions. Nevertheless, by the time we got our minds round this problem with the undergraduate essays we had already analysed a fair number of them, so we included that material in the LUCY Corpus, before going on to seek out alternative sources of data such as the A-Level scripts.

    All undergraduate essays were collected over the period November 1999 to January 2000 and written by first-year undergraduates - in other words they were produced within the first few months of undergraduate study. Although the Sussex undergraduate body includes many individuals who are not English native speakers, whether foreign or members of non-English mother-tongue ethnic communities within Britain, the tutors through whom we gathered the material were asked to filter out any by students whose speech suggested that their English was not at native-speaker competence levels.

    Most essays were well over the target length of 500+ words, so a random number generator was used to choose a starting point within an essay, with the extract beginning at the nearest paragraph boundary, and continuing to the first paragraph break after the 500th word. (If small adjustments to beginning and end points would make at least one of them fall at a higher-order boundary, this was done; naturalness of breaks was treated as a higher priority than perfect randomness in selecting extracts in this and other sections of the LUCY Corpus.) Because only complete paragraphs were used, average length of extracts from the undergraduate essays is about 600 words. In some cases we were supplied with more than one essay by the same student; we used only one from any writer, choosing between alternatives in such a way as to give a variety of topics in the overall selection of essays.

    At the period in question, teaching of individual subjects at undergraduate level was organized into ten broad "schools of study" at Sussex (the university has subsequently restructured itself). The LUCY files represent undergraduates belonging to five of these, as follows:

    African and Asian Studies: E51 to E54
    Biological Sciences: E61 to E63
    Cultural and Community Studies: E71 to E75
    Cognitive and Computing Sciences: E81 to E85
    Social Sciences: E91 to E97

    3.3 Child writing

    This is a data category which was not envisaged for the Corpus at all when the project proposal was put forward; for several reasons it seemed increasingly desirable to include it as the project developed. Arguably, the "market" for corpus linguistics resources is evolving at present in such a way that uses in connexion with language-engineering technology applications are becoming less significant (or are being satisfied by industry-internal research and development, without university involvement), while uses in connexion with education, for instance writing-skills education, are becoming more salient. If we hope that the LUCY Corpus may contribute to the latter, it would be a pity for it to include only young-adult data, relating to the later stages of writing-skills training, and no data on earlier stages - for most learning activities, getting early stages right is surely more important than anything that happens afterwards. (This point is made weightier by the rather unsatisfactory nature of the undergraduate-essay sample, discussed above.) Despite the crucial importance for many areas of employment of skill in writing the national language, the linguistic structure of children's writing is a severely under-researched area. Katharine Perera (1984) is an important book, but was for many years something of an isolated flower in a desert landscape - and it was compiled without the benefit of modern corpus-processing techniques. For quite a long time, the growing field of corpus linguistics by-passed child writing (at least in the native language - learners of English as a Foreign Language were better served), though work by Ngoni Chipere at Reading and Roz Ivani'c at Lancaster is beginning to fill this gap.

    Most important, our group happened to have access to a particularly high-quality source of data, which is otherwise almost wholly inaccessible to current researchers, in the shape of relevant sections of material gathered and published in the 1960s by the Child Language Survey sponsored by the Nuffield Foundation. The LUCY Corpus includes annotated versions of a sample of documents written by children in 1965 and published in the Survey volumes The Written Language of Nine and Ten-Year-Old Children and The Written Language of Eleven and Twelve-Year-Old Children (Handscombe 1967a, 1967b).

    It may sound unappealing to include material gathered almost forty years ago in a Corpus intended to represent modern written-language behaviour and usage. In connexion with studies of education, though, this is arguably the ideal way to cover child writing. The "polished" adult writing in the Corpus dates to the years about 1990; individuals who were children in the 1960s fall close to the middle of the age-range of people likely to have been producing published writing twenty to thirty years later. If we aim to study the relationship between how people write when they are acquiring the skill and how they write when they have mastered it, child writing from the 1960s makes a better comparison with recent published writing than would recent child writing. The ideal comparison would be writing by the very same individuals as children and later as adults, but (except perhaps for a few very unrepresentative individuals) data of that sort would be impossible to collect. Failing that, the best we can hope for is a fair sample of child writing from one period, and a fair sample of adult writing from a generation or so later: that is what the Nuffield material and the BNC material between them give us.

    The Child Language Survey researchers visited a range of schools and collected various kinds of data on pupils' use of oral and written language. While not a perfect statistical cross-section, the range involved diversity both in region (locations in London, Kent, Sussex, and Yorkshire) and in school type (state primary and grammar schools, one secondary modern, and one then-novel comprehensive school - and, from another perspective, boys' schools, girls' schools, and mixed schools). The locations appear to have been suburban and semi-rural rather than either "inner-city" or fully rural. Insofar as this range of schools departs from an ideal cross-section (selective grammar schools over-represented, no locations in areas like Cumbria or Cornwall that are distant from the industrial heartlands), it appears to do so in ways likely to make the pupils sampled a relatively good match to the class of 1990s' published authors (who will themselves clearly not form a random cross-section of the adult population).

    The Survey work as a whole generated more than twenty volumes of data (and some volumes of analysis). For the two volumes used by our project, pupils aged nine to twelve were invited to write essays on a choice of open-ended topics (e.g. "My favourite hobbies", "My last holidays") which were transcribed into typescript and published. It was made clear to the children that this was a voluntary activity separate from the prescribed school syllabus; the resulting documents have every appearance of representing the children's natural, spontaneous writing abilities. The published transcriptions are exemplary in terms of the care taken to render details faithfully (for instance, crossed-out letters or words are recorded).

    More precisely, the age cohorts represented were those in which children are respectively nine, ten, eleven, and twelve at the beginning of the school year (presumably September) - namely the last two years of primary and first two years of secondary education. The "9/10-year-old" and "11/12-year-old" data were collected respectively in July 1965 and May 1965; so children labelled as 11-year-olds will actually have been twelve if their birthdays happened to fall earlier in the school year, and so on for the other nominal ages.

    Although the Survey volumes were (non-commercially) published, even libraries of universities with linguistics departments rarely seem to hold copies, and in consequence scarcely any research use has been made of them since (I know of none, other than the limited amount of analytic work done by the Survey researchers themselves). I held copies of the two volumes used by our project when it started, having bought them when they were published in 1967. I am very grateful to Richard Handscombe, now retired from York University, Toronto, who held all the Survey materials (transcriptions, tapes of speech, etc.), for passing them on to me in their entirety when he heard of our interest, and inviting me to act as the Survey's academic "heir".

    The Nuffield data are organized in a somewhat complex way. The teachers through whom the Nuffield researchers got access to children were asked to organize a suitable number of pupils whom they judged typical of their schools into "recording groups" each containing up to four children, selected to be friends or at least not on bad terms with one another. (This system will clearly have been a sensible way of sampling conversational speech, though it has little relevance for the written material with which the LUCY project is concerned.) The individual authors of essays are not registered even in code terms; for each essay we are given only a coded identification of the recording group to which the author belonged, together with the child's sex and age (at start of school year).

    The children were offered a range of essay titles grouped into five sections, and were asked to write on one topic from each section (not all children wrote five essays, but many did - see below). They were supplied with "small exercise books" and asked to write "at least a page" on each topic chosen. The titles offered were:

    Section I
  • My school
  • My favourite subjects
  • My favourite sports
  • Section II
  • My home
  • My brothers and sisters
  • My friends
  • Section III
  • My last holidays
  • My favourite hobbies
  • My pets
  • My favourite books
  • Section IV
  • How I learnt to swim
  • How I learnt to ride
  • How I learnt to cycle
  • How I learnt to cook
  • How I learnt to sew
  • Section V
  • My favourite possession
  • My favourite story
  • What I like doing best after school time
  • A funny thing that happened to me
  • There were also a handful of pieces under other titles chosen by the children themselves; in the LUCY selection, file M17 is on the title "Our School Garden", and M74 on the title "Topics".

    The introduction to the Handscombe volumes uses the statistics on number of essays written as an index of the "naturalness" of the task for the children. The 9-to-10-year-olds volume comments:

    As 113 primary-school children were asked to write five compositions each, there should have been 565 compositions. There is, however, a gradual falling-off in the number of children still writing by the time they reach Section V ..., and they actually produce 552, an average of 4.88 per child. On the other hand, there are 25 additional pieces on the suggested titles [i.e. some children wrote on two titles from the same section] and 4 extra pieces which are completely spontaneous ..., making a final total of 581, average 5.14 per child. It would seem fair to infer, therefore, that the children were excited enough by the task itself to take it seriously and that, although they did not have complete freedom to choose any five titles, the resulting compositions may be taken as some small indication of interest in as well as preference for the topics they elected to write about within the limited range.

    The 11-to-12-year-olds volume makes similar remarks: the average essays per child are again 4.88 without "extras", and 5.01 including 11 additional pieces on the suggested titles and two spontaneous pieces. It does seem reasonable to conclude that the Nuffield material is about as good a representation of the spontaneous writing style of children at this age as one is likely to find.

    Our priority was to ensure that each essay included in the LUCY Corpus was by a different child, to maximize representativeness within a limited subset of the Nuffield material. Consequently, except for one recording group containing only a single child, we took two essays on the same topic from each recording group: the identity of topic guaranteed that these were by different children. Within this constraint, choice of particular topics per recording group and choice of individual essays on that topic within a recording group was random, subject to some manual adjustment to the results of random choice in order to equalize numbers of boys and girls as far as possible, and to get as wide as possible a spread of topics. (The main significance in the random-choice aspect of the essay selection process was that it prevented members of our team from biassing their choices towards more interesting pieces of writing.)

    Because the essay titles were set by the Nuffield researchers, they are not included in the LUCY files even when the children copied them at the head of their essays (as they usually did).

    In this connexion it is worth noting that the word "favourite", which occurs in several titles and hence also occurs in the body of many essays under those titles, is frequently mis-spelled "favorite" by the children. Since Handscombe is Canadian, it is possible that this may have been triggered by American spelling habits on his own part. (Nowadays, British children get a good deal of exposure to American spellings, but this was much less true in the 1960s.)

    It is also worth noting that while the prose of these essays in general appears very spontaneous and natural, the set titles occasionally can be seen to influence wording following the title slightly. Thus K05 begins "Last year’s summer holidays was the best I have ever had" – the failure of number agreement between "holidays" and "was" may well stem from the fact that the set title was "My last holidays", in the plural. And K63 begins "My favourite possession. is my bicycle." – the redundant stop before "is" surely has to do with the fact that the set title was "My favourite possession".

    The schools where material was collected were as follows. Since only code names appear in the Handscombe volumes, reconstructing the identities has involved a measure of detective work, on which Richard Handscombe collaborated with ourselves. Note that not all these schools still exist in any form; those that do will in many cases have changed their nature and school-type name as a result of various educational reforms in the intervening decades. (The sample includes one comprehensive school, which is now the standard type of secondary school; at this time it was still an unusual school-type.)

  • Primary Schools (all mixed schools)
  • DES: Desmond Anderson (now) First and Middle Schools, Canterbury Road, Tilgate, Crawley, Sussex
  • BB: Bishop Bell School, Tilgate, Crawley, Sussex
  • HIG: Highgate Primary School, North Hill, London N6
  • QUE: Queenswell (now) Infant and Junior Schools, Sweets Way, Whetstone, Middlesex
  • WAR: Warren Road Primary School, Orpington, Kent
  • CRO: Crofton (now) Infants School, Town Court Lane, Orpington, Kent
  • TAL: Talbot Primary School, East Moor Road, Leeds 8
  • Secondary Schools
  • FRI: mixed secondary modern: Friern Barnet County School, North London
  • RAB and RAG: paired boys' and girls' secondary modern schools in Bromley, Kent: believed to have been Ravensbourne School for Boys and Ravensbourne School for Girls, now a single mixed school
  • THO: mixed comprehensive: Thomas Bennett, Ashdown Drive, Tilgate, Crawley, Sussex
  • AH: girls' high (i.e. grammar) school: Allerton High School, Leeds
  • ROU: boys' grammar school: Roundhay Grammar School, Leeds
  • The notes files for the child writing each begin after the filename with the school code followed by the letter B or G, for boy or girl.

    4. File structure

    Each LUCY text file is a sequence of lines, essentially one per text word, each containing the same six fields separated by tab characters:

    4.1 Location codes

    The purpose of the two location codes is to provide a means of identifying specific passages within LUCY files and also to relate them to locations within the original resources from which the LUCY files were derived. Because these original resources are diverse, the numerical codes mean different things for different groups of texts.

    In the files derived from BNC material (B... and C... filenames), the 7-digit code represents the sequential order of the first character of the word in the BNC file, considered as a linear sequence of bytes (the LUCY files are based on the original, 1995 release of the BNC). Where a line corresponds to no specific character-sequence in the BNC file, the 7-digit code is seven zeros. In these files, the 5-digit code represents the BNC s-unit number. BNC s-units are intended to correspond to sentences, though oddities in the BNC compilation process mean, for instance, that occasionally two written sentences are run together as a single s-unit; and often the point where the 5-digit code changes is a line or two away from the point which would most logically be regarded as the sentence boundary. (File B39 was created by concatenating two short BNC files; in order to make the 5-digit codes unique within this LUCY file, 10000 is added to s-unit numbers from the earlier BNC file, and 20000 to those from the later BNC file.) Again some lines created in the LUCY annotation process and outside the BNC s-unit system are assigned 00000 codes.

    With all other LUCY files, the 7-digit codes are arbitrary numbers created for the LUCY corpus: they increase (normally in tens) throughout a text file. The 5-digit codes relate to linguistic divisions of the texts, but because sentence-boundaries are hard to detect automatically, in these files the 5-digit codes are associated with successive paragraphs. Some of the short pieces by children consist of a single paragraph, and these have the same 5-digit code for each line.

    This is not a very satisfactory state of affairs. The main use of location codes in practice is to provide a convenient way of identifying short passages for citation. For BNC-derived files, s-units are the right size for this purpose; but the 7-digit codes in these files are probably of little current relevance (most researchers are likely to be working with later releases of BNC, in which given words will not always be at the same byte offset from the file beginning). Conversely, in the non-BNC-derived files the 5-digit codes cover passages which are longer than convenient for citing examples. The suggested solution at present is to quote passages in non-BNC-derived files using the 7-digit code for their first or most important word, and to quote passages in BNC-derived files using the 5-digit code. It is hoped to move to a more convenient, unified location-coding system in a later release of LUCY.

    4.2 Status Field

    The status field is a single byte, normally a hyphen. In a minority of lines another character appears, identifying a special status of the line, as follows:

    o: A necessary word omitted in the original document but inserted by the LUCY compilers as a word between curly brackets (cf. sec. 6.6 below)

    y: The line represents a "ghost" element in the sense of Sampson (1995: 353ff.), i.e. it is the location in logical structure of an element that has been removed by grammatical transformation

    4.3 Wordtag Field

    The chief purpose of the LUCY Corpus is to illustrate structure above the word level (phrase and clause structure). Consequently, with limited resources available, we had to make compromises in the wordtagging area of corpus compilation.

    Ideally, the tagset should be that defined in Sampson (1995), as modified by Sampson (2000). Three additional tags were created for the LUCY Corpus. The tag EQ is discussed in sec. 4.6 below. The tags YDIVL and YDIVR label the beginning- and end-markers of BNC text divisions (e.g. entities <bdiv1>, <ediv1>). Also, the tag YY, already used in the CHRISTINE Corpus for an inaudible word, in LUCY labels a <gap> entity. (See sec. 5.2 below for discussion of these latter elements.)

    The contents of a LUCY wordtag field would then be either a single member of this tagset, or (in cases of ungrammatical structures of the kind discussed in sec. 6.6 below) a character sequence consisting of a tilde character preceded and possibly followed by either one of these wordtags or a zero character. That is, if WTA and WTB stand for wordtags, well-formed wordtag field contents are of one of the forms:

  • WTA
  • WTA~WTB
  • WTA~0
  • 0~WTA
  • WTA~
  • However, only one text in the current release of LUCY (text E75) was manually tagged using the full range of tags. Other files were tagged by an automatic tagger, with manual postediting. (We are very grateful to our colleague John Carroll for making his tagger available and supervising the processing of our material.) The output of the automatic tagger uses a simplified version of the SUSANNE tagset, which omits the closing lower-case letters that denote fine-grained categories such as transitivity class in verbs, countability class in nouns, and classification of proper names as e.g. town names, country names, personal surnames, etc. Thus a singular proper name in any LUCY file other than E75.tx will normally be tagged NP1, a tag which does not exist in the SUSANNE scheme (which only recognizes NP1f for female Christian names, NP1g for miscellaneous geographical names, and so forth).

    A complicating factor is that the human posteditor did sometimes use the fine-grained SUSANNE tagset when correcting automatic tagger output. As a result, the current LUCY tagset mixes tags lacking fine-grained subcategories with a smaller number of instances of tags including them. This is clearly not an ideal situation. In a future release of LUCY it is hoped to move to a fully-consistent tagset, though this would need to use the less-detailed tags of the automatic tagger.

    4.4 Word Field

    The word field contains an element of the analysed text occupying a leaf node of the parse-tree assigned by the analysis to the stretch of text containing it: normally a word, or an element such as a punctuation mark. Where the contents of a word field was not immediately preceded by white space in the original text, a plus sign is prefixed to it: thus a comma will normally be shown as "+,".

    Where the entire contents of the word field is a hyphen character, this does not represent a hyphen in the text: it corresponds to the fact that the line represents a ghost element of the analysis (see "Status Field" above).

    Orthographic shifts such as passages in italics are identified by surrounding them with markers within angle brackets (in the case of italics, <bital> ... <eital>), and these occupy the word field of lines of their own. The use of symbols within angle brackets is discussed under "Transcription Conventions" below.

    4.5 Parse Field

    The contents of the parse field represent the central raison d'être of the LUCY Corpus. They code the grammatical structure of texts as a sequence of labelled trees, having a leaf node for each Corpus line. The coding system is the same as used in the SUSANNE and CHRISTINE Corpora.

    The highest-level units recognized in the grammatical analysis are paragraphs; each text is treated as a sequence of trees representing successive paragraphs, with root nodes labelled O (or Oh in the case of items such as headings or bylines which are not "paragraphs" in the normal sense). Such a tree has a leaf node labelled with a wordtag for each LUCY word or trace, i.e. each line of the Corpus. There will commonly be many intermediate nodes.

    A LUCY tree is represented as a bracketed string in the ordinary way, with the labels of nonterminal nodes written "inside" both opening and closing brackets (that is, to the right of opening brackets and to the left of closing brackets). This bracketed string is then adapted as follows for inclusion in successive LUCY parse fields. Wherever an opening bracket immediately follows a closing bracket, the string is segmented, yielding one segment per leaf node; and within each such segment, the sequence opening-bracket + wordtag + closing-bracket, representing the leaf node, is replaced by full stop. Thus each parse field contains exactly one full stop, corresponding to a terminal node labelled with the contents of the wordtag field, sometimes preceded by labelled opening bracket(s) and sometimes followed by labelled closing bracket(s), corresponding to higher tagmas which begin or end with the word on the line in question.

    The system of tagma labels is defined in detail in Sampson (1995); it is too complex to describe here. (A resumé of the scheme is included in Sampson (2000: sec. 4).)

    4.6 The new EQ tag

    Other than in the area of prose containing language errors, discussed in sec. 6 below, the LUCY project strove to avoid introducing new complications into a parsing annotation scheme that had already been highly refined with respect to both written and spoken usage. One area of standard written English, though, proved to demand a small extension of the existing scheme. The SUSANNE scheme (Sampson 1995: 109) defines a single wordtag IIx for "mathematical infix operators", including both +, �, etc., and =, >, etc. But these two symbol groups are grammatically quite different. The former group are analogous to prepositions (as implied by the II... tag); but the latter group function like verbs to create assertions.

    (The compilers of the SUSANNE Corpus did not encounter this issue, because the Brown Corpus on which SUSANNE was based happened to treat all formulae occurring in its texts as if they were single "words", replacing the details of the formula with an artificial placeholding element.)

    Accordingly, the LUCY Corpus uses a new wordtag EQ for assertion-creating infixes such as =, >. A "clause" headed by an EQ symbol is tagmatagged L (rather than S), and the EQ node is attached directly to the L node with no intervening phrase node.

    5. Transcription conventions

    Various special characters are represented in word fields by character sequences within angle brackets, for instance <ccedil> represents c with cedilla. Wherever possible these correspond to standard SGML entity names. Note that where a BNC file contains the entity <rehy>, meaning a "soft" line-end hyphen, LUCY writes the word solid; LUCY does not record line or page breaks from the original documents.

    Also, paired symbols in angle brackets, e.g. <bital> ... <eital> already mentioned above, are used to represent orthographic shifts together with boundaries of text segments identified by SGML annotation in the BNC Corpus - for instance <bbyline> ... <ebyline> surrounds a byline in journalistic prose. The reason for writing e.g. <bital> rather than <bitalic> is for consistency with the SUSANNE Corpus, where angle-bracket symbols were made to conform to the SGML entity-name convention of a maximum six characters. Where LUCY uses symbols not used in SUSANNE, though, there seemed to be no advantage in maintaining this convention, hence e.g. <bbyline> is not truncated.

    Lists of angle-bracket symbols used in LUCY are given below.

    Some special remarks are necessary in connexion with the children's writing collected by the Nuffield project The children's handwriting was transcribed by that project on a conventional 1960s typewriter, whose output was reproduced in the published volumes. This meant that the character set was narrower than a modern computer keyboard, but on the other hand there was no difficulty in showing children's crossings-out as hyphens overstriking the original letters. Crossings-out were carefully reproduced in the Nuffield volumes. However, for present purposes it did not seem desirable to include crossed-out material (the material not crossed out is what the child intended to be read), so the LUCY Corpus omits it. (A future research plan is to make the entire Nuffield collection available in machine-readable form, since it is potentially valuable for a variety of education-studies purposes. When that is done, crossings-out will be shown along with the rest of the data.)

    Traditional typewriters normally represented the figure 1 as a lower-case l; where l stands for 1 in the Nuffield volumes it is shown as 1 in the LUCY Corpus (the children are likely to have made the distinction in handwriting). Similarly zero and capital O are distinguished in the LUCY files although again the typescript has no such distinction (and in this case there is unlikely to have been a handwritten distinction either).

    The undifferentiated vertical single inverted comma of a traditional typewriter is replaced in the LUCY files by <lsquo>, <rsquo>, or <apos> as appropriate in context. The children's handwriting is likely to have distinguished opening from closing inverted commas, although the distinction was lost when the material was transcribed by typewriter.

    Where hyphens are used to represent handwritten dashes, they are represented by the SGML entity-name <mdash>.

    In the case of child writing it seemed unwise to throw away data about line breaks, since young children may sometimes use these distinctively, for instance as the sole indication of sentence boundaries. Consequently the notes files give seven-figure location codes for line breaks: thus, the first line-break in text F01 is 0000120, meaning that a break occurred between lines 0000110 and 0000130 of the text file.

    The compilers of the Nuffield material anonymized it by replacing proper names, where they thought appropriate, with initial letters, e.g. (H09.0000160): the two top classes were taken to A baths. At this distance of time there seems little reason to maintain anonymity with respect to the identity of the schools, so where we know what these initials stand for we restore them; however, we do not spell out addresses of individual children even in the few cases where it is clear to us what lies behind the initials. In cases where initials are not changed back to full names, they stand in the text files without further annotation.

    5.1 Special character symbols

    <agrave>	a grave
    <amp>	ampersand
    <apos>	apostrophe
    <auml>	a umlaut
    <bullet>	bullet
    <ccedil>	c cedilla
    <dollar>	dollar sign
    <eacute>	e acute
    <egrave>	e grave
    <frac12>	one-half symbol
    <frac13>	one-third symbol
    <frac14>	one-quarter symbol
    <gap>	an element which appears at this point within the original
    text has been omitted from the LUCY text file; see the corresponding
    notes file
    <hellip>	horizontal ellipsis, ...
    <hrule>	horizontal rule
    <hyphen>	hyphen
    <indent>	indent
    <ldquo>	opening double inverted commas
    <lquo>	opening quotation mark, appearance unspecified
    <lsqb>	opening square bracket
    <lsquo>	opening single inverted comma
    <mdash>	dash
    <minus>	minus sign
    <ntilde>	n tilde
    <ouml>	o umlaut
    <plus>	plus sign
    <pound>	pound sterling sign
    <Prime>	double prime
    <prime>	single prime
    <rdquo>	closing double inverted commas
    <rquo>	closing quotation mark, appearance unspecified
    <rsqb>	closing square bracket
    <rsquo>	closing single inverted comma
    <slash>	solidus
    <ucirc>	Czech u with circle above
    <uuml>	u umlaut
    

    5.2 Orthographic shift and text division symbols

    (in each case, an opening <b...> symbol is balanced by a closing <e...> symbol; only the opening symbol is listed here)

    <bbold>	bold face
    <bbyline>	byline
    <bcaption>	caption
    <bdisplay>	displayed passage
    <bdiv1>, <bdiv2>, <bdiv3>, <bdiv4>	text divisions as coded by
    the BNC Corpus (1 largest, 4 smallest).  For more detail on the BNC
    representation of text structure, see the BNC User Reference Guide,
    version 1.0 p. 33.
    <bhead>	heading
    <bital>	italics
    <bitem>	item within a list of items
    <bline>	line within a poem
    <blist>	list of items
    <bmainhead>	main heading
    <bpara>	paragraph
    <bpoem>	poem
    <bquote>	quoted passage
    <bsubhead>	subheading
    <bunderline>	underlining
    

    6. Treatment of linguistic errors

    In the SUSANNE Corpus of annotated published writing, where the disciplines of editing for publication ensured that linguistic errors in the source texts were few, our practice was to preserve such errors in the annotated text files and apply the annotation scheme to them as best we could (cf. Sampson 1995: 77). Oddities of usage were registered only via notes in the documentation file.

    For LUCY, where the child writing in particular is sometimes very unskilled, this approach would have been unworkable; very often there would simply be no reasonable way to fit the categories of the annotation scheme to the distorted linguistic forms which appear on the page. Consequently (for LUCY as a whole, not just the child writing section) we adopted a different approach.

    6.1 Spelling mistakes

    In the case of spelling mistakes, we show the correct spelling in the text files, and record the spelling actually used in the notes files. (In the notes files, the notation X -> Y means that the form X occurred in the original document and has been corrected to Y in the LUCY text file.) The central purpose of the LUCY Corpus is to show what abilities different kinds of people have at assembling words into larger meaningful structures on paper; spelling words correctly is a separate ability, and showing misspellings in the annotated text files would merely get in the way of the main goal. (Where children are very poor spellers, occasionally it has not been possible to identify the intended word - in these cases we leave the erroneous spelling in the text file, with a note about the problem in the notes file, but cases like this are few.)

    Errors of word division are treated as a type of spelling error. Thus, in text E02, "the time of the first century-conquerors" (0001490) is corrected to "... first century conquerors", and "'Action' movies are sell-outs" (0002130) is corrected to "... sellouts", in the text files, with notes about the original orthography in the notes files. The frequently-encountered present-day solecism thankyou would be corrected to thank you if it occurred in our data.

    As authority on standard spellings, including hyphenation, we have used the Eighth (1990) edition of the Concise Oxford Dictionary, though to date we have not adjusted the -ise/isation/etc. spelling widely used in Britain to the Oxford standard which uses Z rather than S. There can be no guarantee that the corpus compilers have caught every case where a writer has deviated from the COD standard on an issue as subtle as writing a compound solid rather than hyphenated or vice versa.

    6.2 Apostrophe errors

    Apostrophe errors (redundant, missing, or misplaced apostrophes) are treated in the same way as spelling errors. The text files are corrected to the conventional form in the given context, and the notes files log what actually appeared. Logically, apostrophes relate to grammatical structure rather than to spelling, and grammatical errors are normally treated in a rather different way, as we shall see below. But in practice the nature of the SUSANNE annotation scheme means that this is the most satisfactory way of producing consistent structural annotations for apostrophe errors without losing any data.

    6.3 Misprints

    Where a linguistic error in the published writing appears to represent a misprint rather than odd usage on the original writer's part, it is corrected in the text file but logged in the notes file. For instance, at B04.00353 the printed phrase allege that the course in being mismanaged is clearly intended for ... is being mismanaged and is corrected accordingly.

    6.4 Stylistic infelicities

    Where a competent copy-editor might correct wording because it sounds clumsy, but it is not actually ungrammatical, we annotate it as it stands in the LUCY text files. For instance, one document repeatedly uses that relative clauses where most of us would cut out that and the auxiliary verb: ... shows the person that is dealing with the call the relevant information ..., where I would write the person dealing with the call - this is just a matter of style. Likewise, we have not registered violations of artificial grammar rules which some mature writers prefer to obey but others do not, the obvious example being the split infinitive rule. In my own writing, I strive to avoid wording like ... to briefly develop ..., but for LUCY purposes it is not counted as a mistake.

    Another thing which perhaps belongs under this heading is "politically correct pronouns". To the present writer, a sequence like When the customer phones 'First Computers' they are asked ... reads really weirdly (with they = the customer); but this is now a recognized, accepted style and we do not correct it.

    In general, if we were in doubt ("Is this acceptable English or not?"), we aimed to give the writer the benefit of the doubt. And, if something was definitely wrong (so that we used the tilde-tag notation to relate it to some alternative, acceptable wording - see below), we aimed to find the minimal change that would produce acceptable English, rather than some larger change that might turn the wording into more idiomatic English.

    6.5 Incoherent but grammatical wording

    Some passages by unskilled writers seem grammatical but meaningless, or they mean something different from what the writer appeared to want to say. Again these cases are annotated as they stand, though often with a comment drawing attention to them in the notes file. (The examples in this section and section 6.6 are genuine, except where stated as invented, but many of them are taken from documents which did not make it into the completed LUCY Corpus, hence those cases are not given location codes.)

    For instance, wording that seems fine so far as English grammar is concerned, but where it is just not clear in the context what is being said, would be a sentence beginning As asked when the customers phone the help centre ..., in the middle of a discussion about routines for setting up a call-centre system. In the context I do not know what the words As asked are doing; it seems that they could be omitted and the surrounding text would mean the same. But the phrase is good English, so it is annotated as it stands, with a note about the oddity in the notes file.

    A case where the writer clearly says something different from what he meant is: not every company calling has a credit agreement or has forgotten the agreement number ... - this ought to be something like ... or has remembered the agreement number, but there is nothing wrong with the grammar, so the structure is annotated as written.

    Another kind of oddity handled in the same way is errors of vocabulary rather than grammar: the word in the text is the correct part of speech in the correct inflected form, but that word does not mean what the writer supposes. Thus:

    the duration that they were taking to repair ... - we take "time" not "duration"
    to briefly develop on the contemporary phenomenon ... - develop is transitive; in context the writer obviously means "expand on", i.e. discuss at greater length
    utilises the use of - this is horribly clumsy for makes use of
    it is nearly completed and in usage - should be in use

    In all these cases, nothing in the annotated text file reflects the oddity, but a remark is included in the notes file.

    A more debatable case would be They can be entered ... to the system ..., where I believe the writer meant what would normally be expressed as entered into the system or perhaps entered on the system. Arguably a choice of preposition to suit a verb is an aspect of grammar rather than vocabulary; but in debatable cases one chooses the simpler rather than more complex annotation, so again the wording appears in the text file as it stands, with a remark in the notes file.

    6.6 Ungrammatical structures

    After cases of the above categories are taken care of, there remain the cases - frequent in the child writing - where the wording as it stands is grammatically unacceptable (for reasons other than apostrophe errors).

    In compiling corpora of adult usage, linguists are often chary of treating non-standard structures as "incorrect", arguing that it may represent usage which is as conventional in the speaker's or writer's dialect as some alternative usage is in the standard national language. For an annotated corpus in which child writing forms a significant element, that attitude would not be tenable. It is often clear that a child has produced severely distorted wording not because, among his or her friends, that is a normal way to talk, but because it is difficult to learn to marshal words on paper so as to create structures which make sense in terms of the language one speaks fluently. Much of the purpose of the early years of education is to help children to overcome that difficulty. To be useful, the LUCY Corpus needs to represent the children's structural errors as errors. Furthermore, as already said above, the errors are often so severe that it would scarcely be possible to apply the annotation scheme to them in any non-arbitrary fashion, if we annotated what the children actually wrote rather than what they should have written.

    Consequently we have proceeded as follows. Where some component of a structure is of the wrong grammatical type to fit into the slot where it occurs, we give it a tag (tagmatag or wordtag, as appropriate) of the form X~Y, where X is the category that should occur and Y the category that does occur. If it is clear that an X category is called for, but what actually occurs cannot be satisfactorily fitted into any category, a tag of the form X~ is used.

    In some cases a form is wrong because some essential word is missing; and, conversely, unskilled writers often repeat words, unchanged or with minor changes. (One may surmise that this happens through the writer correcting the first attempt but then forgetting to cross it out or delete it - however, if there is no deletion, it would not be justifiable to omit the redundant word token from our text files.)

    Where an essential word is omitted, it is inserted between curly bracket characters, tagged WT~0 (where "WT" represents its appropriate wordtag) - that is, what should occur is a WT, what does occur is nothing. The parsetree then treats that inserted word as a leaf node along with the words that were actually written. Where a word is redundant because it is repeated or near-repeated, the tokens before the last (the ones which have notionally been corrected) are tagged 0~WT.

    Obviously this approach often requires an element of guesswork to decide what particular word should be seen as omitted, or what grammatical structure was intended where the writer has produced something ungrammatical. Furthermore, it is debatable whether this sort of guesswork always has a "right answer" (let alone a knowable right answer). When a child writes something garbled, we surely cannot assume that he or she must necessarily have had a particular good structure in mind and failed to get it down on paper. It seems quite possible that the child may often not have managed to articulate a good grammatical structure even tacitly.

    Nevertheless, frequently the necessary corrections are fairly obvious. Sometimes they are not; there are certainly cases where one analyst might choose one minimal alteration to make sense of a garbled passage while another analyst would alter it in a very different though also minimal way. That is unfortunate, because ideally a linguistic annotation scheme ought to be fully determined at all points. But we see no better way than that outlined here of recording the structure of unskilled writers' language with adequate accounting both for what is actually written and for its deviations from conventional structure.

    Some examples:

    6.6.1 Omitted words

    Text E54.0000480 has a sequence plays a ... role in all of lives. This should surely be ... all of our lives, and that is a grammatical error - no sequence all of Xs seems to be natural without some determiner before Xs (if there is no definite article, possessive pronoun, etc., then of would also normally be missing and the phrase would be all Xs).

    The sequence asked for their information that is their customer number and ... needs at least a comma before that is, and probably also one following.

    In these cases, {our}, wordtagged APPG~0, and {,}, wordtagged YC~0, are inserted in the wordfields and wordtag fields respectively. The sequence {our} lives is tagmatagged Np, as if our were a "real" word.

    Without our, the word lives on its own in this context would not be a tagmatagged constituent by the rules of the SUSANNE scheme, hence {our} lives is tagged Np~NN2 - what should occur is a plural noun phrase, what does occur is just a plural noun. But consider the (invented) example He was the kitchen for He was {in} the kitchen. In that case, the kitchen is an Np; so the corrected constituent {in} the kitchen is tagged P:p~Ns:p, meaning that what should appear here is a prepositional phrase of Place but what actually appears is a singular noun phrase functioning as a Place adjunct.

    A tilde tagmatag may have the same tag on either side of the tilde. For instance, if the given example had run all of long lives, the analysis would be all of [Np~Np {our} long lives], because the phrase as it stands is incomplete in the context but it is a plural noun phrase with or without the necessary determiner. Likewise, in the second real example quoted in this section (above), the annotation would be asked for [Ns~Ns their information {,} that is their ...].

    6.6.2 Repeated or near-repeated words

    Some fairly simple examples are:

    to expose ourselves to almost any media medium to which ... (E54.0000670) - the writer seems to have forgotten to delete media after substituting medium for it.
    ... most radio etc +. +. Some would argue ... (E54.0000950): the writer has both given etc an abbreviatory stop and added a sentence-closing stop, but English only uses a single full stop in this situation.
    and in by doing so ...: one or other preposition is redundant.

    In the Young Adult writing, much of which was originally typed on computer keyboards, there are several cases where a writer has typed the " symbol twice, although a single instance of it represents "double inverted commas".

    In each of these cases the redundant word is given a 0~WT tag, with the identity of "WT" depending on the nature of the word in question. Unless there is some specific reason to do otherwise (in these cases there is not), the word counted as "real" is the last word in the repetitive sequence, and the word or words before it are labelled 0~WT.

    If it is fairly clear which tagma a redundant word belonged to in the writer's mind, the higher structure is arranged accordingly (otherwise, the redundant word is attached as high in the tree as is possible in the surrounding context). For instance, in the third example above it is clear that in, like by, was intended to be a preposition in construction with the following Tg, so it is placed within the Pb~:

    and [Pb~ in_0~II by_IIb [Tg doing so ] ]

    Because the sequence which actually occurred, beginning with two prepositions, is not a correct form of any category, there is nothing after the tilde in the Pb~ label.

    More complex examples include:

    a 'relational database' of which will allow different tables to ... - of is redundant
    ... that were needed to be checked - were is redundant, since only the checked clause and not the higher clause ought to be passive.

    In the latter case, were needed is tagged Vd~Vwp, and the passive ghost element is Np:s123~Np:S123 - it ought to be the surface subject of the relative clause but it is actually only the logical subject.

    6.6.3 Words taking the place of grammatically-distinct words

    Some examples here are:

    Other requirement ... were the call log ... - requirement should be plural
    a level of which I feel that it will be usable: of should be at
    ... which, if we notice or not, ... (E54.0000260): this case is perhaps debatable but it would seem that if ought to be whether

    A case involving punctuation is (irrelevant wording omitted):

    This ... will then allow `First Computers' to deal with their customers far more efficiently than they have done before. As at the moment the entire process is ...
    - the As ... clause clearly should be a subordinate part of the preceding sentence.

    The if which should be whether is wordtagged CSW~CSi; requirement is NN2~NN1c; the text files include the word actually written, but the notes files mention the word which ought to appear. Where appropriate, tilde tagmatags are used; in the level of which case, of is wordtagged II~IO, and of which is tagmatagged P~Po. In the first example, the actual form Other requirement is not a well-formed phrase of any grammatical category, so it is tagmatagged Np:s~ (it ought to be a plural noun phrase acting as subject, it is not any good category).

    The punctuation example does not require a tilde tag at the clause level. The full stop which ought to be a comma is wordtagged YC~YF, with a comment in the notes file, but Fa:c is the only tagmatag appropriate for the As clause irrespective of orthography (Sampson 1995: 245 prescribes that constituents which grammatically belong within a single sentence should be shown as such in the parsefield even across sentence-final punctuation marks). We do not annotate for the fact that As would have a lower-case a if the punctuation were correct.

    6.6.4 Other structural errors

    This is a residual category for cases where no specific word is missing or redundant, but a skilled writer just would not put it like that.

    An example is:

    if the created database is fully functional
    where normal English would not put created before the noun it modifies in this way - we would have to say the database created or, more likely, the database which has been created. In theory one might use the apparatus defined above to mark this by giving created a zero tag where it does occur, and inserting {created} where it should occur; but that seems over-contrived. It is better merely to record the fact that the internal structure of the noun phrase is wrong. That is done by tagging the noun phrase Ns:s~Ns:s (the phrase ought to be a different phrase even though both are singular subject noun phrases), with a note in the notes file about the nature of the error.

    On the other hand, in the rather similar case:

    ... to repair the problems occurred.
    where again occurred needs to be a full relative clause, because nothing is out of its correct order it is more straightforward to use the curly-bracket system to insert two words {that}{have}, together with a ghost indexed to the antecedent. Then the relative clause {that} {S123} {have} occurred is tagmatagged Fr~Tn.

    6.7 Limiting the section of tree affected by error

    Where one of the above devices is applied in annotating a word, the tagma immediately dominating the word will normally take a tilde tag, X~Y or X~ - if the tagma were well-formed, presumably the word would not need special annotation. But this logic is not applied indefinitely up a tree towards the root: an error marked at one level justifies a tilde tag only at the next level up. An (invented) case would be:

    They told that they were coming.
    in a context where it is clear that the omitted but needed indirect object of told was me. This word will be inserted as {me}, wordtagged PPIO1~0, and as a clause consituent {me} will be given a phrasetag Neo:i~0 over it; but the clause will be labelled simply S, without tilde.

    As a special case, errors relating exclusively to punctuation will be ignored for the tagmatagging, i.e. they will never in themselves justify a tilde tag above the wordtag level.

    6.8 Multiple error annotations

    Sometimes, more than one of these annotations is needed at the same place. Consider the example:

    to check the details if their agreement exists or is still valid ...
    In context this means "details of whether their agreement ..." (it does not mean "check the details, provided that ..."). Here, if is wordtagged CSW~CSi, with a note that the word ought to be whether; the clause begun by if is tagged simply Fn?, since it will be that with either conjunction; {of} is inserted before if, with the clause beginning {of} tagged Po~Fn?.

    7. References

    Handscombe, R.J., ed. 1967a. The Written Language of Nine and Ten-Year Old Children. (Nuffield Foreign Languages Teaching Materials Project, Reports and Occasional Papers, no. 24.) Leeds University.

    Handscombe, R.J., ed. 1967b. The Written Language of Eleven and Twelve-Year Old Children. (Nuffield Foreign Languages Teaching Materials Project, Reports and Occasional Papers, no. 25.) Leeds University.

    Langendoen, D.T. 1997. Review of Sampson (1995). Language 73.600-3.

    Lin, D. 2003. "Dependency-based evaluation of Minipar". In Anne Abeillé, ed., Treebanks: Building and Using Parsed Corpora, Kluwer, pp. 317-29.

    Perera, Katharine. 1984. Children's Writing and Reading: Analysing Classroom Language. Basil Blackwell (Oxford) in association with André Deutsch.

    Sampson, G.R. 1995. English for the Computer: the SUSANNE Corpus and analytic scheme. Clarendon Press (Oxford).

    Sampson, G.R. 2000. CHRISTINE Corpus, Stage I: Documentation. www.grsampson.net/ChrisDoc.html

    Sampson, G.R. 2003. "The structure of children's writing: moving from spoken to adult written norms". In S. Granger and S. Petch-Tyson, eds., Extending the Scope of Corpus-Based Research, Rodopi (Amsterdam), pp. 177-93.

    Clicky