Published in Jenny Thomas & M.H. Short, eds., Using Corpora for Language Research, Longman, 1996, and reprinted in ch. 2 of Sampson, Empirical Linguistics.

From central embedding to empirical linguistics

Geoffrey Sampson



Now that the empirical approach to linguistic analysis has reasserted itself, it is not easy to recall how idiosyncratic the idea seemed, twenty years ago, that a good way to discover how the English language works is to look at real-life examples.


As a young academic in the 1970s I went along with the then-standard view that users of a language know what is grammatical and what is not, so that language description can and should be based on native-speaker intuition.  It was the structural phenomenon of “central embedding”, as it happens, which eventually showed me how crucial it is to make linguistic theories answerable to objective evidence.  Central embedding (which I shall define in a moment) was a topic that became significant in the context of linguists’ discussions of universal constraints on grammar and innate processing mechanisms.  In this chapter I describe how central embedding converted me into an empirical linguist.


Central embedding refers to grammatical structures in which a constituent occurs medially within a larger instance of the same kind of tagma; an invented example is [The book [the man left] is on the table], where a relative clause occurs medially within a main clause, as indicated by the square brackets.  A single level of central embedding like this is normal enough, but linguists agreed that multiple central embedding – cases where X occurs medially within X which occurs medially within X, for two or more levels – is in some sense not a natural linguistic phenomenon. 


Theorists differed about the precise nature of the structural configuration they regarded as unnatural.  De Roeck et al. (1982) distinguished four variant hypotheses about the unnaturalness of multiple central embedding.  For Variant 1, the unnatural structures are any trees in which a node has a daughter node which is not the first or last daughter and which is nonterminal, and where that node in turn has a nonterminal medial daughter, irrespective of the labels of the nodes; that is, the unnaturalness depends purely on the shape of the tree rather than on the identity of the higher and lower categories.  De Roeck et al. showed that several writers advocated the very strong hypothesis that multiple central embedding in this general sense is unnatural. 


For other linguists, the tree structure had to meet additional conditions before it was seen as unnatural.  Variant 2 is a weaker hypothesis which rules out only cases where the logical category is the same, e.g. S within S within S, or NP within NP within NP.  (S and NP are the standard grammatical symbols for “clause” and “noun phrase”.)  Variant 3 is a weaker hypothesis still, which treats structures as unnatural when the concentric logical categories not only are the same but occur within one another by virtue of the same surface grammatical construction, e.g. relative clause within relative clause within S; and Variant 4 weakens Variant 3 further by specifying that the structure is unnatural only when the hierarchy of tagmas incorporated into one another by the same construction are not interrupted by an instance of the same category introduced by a different construction (e.g. relative clause within nominal clause within relative clause within S would violate Variant 3 but not Variant 4).


These variant concepts notwithstanding, there was general agreement that multiple central embedding in some sense of the concept does not happen in human languages.  Theorists debated why that should be. 


For generative grammarians, who laid weight on the idea that grammatical rules are recursive, there was a difficulty in accounting for rules which could apparently apply once but could not reapply to their own outputs:  they solved the problem by arguing that multiple central embeddings are perfectly grammatical in themselves, but are rendered ‘unacceptable’ by falling foul of psychological language-processing mechanisms which are independent of the rules of grammar but which, together with the latter, jointly determine what utterances people can produce and understand (Miller and Chomsky 1963: 471).  The relational-network theorist Peter Reich urged that this did not adequately explain the fact that the limitation to a single level of central embedding is as clearcut and rigid a rule as languages possess:  ‘The first thing to note about [multiple central embeddings] is that they don’t exist.  … the number of attested examples of MCEs in English can be counted on the thumbs of one hand’ (Reich and Dell 1977).  (The thumb was needed because Reich and Dell were aware of a single reported instance of multiple central embedding during the many years that linguists had been interested in the topic – but then, what linguistic rules are so rigid as never to be broken even on just one occasion?)  Reich argued that generative grammar ought to make way for a finite-state theory of language within which the permissibility of one and only one level of central embedding is inherent in the model (Reich 1969).


Even William Labov, normally an energetic champion of empirical methods when that was a deeply unfashionable position to take, felt that empirical observation of naturally-produced language was irrelevant to the multiple central embedding issue.  Labov believed that multiple central embeddings are grammatical in every sense, but he saw them as a paradigm case of constructions that are so specific and so complex that instances would be vanishingly rare for purely statistical reasons:  ‘no such sentences had ever been observed in actual use; all we have are our intuitive reactions that they seem grammatical …  We cannot wait for such embedded sentences to be uttered’ (Labov 1973: 101).


Thus, linguists of widely diverse theoretical persuasions all agreed:  if you wanted to understand the status of multiple central embedding in human language, one thing that was not worth doing was looking out for examples.  You would not find any.


Doubt about this was first sown in my mind during a sabbatical I spent in Switzerland in 1980-81.  Giving a seminar to the research group I was working with, I included a discussion of multiple central embedding, during which I retailed what I took to be the standard, uncontroversial position that speakers and writers do not produce multiple central embeddings and, if they did, hearers or readers could not easily interpret them.  In the question period Anne De Roeck asked ‘But don’t you find that sentences that people you know produce are easier to understand?’  Well, perhaps, I responded, but this did not refute the theory because …  – and I got quite a long way through my answer before the expression on Anne’s face alerted me to the fact that the point of her question had been its grammar rather than its semantics.  (The structure of the question, with finite subordinate clauses delimited by square brackets, is But don’t you find [that sentences [that people [you know] produce] are easier to understand]?) 


So evidently, if multiple central embeddings were indeed ‘unacceptable’, this did not mean that if produced they will necessarily draw attention to themselves by being impenetrable to the hearer.  Perhaps, then, it was worth checking whether they were so completely lacking from natural language production as the doctrine alleged.  I began to monitor my reading, and quite soon encountered a series of examples which our research group assembled into the De Roeck et al. paper already cited.  Many of the examples violated even the weakest Variant 4 of the orthodoxy (all violated at least Variant 2); some of them involved more than two layers of central embedding.


A sceptic might have felt that there was something not fully convincing about that initial collection.  More than half of them came from a single book – it is dangerous to rest conclusions about language in general on the linguistic behaviour of one individual, perhaps idiosyncratic, writer; and most of the remainder were taken from a very serious German-language newspaper, the Neue Zürcher Zeitung:  German is a language with rigid and unusually complex word-order rules, and when used in highly formal written registers is arguably a more artificial category of linguistic behaviour than most.  But, on returning from Switzerland, I went on looking for multiple central embeddings, and I began to find examples in very diverse linguistic contexts.


Thus, one could hardly find a newspaper more different in style from the Neue Zürcher Zeitung than the British News of the World:  this is a mass-market Sunday paper beloved for its titillating exposures of the seamier side of life, and those responsible for its contents would, I believe, feel that they were failing if they allowed a highbrow or intellectual flavour to creep into its pages.  But the LOB Corpus (the earliest electronic corpus of British English, which had been completed in 1978) contains an extract from a story ‘Let’s give the Welfare State a shot in the arm’, by Kenneth Barrett, which appeared in the 5.2.1961 edition of that newspaper, and which includes the following sentence:


[And yet a widow, [whose pension, [for which her husband paid], is wiped out because she works for a living wage], will now have to pay 12s. 6d. for each lens in her spectacles, and 17s. 8d. for the frames].


This is a case of wh- relative clause within wh- relative clause within S, violating Variant 4.


Even if popular writing for adults contains these constructions, it may be thought that writing for children will not.  At this time my own children were aged seven and five, and their favourite books by a large margin for paternal bedtime reading were the series of boating adventure stories by Arthur Ransome.  The following sentence occurs in Ransome’s Swallowdale (Jonathan Cape, 1931, pp. 113-14):


[But Captain Flint laid to his oars and set so fast a stroke that John, [who, [whatever else he did], was not going to let himself get out of time], had enough to do without worrying about what was still to come].


S within S within S:  violates Variant 2.  (Indeed, the clause beginning whatever is quite similar in structure to a relative clause; if the two kinds of clause were regarded as varieties of a single construction, the sentence would violate Variant 4.  But probably almost any grammarian would treat the two clause-types as separate.)  For what it is worth, my daughters showed no observable sign of difficulty in understanding this sentence, though similar experiences in the past resigned them to their father’s temporary unwillingness to proceed with the story while he scrutinized it.


Still, while published writing addressed to unsophisticated readers apparently does contain multiple central embeddings, one could nevertheless argue that they are not likely to be found in writing produced by people unskilled with language.  But the following sentence occurred in an essay-assignment written in February 1983 by S.S., a first-year Lancaster University undergraduate student of (at best) moderate ability:


All in all it would seem [that, [although it can not really be proved [that the language influences the script in the beginning at its invention], simply because we seldom have any information about this time in a scripts history], the spoken language does effect the ready formed script and adapts it to suit its needs].


Subordinate clause within subordinate clause within subordinate clause:  violates Version 2.  (Here I assume that when a nominal clause is introduced by that, this word is part of the clause it introduces; this is the consensus view among linguists, but even if it were rejected the example would still be a case of S within S within S – the outer S would then be the main clause, i.e. the first opening square bracket would be repositioned at the beginning of the quotation.)    The various solecisms in the passage (can not for cannot, scripts for script’s, effect for affect, ready for already) were characteristic of the student’s writing.


Conversely, a true believer in the unnaturalness of multiple central embeddings might suggest that while laymen may sometimes produce them, professional linguists, who may be expected to be unusually sensitive to grammatically objectionable structures, would avoid them.  (For the idea that professional linguists may be in some sense more competent in their mother tongue than native speakers who are not linguists, see e.g. Snow and Meijer 1977.)  However, in a review of a book edited by W. S.-Y. Wang, the eminent linguist E.G. Pulleyblank wrote in the Journal of Chinese Linguistics (vol. 10, 1982, p. 410):


[The only thing [that the words [that can lose -d] have in common] is, apparently, that they are all quite common words].


That relative clause within that relative clause within S:  violates Variant 4.


Again, a defender of the orthodox line might suppose that, while the pressures of journalism, academic publication, and the like allow a certain number of ‘unnatural’ constructions to slip through, at least when a short text is composed and inscribed with special care there should be no multiple central embeddings.  What would be the clearest possible test of this?  A ceremonial inscription ornamentally incised on marble seems hard to beat. 


Visiting Pisa for the 1983 Inaugural Meeting of the European Chapter of the Association for Computational Linguistics, I noticed a tablet fixed to one wall of the remains of the Roman baths.  Apart from the conventional heading ‘D O M’ (Deo Optimo Maximo), and a statement at the foot of the names of the six persons referred to, the inscription on the tablet consists wholly of the following single sentence, in a language where word order is much freer than in German or even English, so that undesired structural configurations could easily be avoided.  I quote the inscription using upper and lower case to stand for the large and small capitals of the original, and replace its archaic Roman numerals with modern equivalents:


[{Sex uiri, [qui {Parthenonem, [ubi {parentibus orbæ uirgines} aluntur, et educantur], [{qui} {uulgo} {charitatis Domus} appelatur]}, moderantur, eiusque rem administrant]}, quum at suum ius ditionemque pertineat hic locus, in quo Sudatorium Thermarum Pisanarum tot Seculis, tot casibus mansit inuictum, et officii sui minime negligentes, et Magni Ducis iussis obtemperantes, et antiquitatis reuerentia moti reliquias tam uetusti, tam insignis ædificii omni ope, et cura tuendas, et conseruendas censuerunt An: Sal: MDCXCIII].


[Since this place, where the sudatorium of the Pisan Baths has remained unconquered by so many centuries and so many happenings, comes under their jurisdiction, {the six men [who govern and administer {the Parthenon, [where {orphaned girls} are brought up and educated], [{which} is known by {the common people} as {the House of Charity}]}]}, being diligent in the performance of their duty, obedient to the commands of the Grand Duke, and moved by reverence for antiquity, ordered every effort to be used carefully to protect and to conserve the remains of this building of such age and distinction, in the Year of Grace 1693].


Here, curly brackets delimit NPs, and square brackets delimit clauses.  Thus we have four NPs within NPs within NP, a quadruple violation of either Variant 2 or Variant 4 (depending on the definition of identity of grammatical constructions); at the same time, with respect to clauses, we have two relative clauses within relative clause within sentence, a double violation of Variant 4.  (The fact that one of the innermost relative clauses, together with the relative clause containing it, are both compound might be thought to make the constructions even more similar and accordingly more ‘unnatural’ than they would otherwise be.)  The central embeddings occur at the beginning of the original text, making it wholly implausible that they were produced through careless oversight even if such carelessness were likely in inscriptions of this sort.


At this period I did not record any examples with more than two levels of central embedding, so that a defender of the orthodox view might conceivably try to rescue it by arguing that the boundary between permissible and impermissible degrees of central embedding lies not between one and two levels but between two and three levels.  This would be a gross weakening of the standard claim, which asserts not only that there is a fixed boundary but that it occurs between one and two levels (cf. Reich (1969), and his quotation from Marks (1968)).  If such a strategy were adopted, then at least the first and probably also the second of the following new examples would be relevant:


[Laughland’s assertion that [the presence of [Delors – [14 years] old when [the war] began – ] in the Compagnons de France, the Vichy youth movement,] meant that he supported fascism] is ridiculous. 

            (Charles Grant, letter to the Editor, The Spectator, 12.11.1994, p. 35.)


The phrases 14 years and the war are both cases of NP within NP within NP within NP, a double violation of the two-level limit (incidentally, two paragraphs later the same letter contains a case of S within S within S).


[Your report today [that any Tory constituency party [failing [to deselect its MP], should he not vote in accordance with a prime ministerial diktat,] might itself be disbanded], shows with certainty that Lord Hailsham’s prediction of an “elective dictatorship” is now with us]. 

            (Vice-Admiral Sir Louis Le Bailly, letter to the Editor, The Times, 25.11.1994, p. 21.)


Infinitival clause within present-participle clause within that nominal clause within S:  again a violation of the two-level limit, except that if the should clause were alternatively regarded as subordinate to deselect rather than to failing then the structure would violate only the one-level limit.


All in all, it seemed clear that no matter what kind of language one looks at, multiple central embeddings do occur.  The above examples include no case from speech; that is regrettable but not surprising, first because spoken language is structurally so much less ramified than writing that any kind of multiple embedding, central or not, is relatively unusual there, and equally importantly because it is difficult to monitor such cases in speech (when I happen on a written multiple central embedding, I always have to re-read it slowly and carefully to check that it is indeed one, and with speech this is not possible).  Nevertheless, De Roeck et al. did record one case from spoken English, which happened to have been transcribed into print because it was uttered by a prime minister in the House of Commons; and the example requiring the single thumb in Reich & Dell (1977) occurred in extempore spoken English.  Here is a third case I recently encountered, from the late Richard Feynman’s Nobel Prize address given in Stockholm on 11.12.1965 (quoted from J. Gleick, Genius, Abacus Books, 1994, p. 382):


[The odds [that your theory will be in fact right, and that the general thing [that everybody’s working on] will be wrong,] is low].


That relative clause within that nominal clause within S, violates Variant 2.  While this speech was obviously not a case of extempore chat, the quotation does contain several features characteristic of spoken rather than written language (everybody’s for everybody is; colloquial use of thing; failure of agreement between odds and is).  In any case, Reich & Dell’s footnote 2 makes it clear that their belief in the unnaturalness of multiple centre embedding applies to writing as well as to speech.


Incidentally, the difficulty of identifying multiple central embeddings on first reading offers a further argument against the claim that they are ‘unnatural’.  During fluent reading for normal purposes I register no reaction more specific than ‘clumsy structure here’, and passages which provoke this reaction often turn out to include only structures that linguists do not normally claim to be ‘unnatural’ or ungrammatical, e.g. non-central embeddings.  If the orthodox view of multiple central embedding were correct, one would surely predict that these structures should ‘feel’ much more different from other structures than they do.


The examples listed above were not the only cases of multiple central embedding I encountered in the months after I returned from Switzerland; they are ones I copied down because they seemed particularly noteworthy for one reason or another.  More recently I tried to achieve a very rough estimate of how frequent these structures are, by systematically taking a note of each case I encountered over a period. 


This project was triggered by the experience of spotting two cases in quick succession; but the following list includes only the second of these, which I read on 4 October 1993, because having decided to make a collection I did not manage to locate the earlier case in the pile of newsprint waiting to go to the dustbin.  Thus the list represents the multiple central embeddings noticed during a random period starting with one such observation and continuing for a calendar month (I made the decision to stop collecting on 4 November 1993 – both start and stop decisions were made in the middle of the day rather than on rising or retiring, though there may have been a few hours’ overlap).  In view of my failure to register Anne De Roeck’s trick question, discussed above, there could have been further cases in my reading during this month which escaped my attention.


A greater issue of principle there could not be than the transfer of self-government away from the British electorate to the European Community; [but, [though Tony Wedgwood Benn thought that “[if only Harold would look and sound a bit more convincing (on that subject)], we might have a good chance”], Wilson not only did not do so but his tactics on taking office steered his party, his government, Parliament and the electorate into a referendum of which the result is only now in course of being reversed]. 

            (J. Enoch Powell, review of P. Ziegler, Wilson, p.35 of The Times of 4.10.1993, read 4.10.1993; the brackets surrounding on that subject were square in the original, and are replaced by round brackets here to avoid confusion with the square brackets of my grammatical annotation.)


Adverbial clause within adverbial clause within S:  violates Variant 4.


Harris and I would go down in the morning, and take the boat up to Chertsey, [and George, [who would not be able to get away from the City till the afternoon (George goes to sleep at a bank from ten to four each day, except Saturdays, when they wake him up [and put him outside] at two)], would meet us there]. 

            (Jerome K. Jerome, Three Men in a Boat, 1889, p. 17 of Penguin edition, 1957, read 7.10.1993.)


Reduced relative clause within relative clause within S (the relative clause beginning when they wake … would make a fourth level, but this clause is right-embedded in the who clause):  violates at least Variant 2 and perhaps Variant 3, depending on the definition of similarity between grammatical constructions.


[When the pain, [which nobody [who has not experienced it] can imagine], finally arrives], they can be taken aback by its severity. 

            (Leader, p. 17 of The Times of 16.10.1993, read 16.10.1993.)


Wh- relative clause within wh- relative clause within adverbial clause:  violates Variant 4.


[That the perimeters of [what men can wear and [what they cannot], what is acceptable and what is not,] have become so narrow] goes to show how intolerant our society has become.

            (Iain R. Webb, ‘Begging to differ’, p. 70 of the Times Magazine of 9.10.1993, read 21.10.1993.)


Reduced antecedentless relative clause within compound antecedentless relative clause within nominal clause:  violates at least Variant 2 and perhaps Variant 4, depending on the definition of similarity between grammatical constructions.


[For the remainder of his long and industrious life (apart from during the second world war [when he worked in the Ministry of Information – [where he was banished to Belfast for being “lazy and unenthusiastic”] – and the Auxiliary Fire Service]) Quennell made his living as an author, a biographer, an essayist, a book-reviewer, and as an editor of literary and historical journals]. 

            (Obituary of Sir Peter Quennell, The Times of 29.10.1993, read 29.10.1993.)


Adverbial relative clause within adverbial relative clause within S:  violates Variant 4.


[In the 18th century, [when, [as Linda Colley shows in her book Britons], the British national identity was forged in war and conflict with France], our kings were Germans]. 

            (Timothy Garton Ash, ‘Time for fraternisation’, p. 9 of the Spectator of 30.10.1993, read 29.10.1993.)


As clause within adverbial relative clause within S:  violates Variant 2.


[The cases of Dr Starkie, the pathologist whose procedures in the diagnosis of bone cancer is now being questioned, and Dr Ashok Kumar, [whose nurse, having been taught by him, used the wrong spatula, [which must have been provided by the practice], to obtain cells for cervical smears], are very different]. 

            (Thomas Stuttaford, ‘Patients before colleagues’, The Times of 10.9.1993, read 31.10.1993; the agreement failure (procedures … is) occurs in the source.)


Wh- relative clause within wh- relative clause within S:  violates Variant 4 (and the having been clause constitutes a separate violation of Variant 2).


To assess what rate of occurrence of multiple central embeddings these examples imply requires an estimate of my overall rate of reading, which is very difficult to achieve with any accuracy.  On the basis of word-counts of typical publications read, I believe my average daily intake at this period was perhaps 50,000 and certainly not more than 100,000 written words, so that the seven multiple central embeddings quoted above would imply a frequency of perhaps one in a quarter-million words (more, if we suppose that I missed some), and at least one in a half-million words.  Some time soon, it should be possible for language-analysis software automatically to locate each instance of a specified construction in a machine readable corpus, and we shall be able to give relatively exact figures on the frequency of the construction.  For a construction as complex as multiple central embedding we are not yet at that point; but on these figures there is no reason to suppose that the single example quoted earlier from the LOB Corpus is the only example contained in it.


The conclusion is unavoidable.  Multiple central embedding is a phenomenon which the discipline of linguistics was united in describing as absent from the real-life use of language; theorists differed only in the explanations they gave for this interesting absence.  Yet, if one checks, it is not absent. 


I do not go so far as to deny that there is any tendency to avoid multiple central embedding; I am not sure whether there is such a tendency or not.  Independently of the issue of central embedding, we have known since Yngve (1960) that the English language has a strong propensity to exploit right-branching and to avoid left-branching grammatical structures – this propensity alone (which is examined in detail in Chapter 4 below) would to some extent reduce the incidence of central embedding.  Whether, as for instance Sir John Lyons continues to believe (Lyons 1991: 116), multiple central embeddings are significantly less frequent than they would be in any case as a by-product of the more general preference of the language for right branching is a question whose answer seems to me far from obvious.  But it is clearly a question that must be answered empirically, not by consulting speakers’ ‘intuitions’.


Hence, as I picked up the threads of my working life at home after the Swiss sabbatical, I knew that for me it was time for a change of intellectual direction.  If intuitions shared by the leaders of the discipline could get the facts of language as wrong as this, it was imperative to find some way of engaging with the concrete empirical realities of language, without getting so bogged down in innumerable details that no analytical conclusions could ever be drawn.


Happily, easy access to computers and computerized language corpora had arrived just in time to solve this problem.  I seized the new opportunities with enthusiasm.


Naturally, the discipline as a whole was not converted overnight.  As late as 1988, reviewing a book edited by Roger Garside, Geoffrey Leech, and me about research based on LOB Corpus data, Michael Lesk (nowadays Director, Information Technology, at the U.S. National Science Foundation) found himself asking (Lesk 1988):


Why is it so remarkable to have a book whose analysis of language is entirely based on actual writing? … It is a great relief to read a book like this, which is based on real texts rather than upon the imaginary language, sharing a few word forms with English, that is studied at MIT and some other research institutes … a testimony to the superiority of experience over fantasy. 


However, one by one, other linguists came to see the virtues of the empirical approach.  What Michael Lesk found remarkable in 1988 has, ten years later, become the usual thing; and this is as it should be.






