Geoffrey Sampson


[LOGO]

The APRIL Parsing System

The need for parsing

Computers are machines for processing information. Information has to be exchanged and stored in some medium, and for human beings overwhelmingly the preferred medium is natural language — English, German, Chinese, ... In order to enable computers to handle information, up to now we have had to avoid the complexities of natural language and regiment the information into artificial, simple formats. But in the longer term that can hardly be the right approach. Computers are made for Man, not Man for the computer: they need to adapt to communicate in the way that we can conveniently deal with, not the other way round. Ultimately the goal must be computing systems that can cope with written and spoken English and other languages as readily as we do ourselves.

Possibly the difference between machines and people means that that goal will never be wholly attainable. But since about the 1980s we have been in the business of finding out how far we can move in that direction.

Almost all practical natural-language computing applications turn out to depend crucially on automatic parsing: that is, on enabling the machine to work out the grammatical structures underlying linear sequences of words. For some applications, for instance machine translation between natural languages, or intelligent front-ends to databases which allow users to extract information from computer systems via questions couched in their ordinary everyday language, it is easy to see that parsing is needed to establish the logic and sense of a source-language text or a user’s question, in order to allow the machine to synthesize an equivalent target-language translation or decide what items of information constitute an answer to the question.

But parsing is needed also for applications where at first blush it might seem irrelevant. For instance, automatic announcement systems, which convert texts in written form into synthesized speech played through a workstation speaker or over a public-address system, can get the vowel and consonant sounds right by looking successive words up in an electronic pronouncing dictionary; but they cannot produce appropriate patterns of pitch and loudness variation (intonation) except by analysing the grammar of input texts — and Dalek-like synthetic speech lacking appropriate intonation patterns is really impossible for human hearers to tolerate for more than a short while. The writer K.K. Obermeier has described parsing as “The central problem” in nearly all natural-language processing applications.

Deterministic and probabilistic approaches

Until fairly recently, almost all work on automatic parsing has treated the task as essentially similar to “compiling” programs in a formal computer language. Parsing was based on rules defining “all and only” the valid grammatical structures in a language; faced with a particular input string, the task was to find the structural analysis by virtue of which it is a valid string.

The trouble with this approach is that human language is quite messy and anarchic, by comparison with languages like Pascal or C. Programming languages are designed to conform to rigid (and fairly simple) rules. Programs that break the rules are rejected by the computer. But it isn’t clear that a language like English is rule-governed in the same rigid way. If there is a definite set of rules specifying all and only the valid sentences of English, the rules are certainly much more complicated than those of C. But, to many of us, it seems that complexity is not the whole point — it is questionable whether English is fully definable by rules at all, and such rules as there are will often be broken with no bad consequences for communication.

A group of us around Geoffrey Leech and Roger Garside at Lancaster in the early 1980s felt that the way to deal with this must involve moving away from the “compilation metaphor”. We needed to develop probabilistic techniques, which looked for analyses of linguistic inputs that maximized some numerical measure of conformity to the patterns found in samples of accurately-analysed language. That way, it shouldn’t matter too much whether a piece of English is “perfectly grammatical” or not. Faced with a messy input, instead of just falling over, a probabilistic system would offer the best analysis available for that input. This felt much more like the way we humans operate with language.

We got some way with this at Lancaster. But there was a pervasive problem. If parsing is interpreted not as finding the unique valid analysis of an input but as finding the numerically optimal analysis, then one needs to have some way to seek this analysis out within a huge search-space of alternative analyses. The output of a parser is normally assumed to be a labelled tree with the input words attached to the leaf nodes (or, equivalently, a bracketing of the input string with labels on the brackets). The target parse of a simple sentence, say The cat sat on the mat., might look like this:

[S [Ns:s The cat] [Vd sat] [P:p on [Ns the mat]]].]
— where “Ns:s” means that The cat forms a singular noun phrase functioning as subject of the sentence, and so forth. An optimizing parser would need to allow any tree over an input string, with any distribution of recognized labels over the nonterminal nodes, as a conceivable parse — the search space would include “ridiculous” solutions such as (for the above example):
[S [P:p [P:p [P:p [P:p [P:p The cat] sat] on] the] mat] . ]
— where all the words group to the left, and every constituent is labelled “prepositional phrase of Place”. Solutions like this would be entertained, and rejected simply because their numerical measure of plausibility would be very poor.

But this means that, even for a short sentence and a small alphabet of recognized labels for nonterminal nodes, the number of solutions will be astronomically large; and the type of numerical measures that seem appropriate don’t offer any easy way to locate the best solution without grinding through all the alternatives one by one — even with a fast computer, there are far too many for this to be practical.

Parsing by simulated annealing

In 1985 (the year I left Lancaster for the Chair of Linguistics at Leeds) I formulated the idea of solving this problem using the computing technique of optimization by simulated annealing, which had been invented a few years earlier independently by researchers in the USA and Slovakia. Optimization by simulated annealing is one of a family of stochastic optimization techniques, reminiscent of Darwinian evolution in biology. Rather than using a clever recipe for going directly to the correct answer in some domain, you shake up the possibilities randomly, and apply a selection pressure which tends to preserve good solution-elements that happen to emerge by chance.

At the beginning of 1986 I programmed a very simple annealing parser on one of the tiny home microcomputers that we used in those days. As luck would have it, the very first run of this system (admittedly using an extremely limited range of grammatical labels, and on an input string only slightly less artificially simple than the example above) gave an output that was exactly right. So after that there was no stopping me.

I secured a series of research grants — at first from the Royal Signals & Radar Establishment (now part of Qinetiq) and, since my move to the University of Sussex in 1991, from the joint Science & Engineering Research Council/Ministry of Defence scheme — to implement parsing by simulated annealing as a more serious technology than was possible on a BBC Micro with 32 kilobytes of RAM. I based this work on corpora of real-life language (chiefly the LOB and Brown Corpora of British and American English). The research grants paid for the development (using these corpus resources) of successive versions of a software system we call APRIL — “Annealing Parser for Realistic Input Language”.

The full-time researchers who deserve most of the credit for various versions of APRIL were (at Leeds) Robin Haigh, and (at Sussex) Miles Dennis and Alan Wallington. (Other researchers have contributed for shorter periods, and by now there are academics elsewhere conducting their own experiments in parsing by simulated annealing, independently of me.)

In the early years, we simplified the overall task by assuming that the classification of words into parts of speech had already been done: that is, early versions of APRIL accepted input in the shape of sequences of wordtags — codes for categories such as “singular common noun”, “past tense of verb”, “comma” — rather than raw character-strings. Also, we used a rather crude alphabet of nonterminal categories. For instance, rather than requiring the system to identify The cat in the above example as a singular noun phrase functioning as subject, we just asked for it to be identified as a noun phrase.

Thanks to these simplifications, we quite quickly achieved a system whose performance, though far from perfect, was good enough to encourage us to continue (see our paper “Natural language analysis by stochastic optimization)”. Outputs were evaluated by the word-ancestor assessment technique (see “A proposal for improving the measurement of parse accuracy”), and received scores averaging in the mid to high eighties per cent.

Most of the work was executed on conventional sequential machines, but during a sabbatical term working with Prof. David Bounds’s Research Initiative in Pattern Recognition I used a transputer array to experiment with a parallel tree-optimizing algorithm . Parallel optimization of tree-shaped data structures is a task with potential applications in many areas apart from natural-language parsing, and again I obtained pleasing though very preliminary results.

Manual v. automatic generation of language models

However, in trying to move beyond our initial level of success, we hit a barrier. An optimizing parser is only as good as its language model — the function which evaluates a candidate analysis for an input and assigns it a figure of merit. For the parser to work adequately, it needs to be the case that when an input language-sample is grammatically well-formed, the value assigned by the language model to the correct analysis is superior to the value assigned to any alternative analysis (most of the time, at least — it is too much to expect that a language model could always get the answer right). The APRIL language model is based on a set of probabilistic transition networks, and in the APRIL versions developed at Leeds these networks were hand-crafted, making heavy use of the human analyst’s (Robin Haigh’s) judgment.

This turned out to be such a demanding task that it was difficult-to-impossible to upgrade the language model from one that worked so-so to one that worked significantly better.

Consequently, when I secured funding to resume this research after moving to Sussex, it seemed important to get away from hand-crafted language models and adopt a system which creates its own probabilistic language-model automatically from relatively raw language data.

The APRIL3 project

My most recent “APRIL3” project in this area was completed in early 1996. It built a system of annealing parser software with the following features:

Raw input

The APRIL3 system does the whole task of moving from English text in the shape of sequence of characters as normally typed at a keyboard, to output in the form of labelled tree structures. The first stage of its analysis involves segmenting a character string into words and choosing candidate wordtags for the successive words, using an electronic dictionary (derived, with the kind permission of Oxford University Press, from the Oxford Advanced Learner’s Dictionary of Current English, and including probabilities for alternative tags for grammatically-ambiguous words), together with rules for generating tags for character-strings not found in the dictionary.

Rich public output standards

The information in the analyses output by APRIL3 goes further than previous APRIL versions towards a “complete” account of the grammar of a language sample; and, whereas earlier versions expressed their analyses in a notation which in published documents was defined only sketchily (so that in debatable cases it was not really possible for outsiders to check our assessment of system performance), the target parsing scheme for APRIL3 is publicly defined in great detail (we use the SUSANNE scheme) — when APRIL3 makes a mistake we have no possibility of glossing over the fact that it is a mistake. APRIL3 does not attempt to identify all grammatical features defined in the SUSANNE scheme (some of the subcategories of the formal tagma categories are omitted, and the system does not reconstruct “ghost” elements representing the logical location of constituents moved into different clauses at the surface level); but APRIL3 does for instance identify functional roles, such as subject, indirect object, as well as formal categories such as noun phrase, relative clause.

Self-generated language model

The APRIL3 system uses stochastic optimization to derive a language model automatically from a database of parsed samples of English (in addition to using stochastic optimization to seek the best analyses of particular inputs with respect to its language model). Our experience of the task of automatically optimizing language models is described in the article “Stochastic optimization of a probabilistic language model”.

Lattice-based word segmentation

When the input to a parser consists of raw strings of characters, rather than language analysed into words, the first job the parser has to do is to segment the strings into words. (Recent computational linguistics has used the term tokenizing for this process; but that usage seems to represent a misunderstanding of the logical type/token distinction, and I avoid it.)

Sometimes this job is trivial (words are delimited by spaces), but in other cases the boundaries are prima facie unclear. In the sequence 2 in. Philips screw, on the basis of the characters alone one cannot tell whether the full stop is an abbreviation marker (i.e. in. as a whole is a form of the word inch) or is a sentence-closing punctuation mark, i.e. a separate “word”. And even when the boundaries of a word are clear, it may often represent alternative lexical units (e.g. can as modal verb v. can as tin can). Consequently, the front end of APRIL3 uses its dictionary and other information to build a lattice of word-hypotheses from the input character string. Each path through the lattice represents one way of segmenting the string into a sequence of particular lexical units. The subsequent parse-optimization process not only seeks a plausible labelled tree over a given sequence of lexical units but simultaneously seeks a plausible path through the lattice. Higher-level parsing considerations influence choice of lattice paths.

The significance of this feature is that it makes APRIL3 compatible with the usual structure of automatic speech recognition systems. Speech recognizers normally generate a lattice of word hypotheses to account for a physical speech signal; one path through the lattice is selected by reference to considerations such as bigram frequencies. Thus APRIL3 could potentially be harnessed to a speech recognizer in order to bring more sophisticated linguistic information to bear on the choice of word-hypotheses.

The robustness of stochastic optimization is particularly interesting in the context of speech input (written English is often sufficiently well-behaved grammatically that efficient deterministic parsers can handle it). APRIL3 was developed using written-English resources, simply because suitable spoken-language materials were not available. But the word-hypothesis lattice machinery was deliberately introduced with a view to a future switch to speech inputs (it might have been unduly cumbersome if the system were intended only to deal with written language).

Incidentally, to avoid confusion it should be noted that the term lattice as used by speech researchers is not the same as the “lattice” of algebra — see my discussion in Speakeasy, 1993.

Standalone system suitable for independent assessment

The purpose of the APRIL3 project was to convert parsing by simulated annealing from a local academic hobby into a technology that interested parties elsewhere can assess, experiment with, and develop further.

Consequently, a priority for this project (unlike the earlier phases of research at Leeds) was to produce a suite of software that is self-contained, “cleanly” and systematically coded, and thoroughly documented.

Results

The stress on “documentation before coding” and production of solid software usable by other groups inevitably meant that APRIL3 began to generate assessable outputs only in the closing weeks of the project. The first results were disappointing, as first results often are. APRIL3 certainly produces much worse analyses from raw character input than earlier APRIL versions did from wordtagged input (the latter obviously being a less challenging task). APRIL3 was a very ambitious project, moving beyond its predecessors not merely by the use of raw input but also by introducing automatic language-model generation. It may be that these things were too much to attempt simultaneously. (I don’t know whether I could have got the research grant if I had promised less; this is the dilemma of modern British university research.)

On the other hand, we succeeded in achieving our main goal: to produce a suite of well-organized and well-documented software, so that we could distribute the system to others to work with. APRIL3 is a system with a large number of knobs to turn to different settings: not just the parameters of the parser annealing schedule, but choices of which elements of the full grammatical annotation scheme to include in the language model, and choices of objective function and annealing schedule for the system which derives the language model from the available data. To say that current outputs are disappointing is only to say that the first couple of combinations of settings we tried for all those knobs proved unsatisfactory.

Within the APRIL3 project, we had no time for extensive experiments with a realistic variety of knob-settings. But in any case it was always clear that if the general technique is valuable, its value is only likely to be realized through exploration by the wider community outside our small group. APRIL3 has helped that to happen, and I hope that it will. (Since 1996, I have moved on to different topics of research.)

The “book of APRIL”

A book on the history of my group’s development of the idea of parsing by simulated annealing, from its gestation in 1985 up to the time of writing, was published by Cassell in 1996 under the title Evolutionary Language Understanding.



Geoffrey Sampson

last changed 5 Jan 2005