Geoffrey Sampson

Parse Evaluation

“Parsing” means detecting the structure implicit in superficially-unstructured sequences of items. For instance, someone who encounters the English sentence “The quick brown fox jumps over the lazy dog” understands it by grasping that the first four words group together, as subject of the verb “jumps”, and that the last four words go together as a group within which the preposition “over” is in construction with the phrase “the lazy dog”. Linguists represent parse structure using tree diagrams, or (equivalently, and easier to show in a web page) by inserting labelled brackets among the words – for instance, the SUSANNE parsing scheme would represent this example as:

[S [Ns:s The quick brown fox ] [Vz jumps ] [P:q over [Ns the lazy dog ] ] ]

The concept of “parsing” originated in the study of language, but the concept applies also to various other domains. Resolving the long nucleotide sequence in a DNA strand, conventionally represented as a string of hundreds or thousands of G’s, C’s, T’s, and A’s, into higher-order units such as genes is effectively a kind of parsing (though so far as I know biologists do not use the term).

If utterances were always as simple as “The quick brown fox jumps over the lazy dog” example, natural-language parsing would be easy to automate. But they are not. In reality, automatic parsing of English and other natural languages is one of the hardest challenges facing computational linguists, but also one of the most central. Many, probably most, of the practical, economically or socially useful tasks involving natural language which we would like to delegate to computers require the ability to parse automatically; yet although hundreds of researchers have been working on the problem for decades, even now the best parsing systems available for the most-studied languages perform only indifferently.

In this situation, we clearly need some precise, objective way to measure the accuracy of parsing-software output, so that the discipline can quantify the success of particular systems with particular genres of input, and can identify problem areas in a language in a more than anecdotal fashion.

I constructed a parse-accuracy metric, which I called leaf-ancestor assessment, in the 1980s in connexion with automatic-parsing research sponsored by the Ministry of Defence at Leeds University where I then worked. This was not an aspect of our research which we took pains to publicize, however, and in the 1990s it was eclipsed by an alternative technique adopted by the US-based PARSEVAL (parse evaluation) series of competitions between different teams’ parsing systems; this latter metric is simply called the PARSEVAL metric, or alternatively the GEIG – Grammar Evaluation Interest Group – metric.

Any technique for evaluating a parse in essence measures the similarity between two tree-structures (or labelled bracketings) for the same string: the parse being evaluated, and a “gold-standard” structure representing the ideal analysis which the parsing software is intended to produce. (Different research groups may have different ideas about what gold-standard parses ought to look like in detail. Another main area of my own research has been concerned with defining a maximally comprehensive and rigorous gold-standard parsing scheme for English. But for present purposes we assume that we have some agreement on what the gold standard should be, and we are concerned with how to measure levels of success in attaining it.)

The difference between the PARSEVAL approach and my leaf-ancestor approach lies in how pairs of trees are compared. The PARSEVAL metric counts the proportion of bracketings which group the same sequences of words in both trees. For instance, if words 4 to 7 are grouped as a unit by both candidate and gold-standard trees, that adds to the score; if that particular sequence of words is a unit in either one of the trees but not in the other, that reduces the score. (Early versions of the PARSEVAL metric ignored the question whether matching sequences were labelled the same way in both trees, but more refined versions have subsequently taken this into account.) Leaf-ancestor assessment, on the other hand, considers the “lineages” of node labels linking successive words to the root node. In our simple example, for instance, the word “lazy” is part of a noun phrase which in turn is part of a prepositional phrase which itself is part of a main clause. Leaf-ancestor assessment evaluates the parsing of each word in an input by calculating the similarity of that word’s lineages in candidate and gold-standard trees, and produces an overall figure for a complete parse by averaging the values for the individual words. (Lineage similarity is measured in terms of a widely-used concept of “edit distance” – how many operations are needed to change one sequence into the other. To keep the present account reasonably simple, I ignore one complication which arises in adapting the edit-distance concept to the parse-evaluation problem. I defined the leaf-ancestor metric more carefully in a paper, “A proposal for improving the measurement of parse accuracy”, in International Journal of Corpus Linguistics vol. 5, pp. 53–68, 2000.)

Both techniques yield quantitative measures of similarity between trees; but the measures are very different. A perfectly accurate parse is given a perfect score by both metrics; but a partly-inaccurate parse will often be treated as quite bad by one metric and as near-perfect by the other. Some kinds of error are heavily penalized by one of the techniques, other kinds by the other technique.

Although the PARSEVAL metric was extremely influential throughout the 1990s, by the end of that decade many researchers were expressing dissatisfaction with it. Its relative judgements too often seemed to be at odds with researchers’ intuitive understanding of what is a good or a bad parse. Remarks were voiced along the lines “it is unclear as to how the score on [the PARSEVAL] metric relates to success in parsing” (from a 1998 conference paper by Srinivas Bangalore et al.). The PARSEVAL metric is easy to calculate; but there is little virtue in an easily-computed measure, if it measures the wrong thing.

As I see it, the PARSEVAL metric lays too much emphasis on the boundaries of grammatical constructions. What I (and, I believe, most people who use the term) understand by “parsing” is something like “detecting what large, high-level units are constituted by the small low-level units that are immediately apparent”. Identifying the edges of those high-level units is part of that, but only part. Suppose, in the gold-standard analysis of some long example, that words 12 to 23 form a relative clause, and a candidate parse errs by treating the relative clause as ending with word 22 rather than 23: to my mind the candidate parse should receive substantial credit, though not full credit – it has correctly discovered the existence and approximate location of a relative clause, though it has made a slight error about its right-hand boundary. Leaf-ancestor assessment will give partial but not full credit. The PARSEVAL metric will give no credit. One tree contains a bracketing of words 12 to 23, the other contains a bracketing of words 12 to 22: these do not match.

My then researcher Anna Babarczy and I compared the two metrics experimentally, by applying both of them to hundreds of examples of imperfectly-parsed English and checking the cases where the respective metrics gave significantly different scores to the same examples. Our paper on the experimental results appeared as Sampson & Babarczy, “A test of the leaf-ancestor metric for parse accuracy”, Journal of Natural Language Engineering vol. 9, pp. 365–80, 2003. I have yet to meet anyone who disagrees with our conclusion that the leaf-ancestor scores come far closer than the PARSEVAL scores to operationalizing our pre-theoretic intuitions of relative parse accuracy.

The leaf-ancestor metric has other advantages too: for instance, it does not just give global scores but shows where and to what extent the parsing of a long and complex input has gone wrong. But it is not the aim of this web page to give detailed arguments for preferring the leaf-ancestor metric to its rival. The publications already quoted do that, and there is no reason to repeat this material here. My chief motive for adding this page to my site is to draw researchers’ attention to the fact that I have now made software implementing the leaf-ancestor metric (as used in the Sampson & Babarczy experiment) available for public use. My resources page gives a URL from which the software can be downloaded via anonymous ftp.

The program includes long comment sections which should make it easy for others to use or adapt to their purposes. There are no constraints on taking copies or using them for any purpose; I would only ask that publications produced with its help should include a brief acknowledgement of the software author and his employer, Sussex University. I shall be delighted to hear from anyone who works with the program.

Geoffrey Sampson

last changed 7 Nov 2005