Geoffrey Sampson


[LOGO]

Good–Turing Frequency Estimation

Suppose you want to estimate how common various species of birds are in your garden. You log the first thousand birds you see; perhaps you see 212 sparrows, 109 robins, 58 blackbirds, and lesser numbers of other species, down to one each of a list of uncommon birds. Perhaps you see 30 species all told. How do you use these numbers to estimate the probability that the next bird seen will be, say, a blackbird? Most people would surely say that the best guess is 58 ÷ 1000, i.e. 0.058. Well, that’s wrong.

To see that it’s wrong, consider a species which didn’t occur at all in the thousand-bird sample, but which does occasionally visit your garden: say, nightingales. If the probability of blackbirds is estimated as 58 ÷ 1000, then the probability of nightingales would be estimated as 0 ÷ 1000, i.e. nonexistent. Obviously this is an underestimate for nightingales; and correspondingly 58 ÷ 1000 is an overestimate for blackbirds.

This kind of statistical problem crops up in many kinds of research. In linguistics, the “species” whose frequency has to be estimated might be words, syllables, grammatical constructions, or the like. People often try to get round the problem of zero observations by adding some small quantity (say, 1) to the tally for each species; then, for the bird example, p(nightingale) would be 1 ÷ 1030 (and p(blackbird) would be (58 + 1) ÷ (1000 + 30) = 0.0573). But this is just papering over the cracks in the logic of the earlier estimates; it is still a rotten way of approximating the true probabilities (the estimates will often be wildly misleading).

A much better technique was worked out by Alan Turing and his statistical assistant I.J. Good, during their collaboration in the Second World War effort to crack German ciphers at Bletchley Park, Bucks., which led to the development of machines that were the immediate ancestors of the digital computer. (This was the organization that, long top secret, has now become familiar to a wide public through Robert Harris’s best-selling 1995 thriller Enigma, and other media coverage.) Unfortunately, most versions of the Good–Turing technique required quite cumbersome calculations, so it was not used as widely as it might have been.

Some years ago William Gale, then of AT&T Research, New Jersey, developed and tested a simple version of the Good–Turing approach, which makes it easy for people with little interest in maths to understand and use it. During 1994–5, I worked with Bill Gale (who took medical retirement about then, and who sadly died in 2002) to turn this “Simple Good–Turing” technique into a paper, “Good–Turing frequency estimation without tears”, published under our joint names in the Journal of Quantitative Linguistics, vol. 2 pp. 217–37, 1995 (and subsequently reprinted in my book Empirical Linguistics), which spells out step by step how to apply the technique, as well as explaining its rationale and demonstrating through various examples that it gives good results.

I also developed C code implementing Bill’s Simple Good–Turing algorithm; so, whether you understand the rationale or not, you can simply load your observed data into the software and, presto!, get the probability estimates out at the other end – for the location of this on the internet, see my resources page.   — But you really should try to understand why the algorithm works, rather than just using it “blind”: the concepts are not difficult, and you will be able to see what the technique can and what it can’t do for you.

(My code was subsequently incorporated into a larger suite of smoothing tools software by Joshua Goodman and Stanley Chen, then of Harvard University, now Microsoft Research and IBM Research respectively; however, the only website currently distributing the Goodman/Chen material seems to be a controlled-access site owned by Kenji Kita of Tokushima University.)

In 2004, David Elworthy of Google, Inc. (Santa Monica, Calif.) recoded my software into C++ and encapsulated it in a class so it can be used in other programs; and in 2007 Florian Doemges and Björn Wilmsmann, of the Ruhr-Universität Bochum, produced a Perl version and placed it on CPAN.



Geoffrey Sampson

last changed 5 Sep 2010