Suppose you want to estimate how common various species of birds are in your garden. You log the first thousand birds you see; perhaps you see 212 sparrows, 109 robins, 58 blackbirds, and lesser numbers of other species, down to one each of a list of uncommon birds. Perhaps you see 30 species all told. How do you use these numbers to estimate the probability that the next bird seen will be, say, a blackbird? Most people would surely say that the best guess is 58 ÷ 1000, i.e. 0.058. Well, that’s wrong.

To see that it’s wrong, consider a species which didn’t occur at all in the thousand-bird sample, but which does occasionally visit your garden: say, nightingales. If the probability of blackbirds is estimated as 58 ÷ 1000, then the probability of nightingales would be estimated as 0 ÷ 1000, i.e. nonexistent. Obviously this is an underestimate for nightingales; and correspondingly 58 ÷ 1000 is an overestimate for blackbirds.

This kind of statistical problem crops up in many kinds of
research. In linguistics, the “species” whose frequency
has to be estimated might be words, syllables,
grammatical constructions, or the like. People often
try to get round the problem of zero observations
by adding some small quantity (say, 1) to the tally for
each species; then, for the bird example,
*p*(nightingale) would be 1 ÷ 1030
(and *p*(blackbird)
would be (58 + 1) ÷ (1000 + 30) = 0.0573).
But this
is just papering over the cracks in the logic of
the earlier estimates;
it is still a rotten way of approximating the true
probabilities (the estimates will often be wildly misleading).

A much better technique was worked out by
Alan Turing
and his statistical assistant
I.J. Good,
during their
collaboration in the Second World War effort to crack
German ciphers at Bletchley Park, Bucks., which led to the
development of machines that were the immediate ancestors
of the digital computer. (This was the organization that,
long top secret, has
now become familiar to a wide public through Robert
Harris’s best-selling 1995 thriller *Enigma*, and
other media coverage.)
Unfortunately, most versions of the Good–Turing
technique required quite cumbersome calculations, so
it was not used as widely as it might have been.

Some years ago
William Gale, then of AT&T Research, New Jersey,
developed and tested a simple version of the Good–Turing
approach, which makes it easy for people with little
interest in maths to understand and use it. During
1994–5, I worked with Bill Gale (who took medical retirement
about then, and who sadly died in 2002)
to turn this **“Simple Good–Turing”**
technique into a paper,
“Good–Turing frequency estimation
without tears”, published under our joint names in the
*Journal of Quantitative Linguistics*, vol. 2 pp.
217–37, 1995 (and subsequently reprinted in my book
*Empirical Linguistics*),
which spells out step by step how to apply the technique,
as well as explaining its rationale and demonstrating
through various examples that it gives good results.

I also developed C code implementing Bill’s Simple Good–Turing algorithm; so, whether you understand the rationale or not, you can simply load your observed data into the software and, presto!, get the probability estimates out at the other end – for the location of this on the internet, see my resources page. — But you really should try to understand why the algorithm works, rather than just using it “blind”: the concepts are not difficult, and you will be able to see what the technique can and what it can’t do for you.

(My code was subsequently incorporated into a larger suite of smoothing tools software by Joshua Goodman and Stanley Chen, then of Harvard University, now Microsoft Research and IBM Research respectively; however, the only website currently distributing the Goodman/Chen material seems to be a controlled-access site owned by Kenji Kita of Tokushima University.)

In 2004, David Elworthy of Google, Inc. (Santa Monica, Calif.) recoded my software into C++ and encapsulated it in a class so it can be used in other programs; and in 2007 Florian Doemges and Björn Wilmsmann, of the Ruhr-Universität Bochum, produced a Perl version and placed it on CPAN.

*last changed 5 Sep 2010*