2

I recall hearing a statistic that, in a typical block of English text (e.g. a novel) a really suprisingly large proportion (a third? half?) of the distinct words that appear, appear only once. That is, if you counted each occurrence of each distinct word in the text, then you'd find a huge proportion of the distinct words appear only once each.

I think I heard this on a radio show but I can't find the source. Can anyone confirm where I might have heard this, and/or indicate whether there's any truth to it?

Thanks.

TenMinJoe
  • 123

1 Answers1

5

While I can't confirm (and don't know) where you might have heard this, it is true that Zipf's law seems applicable to representative corpuses of English text. That is, it appears to represent patterns of word frequency, approximately but appropriately. See the Zipf's law Wikipedia article for more information and detailed formulas, and also the question Frequency of word use vs number of words, in which I gave some example calculations. An implication of Zipf's law is that a large fraction of the words in a typical corpus are singletons, or hapax legomena.

A word frequency analysis of Moby Dick appears as an illustration in the hapax legomenon Wikipedia article. The caption says “About 44% of the distinct set of words in this novel, such as "matrimonial", occur only once...”

  • The Wikipedia article on hapax legomena also says that, “[f]or large corpora, about 40% to 60% of the words (counting by type) are hapax legomena, and another 10% to 15% are dis legomena. Thus, in the Brown Corpus of American English, about half of the 50,000 words are hapax legomena within that corpus”. This is basically exactly the same statement as Joe heard in his elusive radio show. – Janus Bahs Jacquet Jul 25 '14 at 13:39
  • 2
    Since the statistic turns out to be correct, I'm not too fussed about where I might have heard it any more! – TenMinJoe Jul 25 '14 at 13:42
  • 1
    Yes, and now you have a word you can throw into any cocktail party conversation and stop it dead in its tracks. Hapax legomenon is singular, and hapax legomena is plural (neuter nouns in Greek and Latin have -a nominative/accusative plurals). – John Lawler Jul 25 '14 at 15:09