English Word Frequency

Ambiguity and Frequency

It has been suggested that word frequency (how often a word is used) and level of polysemy in words is positively correlated: that is, the more frequent a word is the more associated meanings it will have, allowing/requiring the listener to use context to choose the right meaning as s/he goes. Here is a popular discussion of the research. An informal and less less exhaustive look at these data is given below. Maybe simple words aren't so simple.

Thanks to the spring 2012 section of ENG5170 at BGSU, we have 120 words, haphazardly (not precisely randomly) chosen from the list of the 5,000 most frequent English words listed on wordfrequency.info Numbers of meanings associated with each word were counted using dictionary.com, and here is what we found.

The link above to the 5,000 most frequent words nicely separates zero-derived nouns and verbs from one another: that is, the verb 'take' (as in I sneak up behind you and take your money' would be listed as a separate entry from the noun take (as in, 'Then, I slink away to count my take'.) This distinction is useful on a number of levels.

Contrast that list of most frequent words with a list of most likely core vocabulary used by historical linguists (and others) to establish family relationships between languages. here is the first fifty words of a recent version of the Swadesh list used for this purpose. Note the differences between frequent and core vocabulary.

There are so many corpora of English out there, each with its own story and set of sources and rules for adding things up.

here is a nice simple one. It's a sorted and summarized set of statistics for a series of blog entries by Caveblogem (whose blog I stumbled upon while looking for just this sort of thing). the corpus used is 212,000 words from a series of blog posts..

Here are some more examples.

I looked back at the beginning of this page to find my first couple sentences: 46 words reposted below: Of the 46 words, only 32 are unique and I only got to the 13th word before reusing something. Here's what I wrote:

It has been suggested that word frequency and level of polysemy in words is positively correlated: that is, the more frequent a word is the more associated meanings it will have, allowing/requiring the listener to use context to choose the right meaning as s/he goes.

And here is the summary list(s). I put all inflectionally related words together: so all forms of the same verbs (have/has, is/are/been, meaning/meanings) are counted as instances of the same word. first, is the alphabetical list and then the list sorted by frequency.

Alphabetical list:
a
allowing
and
as
associated
be (4 instances)
choose
context
correlated
frequency
frequent
go
have (2 instances)
in
it (2 instances)
level
listener
meaning (2 instances)
more (2 instances)
of
polysemy
positively
requiring
right
s/he
suggested
that (2 instances)
the (4 instances)
to (2 instances)
use
will
word (3 instances)

List by Frequency
4 instances: be the
3 instances: word
2 instances: have it meaning more that to
1 instance: >a allowing and as associated choose context
correlated frequency frequent go in level
listener of polysemy positively requiring right s/he suggested use will

Looking at that made me wonder if the large number of unique words had anything to do with the topic or maybe the formality of the writing.

So I went back to Caveblogem and swiped (thanks there, Caveblogem) 46 words from the beginning of a random post and got 36 unique words with the first repeated word coming 11 words in. Here those are arranged by frequency:

5 instances: the
3 instances: I of
2 instances: a this
1 instance each:
about agree am at but end great
having is first from little masterpiece part
phrase regeneration restatement sentence Slotkin sort stuff
that thesis think totally through trouble Turner
violence with writer

Those are roughly analogous small samples. Here is a larger fancier one:

I wrote a computer program to replace all the words in a text with numbers. The first word that occurred in the text was given the number '1'. After that point, whenever that word reoccurred, I put the number '1' in its place. The second word in the text got The number '2'. If you had a story where no words were ever repeated, it would look like this:

1 2 3 4 5 6 7 8 9 10 ...

And it would be really hard to do; as we've seen above, it isn't long before, writing naturally, we begin to reuse words. Here is the output of that program.

This is interesting, but it might be even more interesting to see what those words are. And let's get the computer to sort them for us so we can see which words are most frequent and add things up.

Here are two tables, one before the other. they contain the same information but are arranged differently. They give each word and the number of times it occurs in the text. Poke around until you figure out what the text is I used for this! (It is interesting to note that your ideas about what the whole text is about are not drawn from the words which occur most frequently.)

Here is another picture of the text. This time, the function words are counted separately and shaded for you. that is, we keep two separate counts: regular 1 2 3 for content words and shaded 1, 2,3 for function words.

Pedagogically, though, for TESOL purposes, perhaps what you want is simply a no-frills list of the most frequent and useful English words, drawn from a diverse set of contexts so one area doesn't get more representation that it should. Here you can find John Bauman's General Service List of approximately 2,000 English words and a very nice discussion of how the list was made and how it can be used. His page is articulat and concise: there is no good reason for me to repeat him.

Once your students have learned all those words (!), you might have a look at the Academic Word List also available on John Bauman's page

It is interesting how little sense we have of word frequency as native speakers. Below are four texts. In the first version of each, you see a short passage. In the second, the words in the text are coded by frequency: for each word:

unmarked = word in the top 1,000
* = 1,000 to 1,999
** = 2,000 to 2,999
*** = 3,000 to 3,999
**** = 4,000 to 4,999
All caps = 5,000 or higher

See what surprises you:

It is possible to write comprehensibly using only 200 or so of the most frequent words. Below, you will find some examples of this courtesy of the spring 2012 section of Applied Syntax at Bowling Green State U: thanks, again, guys.

A Collection of Texts Using only the most Frequently-occurring Words