Word frequency

This page gives a list of the most frequently used words in a given text, and creates a chart of that distribution. The chart should hypothetically follow Zipf’s law:

Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.

If you divide the chart into 4 sections, patterns begin to emerge:

  1. If you work with only the most frequent words, you will find out about authorship, especially if you have text in your corpora from multiple authors
  2. Using the 2nd slice of the most frequently used words, you will find out about the time period when the text was written
  3. Then genre
  4. Then settings and themes

Here's a link to the .txt of Moby Dick you could use, or perhaps Cicero's Orations or Treatment of the diseases of the eye, by means of prussic acid vapour, and other medical agents.

