ICM Week 8 - Documentation

21 Oct 2015
courses, documentation, icm

For this week's assignment, I worked on some text analysis stuff. I stumbled upon some code for Dan Shiffman’s Programming from A to Z Fall 2015 course, which had some great examples for using a lot of the digital humanities text analysis principles I’m familiar with - word frequencies and distributions, topic modeling, etc. I wanted to create a graph of word frequencies, so I combined Dan’s code for creating a word frequency concordance with the section on creating graphs in Chapter 12 of Getting Started with p5.js.

Plotting word frequencies follows Zipf’s law, which reveals interesting patterns if you look at different sections of the graph when you’re working with multiple texts. If you divide the chart into 4 sections, patterns begin to emerge:

If you work with only the most frequent words, you will find out about authorship, especially if you have text in your corpora from multiple authors
Using the 2nd slice of the most frequently used words, you will find out about the time period when the text was written
Then genre
Then settings and themes

How could we use this? There’s a rumor going around that Thomas Pynchon just published a new book under the pseudonym Adrian Jones Pearson. Since the first section of the graph tells us about authorship, we could compare the distributions for Pychon’s other texts with this new text. It would be helpful to include works from other authors to give a fuller sense of context. And the “results” would not be conclusive -- if the distribution for the new text bears a striking similarity to Pynchon’s previous works, all you would be able to say is that Adrian Jones Pearson that is similar to Thomas Pynchon.

Email: coblezc@gmail.com
Twitter: @coblezc
CC-BY