14 Apr 2017
courses, documentation, noc
I have a couple data sets in mind that I’ve collected and would be interested in working with for the final project, both of which are textual corpora. I’ve done some limited work with applying Justin Johnson’s torch-rnn library toward creative ends and would be interested to explore it in more depth. The results I’m looking for are often similar to something like a markov chain bot, but using an RNN library allows one to create “random” outputs on a much larger scale.
- Wikipedia. Wikimedia provides regular data dumps of the entirety of Wikipedia in SQL and XML (~13gb). It would be interesting to train the neural network on this and create an output of a similar size, which hypothetically would retain the underlying data structure, hyperlink references, etc, and ideally create some sort of alternate network of information/knowledge.
- CIA World Factbook. The Factbook provides biographical sketches of all the countries in the world, with info on demographics, economy, culture, etc. While the CIA doesn’t provide any classified info here, the Factbook is a useful resource for geopolitical data. Datahub offers a dump of the site that was scraped in 2013 (656mb). Similar to the Wikipedia example, this could be used as a training corpus, with an output that creates a fictional world of countries and a complex description of each. What will the countries be named? How will their economies interact with each other? Will some of them have gone to war with each other? Given the training set, we can expect the “new world” to look much the same as the current one, although arranging the pieces in a different way might reveal something about our current world that has gone unnoticed (and also to reinforce things we knew all along).