06 Feb 2018
documentationcoursesthesis
I want to create an alternate version of Wikipedia. Before going into possibilities for the project, it will be helpful to briefly discuss how Wikipedia’s technology and how it is organized.
Wikipedia is built using the Mediawiki software. Mediawiki is a wiki software than can be used for other types of wiki projects. When you visit a Wikipedia page, there are actually many components in addition to the body text itself that go into making that page. These include:
Templates: repeatable text that appears on multiple pages (e.g. infobox, badges, talk page, page navigation) Categories: automatic indexes that are useful as tables of contents and appear at the bottom of the article page (e.g. 1961 births, American Nobel laureates, Presidents of the United States Obama family) Extensions: allow you to customize how MediaWiki looks and works (e.g. categories, citations, table of contents)
All of these components are linked and are a key part of what make Wikipedia so useful. But they are also very complex. This makes sense, given that Wikipedia is meant to be an encyclopedia of the world’s knowledge.
One version of the project is to create a version of Wikipedia but for an alternate universe. The output would be a website that looks and feels like the Wikipedia we know, but the pages are fictional.
I would like to use machine learning to generate the pages for my site, and randomness is given in textual generation projects. I am interested in finding ways to make this randomness less “word salad” and more believable. The text generation methods available at the moment are not refined enough to completely remove randomness for a project of this scale, so I would like to break the project into smaller pieces. Rather than creating a textual output for several pages at one time (as I’ve done in my preliminary work), I would create outputs for specific sections of a page. For example, for a page about a person, I would create sections for Early Life, Career, Controversies, etc.
One consideration for this approach is to determine how granular I want to get. For example, if I want to create a new countries, I need to create a create a countries template. And I also need to create a flag template because countries have flags. The flag template has images, so do I create new flags or use existing ones?
Same as above, but instead of training my training data solely on a Wikipedia data dump, I will also include data from other sources. I am particularly interested in including data from female authors or individuals from underrepresented groups. The typical Wikipedia author/editor is a white US male in his mid-20s, and my idea for incorporating other data sources is to see what a Wikipedia article would look like if it was generated from, for example, only on articles written by feminist authors. Would readers have a different reaction to the articles? Would they view them are more or less trustworthy?
Same as the initial idea but with a greater emphasis on generating content on the Edits pages. Since Wikipedia is crowdsourced and pages can be edited by anyone (there are many restrictions to this, but that’s the basic idea), the history of edits for a given page is a rich source of meta-information about that topic, revealing the process of how an encyclopedia arrives at it’s “version of record.” A reference for this idea is James Bridle’s Iraq War Wikihistoriography project, where he compiled and printed the edit history of the Iraq War page (into 12 volumes!) and presented it as the historiography of the war.