09 Apr 2018
After the Quick and Dirty Thesis Show, I have decided to focus on two areas:
My preliminary work on #2 has yielded mixed results, which is why I’m first focusing on creating a contained set of articles. I currently have 10 great articles on the site (I’ve removed the bad ones). Each of these articles contains approximately 20 links to other articles. These are broken links. I am going to create pages for each of these broken links – approximately 200 in total – and then remove the broken links from those pages. This means I’ll have over 200 pages in my final site, user can click around and not worry about landing on a broken link. This is achievable in the next few weeks, and will give cohesion to the site.
After I accomplish this, if time allows, I want to work on creating the functionality to create articles on the fly when you click on a broken link. The problem is that every time I create a new page, it has many links to articles, and those links are usually broken. So I want to build something that will, when you click on a broken link, detect that the article is missing and then generate the article while the page is loading.
In my preliminary work, I have found that the process takes much longer than I can reasonable expect a user to wait (around 1-2 minutes). The problem is that my approach is to run the machine learning script (that generates the page) on the server side of a shared hosting server, which is not a fast or elegant solution. I’m not sure of a better way, although I’m working to minimize the amount of computational effort so that the process is slower.
An additional problem is that the output requires a fair amount of editing, it does not produce a perfectly clean Wikipedia article. First, it creates output of a given length (e.g. 1MB, 20MB, 1GB, etc) and this output may contain 1/2 an article or 20 articles. Second, the output usually needs a bit of cleanup so the wiki markup is legible, repetitive bits are deleted, etc. I have resisted creating a script to do this cleanup work so far because every time I sit down to write it, I realize how many editorial tasks are actually being performed when I edit and how much of a pain it would be to write this as a program.