What makes an ArchiveOfOurOwn story successful?

Brian Hoy

1. Introduction

ArchiveOfOurOwn is a website where users can post their own textual works, whether it be a poem, story, or even lyrics to a song. Most of these are derivative works ("fanfictions") based off of an already existing intellectual property. The website features an intricate tagging system, allowing the user to assign a number of tags relating to the content of their work. There are several different types of tags, including:

Once a story has been posted, users can interact with the story by bookmarking it, giving "kudos" (the equivalent of a like on Twitter), or commenting, which all appear on the stats. Here is a screenshot of a work showing the stats and tags: image-2.png

Goal

My goal is to uncover interesting statistics about Archive of Our Own and see if we can create a machine learning model that can predict the number of kudos given a story. We will go through the entire data pipeline, from data collection to hypothesis testing and analysis.

2: Data Scraping

What kind of data are we scraping?

First off, we need to establish what kind of data we want to scrape. My goal is to create a database of rows where each row is an archive work. Since there is no way I can scrape the entire Archive of Our Own database, I've decided to scrape about four thousand stories from the Marvel fandom. I've also filtered out explicit stories that feature non-school appropriate themes, and excluded languages besides English, as I am not interested in analyzing those types of stories. ArchiveOfOurOwn encodes stories in the URL, so I was able to generate a URL just by clicking some filters. The URL we are using for this project is:

https://archiveofourown.org/works?utf8=%E2%9C%93&work_search%5Bsort_column%5D=authors_to_sort_on&work_search%5Bother_tag_names%5D=&exclude_work_search%5Brating_ids%5D%5B%5D=13&exclude_work_search%5Barchive_warning_ids%5D%5B%5D=19&exclude_work_search%5Barchive_warning_ids%5D%5B%5D=20&work_search%5Bexcluded_tag_names%5D=&work_search%5Bcrossover%5D=&work_search%5Bcomplete%5D=&work_search%5Bwords_from%5D=&work_search%5Bwords_to%5D=&work_search%5Bdate_from%5D=&work_search%5Bdate_to%5D=&work_search%5Bquery%5D=&work_search%5Blanguage_id%5D=en&commit=Sort+and+Filter&tag_id=Marvel

Columns of the rows in our database

I want to include as much metadata about each story as is reasonably possible in each row, so here are the attributes of each story that we'll be scraping:

We use Python's BeautifulSoup library as well as requests library to retrieve most of these columns from the works within the web pages at the URL above. The last few require an additional web request to grab chapter 1's text, with the URL generated from the ID of the work. We use the open source LanguageTool (and a Python wrapper, language-tool) to count the number of mistakes in the first chapter. We will compile all the relevant data into a Pandas dataframe for easy access later.

Rate Limiting

Archive of Our Own has some pretty aggressive rate limiting, which is entirely reasonable as I'm sure they want to avoid people using their CPUs too heavily. They do allow scrapers in their policy ("Using bots or scraping is not against our Terms of Service unless it relates to our guidelines against spam or other activities.") but nonetheless they will respond with a 429 status code if you send too many requests. I found that limiting my requests to a request every 1.1 seconds helped avoid this response by using time.sleep helped, but sometimes I would still get it.

Unfortunately, this also meant that I could only scrape about 4000 or so works before I had to start performing data analysis due to time constraints.
If you're interested in reading about Rate Limiting, check out this website for a nice overview of how servers implement it.

SQLite

SQLite is a lightweight SQL database engine that stores the database in a lightweight file. Below, we use SQLite to store the dataframe we generated using the processPage function we made. This allows us to load the database later after we've shut down the Python kernel. We can also stop scraping with a KeyboardInterrupt and our data will still be saved on the disk.

Resulting Dataframe

After scraping all this data, this is the dataframe we're left with:

3. Exploratory Analysis and Data Visualization

Now, lets plot our data and see if we can find anything interesting in the distributions of values of row entries. Let's start by creating histograms of each column.

Now, let's make a correlation matrix using Seaborn and pandas to see which variables are correlated with one another. For information about Seaborn's heatmaps, check out this link.. They are a great way to find out which variables are correlated with one another.

Observations

Frequency Plots

  1. Almost all the histograms are best viewed with a logarithmic scale, i.e. there are a lot of outliers in almost every category. Take chapters, for instance. There are many works with fewer than 10 chapters, but there are a few with over 100.
  2. Some histograms have a normal distribution while others, like unique words per first 1000, have a very right skewed histogram (I am very curious to read the works with fewer than 100 unique words in first 1000).
  3. People tend to use freeform tags much more than relationships and characters tags.
  4. The distribution of times that works were updated is very right skewed, indicating that most of the scraped stories are recent.

Correlation Matrix

  1. We're interested in predicting Kudos. Obviously kudos is strongly correlated with other user-interaction related statistics, like comments, hits, and bookmarks, but we will try not to use those in our machine learning model.
  2. Comments is more strongly related to the number of chapters/word count than number of kudos, intersetingly.
  3. Among the non-user interaction related variables, words, ch1_unique_words_in_first_1000, ch_1_words, chapters, and freeforms seem to have the largest correlation with kudos. This may be due to the fact that when authors are constantly updating their stories adding more words, more users will see it since it is on Archive of Our Own's recent page.
  4. Kudos seems to have a negative correlation with time, i.e. older stories tend to have more Kudos.

4. Analysis, Hypothesis Testing, and Machine Learning

For this part of the data science pipeline, we're going to use regression algorithms to predict the number of kudos, comments, bookmarks, and hits using the other variables.

Predicting Kudos using Author-Controlled variables

As I'm not sure which model will perform best, let's try all of SciKit-Learn's available models with default parameters and sort them according to the average of their 5-fold cross validation scores (R^2 values). Lets also make sure to normalize the input features by using SciKit-Learn's StandardScaler. In addition, let's only include features that the author can control, like word count or number of freeforms. If we can find a trend, then this might be useful for authors who are looking to have more successful stories.

Okay, it appears every model performs fairly poorly. A good performing regression model would have an R^2 greater than 0.5, but the highest R^2 we achieved was GammaRegressor at 0.06. Lets look at the residual plots of the top 5 models. The residual plots are a great way to assess performance because if they're not symmetrical or don't have a mean of zero, then you'll know that you might not have the best model. See this link to learn more about Scikit's Yellowbrick residual plot API: Yellowbrick Residuals Plot.

Unfortunately, none of the residual plots indicate a reliable or useful regression model. The GammaRegressor model, for example, predicts the same value every time. The other ones show a clearly non-random, non-symmetric distribution of residuals, and also have a very low R^2 value. It's clear that this approach isn't going to work, and we need to rethink our model. Due to time constraints, I am not able to rewrite the scraper to include more data, but some ideas I have are:

Using non-author controlled variables to predict kudos

It is clear that comments, hits, kudos, and bookmarks are closely correlated by looking at the correlation matrix above. Since we can't predict kudos using the variables that the author can control, let's try predicting it using other statistics about the work. Note, this will not be terribly useful for authors looking to maximize their kudos, as they cannot control the variables we'll be using in our regression model.

Let's use SciKit learn's Linear Regression Class along with feature scaling using SKLearn's pipeline class. We will split our data into a train set and test set and visualize the residuals using YellowBrick agian.

The model seems to work fairly well. The residuals are symmetric over the x-axis and appear to have a mean of 0. In addition, the R^2 value for the test set is great at a value over .9! It is interesting to see that words has a negative coefficient, i.e. works with more words are predicted to have fewer kudos. This may be a case of overfitting, but it seems to perform well on the test set regardless.

Let's see if we can do the same for comments, bookmarks, and hits.

Observations:

5. Insights

At the beginning of this project I set out to see what attributes make a story successful on Ao3, and what authors could do to improve the odds that their story gets more hits and kudos. Unfortunately, we could not reliably predict a work's performance based on author-controlled variables. We can see from the correlation matrix that having more words, more chapters, more tags, and more unique words in the first chapter, as they all have a positive correlation value. In addition, we found that we can predict the amount of kudos fairly well by using variables the author can't control, like bookmarks and hits.

The Language Tool addition didn't seem to provide any insights. The correlation values were too low to mean anything too significant.

What could be improved to create a better model predicting kudos

If I were to try a project like this again, there are many things that can be improved. First of all, we only collected 4000 stories and they were all from the Marvel franchise. There are millions of stories on Archive of Our Own, and if we could process all of them, there is a good chance we would find some new insights. Second of all, we can include more categorical variables, like the actual tags we find, instead of just mapping them to a number. Then we can use one-hot encoding to train our model. Finally, we could extract more features from the text itself using natural language processing techniques, extracting more of an author's style. A few ideas would be measuring use of active verbs over total verbs used, or measuring the amount of proper nouns used per word, or amount of adjectives used per word. If you'd like to learn more about natural language processing, here is a great article giving an overview of the field.