Reddit Dataset Update

Recently, Gaffney and Matias shared their findings regarding missing data in the pushshift.io reddit dataset to arXiv. Their thoughtful and careful examination highlighted the fact that some data might be missing from this dataset. In particular, they estimated that 0.043% of comments and 0.65% of submissions may be missing. They then highlight several classes of studies which rely upon an assumption of "full" comment/post history.

Among these were our WWW 2017 and ICWSM 2016 works. Inspired by their examination and subsequent conversations, we decided to undertake the following:

  1. Rescrape the missing data from the reddit API
  2. Re-run key experiments from our WWW 2017 paper, which Gaffney and Matias identified as "high risk"
  3. Re-run key experiments from our ICWSM 2016 paper, which Gaffney and Matias identified as "highest risk"

In short, we found:

  1. Based on Gaffney and Matias' sequential-ID analysis, we are able to add 1.1% more posts and .125% more comments by re-querying the reddit API. Of all the ID gaps identifiable through the sequential ID theory, roughly 10% of post/comment IDs were available via the reddit API.
  2. We were able to replicate the key experiments from our WWW 2017 paper and report no substantial differences between the new results and the published results.
  3. We were able to replicate the key experiments from our ICWSM 2016 paper and report no substantial differences between the new results and the published results.

Part 1: Rescraping Data

We assumed a sequential-ID theory for posts and comments and attempted to fill in dataset gaps. We have some reason to believe that this sequential ID assumption will not fill all posts/comments (e.g., perhaps the posts never existed, or were fully deleted by reddit moderators); these are detailed here. While our previous work does not rely upon exactly the same dataset examined by Gaffney and Matias (our posts derive from an earlier scrape by Jason Baumgartner, and the dates range from 2007-2014), we did the following:

We have (or in the near future will) released updated versions of the datasets from our previous work with the additional posts/comments we were able to find filled-in.

Part 2: Replicating WWW 2017

Here is a link to our WWW 2017 project and paper.

Gaffney and Matias point-out that works which rely upon tying comments to their associated submission are at "high risk" of being impacted by missing data. A majority of the work in our WWW 2017 paper deals with tight, time controlled analyses of content: we conduct ranking experiments on pairs of posts within a short time-window of each-other, and include experiments on a evaluated-exactly-once-fully-held-out test set to validate the generalizability of our content-only models.

We believe the discussion section of Gaffney and Matias identifies our user-feature baselines as "high risk;" this is the primary part of our study that relies upon commenting information. So -- to be careful -- we re-ran all of our user feature baseline experiments using the same 10-fold cross validation splits as the original paper. The results are included below. In short -- the replicated results lie within the 95% CI of the originally reported results. So -- while the user features were not the primary focus of the study -- filling in the missing data does not affect the reported results. Note, however, that our observations do not indicate that all studies that tie comments and posts are robust to missing data.

Part 2.1: The Replicated WWW 2017 Results

"Type" user features (table 4), averaged over 10-fold cross validation; mean 95% CI is +/- .5

aww pics cats MA FP RL
Published Result 50.6 51.2 50.7 52.8 51.8 56.1
Replicated Result 50.7 50.8 50.9 52.9 51.4 56.3

"Activity" user features (table 4), averaged over 10-fold cross validation; mean 95% CI is +/- .5

aww pics cats MA FP RL
Published Result 51.1 53.6 52.8 55.0 53.9 60.6
Replicated Result 51.0 53.8 52.7 55.1 53.7 60.5

"Quality" user features (table 4), averaged over 10-fold cross validation; mean 95% CI is +/- .5

aww pics cats MA FP RL
Published Result 54.7 55.5 52.9 60.7 55.5 67.3
Replicated Result 54.6 55.5 53.0 60.8 55.3 67.4

"All" user features (table 5), averaged over 10-fold cross validation; mean 95% CI is +/- .5

aww pics cats MA FP RL
Published Result 56.3 55.3 54.6 60.9 56.0 68.4
Replicated Result 56.3 55.1 54.8 61.1 55.8 68.5

Part 3: Replicating ICWSM 2016

Here is a link to our ICWSM 2016 project and paper.

Gaffney and Matias identify the "highest risk" results most implicated by missing data as those related to tracing user trajectories by computing a list of all posts/comments made by a given user. Indeed, our ICWSM 2016 paper contains some results that assume access to a user's full activity history. Missing posts/comments undoubtedly violate this assumption.

We believe the results in our work most implicated by missing data are our controlled user-pairing experiments, whose results are summarized by Figures 6 and 7 in the original paper. We re-did our pairing experiments and re-created figures 6 and 7 with the new dataset. The results, included below, are quite similar. For example, for figure 6a; the reported results are within the 95% confidence intervals of the replicated results. Similarly, the shape of the activity-vs-exploration-effect plots in Figure 7 are mirrored by the replicated results. In short -- it doesn't seem that filling in the gaps in the data affect these results. Note, however, that our observations do not indicate that all user trajectory studies are robust to missing data.

Part 3.1: The Replicated ICWSM 2016 Results

Replication of figure 6 from our ICWSM 2016 work:

Replication of figure 6

Replication of figure 7 from our ICWSM 2016 work:

Replication of figure 7

Part 4: Thanks!

We are greatly appreciative of Gaffney and Matias' work on the missing data matter, and will certainly better qualify potential shortcomings of this reddit set in future work. We would also like to thank Jason Baumgartner of pushshift.io; his scraping work has enabled an increasing number of excellent studies.