Austin R. Benson datasets

Data!

This is a collection of datasets from my research projects. I strive to make the data used in my research easily accessible. If you encounter problems, please email me at arb@cs.cornell.edu.

Temporal higher-order networks (hypergraphs)

Each of these datasets is a timestamped sequence of simplices, where a simplex is a set of k nodes from some vertex set. The datasets also contain weighted projected graphs, where the weight is the number of times that two nodes co-appear in a simplex. These datasets were used in the following paper:

Simplicial closure and higher-order link prediction.
Austin R. Benson, Rediet Abebe, Michael T. Schaub, Ali Jadbabaie, and Jon Kleinberg.
Proceedings of the National Academy of Sciences (PNAS), 2018.
Code available at github.com/arbenson/ScHoLP-Tutorial.

Dataset pages:

coauth-DBLP: co-authorship on DBLP papers.
coauth-MAG-Geology: co-authorship on Geology papers.
coauth-MAG-History: co-authorship on History papers.
tags-stack-overflow: sets of tags applied to questions on stackoverflow.com.
tags-math-sx: sets of tags applied to questions on math.stackexchange.com.
tags-ask-ubuntu: sets of tags applied to questions on askubuntu.com.
threads-stack-overflow: sets of users asking and answering questions on threads on stackoverflow.com.
threads-math-sx: sets of users asking and answering questions on threads on math.stackexchange.com.
threads-ask-ubuntu: sets of users asking and answering questions on threads on askubuntu.com.
NDC-substances: sets of substances making up drugs.
NDC-classes: sets of classifications applied to drugs.
DAWN: sets of drugs used by patients recorded in emergency room visits.
congress-bills: sets of congresspersons cosponsoring bills.
email-Eu: sets of email addresses on emails.
email-Enron: sets of email addresses on emails.
contact-high-school: groups of people in contact at a high school.
contact-primary-school: groups of people in contact at a primary school.

Hypergraphs with labeled nodes

Each of these datasets is a hypergraph where the nodes are labeled into discrete classes. These can be used for community detection or node prediction experiments. We used them in the following papers:

Generative hypergraph clustering: from blockmodels to modularity.
Philip S. Chodrow, Nate Veldt, and Austin R. Benson.
Science Advances, 2021.
Code available at github.com/PhilChodrow/HypergraphModularity.
Minimizing Localized Ratio Cut Objectives in Hypergraphs.
Nate Veldt, Austin R. Benson, and Jon Kleinberg.
Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2020.
Code available at github.com/nveldt/HypergraphFlowClustering.
Clustering in graphs and hypergraphs with categorical edge labels.
Ilya Amburg, Nate Veldt, and Austin R. Benson.
Proceedings of the Web Conference (WWW), 2020.
Code available at github.com/nveldt/CategoricalEdgeClustering.

Dataset pages:

stackoverflow-answers: sets of questions answered by users on Stack Overflow, where labels are question tags.
mathoverflow-answers: sets of questions answered by users on Math Overflow, where labels are question tags.
walmart-trips: sets of products bought on Walmart shopping trips, where labels are departments of products.
amazon-reviews: sets of products reviewed by users on Amazon, where labels are product categories.
trivago-clicks: sets of hotels clicked on in a Web browsing session, where labels are the countries of the accomodation.
contact-primary-school: sets of students in proximity, where labels are classrooms.
contact-high-school: sets of students in proximity, where labels are classrooms.
senate-bills: bill cosponsorship in the US Senate, where labels are political affiliation.
house-bills: bill cosponsorship in the US House of Representatives, where labels are political affiliation.
senate-committees: committee membership in the US Senate, where labels are political affiliation.
house-committees: committee membership in the US House of Representatives, where labels are political affiliation.

US county networks for node regression

These are networks of US counties, where edges come from physical adjacency or Facebook connectedness. The nodes are accompanies by various covariates, such as demographic features, climate measurements, and election statistics, depending on the dataset. We used these for transductive node regression experiments. Some of the datasets have demographic features and election statistics from both 2012 and 2016, and we used these for inductive learning experiments. The data was used in the following papers:

A Unifying Generative Model for Graph Learning Algorithms: Label Propagation, Graph Convolutions, and Combinations.
Junteng Jia and Austin R. Benson.
arXiv:2101.07730, 2021.
Code available at github.com/000Justin000/GaussianMRF.
Residual Correlation in Graph Neural Network Regression.
Junteng Jia and Austin R. Benson.
Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2020.
Code available at github.com/000Justin000/gnn-residual-correlation.

Dataset pages:

US-county-demos: demographics and elections, physical adjacency.
US-county-fb: demographics and elections, social connections.
CDC-climate: climate, physical adjacency.

Hypergraphs with categorical edge labels

Each of these datasets is a hypergraph or just a graph where the edges have a discrete label (a categorical label). These datasets were collected for and analyzed in the following papers:

Clustering in graphs and hypergraphs with categorical edge labels.
Ilya Amburg, Nate Veldt, and Austin R. Benson.
Proceedings of the Web Conference (WWW), 2020.
Code available at github.com/nveldt/CategoricalEdgeClustering.
Fair Clustering for Balanced and Diverse Groups.
Ilya Amburg, Nate Veldt, and Austin R. Benson.
arXiv:2006.05645, 2020.
Code available at github.com/ilyaamburg/fair-clustering-for-diverse-and-experienced-groups.

Dataset pages:

cat-edge-Cooking: sets of ingedients in recipes labeld by cuisine type.
cat-edge-DAWN: sets of drugs used by patients recorded in emergency room visits labeled by the most common patient disposition for that combination of drugs.
cat-egde-Walmart-Trips: sets of products bought on Walmart shopping trips categorized by a trip type.
cat-edge-MAG-10: co-authorship with publication venue labels.
cat-edge-Brain: set of brain region coactivation scores based on two categories of measurement.
cat-edge-music-blues-reviews: sets of reviewers categorized by product review type, with set membership given by timestamp similarity.
cat-edge-madison-restaurant-reviews: sets of reviewers categorized by establishment review type, with set membership given by timestamp similarity.
cat-edge-vegas-bars-reviews: sets of reviewers categorized by establishment review type, with set membership given by timestamp similarity.
cat-edge-algebra-questions: sets of users categorized by question tag type, with set membership given by timestamp similarity.
cat-edge-geometry-questions: sets of users categorized by question tag type, with set membership given by timestamp similarity.

Largeish weighted graphs

Each of these datasets is an undirected weighted graph of nontrivial size. These datasets were used in the following paper:

Retrieving Top Weighted Triangles in Graphs.
Raunak Kumar, Paul Liu, Moses Charikar, and Austin R. Benson.
Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM), 2020.
Code available at github.com/raunakkmr/Retrieving-top-weighted-triangles-in-graphs.

Dataset pages:

colisten-Spotify: inter-session plays of songs on Spotify.
coauth-MAG: co-authorship from the Microsoft Academic Graph.
coauth-AMiner: co-authorship from AMiner.

Graphs and hypergraphs with core-fringe structure

Each of these datasets is a graph or hypergraph, where the nodes are labeled as "core" or "fringe" according to the data collection process. Specifically, all of the graphs measured communication involving a set of nodes, and this set of nodes serves as the core. This induces what we call "core-fringe" structure in the network. We studied how well one can recover the core-labeled nodes from the network structure. In this setup, the core nodes form a "planted vertex cover" in the graph case and a "planted hitting set" in the hypergraph case. We studied these datasets in the following papers:

Found Graph Data and Planted Vertex Covers.
Austin R. Benson and Jon Kleinberg.
Advances in Neural Information Processing Systems, 2018.
Code available at github.com/arbenson/FGDnPVC.
Planted Hitting Set Recovery in Hypergraphs.
Ilya Amburg, Jon Kleinberg, and Austin R. Benson.
Journal of Physics: Complexity (Special Issue on Higher-Order Structures in Networks and Network Dynamical Systems), 2021.
Code available at github.com/ilyaamburg/Hypergraph-Planted-Hitting-Set-Recovery.

Dataset pages:

pvc-call-Reality: phone calls made and received by participants in the reality mining project.
pvc-text-Reality: SMS texts made and received by participants in the reality mining project.
pvc-email-W3C: email on W3C mailing lists (graph).
pvc-email-Enron: email involving Enron employees (graph).
phs-email-W3C: email on W3C mailing lists (hypergraph).
phs-email-Enron: email involving Enron employees (hypergraph).

Stack exchange co-tagging networks

These are weighted networks for 168 co-tagging networks on Stack Exchange communities, where the weight of edge (i, j) is the number of questions that were annotated with both tags i and j. The data was analyzed in the following paper:

Modeling and Analysis of Tagging Networks in Stack Exchange Communities.
Xiang Fu*, and Shangdi Yu*, and Austin R. Benson (*equal contribution)
Journal of Complex Networks, 2019.
Code available at github.com/yushangdi/stack-exchange-cotagging

Dataset page:

stack-exchange-tags: 168 stack exchange community weighted co-tagging networks.

Temporal networks

These are temporal networks where (i, j, t) signifies a directed edge from i to j at time t. The networks were used in the following paper:

A sampling framework for counting temporal motifs.
Paul Liu, Austin R. Benson, and Moses Charikar.
Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM), 2019.
Code available at gitlab.com/paul.liu.ubc/sampling-temporal-motifs.

Dataset pages:

temporal-reddit-reply: timestamped comment interactions on reddit.
temporal-bitcoin: timestamped bitcoin transactions.

Spatial networks

Each of these datasets is a network with its spatial coordinate. These datasets were used in the following paper:

Detecting Core-Periphery Structure in Spatial Networks.
Junteng Jia and Austin R. Benson.
Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM), 2019.
Code available at github.com/000Justin000/spatial_core_periphery.

Dataset pages:

spatial-Celegans: C. elegans neural network.
spatial-underground-London: Tube transportation network in London.
spatial-fungi: Fungal networks constructed from experimental data.
spatial-OpenFlights: World airline network from openflights.org.
spatial-Brightkite: Brightkite location-based social network.

Sequences of Sets

These datasets are sequences of sets. Formally, a dataset consists of a collection of sequences, where each sequence is a time-ordered list of subsets of some universal set. These datasets were used in the following paper:

Sequences of Sets.
Austin R. Benson, Ravi Kumar, and Andrew Tomkins.
Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2018.
Code available at github.com/arbenson/Sequences-of-Sets.

Dataset pages:

sos-email-Enron-core: sets of recipients on emails from email addresses.
sos-email-Eu-core: sets of recipients on emails from email addresses.
sos-coauth-Business: sets of co-authors on publications from researchers.
sos-coauth-Geology: sets of co-authors on publications from researchers.
sos-tags-mathoverflow: sets of tags on MathOverflow questions from users.
sos-tags-math-sx: sets of tags on Mathematics Stack Exchange questions from users.
sos-contact-high-school: sets of proximity-based contacts from individuals at a high school.
sos-contact-prim-school: sets of proximity-based contacts from individuals at a primary school.

Discrete subset choices

These datasets are from people making choices from a discrete set of alternatives. In datasets with "universal choice sets," the set of alternatives is the same for every choice that is made. In datasets with "variable choice sets," the set of alternatives changes with each subset selection. These datasets were used in the following paper:

A Discrete Choice Model for Subset Selection.
Austin R. Benson, Ravi Kumar, and Andrew Tomkins.
Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM), 2018.
Code available at github.com/arbenson/discrete-subset-choice.

Universal choice dataset pages:

uchoice-Bakery: sets of items purchased at a bakery.
uchoice-Walmart-Items: sets of items purchased at Walmart.
uchoice-Walmart-Depts: sets of departments from which items were purchased at Walmart.
uchoice-Kosarak: sets of web pages viewed in a browsing session.
uchoice-Instacart: sets of items purchased from Instacart.
uchoice-Lastfm-Genres: sets of genres of music played by users in listening sessions.

Variable choice dataset pages:

vchoice-Yc-Items: sets of items purchased from the items viewed in a browsing session on an e-commerce web site.
vchoice-Yc-Cats: sets of product categories from which purchases were made from a browsing session on an e-commerce web site.

Genius.com data

This is a curated dataset of users, songs, and lyrical annotation on the web site Genius.com. The dataset was used in the following paper:

Expertise and Dynamics within Crowdsourced Musical Knowledge Curation: A Case Study of the Genius Platform.
Derek Lim and Austin R. Benson.
Proceedings of International Conference on Web and Social Media (ICWSM), 2021.
Code available at github.com/cptq/genius-expertise.

Dataset page:

genius-expertise

Manhattan taxi cab trajectories

This dataset contains 1,000 sequences of neighborhoods of Manhattan visited by taxi cabs over a one year period. The dataset was used in the following paper:

The spacey random walk: a stochastic process for higher-order data.
Austin R. Benson, David F. Gleich, and Lek-Heng Lim.
SIAM Review (Research Spotlights), 2017.
Code available at github.com/arbenson/spacey-random-walks.

Dataset page:

Manhattan-taxi-trajectories

Flow cytometry

This flow cytometry dataset represents abundances of fluorescent molecules labeling antibodies that bind to specific targets on the surface of blood cells. The dataset was used in the following paper:

Scalable methods for nonnegative matrix factorizations of near-separable tall-and-skinny matrices.
Austin R. Benson, Jason D. Lee, Bartek Rajwa, and David F. Gleich.
Advances in Neural Information Processing Systems (NeurIPS), 2014.
Code available at github.com/arbenson/mrnmf.

Dataset page:

flow-cytometry