Data!
This is a collection of datasets from my research projects. I strive to make the data used in my research easily accessible. If you encounter problems, please email me at arb@cs.cornell.edu.
Temporal higher-order networks (hypergraphs)
Each of these datasets is a timestamped sequence of simplices, where
a simplex is a set of k nodes from some vertex set. The datasets
also contain weighted projected graphs, where the weight is the
number of times that two nodes co-appear in a simplex. These datasets
were used in the following paper:
- Simplicial closure and higher-order link prediction.
Austin R. Benson, Rediet Abebe, Michael T. Schaub, Ali Jadbabaie, and Jon Kleinberg.
Proceedings of the National Academy of Sciences (PNAS), 2018.
Code available at github.com/arbenson/ScHoLP-Tutorial.
- coauth-DBLP: co-authorship on DBLP papers.
- coauth-MAG-Geology: co-authorship on Geology papers.
- coauth-MAG-History: co-authorship on History papers.
- tags-stack-overflow: sets of tags applied to questions on stackoverflow.com.
- tags-math-sx: sets of tags applied to questions on math.stackexchange.com.
- tags-ask-ubuntu: sets of tags applied to questions on askubuntu.com.
- threads-stack-overflow: sets of users asking and answering questions on threads on stackoverflow.com.
- threads-math-sx: sets of users asking and answering questions on threads on math.stackexchange.com.
- threads-ask-ubuntu: sets of users asking and answering questions on threads on askubuntu.com.
- NDC-substances: sets of substances making up drugs.
- NDC-classes: sets of classifications applied to drugs.
- DAWN: sets of drugs used by patients recorded in emergency room visits.
- congress-bills: sets of congresspersons cosponsoring bills.
- email-Eu: sets of email addresses on emails.
- email-Enron: sets of email addresses on emails.
- contact-high-school: groups of people in contact at a high school.
- contact-primary-school: groups of people in contact at a primary school.
Hypergraphs with labeled nodes
Each of these datasets is a hypergraph where the nodes are labeled
into discrete classes. These can be used for community detection or
node prediction experiments. We used them in the following papers:
- Generative hypergraph clustering: from blockmodels to modularity.
Philip S. Chodrow, Nate Veldt, and Austin R. Benson.
Science Advances, 2021.
Code available at github.com/PhilChodrow/HypergraphModularity. - Minimizing Localized Ratio Cut Objectives in Hypergraphs.
Nate Veldt, Austin R. Benson, and Jon Kleinberg.
Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2020.
Code available at github.com/nveldt/HypergraphFlowClustering. - Clustering in graphs and hypergraphs with categorical edge labels.
Ilya Amburg, Nate Veldt, and Austin R. Benson.
Proceedings of the Web Conference (WWW), 2020.
Code available at github.com/nveldt/CategoricalEdgeClustering.
- stackoverflow-answers: sets of questions answered by users on Stack Overflow, where labels are question tags.
- mathoverflow-answers: sets of questions answered by users on Math Overflow, where labels are question tags.
- walmart-trips: sets of products bought on Walmart shopping trips, where labels are departments of products.
- amazon-reviews: sets of products reviewed by users on Amazon, where labels are product categories.
- trivago-clicks: sets of hotels clicked on in a Web browsing session, where labels are the countries of the accomodation.
- contact-primary-school: sets of students in proximity, where labels are classrooms.
- contact-high-school: sets of students in proximity, where labels are classrooms.
- senate-bills: bill cosponsorship in the US Senate, where labels are political affiliation.
- house-bills: bill cosponsorship in the US House of Representatives, where labels are political affiliation.
- senate-committees: committee membership in the US Senate, where labels are political affiliation.
- house-committees: committee membership in the US House of Representatives, where labels are political affiliation.
US county networks for node regression
These are networks of US counties, where edges come from physical adjacency or
Facebook connectedness. The nodes are accompanies by various covariates, such as
demographic features, climate measurements, and election statistics, depending on the dataset.
We used these for transductive node regression experiments.
Some of the datasets have demographic features and election statistics from both 2012 and 2016, and
we used these for inductive learning experiments.
The data was used in the following papers:
- A Unifying Generative Model for Graph Learning Algorithms: Label Propagation, Graph Convolutions, and Combinations.
Junteng Jia and Austin R. Benson.
arXiv:2101.07730, 2021.
Code available at github.com/000Justin000/GaussianMRF. - Residual Correlation in Graph Neural Network Regression.
Junteng Jia and Austin R. Benson.
Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2020.
Code available at github.com/000Justin000/gnn-residual-correlation.
- US-county-demos: demographics and elections, physical adjacency.
- US-county-fb: demographics and elections, social connections.
- CDC-climate: climate, physical adjacency.
Hypergraphs with categorical edge labels
Each of these datasets is a hypergraph or just a graph where the
edges have a discrete label (a categorical label). These datasets
were collected for and analyzed in the following papers:
- Clustering in graphs and hypergraphs with categorical edge labels.
Ilya Amburg, Nate Veldt, and Austin R. Benson.
Proceedings of the Web Conference (WWW), 2020.
Code available at github.com/nveldt/CategoricalEdgeClustering. - Fair Clustering for Balanced and Diverse Groups.
Ilya Amburg, Nate Veldt, and Austin R. Benson.
arXiv:2006.05645, 2020.
Code available at github.com/ilyaamburg/fair-clustering-for-diverse-and-experienced-groups.
- cat-edge-Cooking: sets of ingedients in recipes labeld by cuisine type.
- cat-edge-DAWN: sets of drugs used by patients recorded in emergency room visits labeled by the most common patient disposition for that combination of drugs.
- cat-egde-Walmart-Trips: sets of products bought on Walmart shopping trips categorized by a trip type.
- cat-edge-MAG-10: co-authorship with publication venue labels.
- cat-edge-Brain: set of brain region coactivation scores based on two categories of measurement.
- cat-edge-music-blues-reviews: sets of reviewers categorized by product review type, with set membership given by timestamp similarity.
- cat-edge-madison-restaurant-reviews: sets of reviewers categorized by establishment review type, with set membership given by timestamp similarity.
- cat-edge-vegas-bars-reviews: sets of reviewers categorized by establishment review type, with set membership given by timestamp similarity.
- cat-edge-algebra-questions: sets of users categorized by question tag type, with set membership given by timestamp similarity.
- cat-edge-geometry-questions: sets of users categorized by question tag type, with set membership given by timestamp similarity.
Largeish weighted graphs
Each of these datasets is an undirected weighted graph of nontrivial
size. These datasets were used in the following paper:
- Retrieving Top Weighted Triangles in Graphs.
Raunak Kumar, Paul Liu, Moses Charikar, and Austin R. Benson.
Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM), 2020.
Code available at github.com/raunakkmr/Retrieving-top-weighted-triangles-in-graphs.
- colisten-Spotify: inter-session plays of songs on Spotify.
- coauth-MAG: co-authorship from the Microsoft Academic Graph.
- coauth-AMiner: co-authorship from AMiner.
Graphs and hypergraphs with core-fringe structure
Each of these datasets is a graph or hypergraph, where the nodes are
labeled as "core" or "fringe" according to the data collection
process. Specifically, all of the graphs measured communication
involving a set of nodes, and this set of nodes serves as the
core. This induces what we call "core-fringe" structure in the
network. We studied how well one can recover the core-labeled nodes
from the network structure. In this setup, the core nodes form a
"planted vertex cover" in the graph case and a "planted hitting set"
in the hypergraph case. We studied these datasets in the following papers:
- Found Graph Data and Planted Vertex Covers.
Austin R. Benson and Jon Kleinberg.
Advances in Neural Information Processing Systems, 2018.
Code available at github.com/arbenson/FGDnPVC. - Planted Hitting Set Recovery in Hypergraphs.
Ilya Amburg, Jon Kleinberg, and Austin R. Benson.
Journal of Physics: Complexity (Special Issue on Higher-Order Structures in Networks and Network Dynamical Systems), 2021.
Code available at github.com/ilyaamburg/Hypergraph-Planted-Hitting-Set-Recovery.
- pvc-call-Reality: phone calls made and received by participants in the reality mining project.
- pvc-text-Reality: SMS texts made and received by participants in the reality mining project.
- pvc-email-W3C: email on W3C mailing lists (graph).
- pvc-email-Enron: email involving Enron employees (graph).
- phs-email-W3C: email on W3C mailing lists (hypergraph).
- phs-email-Enron: email involving Enron employees (hypergraph).
Stack exchange co-tagging networks
These are weighted networks for 168 co-tagging networks on Stack Exchange communities,
where the weight of edge (i, j) is the number of questions that were annotated
with both tags i and j. The data was analyzed in the following paper:
- Modeling and Analysis of Tagging Networks in Stack Exchange Communities.
Xiang Fu*, and Shangdi Yu*, and Austin R. Benson (*equal contribution)
Journal of Complex Networks, 2019.
Code available at github.com/yushangdi/stack-exchange-cotagging
- stack-exchange-tags: 168 stack exchange community weighted co-tagging networks.
Temporal networks
These are temporal networks where (i, j, t) signifies a directed edge
from i to j at time t. The networks were used in the following paper:
- A sampling framework for counting temporal motifs.
Paul Liu, Austin R. Benson, and Moses Charikar.
Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM), 2019.
Code available at gitlab.com/paul.liu.ubc/sampling-temporal-motifs.
- temporal-reddit-reply: timestamped comment interactions on reddit.
- temporal-bitcoin: timestamped bitcoin transactions.
Spatial networks
Each of these datasets is a network with its spatial coordinate.
These datasets were used in the following paper:
- Detecting Core-Periphery Structure in Spatial Networks.
Junteng Jia and Austin R. Benson.
Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM), 2019.
Code available at github.com/000Justin000/spatial_core_periphery.
- spatial-Celegans: C. elegans neural network.
- spatial-underground-London: Tube transportation network in London.
- spatial-fungi: Fungal networks constructed from experimental data.
- spatial-OpenFlights: World airline network from openflights.org.
- spatial-Brightkite: Brightkite location-based social network.
Sequences of Sets
These datasets are sequences of sets. Formally, a dataset consists of a collection
of sequences, where each sequence is a time-ordered list of subsets of some universal
set. These datasets were used in the following paper:
- Sequences of Sets.
Austin R. Benson, Ravi Kumar, and Andrew Tomkins.
Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2018.
Code available at github.com/arbenson/Sequences-of-Sets.
- sos-email-Enron-core: sets of recipients on emails from email addresses.
- sos-email-Eu-core: sets of recipients on emails from email addresses.
- sos-coauth-Business: sets of co-authors on publications from researchers.
- sos-coauth-Geology: sets of co-authors on publications from researchers.
- sos-tags-mathoverflow: sets of tags on MathOverflow questions from users.
- sos-tags-math-sx: sets of tags on Mathematics Stack Exchange questions from users.
- sos-contact-high-school: sets of proximity-based contacts from individuals at a high school.
- sos-contact-prim-school: sets of proximity-based contacts from individuals at a primary school.
Discrete subset choices
These datasets are from people making choices from a discrete set of
alternatives. In datasets with "universal choice sets," the set of
alternatives is the same for every choice that is made. In datasets
with "variable choice sets," the set of alternatives changes with each
subset selection. These datasets were used in the following paper:
- A Discrete Choice Model for Subset Selection.
Austin R. Benson, Ravi Kumar, and Andrew Tomkins.
Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM), 2018.
Code available at github.com/arbenson/discrete-subset-choice.
- uchoice-Bakery: sets of items purchased at a bakery.
- uchoice-Walmart-Items: sets of items purchased at Walmart.
- uchoice-Walmart-Depts: sets of departments from which items were purchased at Walmart.
- uchoice-Kosarak: sets of web pages viewed in a browsing session.
- uchoice-Instacart: sets of items purchased from Instacart.
- uchoice-Lastfm-Genres: sets of genres of music played by users in listening sessions.
- vchoice-Yc-Items: sets of items purchased from the items viewed in a browsing session on an e-commerce web site.
- vchoice-Yc-Cats: sets of product categories from which purchases were made from a browsing session on an e-commerce web site.
Genius.com data
This is a curated dataset of users, songs, and lyrical
annotation on the web site Genius.com.
The dataset was used in the following paper:
- Expertise and Dynamics within Crowdsourced Musical Knowledge Curation: A Case Study of the Genius Platform.
Derek Lim and Austin R. Benson.
Proceedings of International Conference on Web and Social Media (ICWSM), 2021.
Code available at github.com/cptq/genius-expertise.
Manhattan taxi cab trajectories
This dataset contains 1,000 sequences of neighborhoods of Manhattan visited
by taxi cabs over a one year period. The dataset was used in the following paper:
- The spacey random walk: a stochastic process for higher-order data.
Austin R. Benson, David F. Gleich, and Lek-Heng Lim.
SIAM Review (Research Spotlights), 2017.
Code available at github.com/arbenson/spacey-random-walks.
Flow cytometry
This flow cytometry dataset represents abundances of fluorescent
molecules labeling antibodies that bind to specific targets on the surface
of blood cells. The dataset was used in the following paper:
- Scalable methods for nonnegative matrix factorizations of near-separable tall-and-skinny matrices.
Austin R. Benson, Jason D. Lee, Bartek Rajwa, and David F. Gleich.
Advances in Neural Information Processing Systems (NeurIPS), 2014.
Code available at github.com/arbenson/mrnmf.