NDC-substances Dataset

This is a temporal higher-order network dataset, which here means a sequence of timestamped simplices where each simplex is a set of nodes. Under the Drug Listing Act of 1972, the U.S. Food and Drug Administration releases information on all commercial drugs going through the regulation of the agency, forming the National Drug Code (NDC) Directory. In this dataset, each simplex corresponds to an NDC code for a drug, and the nodes are substances that make up the drug. Timestamps are in days and represent when the drug was first marketed. We restricted to simplices that consist of at most 25 nodes. Some basic statistics of this dataset are:

number of nodes: 5,311
number of timestamped simplices: 112,405
number of unique simplices: 10,025
number of edges in projected graph: 88,268

Data restricted to simplices with at most 25 nodes:

NDC-substances.tar.gz (timestamped simplices, node labels, and simplex labels)
NDC-substances-proj-graph.tar.gz (weighted projected graph)

Full data without restriction on simplex size:

NDC-substances-full.tar.gz (timestamped simplices, node labels, and simplex labels)
NDC-substances-full-proj-graph.tar.gz (weighted projected graph)

If you use this data, please cite the following paper:

Simplicial closure and higher-order link prediction.
Austin R. Benson, Rediet Abebe, Michael T. Schaub, Ali Jadbabaie, and Jon Kleinberg.
Proceedings of the National Academy of Sciences (PNAS), 2018. [bibtex]

NDC-substances dataset