phs-email-W3C dataset
This is a hypergraph dataset with core-fringe structure constructed from emails on W3C mailing lists. Nodes are labeled as either "core" or "fringe", with core nodes corresponding to email addresses with a w3c.org domain. Each hyperedge consists of a set of email addresses, which have all appeared on the same email. Each hyperedge has at least one core node, so the core forms a hitting set for the hypergraph. We studied ways of recorvering core labels from network structure, i.e., the case of finding a planted hitting set. Some summary statistics of the network are:
  • number of nodes: 14,317
  • number of hyperedges: 19,821
  • number of core nodes: 1,509
  • rank of hypergraph (maximum hyperedge size): 25
Data files: If you use this data, please cite the following paper:
  • Planted Hitting Set Recovery in Hypergraphs.
    Ilya Amburg, Jon Kleinberg, and Austin R. Benson.
    Journal of Physics: Complexity, 2021. [bibtex]
  • Overview of the TREC 2005 Enterprise Track.
    Nick Craswell, Arjen P. de Vries, and Ian Soboroff.
    TREC, 2005. [bibtex]