Scott Freitas

A Large-Scale Database for Graph Representation Learning

Scott Freitas, Yuxiao Dong, Joshua Neil, Duen Horng (Polo) Chau


MalNet: Advancing State-of-the-Art Graph Databases. MalNet contains 1,262,024 function call graphs averaging 17,242 nodes and 39,043 edges per graph, across a hierarchy of 47 types and 696 families.


Abstract

With the rapid emergence of graph representation learning, the construction of new large-scale datasets are necessary to distinguish model capabilities and accurately assess the strengths and weaknesses of each technique. By carefully analyzing existing graph databases, we identify 3 critical components important for advancing the field of graph representation learning: (1) large graphs, (2) many graphs, and (3) class diversity. To date, no single graph database offers all of these desired properties. We introduce MalNet, the largest public graph database ever constructed, representing a large-scale ontology of software function call graphs. MalNet contains over 1.2 million graphs, averaging over 17k nodes and 39k edges per graph, across a hierarchy of 47 types and 696 families. Compared to the popular REDDIT-12K database, MalNet offers 105x more graphs, 44x larger graphs on average, and 63x more classes. We provide a detailed analysis of MalNet, discussing its properties and provenance. The unprecedented scale and diversity of MalNet offers exciting opportunities to advance the frontiers of graph representation learning–enabling new discoveries and research into imbalanced classification, explainability and the impact of class hardness. The database is publically available at www.mal-net.org.

Citation

A Large-Scale Database for Graph Representation Learning
Scott Freitas, Yuxiao Dong, Joshua Neil, Duen Horng (Polo) Chau
Neural Information Processing Systems Datasets and Benchmarks (NeurIPS). Virtual, 2021.
Project Demo PDF Blog Code BibTeX