The heterogeneous graph benchmark (HGB) aims to advance open and reproducible heterogeneous graph research. In HGB, we standardize the process of data splits, feature processing, and performance evaluation, by establishing the HGB pipeline “feature preprocessing → HGNN encoder → downstream decoder”, among 3 heterogeneous graph learning tasks, i.e., node classification, link prediction and knowledge-aware recommendation.
Given a heterogeneous graph and some labeled nodes, you need to pre-dict the categories of some unlabeled nodes. There are both single-label setting and multi-label setting in our datasets.
Given a heterogeneous graph, you need to predict the probability of the existence of a link between some node pairs. Since real-world graphs are usually sparse, link predict makes sense of finding missing links.
Given a user-item graph and a knowledge graph attched to it, you need to give the item rankings for some test users.
How to take part in HGB?
There are three channels in HGB, i.e., node classification (NC), link prediction (LP) and knowledge-aware recommendation (Recom). You can find the entries at the bottom of HGB homepage. For NC and LP, you can join the competition as a common biendata competition, and submit your prediction to get the scores. For Recom, due to the prediction files are too large to upload, you need to test offline by yourself.
If you want to show your scores on the official HGB leaderboard, you have to make a request and fill a form in the "My Submissions" page in each competition channel. After we have verified your provided information and code reproducibility, you will obtain a rank on the official HGB leaderboard.
To standardize heterogeneous graph experiment settings, we adapt all methods under “feature preprocessing → HGNN encoder → downstream decoder” HGB pipeline. The whole pipeline is trained in an end-to-end fashion. Ideally, researchers only need to focus on HGNN encoder part. Maybe you can also explore more effective feature processing and downstream decoder modules, but most of time, you can just leave these two parts to HGB pipeline scripts.
Since different node types may have different node feature dimension, we use a Linear layer for each node type to encode all nodes into a unified vector space.
Moreover, we find that not all node features have positive effect on model performance. For example, if we replace some types of node features with one-hot encoding, performance may boost. Therefore, we select three types of feature combinations, i.e., using all given node features, using only features of target node type, or replacing all node features as one-hot vectors.
This is the primary module for researchers to explore.
For node classification, we use softmax (single-label classification) or sigmoid (multi-label classification) decoder.
For link prediction, we use dot product decoder or DistMult decoder.
For recommendation, we use dot product and BPR-MF ensembled decoder.
About Data Format
Although you can use our data_loader script to load data, we still make some clarification of our dataset format here.
Each line is split by '\t'. For each dataset, we have:
node.dat:The information of nodes. Each line has (node_id, node_name, node_type_id, node_feature). One-hot node_features can be omitted. Each node type takes a continuous range of node_ids. And the order of node_ids are sorted by node_type_id, which means that node_type_id=0 takes the first interval of node_id, node_type_id=1 takes the second interval of node_id, etc. Node features are vectors split by comma.
link.dat:The information of edges. Each line has (node_id_source, node_id_target, edge_type_id, edge_weight).
label.dat: The information of node labels. Each line has (node_id, node_type_id, node_label). For multi-label setting, node_labels are split by comma.
label.dat.test: Test set node labels. The format is same as label.dat, but the node_label is randomly replaced.
node.dat: The format is same as node.dat in node classification.
link.dat: The format is same as link.dat in node classification.
link.dat.test: Originally, this file is test set links. But to pervent data leakage problem, the target nodes are radomly replaced.
mf.npz: Pretrained BPR-MF embeddings for recommendation.
kg_final.txt: Each line has (head_entity_id, relation_id, tail_entity_id). The entity ids are aligned with items if possible.
train.txt: For each line, the first number is the id of user, and the second to the last number are the items related to this user.
test.txt: The format is same as train.txt, but for test set. Note that we do not replace this test set, because the datasets are same with those in KGAT, which can be obtained easily. Therefore, you need to care about data leakage issue by yourself.
Knowledge Engineering Group, Department of Computer Science and Technology, Tsinghua University