HGB

Heterogeneous Graph Benchmark

The heterogeneous graph benchmark (HGB) aims to advance open and reproducible heterogeneous graph research. In HGB, we standardize the process of data splits, feature processing, and performance evaluation, by establishing the HGB pipeline “feature preprocessing → HGNN encoder → downstream decoder”, among 3 heterogeneous graph learning tasks, i.e., node classification, link prediction and knowledge-aware recommendation.

Node Classification

Given a heterogeneous graph and some labeled nodes, you need to pre-dict the categories of some unlabeled nodes. There are both single-label setting and multi-label setting in our datasets.
Link Prediction

Given a heterogeneous graph, you need to predict the probability of the existence of a link between some node pairs. Since real-world graphs are usually sparse, link predict makes sense of finding missing links.
Knowledge-aware Recommendation

Given a user-item graph and a knowledge graph attched to it, you need to give the item rankings for some test users.

How to take part in HGB?

There are three channels in HGB, i.e., node classification (NC), link prediction (LP) and knowledge-aware recommendation (Recom). You can find the entries at the bottom of HGB homepage. For NC and LP, you can join the competition as a common biendata competition, and submit your prediction to get the scores. For Recom, due to the prediction files are too large to upload, you need to test offline by yourself.
There are some helpful code related to data loading, model training and evaluation, prediction saving for submission in our Github repo. HGB related scripts are in NC/benchmark, LP/benchmark, and Recom/baseline sub-folder, respectively.
If you want to show your scores on the official HGB leaderboard, you have to make a request and fill a form in the "My Submissions" page in each competition channel. After we have verified your provided information and code reproducibility, you will obtain a rank on the official HGB leaderboard.

About Pipeline

To standardize heterogeneous graph experiment settings, we adapt all methods under “feature preprocessing → HGNN encoder → downstream decoder” HGB pipeline. The whole pipeline is trained in an end-to-end fashion. Ideally, researchers only need to focus on HGNN encoder part. Maybe you can also explore more effective feature processing and downstream decoder modules, but most of time, you can just leave these two parts to HGB pipeline scripts.
Feature preprocessing
- Since different node types may have different node feature dimension, we use a Linear layer for each node type to encode all nodes into a unified vector space.
- Moreover, we find that not all node features have positive effect on model performance. For example, if we replace some types of node features with one-hot encoding, performance may boost. Therefore, we select three types of feature combinations, i.e., using all given node features, using only features of target node type, or replacing all node features as one-hot vectors.
HGNN encoder
- This is the primary module for researchers to explore.
Downstream decoder
- For node classification, we use softmax (single-label classification) or sigmoid (multi-label classification) decoder.
- For link prediction, we use dot product decoder or DistMult decoder.
- For recommendation, we use dot product and BPR-MF ensembled decoder.

About Data Format

Although you can use our data_loader script to load data, we still make some clarification of our dataset format here.
Node classification

Each line is split by '\t'. For each dataset, we have:
- node.dat:The information of nodes. Each line has (node_id, node_name, node_type_id, node_feature). One-hot node_features can be omitted. Each node type takes a continuous range of node_ids. And the order of node_ids are sorted by node_type_id, which means that node_type_id=0 takes the first interval of node_id, node_type_id=1 takes the second interval of node_id, etc. Node features are vectors split by comma.
- link.dat:The information of edges. Each line has (node_id_source, node_id_target, edge_type_id, edge_weight).
- label.dat: The information of node labels. Each line has (node_id, node_type_id, node_label). For multi-label setting, node_labels are split by comma.
- label.dat.test: Test set node labels. The format is same as label.dat, but the node_label is randomly replaced.
Link prediction
- node.dat: The format is same as node.dat in node classification.
- link.dat: The format is same as link.dat in node classification.
- link.dat.test: Originally, this file is test set links. But to pervent data leakage problem, the target nodes are radomly replaced.
Recommendation
- mf.npz: Pretrained BPR-MF embeddings for recommendation.
- kg_final.txt: Each line has (head_entity_id, relation_id, tail_entity_id). The entity ids are aligned with items if possible.
- train.txt: For each line, the first number is the id of user, and the second to the last number are the items related to this user.
- test.txt: The format is same as train.txt, but for test set. Note that we do not replace this test set, because the datasets are same with those in KGAT, which can be obtained easily. Therefore, you need to care about data leakage issue by yourself.

DataSets

Name	Size	Tesk	Metric	Last Update	Download

Paper/Code Included