KDD 2024

Overview of OAG-Challenge

OAG-Challenge currently includes three tasks, each of which is designed to evaluate a specific aspect of academic graph mining. For the design principle of OAG-Challenge, we aim to include representative tasks that cover the life cycle of academic graph mining. Firstly, we identify valuable and challenging tasks in the construction process of academic graphs, such as author name disambiguation (AND). Then, powered by the academic graph, academic applications explore tasks beyond the academic graph itself and study knowledge acquisition and cognitive impact, such as academic question answering (AQA) and paper source tracing (PST).

WhoIsWho-IND: Given the paper assignments of each author and paper metadata, the goal is to detect paper assignment errors for each author.
AQA: Given professional questions and a pool of candidate papers, the objective is to retrieve the most relevant papers to answer these questions.
PST: Given the full texts of each paper, the goal is to automatically trace the most significant references that have inspired a given paper.

Submission Guidelines

The objective of this workshop is to discuss the winning solutions of OAG-Challenge at KDD Cup 2024. This submission is single-blind (author names and affiliations should be listed). All participants listed in the Top-11 leaderboard will have a guaranteed opportunity for an in-person oral or poster presentation. Other submissions will be evaluated by a committee based on their novelty and insights.

Important Dates:

Full Paper Submission Deadline: The deadline for the submissions is July 20, 2024 (Anywhere on Earth time).
Notification of Acceptance: August 1, 2024.

Please note that the KDD Cup workshop will not have formal proceedings. Authors retain full rights to submit or publish their papers at other venues.

Submission Website: https://openreview.net/group?id=KDD.org/2024/Workshop/OAG-Challenge_Cup

Submission Requirements

Format: Submissions must be in PDF format.
Length:
▪ Submissions for each task: Maximum of 4 pages (including all content and references).
Note: Teams winning at multiple tracks are required to submit separate reports for each track.

Templates: Please use the ACM Conference templates (two-column format). One recommended setting for LaTeX files is: \documentclass[sigconf, review]{acmart}.Template guidelines are here: https://www.acm.org/publications/proceedings-template

Reproducibility Supplement: Authors may include an optional one-page supplement focused on reproducibility at the end of their submitted paper. This page must be part of the same PDF file.

Author Information: After the submission deadline, the names and order of authors cannot be changed.

It would be great if you could cite our dataset paper available at ArXiv.

@inproceedings{zhang2024oag,
  title={OAG-Bench: A Human-Curated Benchmark for Academic Graph Mining},
  author={Fanjin Zhang and Shijie Shi and Yifan Zhu and Bo Chen and Yukuo Cen and Jifan Yu and Yelin Chen and Lulu Wang and Qingfei Zhao and Yuqing Cheng and Tianyi Han and Yuwei An and Dan Zhang and Weng Lam Tam and Kun Cao and Yunhe Pang and Xinyu Guan and Huihui Yuan and Jian Song and Xiaoyan Li and Yuxiao Dong and Jie Tang},
  booktitle={Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
  year={2024}
}

Contact Information: For any questions, please contact: open-academic-graph@googlegroups.com .

Program of OAG-Challenge KDD Cup 2024 Workshop

Online Meeting:
Topic：OAG-Challenge @ KDD Cup 2024
Time：2024/08/26 14:00-2024/08/26 18:00 Barcelona Time
URL: https://meeting.tencent.com/dm/EBo78DZSqz1j
#Tencent Meeting：835-154-872

For offline attendees:

AY	ROOM / LOCATION	START TIME	END TIME	SESSION / EVENT
Monday, August 26	M213-214	14:00	18:00	KDD Cup: OAG

Schedule:
Invited Talk:

Speaker: Professor Jing Zhang (Renmin University of China)
Title: Large Language Model-Based Techniques for Structured Data Processing
Abstract: LLMs have demonstrated exceptional performance on general natural language tasks. However, further adaptation is required for domain-specific applications. A significant portion of data in domain-specific applications is structured, while existing general-purpose large language models primarily focus on unstructured text data, resulting in limited performance on tasks involving structured data. In the context of relational databases, the research objective is to leverage LLMs to enable the translation of natural language instructions into SQL queries. Moreover, a substantial amount of structured data is stored in spreadsheets. It is desirable for large language models to support operations such as data updating, merging, and charting, extending beyond mere querying capabilities. Furthermore, we aim to achieve transparency in data storage formats, effectively utilizing existing domain-specific data access APIs (i.e., tools) and exploring how LLMs can effectively invoke these tools. In conclusion, adapting general-purpose large language models to specific domain applications for processing structured data is a crucial challenge that demands immediate attention. Bio: I am a Professor at the School of Computer Science, Renmin University of China. My main research interests are in the fields of Knowledge Engineering and Large Models. My research focuses on data mining and knowledge discovery, with an emphasis on tailoring large language models (LLMs) for structured data processing to advance their application in data science. Additionally, I explore model compression techniques and efficient inference methods to improve the deployment of these models. I have published over 70 papers in international top conferences such as KDD, ACL, SIGMOD, WWW, SIGIR, EMNLP, and international top journals like TKDE and TOIS. My papers have been cited over 9000 times by Google Scholar. I have received the 10-Year Best Paper Award at the 2020 SIGKDD Conference. I have been leading national projects such as the National Excellent Young Fund, the Key Research and Development Program of the Ministry of Science and Technology. I am also an Associate Editor for the IEEE Transactions on Big Data and AI Open journals.

Title	Speaker/Team Name	Time
Opening	Fanjin Zhang (Tsinghua University)	14:00-14:10
Large Language Model-Based Techniques for Structured Data Processing	Jing Zhang (Professor at Renmin University of China)	14:10-14:30
Enhancing Name Disambiguation via Iterative Self-Refining with LLMs LLM-Based Iterative Hard Example Mining with Boosting for Academic Question Answering Grafting Learning Is All You Need	BlackPearl	14:30-15:00
An Ensemble Model with Multi-Scale Features for Incorrect Assignment Detection	LoveFishO	15:00-15:10
Synergizing Large Language Models and Tree-based Algorithms for Author Name Disambiguation	AGreat	15:10-15:20
DIEq: Dynamic Identity Equilibrium for Author Disambiguation in KDD Cup 2024 WhoIsWho-IND Challenge	Leo_Lu	15:20-15:30
Advancing Academic Knowledge Retrieval via LLM-enhanced Representation Similarity Fusion: The 2nd Place of KDD Cup 2024 OAG-Challenge AQA	Robo Space	15:30-15:40
3rd Place Solution to KDD Cup 2024 Task 2: Large Academic Graph Retrieval via Multi-Track Message Passing Neural Network	PineappleHouse	15:40-15:50
A Multi-Channel Retriever for Effective Academic Question Answering	fuxinjiang	15:50-16:00
Coffee Break
The Solution for The PST-KDD-2024 OAG-Challenge	NJUST_KMG	16:30-16:40
LLM-Powered Ensemble Learning for Paper Source Tracing: A GPU-Free Approach	英国大力士	16:40-16:50
Efficient Training and Stacking Methods with BERTs-LightGBM for Paper Source Tracing	Heart	16:50-17:00
Poster Session		17:00-17:30

UPDATES

▪ To be eligible for the possible awards, winners who place in the Top 15 on the TEST leaderboard must open-source their solutions. The public GitHub repository must include all code necessary to reproduce the results. Placeholder repositories will not be accepted. The submission deadline is 23:59 June 12, 2024, AOE Time. Please reply your (Track, Team Name, Rank, Github URL) in these threads (https://www.biendata.xyz/forum/view_post_category/1034647/,https://www.biendata.xyz/forum/view_post_category/1034648/,https://www.biendata.xyz/forum/view_post_category/1034649/).
The authors are responsible for addressing any inquiry about their code.
Please add README and provide sufficient information, including:
  Instructions: Detailed steps and exact commands required to reproduce the submitted result.
  Method Introduction: A brief overview of your methods.
  Good examples are [WhoIsWho-IND Codes] [AQA Codes] [PST Codes].

▪ May 30th, 2024
  1. Please ensure to accurately fill in the parameter count and GPU memory of your submissions for the validation and test set leaderboards, otherwise it will affect the final ranking of your team.
  2. The deadline for applying for the GLM-4 API Token is 2024-6-5 23:59 AOE time. Teams needing to apply should submit their application through https://zhipu-ai.feishu.cn/share/base/form/shrcnnAWmk1GWiUDgHsuwnkUr1b.

▪ April 29th, 2024. The slides of official competition analyses are available at [Download Link].

▪ April 16th, 2024. Baseline codes for three tracks are available! [WhoIsWho-IND Codes] [AQA Codes] [PST Codes]

▪ April 3rd, 2024: The GLM-4 API token recharge progress can be queried at [here]. Should you encounter any issues, feel free to address your concerns on the Discussion Board provided on the respective competition's website.

▪ March 28th, 2024: GLM-4 API tokens are distributed every Thursday (no later than 23:59 AOE Time). We kindly encourage you to submit your application as early as possible to ensure a smooth process. To recharge successfully, please sign up at https://open.bigmodel.cn/ first.

▪ March 20th, 2024: OAG-Challenge at KDD Cup 2024 started!

TIMELINE

March 20th, 2024: Start of KDD Cup 2024
May 31st, 2024: Team Merge Deadline
May 31st, 2024: Release test data. All participants have 7 days to submit their results.
June 7th, 2024: All tracks end.
June 14th, 2024: Announcement of the KDD Cup winner.

RULES

Code and Report Submissions

For the winning solutions of the final leaderboard, we require public code submission through the GitHub repo. The repo should contain
▪ All the code to reproduce your results (including data pre-processing and model training/inference) and save the test submission.
▪ README.md that contains all the instructions to run the code (from data pre-processing to model inference on test data).
In addition, we require a short technical report that describes your approach. The link can be either Arxiv or PDF uploaded to your GitHub repository.

Use of Large Language Models (LLMs) and API

For all tracks, pre-trained models that have been open-sourced before the end of the competition are allowed to be used.
WhoIsWho and IND allow the use of APIs. After a valid submission to the validation set, participating teams can obtain a free quota of 1 million tokens for the GLM-4 API [How To Get].
Since AQA dataset was collected from QA platforms, AQA task doesn't allow the use of APIs.

AWARDS

Awards are allocated $10,000 for each track.

▪ Gold Medal (1st Place): $3,000
▪ Silver Medal (2nd Place): $2,000
▪ Bronze Medal (3rd Place): $1,000
▪ Honorable Prizes (4th – 11th Place): $500, each team.

ORGANIZERS

Tsinghua University, Knowledge Engineering Group (KEG) and Zhipu AI

OAG-Challenge Team:

Fanjin Zhang (Tsinghua University), Shijie Shi (Zhipu AI), Kun Cao (Zhipu AI), Bo Chen (Tsinghua University)

Steering Committee (in alphabetical order):

Yuxiao Dong, Cho-Jui Hsieh, Jie Tang, Steffen Staab, Yizhou Sun

Reference

Details about our datasets and initial baseline analysis are described in our OAG-Bench paper. If you use OAG-Challenge in your work, please cite our paper. (Bibtex below)

@article{zhang2024oag, title={OAG-Bench: A Human-Curated Benchmark for Academic Graph Mining}, author={Fanjin Zhang and Shijie Shi and Yifan Zhu and Bo Chen and Yukuo Cen and Jifan Yu and Yelin Chen and Lulu Wang and Qingfei Zhao and Yuqing Cheng and Tianyi Han and Yuwei An and Dan Zhang and Weng Lam Tam and Kun Cao and Yunhe Pang and Xinyu Guan and Huihui Yuan and Jian Song and Xiaoyan Li and Yuxiao Dong and Jie Tang}, journal={arXiv preprint arXiv:2402.15810}, year={2024} }