CCKS 2019 Task 2 (Mandarin Text Data Only)

CCKS 2019 & Baidu • ￥15000 • 744 Team • 820 participants

2019-04-19 - Launch

2019-07-30 - Team Merger Deadline

2019-07-25 - Close

Home Competitions

Update on 2019/5/30

The deadline for posting your extra used data is extended to July 10, 2019. If you use ANY extra data other than the data provided by the organizers, please post the data source in the forum. Please note that the extra data must be open source and free to download/use. No additional labeled data is allowed to be used in this competition.

Introduction

Background

In recent years, there has been rapid developments in knowledge graph and it has become a very important resource to support artificial intelligence applications. One of the example is search engine, based on the structured information from a knowledge base, a search engine can perform semantic analysis on a user’s query to better understand his/her intention. One of the key tasks involved is to discover and identify named entities for the user queries, and then disambiguate these entities and link them to the corresponding entries in the knowledge base. The search engine can also perform additional tasks such as search recommendation, query expansion, based on knowledge graph.

From the perspective of technical difficulty and application scenarios, many existing entities linking datasets focused on long texts, which are easier to annotate due to their sufficient contextual information. However, entity linking on short texts such as search queries, microblogs, dialogues, advertising slogans still remains a challenging task. The main reasons are as follows:

(1) Serious colloquialism leads to difficulty in disambiguating entities;

(2) Entity recognition and linking for short texts require a deeper understanding of the text because short text lacks rich contextual information;

(3) Annotation of short texts in Chinese is even more challenging than in English because of its language characteristics.

By opening Baidu’s data and host entity linking task competition, we wish topromote the progress of algorithm with young talents from academy and industry, make the complex world simpler through technology.

Task

Input

Several lines of Chinese short text.

Output：

The system is expected to provide a reasonable linking result for a given short text, every named entity in this short text should be linked to the same entity in the knowledge base, if exists, otherwise labeled with “NIL”.

Example：

Input：

{
    "text_id":"1",
    "text":"比特币吸粉无数，但央行的心另有所属|界面新闻 · jmedia"
}

The “text_id” is text index，the “text” is a short Chinese text.

Output：

Entity Recognition and Linking result is json format, it includes text_id、text and mention_data, text_id. “mention_data” are list of mentions that linked to the kb, each mention includes a kb_id, a mentioned name and the offset of the entity in the input string.

{
"text_id":"1",
"text":"比特币吸粉无数，但央行的心另有所属|界面新闻 · jmedia"
"mention_data":[
        {
            "kb_id":"278410",
            "mention":"比特币",
            "offset":"0"
        },
        {
            "kb_id":"199602",
            "mention":"央行",
            "offset":"9"
        },
        {
            "kb_id":"215472",
            "mention":"界面新闻",
            "offset":"18"
        }
    ]
}

Explanation：

If the short text inputted includes an ambiguous entity, the system is expected to perform disambiguation and link it to the correct entity in the knowledge base. For example, there are 3 different entities named “比特币” in the knowledge base and superficially, they are all possibly be the correct entity to be linked.

Note：

The objects to be recognized and linked include all the mention and concepts.

CCKS 2019 Task 2 (Mandarin Text Data Only)

￥15000

820 participants

start

Final Submissions

2019-04-19

2019-07-25

Overview
Team
Leaderboard

My Team