Launched in January 2011 in China，Zhihu is currently the world’s largest knowledge-sharing platform. More than 200 million users everyday access Zhihu’s service to present their professionality, to find quality content to facilitate their decision making, and to connect with people from whom they seek help, collaboration or partnership. Since thousands of hundreds of high quality UGC is generated per day in Zhihu, how to understand the created contents and distribute them in a highly effective way has become an important topic within the closed loop of Content Production-Content Distribution that Zhihu has created.
So far, Zhihu machine learning team has developed a basic set of ecosystem. By using machine learning technologies. , the team realized User Profiling, Content Profiling, Customized News-Feed, etc. The processing efficiency raised by dozens of times comparing with previous manual operations. In Zhihu’s idea, the value for algorithm is that there is a high possibility to understand people’s potential needs with less obstacles for content acquisition, thus enhancing the learning efficiency. Zhihu’s machine learning team is not only eying on fulfilling users reading interests, but also to satisfy the needs for acquiring good information that will improve oneself. With the support from Zhihu machine learning, high quality information will be routed to users in a highly effective, automatic and intelligent way.
Currently, an important content distribution channel of zhihu is the feed stream per “Focus” relationship. The “Focus” relationship could be based on user, or on topic tags. Recommending contents to users per the focused topic tags will better fulfill users’ needs on knowledge per different areas and types. Therefore, automatic and accurate tagging topics for Zhihu contents will play a very important role in enhancing user experience and content distribution efficiency. Meanwhile, interpreting meanings of text and automatically tagging them, especially in the scenario of the tag volume is large, and the tags are interrelated, is a cutting the edge study trend for natural language processing. For the above reason, Zhihu algorithm team, together with IEEE Computer Association and IEEE China office, host Zhihu Machine Learning Challenge 2017, so as to revolutionize or even totally reproduce the way people gets information.
This contest is also known as the China Artificial Intelligence Contest.
Participants should achieve an automatic tagging model for untagged data, per training data based on the bonding relationship between questions and topic tags that Zhihu provided.
There are 3 million questions within the tagged data. 1 question has 1 or more tags. All in all, there are 2 thousand tags. 1 tag matches with 1 topic in Zhihu. The tags may have parental relationship , based on which the topics will form into a Directed Acyclic Graph (DAG).
Considering user privacy and data security, this contest will not provide original texts to describe questions and topics, but will use numbered codes and numbered segmented words to represent text messages. Meanwhile, considering the vast use of Distributed Representation in natural language processing, the contest will provide embedding vectors at character level and word level. These embedding vectors will be get by conducting training with google word2vec and taking the advantage of the mega text corpora that Zhihu provides.
Except for case conversion, full-width half-angle conversion and removing special characters (e.g. emoji expressions and invisible characters), the training data and forecasting data will not go through any case of cleansing.
(practice) Zhihu Machine Learning Challenge 2017