The BUPT Automation School Team Wins First Place by Achieving a Multimodal Identification Technology Accuracy Rate of 91.14%
iQIYI, Inc., an innovative market-leading online entertainment service in China, and ACM International Conference on Multimedia (“ACM MM”)’s jointly held 2019 Celebrity Video Identification Challenge (the “Challenge”) officially came to a close after three months of competition. At this year’s AI competition, a total of 255 teams including iQIYI along with top foreign and domestic universities, such as Carnegie Mellon University, University College London, University of Exeter, Tsinghua University, Peking University, as well as well-known companies, such as Baidu, ZTE, JD.com, Meitu, and Nvidia, worked together to once again make a major breakthrough in video-based multimodal person identification technology. An accuracy rate of 91.14% was achieved by the team from Beijing University of Posts and Telecommunications’ (“BUPT”) Automation School, who won first place.
Liu Wenfeng, iQIYI’s Chief Technology Officer and President of Infrastructure and Intelligent Content Distribution Business Group (IIG), said, “The Challenge continues to make breakthroughs. In addition to adding important value to iQIYI’s entertainment environment, it also creates far-reaching effects on the advancement of person identification technology as well as academic research and professional training in the field.”
In recent years, many tech companies and academic institutions around the world have released video datasets to improve the accuracy of person recognition to better solve many of the difficult problems in the video industry. Among them, Oxford University has released the VoxCeleb2 dataset, which includes more than 6,000 human figures and 150,000 videos, to focus on solving the problem of speaker recognition. In order to better decipher characters in videos, the Chinese University of Hong Kong and SenseTime jointly released CSM data sets, which includes 1,218 characters and 127,000 videos. The YouTube Faces DB developed at Tel Aviv University in Israel has 3,425 video clips and 1,595 characters to solve face recognition problems in constrained environments.
At the contest, iQIYI announced a more challenging, detailed data set, iQIYI-VID-2019, with 10,000 celebrity figures, 200 hours of content, and 200,000 video clips of films, dramas and short videos, which was more relevant to the actual application scenarios. The data set included four kinds of multimodal features: face, head, body, and voiceprint. The participating teams did not need to use their own computing resources to extract these features, which greatly reduces the threshold for hardware resources in the competition, attracts more top academic teams from around the world, and accelerates the continuous evolution of facial recognition technology. The BUPT Automation School team, who won first place, re-extracted the facial feature on aligned faces and used these five types of modal features to train a multimodal classification model, improving the accuracy rate of the multimodal identification technology to 91.14%, increasing 2.5pp from last year.
The latest multimodal algorithms used at the competition will be valuable to iQIYI’s many services, such as improving the HomeAI voice interactive platform, enhancing the user’s video interaction experience, making AIWorks intelligent video content more accurate in long-form videos, generating short-form video content more quickly, and further elevating the efficiency of iQIYI “iMAM” (iQIYI Media Asset Management) to create high-quality content. In the future iQIYI will continue to cooperate with both domestic and foreign academic institutions and industry leaders to continuously improve the exploration and implementation of cutting-edge technology.