Description of the Problem

Videos carry rich information with both visual (including text) and audio modality. Video understanding has been a popular research topic in both academia and industry. The development of methods to fuse the multi-modal information is also a challenging and meaningful exploration.
In this challenge, we focus on extracting subtitles from videos. Subtitles are the text derived from either a transcript or screenplay of the dialogue or commentary in videos. Subtitles may be the most important text information for video data, as they contain information of what the people said. They are widely used in recommendation, retrieval and video understanding systems.
Subtitles refer to the text displayed in the video after converting the audio into accurate text, however the text in the video is not always belong to subtitles. For example, the text in the blue boxes in the figure is not a subtitle, but the text in the red box that converts audio into text is a subtitle.

Subtitle extraction is a challenging task with single modal information. The audio method is sensitive to the background noise and variations in accent, and some specific or homophonic words are difficult to recognize accurately. If we only consider visual information, it is plausible that the above problems could be tackled. However, other problems along with video scenes appear if we overlook speech information: Texts of plenty of categories are included and stacked, such as logos, ads, backgrounds, and so on. They are great interference for extracting subtitles only from video frames.
From the application perspective, annotating videos is very costly, whether in annotating audio or text. Sometimes, the annotations of a single modality may be available. In this challenge, we aim to encourage the development of methods to use this supervision to train models to learn to perform annotations in another modality, exploring new way of effective algorithm iteration across different modality. Therefore, fusing both audio and visual modalities is necessary and complimentary for subtitle extraction.
In this challenge, we provide a large-scale video dataset with both visual and audio annotations for extracting subtitles with multi-modal technologies. We provide three subtasks for the participants, on the first task, participants can use only audio supervision, whereas in the second, only visual information is provided. In the third subtask, both visual and audio supervision are provided and can be used.

Subtask 1: Extracting subtitles in visual modality with audio annotations

To extract subtitles from video frames, a large number of keyframes should be annotated with bounding boxes and contents, which is extremely costive. However, speech transcripts are much easier to obtain, and they contain almost all the content of the subtitles. In this subtask, we present a challenge that explores learning visual subtitles with the supervision of speech scripts. We expect that the annotation from the audio modality can improve the subtitle extraction from the visual modality. In this subtask, we will present 75h of video content, divided into set of 50, 20, 5 as training, validation, and testing sets, respectively. For the training set, only audio annotations will be provided. The participants are required to design subtitles OCR system with these annotations. To pretrain an OCR system, participants can also use a limited number of open datasets, and fine-tune their model with audio supervision. Under these conditions, will be asked to produce subtitle text for each video in our testing set, and the submitted results will be ranked using the CER metric.
The site for submissions is subtask1.

Subtask 2: Extracting subtitles in audio modal with visual annotations

In speech recognition tasks, especially in video speech recognition tasks, audio data are difficult to label owing to background music, sound effects, or noise. However, all texts information including subtitles in the video can supply weakly labeled information. There are lots of videos with subtitles. Some texts information including subtitles are manually generated, whereas some are automaticly generated. Although the quality of automatic subtitles may be worse than that of manual subtitles, they are often available in much greater quantity. Therefore, this task considers the use of visual annotations in videos, especially automatic annotations to assist in building an ASR system.
In this subtask, the participants will be required to use only visual annotations to build an ASR system for the corresponding videos. To improve the robustness, some public ASR data in the following tables may be used as well. We will also provide a baseline model. The submitted results will be ranked with the CER metric on our testing set.
The site for submissions is subtask2.

Subtask 3: Extracting subtitles with both visual and audio annotations

In this subtask, for the training set, we present 50 hours of video content with both the visual and audio supervisions and 200-hour video content with no annotation. Another 20 and 5 h of videos will be provided to serve as validation and testing sets, respectively. For the visual annotation, we will provide characters of all text in key frames, we will present speech transcripts of each VAD segment or the audio modal. With these data, participants will be required to produce subtitle for each video in our testing set, and the submitted results will be ranked with the CER metric.
The site for submissions is subtask3.