Description of the Problem
Videos carry rich information with both visual (including text) and audio modality.
Video understanding has been a popular research topic in both academia and industry.
The development of methods to fuse the multi-modal information is also a
challenging and meaningful exploration.
In this challenge, we focus on extracting subtitles from videos. Subtitles are the text
derived from either a transcript or screenplay of the dialogue or commentary in
videos. Subtitles may be the most important text information for video data, as they
contain information of what the people said. They are widely used in
recommendation, retrieval and video understanding systems.
Subtitles refer to the text displayed in the video after converting the audio into
accurate text, however the text in the video is not always belong to subtitles. For
example, the text in the blue boxes in the figure is not a subtitle, but the text in the red
box that converts audio into text is a subtitle.
Subtitle extraction is a challenging task with single modal information. The audio
method is sensitive to the background noise and variations in accent, and some
specific or homophonic words are difficult to recognize accurately. If we only
consider visual information, it is plausible that the above problems could be tackled.
However, other problems along with video scenes appear if we overlook speech
information: Texts of plenty of categories are included and stacked, such as logos,
ads, backgrounds, and so on. They are great interference for extracting subtitles only
from video frames.
From the application perspective, annotating videos is very costly, whether in
annotating audio or text. Sometimes, the annotations of a single modality may be
available. In this challenge, we aim to encourage the development of methods to use
this supervision to train models to learn to perform annotations in another modality,
exploring new way of effective algorithm iteration across different modality.
Therefore, fusing both audio and visual modalities is necessary
and complimentary for subtitle extraction.
In this challenge, we provide a large-scale
video dataset with both visual and audio annotations for extracting subtitles with
multi-modal technologies. We provide three subtasks for the participants, on the first
task, participants can use only audio supervision, whereas in the second, only visual
information is provided. In the third subtask, both visual and audio supervision are
provided and can be used.
Subtask 1: Extracting subtitles in visual modality with audio annotations
To extract subtitles from video frames, a large number of keyframes should be
annotated with bounding boxes and contents, which is extremely costive. However,
speech transcripts are much easier to obtain, and they contain almost all the content of
the subtitles. In this subtask, we present a challenge that explores learning visual
subtitles with the supervision of speech scripts. We expect that the annotation from
the audio modality can improve the subtitle extraction from the visual modality.
In this subtask, we will present 75h of video content, divided into set of 50, 20, 5 as
training, validation, and testing sets, respectively. For the training set, only audio
annotations will be provided. The participants are required to design subtitles OCR
system with these annotations. To pretrain an OCR system, participants can also use a
limited number of open datasets, and fine-tune their model with audio supervision.
Under these conditions, will be asked to produce subtitle text for each video in our
testing set, and the submitted results will be ranked using the CER metric.
The site for submissions is subtask1.
Subtask 2: Extracting subtitles in audio modal with visual annotations
In speech recognition tasks, especially in video speech recognition tasks, audio data
are difficult to label owing to background music, sound effects, or noise. However, all
texts information including subtitles in the video can supply weakly labeled
information.
There are lots of videos with subtitles. Some texts information including
subtitles are manually generated, whereas some are automaticly generated.
Although the quality of automatic subtitles may be worse than that of manual subtitles, they are
often available in much greater quantity.
Therefore, this task considers the use of
visual annotations in videos, especially automatic annotations to assist in building an
ASR system.
In this subtask, the participants will be required to use only visual annotations to
build an ASR system for the corresponding videos. To improve the robustness, some
public ASR data in the following tables may be used as well. We will also provide a
baseline model. The submitted results will be ranked with the CER metric on our
testing set.
The site for submissions is subtask2.
Subtask 3: Extracting subtitles with both visual and audio annotations
In this subtask, for the training set, we present 50 hours of video content with both the
visual and audio supervisions and 200-hour video content with no annotation.
Another 20 and 5 h of videos will be provided to serve as validation and testing sets, respectively.
For the visual annotation, we will provide characters of all text in key frames,
we will present speech transcripts of each VAD
segment or the audio modal. With these data, participants will be required to produce
subtitle for each video in our testing set, and the submitted results will be ranked with
the CER metric.
The site for submissions is subtask3.