We present a large-scale video dataset with 75 hours of video content, among which 50/5/20 hours are used for training, validation, and testing, respectively. Both visual (weak) and audio annotations are provided. Moreover, additional 200-hour unlabeled video content is provided as an unsupervised training resources.

Visual annotation:

For each video, we will provide pseudo-subtitles along with their locations and timestamps. In the creation stage, our video OCR system results are generated and corrected in combined with ASR ground truth labels as follows:
Step 1: We extract five frames per second from videos, and then detect and recognize the text in the frames with the high-precision TencentOCR systems. We save all the text lines as the visual annotations, and they are used for the subtask 2.
Step 2: To identify the subtitle text, we compare the OCR results with the ASR ground truth to determine which text lines belonge to the subtitles, and take the corresponding bounding boxes, and recognized text as the subtitle pseudo- annotations.
The location is presented as a bounding box with the coordinates of the four corners. The timestamp can be obtained with the frames per second (FPS) of the video and the index of each frame.
The annotation has the following format for a subtitle:
            "video_name": video_name,
            "frame_id": frame_id,
            "bounding_box": bounding_box,
            "text": text,
            "fps": fps,
For example, for a video named "TV_00000001", all of the texts in a frame, including one subtitle in the red box, has annotations shown belows :

            "video_name": "TV_00000001",
            "frame_id": 100,
            "content": {
            "text": "都放在你这个手指的动作上面", },
            "fps": 25

Audio annotation:

For each audio clip, we will provide their text and segment file in terms of the KALDI format ( The segment file is the start and end time of VAD segments for each audio clip.
text: TV_00000001 都 放在 你 这个 手指 的 动作 上面
segments: TV_00000001 TV 0 3.259
wav.scp: TV_00000001 TV.wav

Dev and Eval set

We will provide a testing dataset without ground-truth. Participants can submit their results and we will evaluate and rank them. A development set will also be provided with ground-truth so that the participants can optimize their algorithms with it. We will develop and publish a web site for this challenge prior to registration.

Summary of the Dataset

The following data will be provided at the training periods as follows:

Visual Annotation Audio Annotation
Training set provided (50h) Yes (weak annotation) Yes
Public dataset for training detection: ICDAR2019_LSVT
recognition: chinese_ocr
(chinese_ocr only 10k images can be used)
Training set without annotation (200h) No No
Dev set (5h) Yes Yes
Eval set (20h) No No

The rules governing the use of the different dataset are outlined in the follwing table.

Activity Build Dev Eval
Manually examine data before the end of the evaluation Yes No No
Manually examine data after the end of the evaluation Yes Yes No
Train models using released data Yes No No
Parameter tuning Yes Yes No
Score Yes Yes No


The dataset with copyright belonging to VMR is available for downloading for non- commercial purposes under a Creative Commons Attribution 4.0 International License. We will provide URLs to the original data for the other dataset, and release under a CC license for our annotations.