Datasets
We present a large-scale video dataset with 75 hours of video content, among which 50/5/20 hours are used for training, validation, and testing, respectively. Both visual (weak) and audio annotations are provided. Moreover, additional 200-hour unlabeled video content is provided as an unsupervised training resources.
Visual annotation:
For each video, we will provide pseudo-subtitles along with their locations and
timestamps. In the creation stage, our video OCR system results are generated and
corrected in combined with ASR ground truth labels as follows:
Step 1: We extract five frames per second from videos, and then detect and recognize
the text in the frames with the high-precision TencentOCR systems. We save all the
text lines as the visual annotations, and they are used for the subtask 2.
Step 2: To identify the subtitle text, we compare the OCR results with the ASR
ground truth to determine which text lines belonge to the subtitles, and take the
corresponding bounding boxes, and recognized text as the subtitle pseudo-
annotations.
The location is presented as a bounding box with the coordinates of the four
corners. The timestamp can be obtained with the frames per second (FPS) of the video
and the index of each frame.
The annotation has the following format for a subtitle:
{
"video_name": video_name,
"frame_id": frame_id,
"bounding_box": bounding_box,
"text": text,
"fps": fps,
}.
For example, for a video named "TV_00000001", all of the texts in a frame, including one subtitle in the red box, has annotations shown belows :
{
"video_name": "TV_00000001",
"frame_id": 100,
"content": {
"text":"BTV",
"text":"北京同仁堂",
"text":“冠名播出”,
"text": "都放在你这个手指的动作上面", },
"fps": 25
}.
Audio annotation:
For each audio clip, we will provide their text and segment file in terms of the KALDI
format (https://kaldi-asr.org/doc/data_prep.html). The segment file is the start and end
time of VAD segments for each audio clip.
text: TV_00000001 都 放在 你 这个 手指 的 动作 上面
segments: TV_00000001 TV 0 3.259
wav.scp: TV_00000001 TV.wav
Dev and Eval set
We will provide a testing dataset without ground-truth. Participants can submit their results and we will evaluate and rank them. A development set will also be provided with ground-truth so that the participants can optimize their algorithms with it. We will develop and publish a web site for this challenge prior to registration.
Summary of the Dataset
The following data will be provided at the training periods as follows:
Visual Annotation | Audio Annotation | |
Training set provided (50h) | Yes (weak annotation) | Yes |
Public dataset for training | detection: ICDAR2019_LSVT recognition: chinese_ocr (chinese_ocr only 10k images can be used) |
Aishell1(150h) |
Training set without annotation (200h) | No | No |
Dev set (5h) | Yes | Yes |
Eval set (20h) | No | No |
The rules governing the use of the different dataset are outlined in the follwing table.
Activity | Build | Dev | Eval |
Manually examine data before the end of the evaluation | Yes | No | No |
Manually examine data after the end of the evaluation | Yes | Yes | No |
Train models using released data | Yes | No | No |
Parameter tuning | Yes | Yes | No |
Score | Yes | Yes | No |
Copyright
The dataset with copyright belonging to VMR is available for downloading for non- commercial purposes under a Creative Commons Attribution 4.0 International License. We will provide URLs to the original data for the other dataset, and release under a CC license for our annotations.