Speech and Multimodal Interfaces Laboratory

Biometric Russian Audio-Visual Extended MASKS (BRAVE-MASKS) Corpus

6 classes of the BRAVE-MASKS corpus
30 speakers: 15 males, 15 females
Age: 19-86 y.o. (mean: 40.83, STD: 19.01)
Duration: 21 h 00 min 09 sec (for one channel)
Duration of utterances: 0.42 sec - 514.9 sec.
Devices: iPhone XS Max (left), iPad Pro (center), iPhone XS Max + Boya BY-M1 (right)
Audio: 48 kHz, 16 bit, mono format (PCM WAV)
Video: 4K 3840x2160 pixels, 60 (for smartphones) and 30 (for tablet) frames per second (MOV)
Data volume: ~185 Gb
Sample files from the BRAVE-MASKS corpus: download

The BRAVE-MASKS corpus contains multi-angle images of different person's faces in protective masks of many kinds, as well as audio recordings of continuous Russian speech of people in masks. The multimodal data were recorded using three devices: two Apple iPhone XS Max (as left, right) smartphones and an Apple iPad Pro (center) tablet in regular office conditions in front on a heterogeneous background. A Boya BY-M1 microphone was connected to one phone. Three continuous audio-video recordings were made simultaneously. At present, the corpus contains recordings of 30 native Russian speakers (15 males and 15 females, aged from 19 to 86 y.o., mean age is 40.84, std. dev. is 19.02), both wearing various protective masks and without them. The informants performed various tasks and scenarios both without a mask and wearing several different protective masks (disposable medical masks, reusable tissue masks of various colors and prints, medical and special respirators both with and without filters, and protective face shields). In total, different protective masks of 33 types were used. Then similar protective masks were combined into one class. Thus, we had 6 classes (types of masks): tissue mask (TM), medical mask (MM), FFP2 and FFP3 protective masks (PM), respirator (R), protective face shield (PFS), and no mask (NM). Each informant was recorded in 6 sessions in 3 channels: once without any mask and 5 times wearing 5 different masks. The corpus consists of two parts: bimodal (audio-visual data) and unimodal (video data).

Age and gender distribution of the BRAVE-MASKS corpus
The BRAVE-MASKS corpus recording setup

Bimodal part

The bimodal part contains audio-visual recordings of speech statements. The audio data was sampled at 48 kHz, 16 bit, mono format. Parameters of the video data are equivalent to the unimodal part (see Unimodal part). All the speakers were asked to read sentences from the Russian national standard 50840-95 "Speech transmission over communication paths. Methods for assessing quality, intelligibility and recognition", different for each speaker, and from one phonetically representative text; they also answered some questions and described proposed pictures (e.g. on sport activities, family, kids, food, and countries).

Unimodal part

The unimodal part contains only video (without audio) recordings of head rotation (clockwise and counterclockwise) from 8 different points in the room: from a distance of from 0.9 (for audio setup) to 3.2 meters (for video setup) at different angles. Parameters of the video files: resolution of video data is 4K 3840x2160 pixels, frame rate is 60 (for smartphones) and 30 (for tablet) frames per second, color is 24 bits per pixel.

Corpus annotation

The data of each informant was recorded continuously, so we have split the obtained files into sessions and utterances in a semi-automatic way. After that, we split all the data into Train/Development/Test sets in a speaker independent way with approximately the same distribution by age and gender classes.

For each channel of bimodal part we obtained 30 speakers x 6 masks x 83 utterances = 14940 video files, in total 20 h 57 min 33 sec. Speakers' utterance length varied from 0.42 to 514.9 sec (for the longest spontaneous narrative).

All recorded video files in the unimodal part with head rotation (30 speakers x 3 channels 2 rotation scenarios = 180 videos) were cut into fragments for each mask (180 videos x 6 masks = 1080 fragments). A set of one frame per second was extracted from each fragment. Sets of 7800 to 13300 images (mean is 9350) were extracted from the video recordings of each person in JPG format. Additionally, we have performed a region-of-interest (or mask bounding boxes) annotation. For this, the RetinaFace detector was used. We found out, that this detector had a lot of false positives (various non-face objects), so we had to manually check annotations for each frame and remove erroneous cases.

Possible tasks

  • a multi-class recognition of a type of a mask worn by a speaker using his/her audio and video,
  • a binary classification of speakers with/without masks,
  • a regression task, in particular, determine how much the masked voice changes compared to the unmasked voice,
  • a speaker verification and identification tasks.

Access to the corpus:

This corpus is available to the public. Permission to use, but not to reproduce or distribute our corpus is granted to all researchers, provided that the following steps are properly followed:

  • Send an email to Maxim Markitantov (m.markitantov@yandex.ru) to get a link to download this corpus and a password to access the files of this corpus. Your email MUST be sent from a valid university account and MUST contain the following text:

    1. Subject: Application to download the BRAVE-MASKS corpus          
    2. Name: <your first and last name>
    3. Affiliation: <University where you work>
    4. Department: <your department>
    5. Position: <your job title>
    6. Email: <must be the email at the above mentioned institution>
    
    I have read and agree to the terms and conditions specified in the BRAVE-MASKS corpus webpage. 
    This corpus will only be used for research purposes. 
    I will not make any part of this corpus available to a third party. 
    I'll not sell any part of this corpus or make any profit from its use.
    
  • If you are going to use the data mentioned above, you MUST cite the paper below:

    Markitantov M., Ryumina E., Ryumin D., Karpov A. Biometric Russian Audio-Visual Extended MASKS (BRAVE-MASKS) Corpus: Multimodal Mask Type Recognition Task // In Proc. of INTERSPEECH, 2022, pp. 1756-1760.

    or:

    @inproceedings{bravemasks_corpus,
      title={Biometric Russian {Audio-Visual} Extended MASKS ({BRAVE-MASKS}) Corpus: Multimodal Mask Type Recognition Task},
      author={Maxim Markitantov and Elena Ryumina and Dmitry Ryumin and Alexey Karpov},
      booktitle={Proc. of INTERSPEECH},
      pages={1756--1760},
      year={2022},
      doi={10.21437/Interspeech.2022-10240}
    }