Speech and Multimodal Interfaces Laboratory

Analysis of Voice and Facial Features of a Human in a Mask

Analysis of Voice and Facial Features of a Human in a Mask

Due to an unexpected occurrence and the current rapid global spread of the coronavirus COVID-19 pandemic, the most urgent task is to monitor the level of security of individuals and the whole society in the new world of a social distance and a "mask" culture. In recent years, the periodic wearing of protective face masks in public places has become absolutely familiar and commonplace for many residents of densely populated Asian countries (Japan, Singapore, Malaysia, China, etc.). Therefore, they were protected from people with possible respiratory diseases, air pollution and allergens. This mask culture and strict observance of quarantine requirements by the population of these Asian countries became the main guarantee of the extinction of COVID-19 spread. In recent months, masks have become an element of European culture and even fashion, firmly entering our dress code. Now and in the coming years there is an urgent need for automated verification of the presence of a protective mask for people who are in public places or are in contact with infected people or those at risk of infection. As part of this RFBR project, it is proposed to develop and research a new software system for automatic bimodal analysis of voice and facial characteristics of a masked person.

A number of fundamentally new scientific and technical results will be obtained during 2-year research project: (1) new infoware - a bimodal Russian-language database (corpus) containing multi-angle images of people's faces in various variations of protective masks, as well as audio recordings of dozens native speakers of the Russian language in masks, including disposable medical masks of various densities, reusable fabric masks of various colors with and without drawings, special respirators and other means of protecting the mucous surfaces of the face; (2) new methods and models for the automatic analysis of people's voice characteristics by speech, including the presence of a protective mask when speaking, detection of cough, the likelihood of a respiratory disease, etc.; (3) new methods and models for analyzing the facial characteristics of people by video data, including detection of the presence or absence of a protective mask on the face, biometric characteristics of the open part of the face (upper part of the head) of a person; (4) a prototype software system for automatic bimodal analysis of voice and facial characteristics of a person in a mask.

The results of these studies based on modern artificial intelligence technologies can be directly used to combat the spread of viral epidemics (coronaviruses, including COVID-19, flu viruses, and other very pathogenic types of viruses in the future) both in Russia and around in the world.

Results for 2021

During the 2nd stage of the project, new methods and models for analyzing the voice characteristics of people based on their natural speech and audio signals were developed and studied, including the presence of a protective mask when speaking, recognition of the type of protective mask and cough detection, which are based on modern pre-trained convolutional neural networks, including PANN, and SpecAugment augmentation methods as well as development and research of new methods and models for analyzing people's facial characteristics using video data, including the presence of a protective mask on the face, recognition of its type. The proposed methods are based on ResNet pre-trained neural networks, Yolov5 object detectors, and Mixup, Insert, and Mosaic augmentation methods were presented. For the task of analyzing the biometric characteristics of the open part of the face, an analytical review of modern solutions in the field of audio-visual recognition of people in masks was carried out, deterministic and neural network methods were developed and researched, the latter is based on the ArcFace model. Also, in order to increase the training data, methods for generating a synthetic set of images and video data using the method of applying protective masks to images of people's faces have been developed. We developed and studied a software system for automatic bimodal analysis of the voice and facial characteristics of a person in a mask. The software receives real-time audio from a microphone and an image from a webcam, pre-processes the signals, extracts informative features, calculates a probabilistic prediction, and combines the resulting recognition predictions. Combining information from different modalities occurs at the level of recognition predictions/hypotheses. Conducted research demonstrated that the combination of audio and video modalities makes it possible to compensate for the weaknesses of unimodal systems.

The results obtained are presented in the form of a series of scientific publications, including 11 articles: 5 in the proceedings of international conferences and 6 in specialized Russian journals indexed in the citation systems Scopus and Web of Science and RSCI: “Computer Optics” (Q1-2), “Informatics and Automation”, “Scientific Visualization”, Scientific and technical journal of ITMO", and "Journal of Instrument Engineering". Five reports were presented at international and Russian conferences, two of them – at the top international conference INTERSPEECH of level A of the international rating CORE.

Three results of intellectual activity have been created and registered with Rospatent: 1) "Software for recording audiovisual data of people in protective masks", 2) "Corpus of audiovisual Russian-language data of people in protective masks (BRAVE-MASKS - Biometric Russian Audio-Visual Extended MASKS corpus)", and 3) "Software package for audio-visual recognition of personal protective equipment on a person's face (Audio-visual facial masks detection - AVIFAME)". Also, based on the results of the project, applications for the registration of two RF patents for inventions were submitted: "Method for audiovisual recognition of personal protective equipment on a person's face" and "Method for generating colored protective masks on images of people's faces".

Addresses of Internet resources prepared for the Project:

  1. Web page of Biometric Russian Audio-Visual Extended MASKS corpus (BRAVE-MASKS)
  2. Web-page of Masked Frontal-Faces Database (MFFD)
  3. A bimodal Russian-language database (corpus) of persons in protective masks (BRAVE-MASKS - Biometric Russian Audio-Visual Extended MASKS corpus)
  4. Software for recording audiovisual data of persons in protective masks
  5. Software for Audio-VIsual FAcial Masks dEtection - AVIFAME
  6. Ryumina E., Ryumin D., Ivanko D., Karpov A. A Novel Method for Protective Face Mask Detection Using Convolutional Neural Networks and Image Histograms // International Archives of the Photogrammetry Remote Sensing and Spatial Information Sciences. Vol. XLIV-2/W1-2021. 2021. pp. 177–182. DOI: 10.5194/isprs-archives-XLIV-2-W1-2021-177-2021
  7. Markitantov M., Dresvyanskiy D., Mamontov D., Kaya H., Minker W., Karpov A. Ensembling End-to-End Deep Models for Computational Paralinguistics Tasks: ComParE 2020 Mask and Breathing Sub-Challenges // In Proc. of INTERSPEECH. 2020. pp. 2072-2076. DOI: 10.21437/Interspeech.2020-2666
  8. Markitantov M., Ryumina E., Ryumin D., Karpov A. Biometric Russian Audio-Visual Extended MASKS (BRAVE-MASKS) Corpus: Multimodal Mask Type Recognition Task // In Proc. of INTERSPEECH. 2022. pp. 1756-1760. 10.21437/Interspeech.2022-10240
  9. Ryumina E., Ryumin D., Markitantov M., Karpov A. A method for generating training data for a protective face mask detection system // Computer Optics. Vol 46(4). 2022. pp. 603-611. DOI: 10.18287/2412-6179-CO-1039.
  10. Dvoynikova A., Markitantov M., Ryumina E., Ryumin D., Karpov A. Analytical Review of Audiovisual Systems for Determining Personal Protective Equipment on a Person’s Face // Informatics and Automation. No. 20(5). 2021. pp. 1116-1152. DOI: 10.15622/20.5.5
  11. Ryumina, E., Verkholyak, O., Karpov, A. Annotation Confidence vs. Training Sample Size: Trade-Off Solution for Partially-Continuous Categorical Emotion Recognition // In Proc. of INTERSPEECH. 2021. pp. 3690-3694. DOI: 10.21437/Interspeech.2021-1636
  12. Letenkov M.A., Iakovlev R.N., Markitantov M.V., Ryumin D.A., Saveliev A.I., Karpov A.A. Method for Generating Synthetic Images of Masked Human Faces // Scientific Visualization. Vol. 14. No. 2. 2022. pp. 1-17. DOI: 10.26583/sv.14.2.01.
  13. Kosulin K.E., Karpov A.A. Methods for audiovisual recognition of people in masks // Scientific and Technical Journal of Information Technologies, Mechanics and Optics. Vol. 22. No. 3. 2022. pp. 415–432. DOI: 10.17586/2226-1494-2022-22-3-415-432.
  14. Kukharev G.A., Ryumina E.V., Shulgin N.A. Method for generating masks on face images and systems for their recognition // Scientific and Technical Journal of Information Technologies, Mechanics and Optics. Vol. 22. No. 3. 2022. pp. 547–558. DOI: 10.17586/2226-1494-2022-22-3-547-558.
  15. Letenkov M., Iakovlev R., Karpov A. Approach to Image-Based Recognition of User Face in Setting of Partial Face Occlusion by Personal Protective Equipment // Electromechanics and Robotics. Smart Innovation, Systems and Technologies. Vol. 232. 2021. pp. 249-258. DOI: 10.1007/978-981-16-2814-6_22


Results for 2020

At the first stage of the project RFBR № 20-04-60529, an extended analytical review in the topic of protective mask detection on the human face by voice and facial characteristics, respiratory diseases, as well as automatic COVID-19 recognition by human speech and sounds currently available audio-visual speech corpora was carried out. Along with this software has been developed for recording video data in order to collect and annotate a bimodal data corpus with different angles of people in various variations of protective masks and audio recordings of continuous Russian speech of people in masks. A distinctive feature of the software is the ability to capture and record video data simultaneously from several mobile devices in parallel (up to 3 devices). A new methodology for creating corpora of audio-visual speech was proposed. This methodology allows recording multi-angle data corpora and spontaneous speech. In order to solve the fundamental problem of detecting protective masks on a person's face by voice and facial characteristics, a bimodal Russian-language database (corpus) BRAVE-MASKS was recorded, which includes records of 30 native Russian speakers. The corpus contains 44820 video files (with audio) in MOV format, 180 files with spelling text of spoken phrases in TXT format, as well as about 2 million frame-by-frame images extracted from video recordings in JPG format. The corpus was recorded using two smartphones and one tablet, controlled by the developed software for the iOS operating system. In addition, preliminary research results were obtained on the automatic recognition of the presence / absence of a protective mask on a person's face, both by video and audio information. Moreover, an approach for creating a synthetic dataset using the methodology of overlaying masks on images of human faces is presented.

Karpov A.A.
Project's head
N 20-04-60529-viruses
Russian Foundation for Basic Research (RFBR)