Speech and Multimodal Interfaces Laboratory

Software and infoware for intelligent analysis of video and audio information for assistive mobile systems in vehicles

Last years, the field of automation and intellectualization of vehicles becomes more and more popular. The main factor determining the interest of researchers in these areas is the high accident rate on public roads, both in Russia and abroad. At that, more than half of road accidents occur as a result of the human factor. A new approach, based on a contactless voice interface for interacting with assistive systems in vehicle, is proposed to solve the problem of diverting the driver’s hands and visual attention from driving a vehicle. Existing voice control systems differ in the set of supported languages, the number of recognizable commands, the number of implemented functions, etc. But such voice control systems have one thing in common - they do not work well in conditions of strong acoustic noise, which is common for vehicles in traffic, especially at high speed. In acoustically noisy environments, visual information about speech plays an important role. To improve the quality and robustness of automatic speech recognition in noisy traffic environment we propose the development and research of an audio-visual speech recognition system based on the joint processing of video and audio information, integrating modern methods of computer vision for automatic lip-reading and methods for analyzing acoustic information.

In this project, a new mathematical, software, and information models has been developed for the intelligent analysis of video and audio information in assistive transport mobile systems. A set of models and methods has been developed that allows to automatically recognize the control voice commands of a vehicle driver by analyzing speech information from two modalities (acoustic speech signal and analysis of lip images) within a given command dictionary. Models and methods for determining the destructive psycho-emotional state of the driver, which affects his speech, have also been developed, which makes it possible to improve the accuracy of voice command recognition.

Results for 2022

Within the framework of the third stage of the project a software for an automatic system of audiovisual recognition of Russian speech was developed and experimentally studied, which is based on the processing of video and audio information and is intended for use in assistive transport mobile systems. Also, software has been developed for the module for determining the destructive psycho-emotional state of the driver using speed and accelerometer data together with human video signals to form a context for speech recognition in a mobile vehicle. Experimental studies were carried out with the developed system of audiovisual recognition of Russian speech based on the collected audiovisual database of the speech of drivers RUSAVIC and data on the psycho-emotional state of drivers. The system of audiovisual recognition of continuous speech was integrated into the previously developed system for the prevention of emergency situations of vehicles based on mobile video measurements of the driver's behavior.

As part of the project, 20 papers were published in journals and publications indexed in the Scopus and Web of Science citation systems, including 2 papers in international journals of the first quartile Q1 according to the Scopus (IEEE Access) and Web of Science (Mathematics) ratings, as well as 3 papers in journals from the list of HAC (RSCI). As a result of the project, 12 papers were presented at leading international conferences. Also, 3 developed computer programs and one RUSAVIC database are officially registered in Rospatent. The development results are also presented by TV channels Russia 1, 5 Channel and 78 Channel in scope of the programs that show the developed system prototype.

Addresses of Internet resources prepared for the Project:

Software Driver’s Audio-Visual Speech Recognition – DAVIS
Software Multimodal Interaction for Drive Safely – MIDriveSafely
Software for audio and multiview video processing, synchronization and annotation
The multi-speaker audiovisual corpus RUSAVIC (Russian Audio-Visual Speech in Cars)
Web page about the recognition system and corpus on the DriveSafely
Axyonov A., Ryumin D., Kashevnik A., Ivanko D., Karpov A. Method for visual analysis of driver's face for automatic lip-reading in the wild // Computer Optics. 2022. V. 46, N. 6. pp. 955-962. (in Russian)
Ivanko D., Kashevnik A., Ryumin D., Kitenko A., Axyonov A., Lashkov I., Karpov A. MIDriveSafely: Multimodal Interaction for Drive Safely // In Proc. of ACM International Conference on Multimodal Interaction (ICMI-2022). 2022. pp. 733-735.
Ivanko D., Axyonov A., Ryumin D., Kashevnik A., Karpov A. RUSAVIC Corpus: Russian Audio-Visual Speech in Cars // In Proc. of 13th Language Resources and Evaluation Conference (LREC 2022). 2022. pp. 1555-1559.
Ivanko D., Ryumin D., Kashevnik A., Axyonov A., Karpov A. Visual Speech Recognition in a Driver Assistance System // In Proc. of 30th European Conference on Signal Processing (EUSIPCO 2022). 2022. pp. 1131-1135.
Axyonov A., Ivanko D., Lashkov I., Ryumin D., Kashevnik A., Karpov A. A methodology of multimodal corpus creation for audio-visual speech recognition in assistive transport systems // Informatization and communication. 2020. N. 4. pp. 49-55.
Dresvyanskiy D., Ryumina E., Kaya H., Markitantov M., Karpov A., Minker W. End-to-end Modelling and Transfer Learning for Audiovisual Emotion Recognition in the Wild // Multimodal Technologies and Interaction. 2022. Vol. 6(2). ID 11.
Ivanko D., Ryumin D., Karpov A. A. Review of Recent Advances on Deep Learning Methods for Audio-Visual Speech Recognition // Mathematics. 2023. Vol. 11(12). ID 2665
Kashevnik A., Lashkov I., Axyonov A., Ivanko D., Ryumin D., Kolchin A., Karpov A. Multimodal corpus design for audio-visual speech recognition in vehicle cabin // IEEE Access. 2021. Vol. 9. pp. 34986–35003
Lashkov I., Kashevnik A. Aggressive Behavior Detection Based on Driver Heart Rate and Hand Movement Data // In Proc. of IEEE International Intelligent Transportation Systems Conference (ITSC2021). 2021. pp. 1490–1495.
Ivanko D., Ryumin D., Axyonov A., Kashevnik A. Speaker-dependent visual command recognition in vehicle cabin: methodology and evaluation // In Proc. of 23rd International Conference on Speech and Computer (SPECOM-2021). 2021. pp. 291-302.

Results for 2021

Within the framework of the second stage of the project, the research and development of methods for effective parametric representation of multi-angle video and audio signals for bimodal analysis and recognition of Russian speech was carried out. A method for determining speech boundaries in audiovisual signals based on computer vision, machine learning technologies, and the Silero-VAD speech boundary detection system is proposed. A method is proposed for synchronizing audio and video information in a speech recognition system based on synchronizing two streams using the method of determining speech boundaries and synchronizing sensory information based on a time server. New methods of speech recognition based on multi-angle video and audio information using probabilistic models of acoustic and visual speech units have been developed and investigated. Hybrid neural network models for the Russian speech recognition system were created and trained separately for video and audio information. A method and algorithm for determining the destructive psycho-emotional state of a driver using speed and accelerometer data together with video information from a person to form a context for speech recognition in a mobile vehicle have been developed.

Results for 2020

As part of the first stage of the project, an analytical review was carried out in the topic of assistive transport systems and existing audiovisual speech corpora, math, and software was also developed to create bimodal corpora that include multi-angle video and audio recordings of voice commands during the movement of the vehicle, synchronized with available mobile sensors. Based on the analysis of the video stream from the smartphone camera turned to driver, synchronized with sensory information (GPS, accelerometer, gyroscope), it is proposed to determine the psycho-emotional state of the driver, the definition of which is necessary for more accurate recognition of audiovisual speech. A new technique for creating speech audiovisual data corpora is proposed, which makes it possible to record data of different angles using the proposed methods and algorithms. To train models of bimodal recognition of Russian speech in assistive mobile transport systems in full-scale and semi-natural conditions, the RUSAVIC audiovisual corpus was recorded, which includes recordings of 20 native speakers of the Russian language. RUSAVIC output consists of 21,100 audio/video files with markup. The bimodal corpus was recorded using three smartphones controlled by the developed software for the Android operating system.

Project's head

Karpov A.A.

Number

N 19-29-09081-mk

Period

2019-2022

Financing

Russian Foundation for Basic Research (RFBR)