Speech and Multimodal Interfaces Laboratory

In SPC RAS, a neural network was taught to "read lips" to improve the accuracy of speech recognition

Professionals of the St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS) have learned to recognize human speech by lips via artificial intelligence algorithms and "computer vision". The development will contribute to improving the accuracy of voice assistants in noisy environment, like, crowded places or at controling heavy machinery.

Today, systems able to recognize human speech (sound signal) for automated command execution are being actively implemented in a variety of areas, from cell phones to combat helicopters. As a rule they are used by people with limb injuries or operators of complex equipment whose hands are kept busy. Lately, in order to increase users’ comfort, such systems are gaining a rising popularity in various business areas, gadgets and smart home systems with voice control. Although modern recognition systems have significantly advanced in the accuracy of speech interpretation, nevertheless at heavy noise (loud sounds from equipment or from crowded places) their effectiveness may decrease dramatically.

"We have developed a smartphone application that recognizes the sounding speech and “reads lips” by analyzing the video signal from the gadget's camera. The program combines and analyzes information from two sources to improve recognition accuracy. Experiments have shown that such a hybrid system is significantly effective at recognizing human commands under difficult and noisy conditions," says Denis Ivanko, Senior Researcher of the Laboratory of Speech and Multimodal Interfaces at SPC RAS.

According to D. Ivanko, the application works by analogy with the principle of the cognitive system of a human who, while speaking in a noisy place, involuntarily pays attention to the lips of the interlocutor, trying to lip-read information that he might not have heard. This feature has been confirmed by scientific experiments, when people at noisy conditions were offered to recognize only audio or only visual information. However, the best results were demonstrated by the group that received both types of data. Application is based on a neural network model that has been taught to recognize several hundred of the most common commands by an audio-visual signal (video recordings accompanied by sound). Moreover, according to scientists, the developed neural network is able to perceive an audiovisual signal and automatically make a decision what data (video or sound, or both) will give maximum accuracy during recognition. During the experiments, the application was used by drivers of noisy heavy trucks at the big logistics company in Russia. For the sake of experiment validity, the software was installed on the smartphones of the subjects. The accuracy of command recognition based on visual effects alone was 60-80%, and in combination with an audio signal – over 90%.

"Also last year, at specialized international scientific competitions, our model took the first place in the world in terms of the accuracy of speech lips-reading of the speaker. Participants trained their neural networks on an open database of English-language data consisting of 500 thousand videos and tested them on a set of 25 thousand records. The accuracy of our model turned out to be close to 90% of recognition based only on the movements of the speakers' lips. We assume that in the future our application may be used by pilots of aircrafts and heavy industrial equipment or in interactive information kiosks at shopping malls and other crowded places," explains Denis Ivanko.

This research is supported by the RSF Grant (No. 21-71-00132). In addition, a certificate of state registration has been received for the developed software. The answers to these questions were presented at the International Signal Processing Conference (EUSIPCO).