Speech and Multimodal Interfaces Laboratory

Research and development of an intellectual system for a complex paralinguistic speech analysis

Сomputational paralinguistics is a new and rapidly developing area of modern speech technologies. It examines and analyzes a variety of non-verbal aspects in natural speech, texts and multimodal communication: emotions, accents, intonation, psycho-physiological states, features of pronunciation, parameters of the human voice and other non-verbal speech characteristics. Paralinguistics mainly concerns the questions of how the speech is pronounced rather than what is pronounced. Automatic recognition of emotions in speech (emotional / affective computing) is the most popular and highly demanded area of computational paralinguistics; it is closely associated with such studies as recognition of speaker’s state and an analysis of individual characteristics of a voice. The current state of a speaker, as a rule, corresponds to dynamically changing environments and can be described by some parameters such as their emotional and psycho-physiological state, state of health, fatigue, stress, depression, etc. Individual characteristics of the speaker correspond to permanent or relatively constant human characteristics: gender, age, height, accent, ethnicity, medical illness, such as Parkinson's or Alzheimer's disease, etc.
The main objective of this RSF (Russian Scientific Fund) project is to develop a new intelligent computer system for a language-independent paralinguistic analysis of speech. A feature of the intelligent system being created is that it will be able to carry out an integral paralinguistic analysis of speech audio signals, i.e. it will automatically analyze speaker’s speech in order to simultaneously determine one’s gender, age, psycho-emotional state and evaluate truth/deception of their statements and some other paralinguistic characteristics of speech. This automated system will focus not only on Russian speech, but also on other world languages to enable universal paralinguistic speech analysis. Thus, this research is topical and extensive for both Russian and International world science. Unlike other speech technologies (automatic speech recognition and understanding systems, speech synthesis, machine translation of speech), paralinguistic speech analysis systems are not tied to a specific natural language. Thus, it is possible to create practically universal methods for processing non-verbal acoustic information taking into account that ways and quality of emotional expression differ to some extent among different peoples and cultures.
The main tasks of this project are the development of and theoretical and experimental research into software, mathematical, information and linguistic support for the promising intelligent system for paralinguistic speech analysis.
For the successful fulfillment of the project, all of the following tasks, summarized in three successive stages, should be solved:

development of information-linguistic and mathematical support for the intelligent system for complex paralinguistic speech analysis in 2018;
research and development of mathematical support and software for intelligent system for integral paralinguistic speech analysis in 2019;
testing and quantitative evaluation of the intelligent system for integral paralinguistic speech analysis, as well as estimation of results in 2020.

Results for 2020

In 2020, the authors fulfilled the third stage of the RSF project related to the testing and assessing the performance of the proposed intellectual system of complex paralinguistic speech analysis, as well as summarizing and drawing important conclusions about the completed work in general. The following results were obtained:

Different methods of combining computer algorithms for analyzing various paralinguistic speech phenomena were investigated:
- based on the analysis we built new mathematical models for incorporating age and gender information for predicting emotional states, as well as for using information about emotional states to determine truthfulness of the speech statement;
- the effectiveness of the proposed methods was proven vie experiments and quantitative assessment using various paralinguistic speech corpora.
In the task of recognizing deception in speech we have developed new mathematical models and improved on the existing techniques. Experiments are conducted via the DSD and RLTDDD speech corpora using the following classification methods: Gradient Boosting Machines, bootstrap-aggregation (bagging), k-Nearest Neighbors classifier, as well as Support Vector Machines, Random Forests and Logistic Regression.
In 2020 we developed and registered in the Russian Federal Service for Intellectual Property (Rospatent) 2 computer programs:
- a Program for complex analysis of paralinguistic phenomena in speech (ComPAS - Complex Paralinguistic Analysis of Speech). At the time of writing, the proposed system ComPAS was the first and the only registered program that is capable of recognizing various paralinguistic phenomena, including age and gender, emotion, and sincerity of the statement at the same time.
- a Program for recognizing emotions from speech (ProSpER - Program for Speech Emotion Recognition). ProSpER has special mechanisms for adaptation to new users, high generalization capabilities due to effective cross-corpus training set-up, fast compute times, as well as high accuracy of the 4 emotional categories: happiness, anger, sadness and neutral state.
In the framework of this RFS project we took part in several international paralinguistic challenges. For the 12th INTERSPEECH Computational Paralinguistics Challenge (ComParE-2020) ) we have proposed three different systems for the following sub-challenges:
- recognizing emotions of elderly people (Elderly Emotion Sub-Challenge),
- recognition of a breathing signal (Breathing Sub-Challenge),
- recognition of a presence of a medical mask while talking (Mask Sub-Challenge).
In two out of three sub-challenges the authors of the project in collaboration with partners from Germany and Netherlands won the 1st place reaching the highest accuracy among other participants. As a result, we were awarded the 1st prize for the Elderly Emotion and the Breathing sub-challenges. The proposed systems featured an effective cross-validation training strategy and an ensemble classification approach to recognition of paralinguistic speech phenomena that provide a higher generalization capability in comparison to the traditional train/development data split. The investigated methods include both acoustic and linguistic speech modeling, as well as neural networks based on transfer learning that allows us to improve learning on the small dataset. The official results of the ComParE challenge are available at the site of the Interspeech Computational Paralinguistics Challenges.
In addition to the ComParE-2020 challenge we took part in the International FG-2020 Competition: Affective Behavior Analysis in-the-wild (ABAW), where our proposed audio-visual system of recognizing 7 basic emotional facial expressions scored the 3-rd place among other participants. The proposed system was based on deep learning architecture trained via transfer learning technique for better performance that reached the official classification metric of 42.1% (absolute 6.1% improvement over the baseline) in the recognition of the following classes: anger, disgust, fear, happiness, sadness, surprise, neutral state.
In addition to the 2020 plan, we have conducted several experiments that were designed to aid the main objectives of the project, including the facial expression recognition, sentiment analysis, addressee detection, as well as recognition of breathing and a presence of medical mask on a speaker. We also investigated the practical possibility of using the speech emotion recognition algorithms in an emergency call-center. The proposed call redistribution algorithm allows to significantly reducing the waiting time for the high-priority calls that are calls featuring the emotions of anger and fear.
In 2020, we have published 15 scientific works in total, including 5 journal articles, 9 papers in the conference proceedings and 1 chapter in a collective monograph. Among these works, 10 have been published in international periodic literature indexed by Web of Science and/or Scopus, including the proceedings of top international conferences (INTERSPEECH, SPECOM, ACM ICMI Workshop WoCBU, etc.), as well as Q1 journals (Sensors and Applied Sciences). The rest were published in Russian scientific literature indexed by Russian Science Citation Index and Higher Attestation Commission.
Additionally, the results of the project were actively highlighted in several news agencies. Scientific internet-portal Indicator.Ru has published an interview with the leader of the project Alexey Karpov called "Neural networks were taught to better recognize paralinguistic phenomena", web-based media platform ITMO.NEWS has published an interview with the principal investigator of the project Oxana Verkholyak called "Computational paralinguistics serving elderly people", and the newspaper agency "Kommersant" has published an article “Attentive interlocutor without a key word. An improved voice assistant will speak on par with humans”.

Results for 2019

In 2019, the authors fulfilled the second stage of the RSF project, which is related to development and research of the mathematical and programming models for intellectual system of complex paralinguistic speech analysis and the following results were obtained:

Alongside of improvement of the existing methods for digital speech signal processing and machine learning for classification of paralinguistic phenomena in speech, a new mathematical modelling paradigm was developed in a few directions:
- automatic recognition of spontaneous emotions, including emotions occurring in the dialogues;
- automatic recognition of potential deception in speech;
- automatic detection of gender and age group of the speaker;
- automatic addressee recognition.
Overall on this stage we have developed 5 different deep neural network architectures, 3 approaches using traditional machine learning methods, as well as algorithms for feature selection, creation of classification ensembles, domain adaptation techniques and hierarchical context modelling. New program architectures and prototypes written in high-level programming language Python were developed based on the proposed mathematical models mentioned above.
In order to integrate some already existing libraries and toolkits in the framework of the intellectual system under development we have performed an analysis of opensource programming toolkits dedicated to automatic processing and recognition of paralinguistic phenomena in speech.
The system program developed for automatic speaker age and gender recognition and named “GASpeakerRecognizer” was registered with the Russian Federal Agency for Intellectual Property (Certificate of intellectual property number 2019662952 from 07.10.2019, authors: Markitantov M. V., Karpov A. A, rights owner: SPIIRAS). This program allows to record speech from real environment with the use of a microphone, as well as to read already existing audio files and analyze them for paralinguistic phenomena of interest.
Another program prototype was developed for the task of speech emotion recognition into 4 categories: joy, sadness, anger and neutral state. After a simple registration procedure that encompasses a recording and statistical processing of an arbitrary spontaneous user speech, the statistics of each registered user are saved in the memory and used during the testing stage for user adaptation purposes.
A number of databases in the possession of the contributors to the project was extended to include the following speech corpora with rich paralinguistic annotation.
Experiments were conducted with the proposed two-level hierarchical system for two speech emotion parameters, namely activation and arousal. Experiments were conducted in a cross-corpus setup with a domain adaptation technique based on the consecutive application of Principle Component Analysis (PCA) and Canonical Correlation Analysis (CCA) together with the acoustical context modelling on various levels of context (personal level and dialogue level).
Preliminary results of the experiments on the characteristics of the potentially deceptive speech that indicate an increased inner pressure and stress that has a direct influence on the psychoemotional state of the speaker were conducted to reflect the binary classification accuracy in the two categories [Truth, Deception].
Research and experiments were conducted for automatic speaker age and gender classification and experimental results were obtained on the German speech aGender dataset.
Our team, including the contributors of the current project, has participated in the 11th International Computational Paralinguistics Challenge ComParE-2019 (Graz, Austria) and conducted experiments using the proposed algorithms on the databases provided by organizers of the challenge. Our group has scored 2nd place in the Baby Sounds Sub-Challenge beating the baseline system performance on both validation and test sets (our scores UAR=60.55% и UAR=61.44%) while the winner of the challenge achieved UAR=62.4% on the test set. In the Continuous Sleepiness Sub-Challenge we scored 4th with SCC=0.339 и SCC=0.356 against baseline with marginal improvement for validation and test set, respectively. The result of the first place of the Sub-Challenge was not far from ours with SCC=0.387. In the third Sub-Challenge dedicated to recognition of Styrian Dialects our results on the validation set (UAR=51.5%) beat the baseline, however the test set results did not improve over the baseline. Our proposed scheme scored 4th place among 17 other participants. Additionally, our team also participated in the 9th international Audio Visual Emotion Challenge AVEC-2019 (Nice, France) in two Sub-Challenges. In the Cross-Cultural Emotion Sub-Challenge our team (named SUN) scored 3rd place, and in the Depression Level Sub-Challenge we scored 5th place.
In 2019 there were 8 articles published with indexing in international databases Scopus, Web-of-Science and Russian Citation Index system. The leader and contributors of the project have participated with oral, poster and invited talks about the results of the second stage of the given project on the following top international conferences: 44th International Conference on Acoustics, Speech and Signal Processing ICASSP-2019, 9th ACM International Workshop AVEC-2019, 20th ACL International Conference SigDial-2019, 21st International Conference SPECOM-2019, 13th International Symposium IDC-2019, 3rd International conference on engineering and applied linguistics (Karpov has presented an invited plenary talk), 8th multidisciplinary workshop Analysis of Spoken Russian Language 2019, Congress of young scientists of ITMO University 2019.
The results of the given project were highlighted in the following media sources: news of the international information agency TASS and a TV show “Matrix of Science” on the TV program “Saint Petersburg”.

Project's head

Karpov A.A.

Number

N 18-11-00145

Period

2018-2020

Financing

Russian Science Foundation

Results 2018