Speech and Multimodal Interfaces Laboratory

Intelligent system for multimodal recognition of human's affective states

Intelligent system for multimodal recognition of human's affective states

This interdisciplinary project of the Russian Science Foundation is aimed at solving the problems of multimodal analysis and recognition of affective states of people by their behavior using modern methods of digital signal processing and deep machine learning. The problem of affective computing is very relevant and significant from a scientific, technical and practical point of view. There are many unsolved problems in this area, while the practical application of systems for recognizing human affective states solely from unimodal data (for example, only from audio or video data) has a number of significant limitations. The most natural way for a person to interact and exchange information is multimodal communication, which involves several modalities (communication channels) at the same time, including natural speech and sounds, facial expressions and articulation, hand and body gestures, gaze direction, general behavior, textual information, etc. Multimodal systems for the analysis of human affective states have significant advantages over unimodal methods, allowing analysis in difficult conditions such as noise in one of the information channels (acoustic noise or lack of lighting), as well as in the complete absence of information in one of the channels (the person is silent or not facing the camera). In addition, multimodal analysis often makes it possible to recognize such ambiguous affective phenomena as sarcasm and irony, which are characterized by a clear mismatch between the meaning of the utterance (text analysis) and voice intonation (audio analysis) and facial expressions (video analysis). Therefore, the simultaneous analysis of several components of human behavior (speech, facial expressions, gestures, direction of gaze, text transcriptions of statements) will improve the quality of work and the accuracy of recognition of automatic systems for analyzing affective states in tasks such as recognizing emotions, sentiment, aggression, depression, etc. All these tasks are of great practical importance in the field of technologies of emotional artificial intelligence (Emotional AI), as well as in psychology, medicine, banking, forensic science, cognitive sciences, etc. They are of high scientific and technical, as well as social and economic importance.

The main goal of this RSF project is to develop and research a new intelligent computer system for multimodal analysis of human behavior in order to recognize manifested affective states based on audio, video and text data from a person. A unique feature of the system will be the multimodal analysis, i.e. simultaneous automatic analysis of the user's speech and image, as well as the meaning of his statements for the purpose of determining various psychoemotional (affective), including emotions, sentiment, aggression and depression. At the same time, the target audience of the automated system being developed will include not only the Russian-speaking population, but also other representative groups regardless of gender, age, race and language. Thus, this study is relevant and large-scale both within the framework of Russian and world science.

The main objectives of this project are the development, theoretical and experimental research of infoware, mathematical and software support for the intelligent system of multimodal analysis of affective behavior of people.

To achieve the main goal of the project, the specified tasks must be solved, summarized in 3 sequential stages of work:

  1. development of infoware and mathematical support for the intelligent system of multimodal analysis of affective states (2022);
  2. development and research of mathematical and software support for the intelligent system of multimodal analysis of affective states (2023);
  3. experimental study and evaluation of the intelligent system for multimodal analysis of affective states, development of a system demonstrator and generalization of results (2024).

Results for 2022

In 2022, the 1st stage of the project, related to the development and research of the mathematical, information and linguistic support of an intelligent system for multimodal recognition of human affective states was completed.

An analytical review of modern scientific and technical literature on the topic of multimodal modeling of audio-visual signals for the analysis of affective states was conducted. We can conclude that neural network methods are gradually replacing traditional ones by achieving greater accuracy in recognizing affective states and quickly processing large amounts of data. The analysis of the existing multimodal corpora was carried out and some of them were obtained for analysis of affective states: for emotions (AFEW, AFEW-VA, AffWild2, SEWA, AffectNet); for emotions and sentiment (CMU-MOSEI, MELD); for aggression (parts of the TR and SD corpora); for depression (DAIC).

We collected and annotated Audio-Visual Aggressive Behavior in Online Streams dataset (AVABOS). This corpus contains video files obtained from open sources on the Internet, which contain individual and group aggressive behavior of Russian-speaking users, manifested during live video broadcasts. The database is designed for automatic audiovisual analysis of aggressive behavior, and is officially registered with Rospatent, certificate No. 2022623239 dated 12.05.2022.

Novel and improved existing mathematical tools (models, methods and algorithms) has been developed to extract informative features from audio, video and text modalities in order to automatically predict various affective states:

  • for the task of recognizing emotions, aggression and depression by audio modality, a method based on the use of expert and neural network features (openSMILE, openXBOW, AuDeep, DeepSpectrum, PANN, Wav2Vec) has been developed. The influence of data annotation quality on the effectiveness of emotion recognition methods on the RAMAS corpus was analyzed. Quantitative assessment of the developed methods for recognizing emotions, aggression and depression was carried out on the following corpora: RAMAS, TR & SD and DAIC, respectively. For audio modality, a method of audio data augmentation has been developed based on the modification of spectrogram images: rotation, rescaling, shift in width and height, brightness change, horizontal reflection, stretching, compression, and SpecAugment;
  • for the task of recognizing emotions and sentiment by text modality on the RAMAS and CMU-MOSEI corpora, methods for preprocessing orthographic text transcriptions (tokenization, deletion of punctuation, case reduction, lemmatization for Russian-language data and stemming for English-language data), as well as a neural network vectorization method Word2Vec, have been developed and studied. The advantages of the proposed methods are the preservation of the syntactic and semantic information of the text after vectorization, the small size of the vector space and the possibility of using different models for Russian and English. For augmentation of text data, an augmentation approach that combines methods for modifying text data: deletion, permutation, replacement of words, permutation of sentences, reverse translation, generative models and domain augmentation has been developed;
  • for the task of recognizing emotions by video modality, a method based on the ResNet-50 convolutional neural network was developed. This method was trained on the AffectNet corpus and is capable of extracting face textural features of different dimensions, which can be fed to both classical deterministic and neural network methods of machine learning. The developed method is also effective for the tasks of recognizing other affective states, in particular, aggression and depression. For video modality, a video data augmentation method has been developed. This method based on the use of image modification methods: Mixup, affine transformations, contrast adjustment and class weighting.

It is well-known that when communicating, people use both verbal and non-verbal manifestations of affective states (emotions, sentiment, aggression, depression). At the same time, the speaker expresses the semantic content of the communicative statement with the help of verbal information, which is representative for the expression of sentiment or the polarity of emotion. At the same time, the intensity of emotions is reflected in non-verbal manifestations, and is better transmitted through audio modality than video. To analyze the manifestation of depression, it is more efficient to use visual and acoustic information, linguistic information in this case cannot convey the full state of an affective disorder and the manifestation of emotions, but can only show the negative polarity of the statement. Multimodal recognition of affective states of a person makes it possible to analyze the manifestation of verbal and non-verbal information of the speaker simultaneously and obtain reliable information about the psychological state of the communicant.

Based on the results of theoretical and experimental research conducted in 2022, a series of 5 scientific papers on current results was prepared and published in publications and journals indexed in the international citation systems Scopus, Web of Science and RSCI, including in the Russian journals “Informatics and Automation” (Scopus and RSCI) and “Proceedings of VSU. Series: Systems Analysis and Information Technologies” (RSCI), as well as in the Proceedings of the 24th International Conference “Speech and Computer” SPECOM (India, top-level conference according to the international portal Research.com, 2 papers published in the Springer Lecture Notes in Computer Science series), the 24th International Congress on Acoustics ICA (Korea, the prestigious congress takes place every 3 years). In addition, an invited paper was presented at the 4th International Conference on Language Engineering and Applied Linguistics “Piotrowski’s Readings 2022” (St. Petersburg).

Addresses of Internet resources prepared for the Project:

  1. Reportage Automated call-centers: the way from IVR to "lie detector" Delovoy Petersburg
  2. Velichko A A speech singnal analysis method for automatic aggression detection in colloquial speech // Proceedings of VSU, series: Systems Analysis and Information Technologies. 2022. No. 4. pp. 1-9
  3. Dvoynikova A., Markitantov M., Ryumina E., Uzdiaev M., Velichko A., Kagirov I., Kipyatkova I., Lyakso E., Karpov A. An analysis of automatic techniques for recognizing human's affective states by speech and multimodal data // In Proc. of the 24th International Congress on Acoustics ICA-2022. 2022. pp. 22-33.
  4. Dvoynikova A., Markitantov M., Ryumina E., Uzdiaev M., Velichko A., Ryumin D., Lyakso E., Karpov A. Analysis of infoware and software for human affective states recognition // Informatics and Automation. 2022. Vol. 21. No. 6. pp. 1097-1144.
  5. Mamontov D., Minker W., Karpov A. Self-Configuring Genetic Programming Feature Generation in Affect Recognition Tasks // In Proc. of International Conference on Speech and Computer (SPECOM). 2022. pp. 464-476.
  6. Ryumina E., Ivanko D. Emotional Speech Recognition Based on Lip-Reading // In Proc. of International Conference on Speech and Computer (SPECOM). 2022. pp. 616-625.
  7. Audio-Visual Aggressive Behavior in Online Streams corpus (AVABOS)
Project's head
N 22-11-00321
Russian Science Foundation