Speech and Multimodal Interfaces Laboratory

Intelligent system for multimodal recognition of human's affective states

Intelligent system for multimodal recognition of human's affective states

This interdisciplinary project of the Russian Science Foundation is aimed at solving the problems of multimodal analysis and recognition of affective states of people by their behavior using modern methods of digital signal processing and deep machine learning. The problem of affective computing is very relevant and significant from a scientific, technical and practical point of view. There are many unsolved problems in this area, while the practical application of systems for recognizing human affective states solely from unimodal data (for example, only from audio or video data) has a number of significant limitations. The most natural way for a person to interact and exchange information is multimodal communication, which involves several modalities (communication channels) at the same time, including natural speech and sounds, facial expressions and articulation, hand and body gestures, gaze direction, general behavior, textual information, etc. Multimodal systems for the analysis of human affective states have significant advantages over unimodal methods, allowing analysis in difficult conditions such as noise in one of the information channels (acoustic noise or lack of lighting), as well as in the complete absence of information in one of the channels (the person is silent or not facing the camera). In addition, multimodal analysis often makes it possible to recognize such ambiguous affective phenomena as sarcasm and irony, which are characterized by a clear mismatch between the meaning of the utterance (text analysis) and voice intonation (audio analysis) and facial expressions (video analysis). Therefore, the simultaneous analysis of several components of human behavior (speech, facial expressions, gestures, direction of gaze, text transcriptions of statements) will improve the quality of work and the accuracy of recognition of automatic systems for analyzing affective states in tasks such as recognizing emotions, sentiment, aggression, depression, etc. All these tasks are of great practical importance in the field of technologies of emotional artificial intelligence (Emotional AI), as well as in psychology, medicine, banking, forensic science, cognitive sciences, etc. They are of high scientific and technical, as well as social and economic importance.

The main goal of this RSF project is to develop and research a new intelligent computer system for multimodal analysis of human behavior in order to recognize manifested affective states based on audio, video and text data from a person. A unique feature of the system will be the multimodal analysis, i.e. simultaneous automatic analysis of the user's speech and image, as well as the meaning of his statements for the purpose of determining various psychoemotional (affective), including emotions, sentiment, aggression and depression. At the same time, the target audience of the automated system being developed will include not only the Russian-speaking population, but also other representative groups regardless of gender, age, race and language. Thus, this study is relevant and large-scale both within the framework of Russian and world science.

The main objectives of this project are the development, theoretical and experimental research of infoware, mathematical and software support for the intelligent system of multimodal analysis of affective behavior of people.

To achieve the main goal of the project, the specified tasks must be solved, summarized in 3 sequential stages of work:

  1. development of infoware and mathematical support for the intelligent system of multimodal analysis of affective states (2022);
  2. development and research of mathematical and software support for the intelligent system of multimodal analysis of affective states (2023);
  3. experimental study and evaluation of the intelligent system for multimodal analysis of affective states, development of a system demonstrator and generalization of results (2024).

Results for 2023

In 2023, the 2nd stage of the project related to the development and research of mathematical support and software for the processing of unimodal data (audio, video, text), as well as the creation of bimodal models (audio+video and audio+text) of an intelligent system for analyzing human affective states was completed.

Classification and regression methods for analyzing individual affective states using unimodal data were improved for: binary classification of aggression (absence or presence of state) using audio data; classification of sentiment into three (negative, neutral, positive) and two classes (negative, positive) using text data; binary classification of aggression (absence or presence), emotion (anger, sadness, fear, disgust, happiness, neutral state) using video data. Experimental studies on automatic recognition of aggression (on the AVABOS corpus), sentiment (CMU-MOSEI), emotion (CREMA-D) were conducted to select the most effective neural network features, as well as models with recurrent, fully-connected and attention mechanism (AM) layers for their modeling and analysis.

A hierarchical method for binary classification of lying (false or true information), aggression (low, medium or high level) and depression (presence or absence of signs of illness) using audio data was proposed. In its development the theoretical basis of correlation between the considered paralinguistic phenomena was used: the results of classification methods for recognizing aggression and lies are the input data of the method for determining depression. The method of integral evaluation of the degree of expression of destructive phenomena in speech was proposed. Experimental studies on automatic recognition of lying (on the DSD corpus), depression (DAIC) and aggression (SD&TR) were carried out.

Multi-task methods for classification of emotion (surprise, anger, sadness, fear, disgust, happiness) and sentiment (negative, neutral, positive) using unimodal data (audio, video, text) were proposed. Experimental studies on multi-task emotion and sentiment recognition (on the RAMAS and CMU-MOSEI corpora), with training on unimodal/multimodal data were conducted, including:

  • using audio data, we compared the performance of transformer models for extracting acoustic features, which were then processed by a GRU-based model. The EW2V neural network model outperformed the other models by an average of 3.5%. Combination of AM and recurrent layers also contributed positively to the recognition accuracy. The proposed method for emotion recognition outperformed the state-of-the-art (SOTA) methods on the CMU-MOSEI corpus by 3.3% (mWAcc);
  • using text data, we compared the performance of transformer models for extracting linguistic features, which were then processed by the AM-based model. The RoBERTa linguistic features outperformed the other features by an average of 2%. The best feature set was processed by two identical neural networks with AM (for emotion and sentiment). The best method based on these features and the neural network with AM outperformed the others by an average of 3%. This is due to the different training procedures of the original transformer models. The proposed method for emotion recognition outperformed the SOTA methods on the CMU-MOSEI corpus by 6.6% (mWAcc);
  • using video data, we compared the performance of visual features that were processed by the LSTM-based model. The EmoFFs features outperformed the others by an average of 2.4%. EmoFFs were able to detect complex nonlinear dependencies and facial features. The proposed method for emotion recognition outperformed the SOTA methods on the CMU-MOSEI corpus by 7.2% (mWAcc).

Multi-task methods for classification of emotions (surprise, anger, sadness, fear, disgust, happiness) and sentiment (negative, neutral, positive) using bimodal data (audio+video, audio+text) were proposed. Experimental studies on multi-task multimodal emotion and sentiment recognition (on the RAMAS and CMU-MOSEI corpora) were conducted:

  • using audio and video data, the performance of different modality fusion methods was compared. The CMGSAF method based on the use of statistical functionals, fully connected layers and two consecutive layers of attention was proposed. CMGSAF outperformed the considered classical modality fusion methods by 2.2%. It was clear from the results that for RAMAS, video was more effective than audio data, whereas the opposite was true for CMU-MOSEI. CMGSAF outperformed other SOTA methods in the emotion recognition task on the RAMAS corpus by 0.7% (UAR) and on the CMU-MOSEI corpus by 18.2% (mWAcc) and 1.6% (mWF1);
  • using audio and text data, the performance of modality fusion methods was compared. A FCF method based on concatenation of features processed by two identical neural networks with AM (for emotion and sentiment) was proposed. FCF outperformed other modality fusion methods by 1%, including AM-based fusion. The audio data was more effective than text data in recognizing emotion, whereas the opposite was true for sentiment. The FCF method outperformed other SOTA methods in the emotion recognition task by 2.82% (mWAcc) and 0.7% (mWF1) and in the sentiment recognition task by 7.13% (Acc) and 6.06% (WF1) on the CMU-MOSEI corpus.

The results show that audio and video data are more effective for emotion recognition, while textual data are more informative for sentiment analysis.

Two software were developed and registered in Rospatent: 1) Software for Audio-Visual Emotions and Sentiment Recognition (AVESR); 2) Software for hierarchical recognition of destructive phenomena in speech (Destructive Behaviour Detection (DesBDet). AVESR using a webcam can perform real-time recognition of emotions (surprise, anger, sadness, fear, disgust, happiness) and sentiment (negative, neutral, positive). DesBDet performs hierarchical recognition of destructive phenomena (false or true information, level of aggression and absence/presence of depression) in speech. The software can record audio files using a microphone or read them from a disk. Both software models are characterized by good generalizability due to the use of cross-corpus learning models, fast response times, and high recognition accuracy.

In 2023, a cycle of 7 scientific papers was published in publications and journals indexed in the international citation systems Scopus, Web of Science and RSCI, including the international journal Mathematics (Q1 WoS), Russian journals "Information and Control Systems" (Scopus) and "Journal of Instrument Engineering" (RSCI), as well as in the proceedings of the jubilee 25th International Conference on Speech and Computer SPECOM-2023 (Dharwad, India, top-level conference in the international portal Research.com); 7th International Scientific Conference "Intelligent Information Technologies for Industry" IITI-2023 (St. Petersburg, an invited talk by A. Karpov); 29th International Conference on Computational Linguistics and Intellectual Technologies DIALOGUE-2023 (Moscow); 5th International conference on Photogrammetric techniques for environmental and infraStructure monitoring, Biometry and Biomedicine PSBB-2023 (Moscow).

Addresses of Internet resources prepared for the Project:

  1. Ryumina E., Markitantov M., Karpov A. Multi-Corpus Learning for Audio-Visual Emotions and Sentiment Recognition // Mathematics, 2023, vol. 11(16), ID 3519.
  2. Ryumina E., Karpov A. Impact of visual modalities in multimodal personality and affective computing // The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2023, vol. XLVIII-2/W3-2023, pp. 217–224.
  3. Karpov A., Dvoynikova A., Ryumina E. Intelligent Interfaces and Systems for Human-Computer Interaction // In Proc. of 7th International Conference «Intelligent Information Technologies for Industry» IITI 2023, St. Petersburg, Springer, Lecture Notes in Networks and Systems LNNS, 2023, vol. 776, pp. 3-13.
  4. Dvoynikova A., Karpov A. Bimodal sentiment and emotion classification with multi-head attention fusion of acoustic and linguistic information // Computational Linguistics and Intellectual Technologies. In Proc. of 29th International Conference DIALOGUE-2023, Moscow, 2023, pp. 51-61.
  5. Velichko A., Karpov A. An approach and software system for integral analysis of destructive paralinguistic phenomena in colloquial speech // Information and Control Systems, 2023, No. 4, pp. 2-11.
  6. Ivanko D., Ryumina E., Ryumin D., Axyonov A., Kashevnik A., Karpov A. EMO-AVSR: Two-Level Approach for Audio-Visual Emotional Speech Recognition // In Proc. of SPECOM 2023, 2023, Springer LNCS vol. 14338, pp 1-14.
  7. Dvoynikova A., Kondratenko K. Approach to automatic recognition of emotions in speech transcriptions // Journal of Instrument Engineering. 2023, Vol. 66, No. 10, pp. 818-827.
  8. Software for Audio-Visual Emotions and Sentiment Recognition (AVESR), authors: Markitantov M, Ryumina E., Karpov A.
  9. Software for Destructive Behaviour Detection (DesBDet), authors: Velichko A., Karpov A.

 

Results for 2022

In 2022, the 1st stage of the project, related to the development and research of the mathematical, information and linguistic support of an intelligent system for multimodal recognition of human affective states was completed.

An analytical review of modern scientific and technical literature on the topic of multimodal modeling of audio-visual signals for the analysis of affective states was conducted. We can conclude that neural network methods are gradually replacing traditional ones by achieving greater accuracy in recognizing affective states and quickly processing large amounts of data. The analysis of the existing multimodal corpora was carried out and some of them were obtained for analysis of affective states: for emotions (AFEW, AFEW-VA, AffWild2, SEWA, AffectNet); for emotions and sentiment (CMU-MOSEI, MELD); for aggression (parts of the TR and SD corpora); for depression (DAIC).

We collected and annotated Audio-Visual Aggressive Behavior in Online Streams dataset (AVABOS). This corpus contains video files obtained from open sources on the Internet, which contain individual and group aggressive behavior of Russian-speaking users, manifested during live video broadcasts. The database is designed for automatic audiovisual analysis of aggressive behavior, and is officially registered with Rospatent, certificate No. 2022623239 dated 12.05.2022.

Novel and improved existing mathematical tools (models, methods and algorithms) has been developed to extract informative features from audio, video and text modalities in order to automatically predict various affective states:

  • for the task of recognizing emotions, aggression and depression by audio modality, a method based on the use of expert and neural network features (openSMILE, openXBOW, AuDeep, DeepSpectrum, PANN, Wav2Vec) has been developed. The influence of data annotation quality on the effectiveness of emotion recognition methods on the RAMAS corpus was analyzed. Quantitative assessment of the developed methods for recognizing emotions, aggression and depression was carried out on the following corpora: RAMAS, TR & SD and DAIC, respectively. For audio modality, a method of audio data augmentation has been developed based on the modification of spectrogram images: rotation, rescaling, shift in width and height, brightness change, horizontal reflection, stretching, compression, and SpecAugment;
  • for the task of recognizing emotions and sentiment by text modality on the RAMAS and CMU-MOSEI corpora, methods for preprocessing orthographic text transcriptions (tokenization, deletion of punctuation, case reduction, lemmatization for Russian-language data and stemming for English-language data), as well as a neural network vectorization method Word2Vec, have been developed and studied. The advantages of the proposed methods are the preservation of the syntactic and semantic information of the text after vectorization, the small size of the vector space and the possibility of using different models for Russian and English. For augmentation of text data, an augmentation approach that combines methods for modifying text data: deletion, permutation, replacement of words, permutation of sentences, reverse translation, generative models and domain augmentation has been developed;
  • for the task of recognizing emotions by video modality, a method based on the ResNet-50 convolutional neural network was developed. This method was trained on the AffectNet corpus and is capable of extracting face textural features of different dimensions, which can be fed to both classical deterministic and neural network methods of machine learning. The developed method is also effective for the tasks of recognizing other affective states, in particular, aggression and depression. For video modality, a video data augmentation method has been developed. This method based on the use of image modification methods: Mixup, affine transformations, contrast adjustment and class weighting.

It is well-known that when communicating, people use both verbal and non-verbal manifestations of affective states (emotions, sentiment, aggression, depression). At the same time, the speaker expresses the semantic content of the communicative statement with the help of verbal information, which is representative for the expression of sentiment or the polarity of emotion. At the same time, the intensity of emotions is reflected in non-verbal manifestations, and is better transmitted through audio modality than video. To analyze the manifestation of depression, it is more efficient to use visual and acoustic information, linguistic information in this case cannot convey the full state of an affective disorder and the manifestation of emotions, but can only show the negative polarity of the statement. Multimodal recognition of affective states of a person makes it possible to analyze the manifestation of verbal and non-verbal information of the speaker simultaneously and obtain reliable information about the psychological state of the communicant.

Based on the results of theoretical and experimental research conducted in 2022, a series of 5 scientific papers on current results was prepared and published in publications and journals indexed in the international citation systems Scopus, Web of Science and RSCI, including in the Russian journals “Informatics and Automation” (Scopus and RSCI) and “Proceedings of VSU. Series: Systems Analysis and Information Technologies” (RSCI), as well as in the Proceedings of the 24th International Conference “Speech and Computer” SPECOM (India, top-level conference according to the international portal Research.com, 2 papers published in the Springer Lecture Notes in Computer Science series), the 24th International Congress on Acoustics ICA (Korea, the prestigious congress takes place every 3 years). In addition, an invited paper was presented at the 4th International Conference on Language Engineering and Applied Linguistics “Piotrowski’s Readings 2022” (St. Petersburg).

Addresses of Internet resources prepared for the Project:

  1. Reportage Automated call-centers: the way from IVR to "lie detector" Delovoy Petersburg
  2. Velichko A A speech singnal analysis method for automatic aggression detection in colloquial speech // Proceedings of VSU, series: Systems Analysis and Information Technologies, 2022, No. 4, pp. 1-9
  3. Dvoynikova A., Markitantov M., Ryumina E., Uzdiaev M., Velichko A., Kagirov I., Kipyatkova I., Lyakso E., Karpov A. An analysis of automatic techniques for recognizing human's affective states by speech and multimodal data // In Proc. of the 24th International Congress on Acoustics ICA-2022, 2022, pp. 22-33.
  4. Dvoynikova A., Markitantov M., Ryumina E., Uzdiaev M., Velichko A., Ryumin D., Lyakso E., Karpov A. Analysis of infoware and software for human affective states recognition // Informatics and Automation, 2022, Vol. 21, No. 6, pp. 1097-1144.
  5. Mamontov D., Minker W., Karpov A. Self-Configuring Genetic Programming Feature Generation in Affect Recognition Tasks // In Proc. of International Conference on Speech and Computer (SPECOM), 2022, pp. 464-476.
  6. Ryumina E., Ivanko D. Emotional Speech Recognition Based on Lip-Reading // In Proc. of International Conference on Speech and Computer (SPECOM), 2022, pp. 616-625.
  7. Audio-Visual Aggressive Behavior in Online Streams corpus (AVABOS)
Project's head
Number
N 22-11-00321
Period
2022-2024
Financing
Russian Science Foundation