Speech and Multimodal Interfaces Laboratory

All projects

Intelligent system for multimodal recognition of human's affective states

This interdisciplinary project of the Russian Science Foundation is aimed at solving the problems of multimodal analysis and recognition of affective states of people by their behavior using modern methods of digital signal processing and deep machine learning. The problem of affective computing is very relevant and significant from a scientific, technical and practical point of view. There are many unsolved problems in this area, while the practical application of systems for recognizing human affective states solely from unimodal data (for example, only from audio or video data) has a number of significant limitations. The most natural way for a person to interact and exchange information is multimodal communication, which involves several modalities (communication channels) at the same time, including natural speech and sounds, facial expressions and articulation, hand and body gestures, gaze direction, general behavior, textual information, etc. Multimodal systems for the analysis of human affective states have significant advantages over unimodal methods, allowing analysis in difficult conditions such as noise in one of the information channels (acoustic noise or lack of lighting), as well as in the complete absence of information in one of the channels (the person is silent or not facing the camera). In addition, multimodal analysis often makes it possible to recognize such ambiguous affective phenomena as sarcasm and irony, which are characterized by a clear mismatch between the meaning of the utterance (text analysis) and voice intonation (audio analysis) and facial expressions (video analysis). Therefore, the simultaneous analysis of several components of human behavior (speech, facial expressions, gestures, direction of gaze, text transcriptions of statements) will improve the quality of work and the accuracy of recognition of automatic systems for analyzing affective states in tasks such as recognizing emotions, sentiment, aggression, depression, etc. All these tasks are of great practical importance in the field of technologies of emotional artificial intelligence (Emotional AI), as well as in psychology, medicine, banking, forensic science, cognitive sciences, etc. They are of high scientific and technical, as well as social and economic importance.

The main goal of this RSF project is to develop and research a new intelligent computer system for multimodal analysis of human behavior in order to recognize manifested affective states based on audio, video and text data from a person. A unique feature of the system will be the multimodal analysis, i.e. simultaneous automatic analysis of the user's speech and image, as well as the meaning of his statements for the purpose of determining various psychoemotional (affective), including emotions, sentiment, aggression and depression. At the same time, the target audience of the automated system being developed will include not only the Russian-speaking population, but also other representative groups regardless of gender, age, race and language. Thus, this study is relevant and large-scale both within the framework of Russian and world science.

The main objectives of this project are the development, theoretical and experimental research of infoware, mathematical and software support for the intelligent system of multimodal analysis of affective behavior of people.

To achieve the main goal of the project, the specified tasks must be solved, summarized in 3 sequential stages of work:

development of infoware and mathematical support for the intelligent system of multimodal analysis of affective states (2022);
development and research of mathematical and software support for the intelligent system of multimodal analysis of affective states (2023);
experimental study and evaluation of the intelligent system for multimodal analysis of affective states, development of a system demonstrator and generalization of results (2024).

Results for 2024

In 2024, the final stage of the project related to the development and research of mathematical support and software for the processing of multimodal data, as well as the creation of a multimodal intelligent system for analyzing human affective states, was completed:

A method for verbal and physical aggression recognition based on the masked self-attention is developed. The method through the formation of a special mask excludes certain vectors of signs of missing modalities from processing. This method takes into account the peculiarities of combining different modalities and correctly processes situations with missing modalities, which allows us to respond flexibly to the conditions encountered in real tasks of analyzing affective states.
A depression recognition method based on three types of features: acoustic (DenseNet), visual (OpenFace) and textual (Word2Vec) is developed. Deterministic classification methods such as Catboost are used for classification, and the final decision is performed by voting.
A multi-task emotion and sentiment recognition method based on a triple fusion strategy that processes high-level features (wav2vec2, EmoAffectNet, RoBERTa) of all modalities is developed. Emotion and sentiment are modeled using transformer layers. The developed method allowed us to simultaneously solve the problems of emotion and sentiment recognition, making optimal use of computational resources and improving the generalization abilities of the model.
A method for hierarchical recognition of emotion, sentiment, and depression based on a two-level approach is developed. The first level recognizes emotions and sentiment, which are then passed as features to the second level, where binary recognition of depression is performed. In the hierarchical method, emotion and sentiment are considered as factors influencing the recognition of depression because the presence of depression is often associated with persistent negative emotionality and reduced positive reactions.

Experimental studies on the classification of emotion and sentiment (on the RAMAS, MELD, CMU-MOSEI corpora), aggression (AVABOS), and depression (CMDC, MENHIR, DAIC) have been carried out:

For aggression recognition, the study of different combinations of modalities showed the importance of choosing the initial modality for a specific type of aggression (text for verbal, video for physical). The introduction of an additional modality increased the recognition accuracy of physical and verbal aggression, provided that the initial modality for the corresponding type of aggression shows high recognition results.
For depression recognition, the focus is on finding optimal parameters, window size and feature types, and used classifiers. Experiments with video data showed that openFace features, combined with decision trees, are the most effective representation of the data. In the experiments with text data, the Catboost method with Word2Vec features was the best, as it provided balanced recognition results. The final result of combining modalities confirms the effectiveness and comparability with global studies, showing balanced classification results between classes.
For multi-task emotions and sentiment recognition, the performance of temporal models (Transformer, Mamba and xLSTM) was compared. A comparison of four strategies for combining multimodal data was performed. The best average recognition accuracy was shown by the triple fusion strategy (TFS), which utilizes all three modalities equally.
For hierarchical recognition of emotion, sentiment and depression, experimental studies have been conducted. Due to the limitations of the DAIC corpus used, we used only audio and text data. By adding sentiment information, we were able to improve the results of depression recognition using audio and text data relative to the baseline method without emotions and sentiment.

Comparison of the developed methods with state-of-the-art methods showed high performance of strategies for combining multimodal data using attention mechanisms. The proposed methods demonstrated competitive and/or superior accuracy in affective states recognition. Experiments confirmed that the joint analysis of acoustic, visual, and textual features can more deeply model the nature of affective states and improve recognition reliability.

We developed and registered in Rospatent “Intelligent system for Multimodal Affective States Analysis (MASAI)”. It is a multimodal and multi-task system for emotions and sentiment recognition, implemented as a web application. The MASAI system works with multimedia files that can be downloaded from a local computer or recorded with a webcam and microphone, and is hosted on the Hugging Face platform.

In 2024, our united international team participated in the 6th Workshop and Competition on Affective Behavior Analysis in-the-wild (ABAW) (held as a workshop at the CVPR 2024 conference), in particular, in two competitions on emotion valence/activation estimation, and compound emotion recognition. In the first competition, we proposed an audio-visual method based on the acoustic PDEM model and EfficientNet visual model to analyze human face, and several strategies for combining acoustic and visual features. In the second competition, our proposed AVCER method solved the problem of compound emotions recognition. The method combines an acoustic model (wav2vec2) and two visual models (static ResNet-50 and LSTM) to recognize basic emotions. The decision to recognize compound emotions is based on the pairwise sum of weighted probability distributions of the basic emotions.

In 2024, a cycle of 6 scientific papers was published in publications and journals indexed in the international citation systems Scopus, Web of Science and RSCI, including the Russian journals "Informatics and Automation" (Scopus), "Scientific and Technical Journal of Information Technologies, Mechanics and Optics" (Scopus) and "Information and Control Systems" (Scopus), as well as in the proceedings of the Conference on Computer Vision and Pattern Recognition Workshops CVPRW 2024 and 26th International Conference on Speech and Computer SPECOM-2024.

Addresses of Internet resources prepared for the Project:

Velichko A., Karpov A. An approach for depression recognition by speech using a semi-automatic data annotation // Information and Control Systems, no. 4, pp. 2-11.
Dvoynikova A., Kagirov I., Karpov A. A Method for Recognition of Sentiment and Emotions in Russian Speech Transcripts Using Machine Translation // Informatics and Automation, no. 23(4), pp. 1173-1198.
Uzdiaev M., Karpov A. Creation and analysis of multimodal corpus for aggressive behavior recognition // Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2024, vol. 24, no. 5, pp. 834-842.
Dresvyanskiy D., Markitantov M., Yu J., Kaya H., Karpov A. Multi-modal Arousal and Valence Estimation under Noisy Conditions // In Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2024, pp. 4773-4783.
Ryumina E., Markitantov M., Ryumin D., Kaya H., Karpov A. Zero-Shot Audio-Visual Compound Expression Recognition Method based on Emotion Probability Fusion // In Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2024, pp. 4752-4760.
Mamontov D., Zepf S., Karpov A., Minker W. Cross-Cultural Automatic Depression Detection Based on Audio Signals // In Proc. of International Conference on Speech and Computer (SPECOM), 2025, pp. 309-323.
Software «Intelligent system for Multimodal Affective States Analysis (MASAI)», authors: Ryumin D., Markitantov M., Ryumina E., Dvoynikova A., Karpov A.
Web-page of developed software «Intelligent system for Multimodal Affective States Analysis (MASAI)»

Results for 2023

In 2023, the 2nd stage of the project related to the development and research of mathematical support and software for the processing of unimodal data (audio, video, text), as well as the creation of bimodal models (audio+video and audio+text) of an intelligent system for analyzing human affective states was completed.

Classification and regression methods for analyzing individual affective states using unimodal data were improved for: binary classification of aggression (absence or presence of state) using audio data; classification of sentiment into three (negative, neutral, positive) and two classes (negative, positive) using text data; binary classification of aggression (absence or presence), emotion (anger, sadness, fear, disgust, happiness, neutral state) using video data. Experimental studies on automatic recognition of aggression (on the AVABOS corpus), sentiment (CMU-MOSEI), emotion (CREMA-D) were conducted to select the most effective neural network features, as well as models with recurrent, fully-connected and attention mechanism (AM) layers for their modeling and analysis.

A hierarchical method for binary classification of lying (false or true information), aggression (low, medium or high level) and depression (presence or absence of signs of illness) using audio data was proposed. In its development the theoretical basis of correlation between the considered paralinguistic phenomena was used: the results of classification methods for recognizing aggression and lies are the input data of the method for determining depression. The method of integral evaluation of the degree of expression of destructive phenomena in speech was proposed. Experimental studies on automatic recognition of lying (on the DSD corpus), depression (DAIC) and aggression (SD&TR) were carried out.

Multi-task methods for classification of emotion (surprise, anger, sadness, fear, disgust, happiness) and sentiment (negative, neutral, positive) using unimodal data (audio, video, text) were proposed. Experimental studies on multi-task emotion and sentiment recognition (on the RAMAS and CMU-MOSEI corpora), with training on unimodal/multimodal data were conducted, including:

using audio data, we compared the performance of transformer models for extracting acoustic features, which were then processed by a GRU-based model. The EW2V neural network model outperformed the other models by an average of 3.5%. Combination of AM and recurrent layers also contributed positively to the recognition accuracy. The proposed method for emotion recognition outperformed the state-of-the-art (SOTA) methods on the CMU-MOSEI corpus by 3.3% (mWAcc);
using text data, we compared the performance of transformer models for extracting linguistic features, which were then processed by the AM-based model. The RoBERTa linguistic features outperformed the other features by an average of 2%. The best feature set was processed by two identical neural networks with AM (for emotion and sentiment). The best method based on these features and the neural network with AM outperformed the others by an average of 3%. This is due to the different training procedures of the original transformer models. The proposed method for emotion recognition outperformed the SOTA methods on the CMU-MOSEI corpus by 6.6% (mWAcc);
using video data, we compared the performance of visual features that were processed by the LSTM-based model. The EmoFFs features outperformed the others by an average of 2.4%. EmoFFs were able to detect complex nonlinear dependencies and facial features. The proposed method for emotion recognition outperformed the SOTA methods on the CMU-MOSEI corpus by 7.2% (mWAcc).

Multi-task methods for classification of emotions (surprise, anger, sadness, fear, disgust, happiness) and sentiment (negative, neutral, positive) using bimodal data (audio+video, audio+text) were proposed. Experimental studies on multi-task multimodal emotion and sentiment recognition (on the RAMAS and CMU-MOSEI corpora) were conducted:

using audio and video data, the performance of different modality fusion methods was compared. The CMGSAF method based on the use of statistical functionals, fully connected layers and two consecutive layers of attention was proposed. CMGSAF outperformed the considered classical modality fusion methods by 2.2%. It was clear from the results that for RAMAS, video was more effective than audio data, whereas the opposite was true for CMU-MOSEI. CMGSAF outperformed other SOTA methods in the emotion recognition task on the RAMAS corpus by 0.7% (UAR) and on the CMU-MOSEI corpus by 18.2% (mWAcc) and 1.6% (mWF1);
using audio and text data, the performance of modality fusion methods was compared. A FCF method based on concatenation of features processed by two identical neural networks with AM (for emotion and sentiment) was proposed. FCF outperformed other modality fusion methods by 1%, including AM-based fusion. The audio data was more effective than text data in recognizing emotion, whereas the opposite was true for sentiment. The FCF method outperformed other SOTA methods in the emotion recognition task by 2.82% (mWAcc) and 0.7% (mWF1) and in the sentiment recognition task by 7.13% (Acc) and 6.06% (WF1) on the CMU-MOSEI corpus.

The results show that audio and video data are more effective for emotion recognition, while textual data are more informative for sentiment analysis.

Two software were developed and registered in Rospatent: 1) Software for Audio-Visual Emotions and Sentiment Recognition (AVESR); 2) Software for hierarchical recognition of destructive phenomena in speech (Destructive Behaviour Detection (DesBDet). AVESR using a webcam can perform real-time recognition of emotions (surprise, anger, sadness, fear, disgust, happiness) and sentiment (negative, neutral, positive). DesBDet performs hierarchical recognition of destructive phenomena (false or true information, level of aggression and absence/presence of depression) in speech. The software can record audio files using a microphone or read them from a disk. Both software models are characterized by good generalizability due to the use of cross-corpus learning models, fast response times, and high recognition accuracy.

In 2023, a cycle of 7 scientific papers was published in publications and journals indexed in the international citation systems Scopus, Web of Science and RSCI, including the international journal Mathematics (Q1 WoS), Russian journals "Information and Control Systems" (Scopus) and "Journal of Instrument Engineering" (RSCI), as well as in the proceedings of the jubilee 25th International Conference on Speech and Computer SPECOM-2023 (Dharwad, India, top-level conference in the international portal Research.com); 7th International Scientific Conference "Intelligent Information Technologies for Industry" IITI-2023 (St. Petersburg, an invited talk by A. Karpov); 29th International Conference on Computational Linguistics and Intellectual Technologies DIALOGUE-2023 (Moscow); 5th International conference on Photogrammetric techniques for environmental and infraStructure monitoring, Biometry and Biomedicine PSBB-2023 (Moscow).

Addresses of Internet resources prepared for the Project:

Ryumina E., Markitantov M., Karpov A. Multi-Corpus Learning for Audio-Visual Emotions and Sentiment Recognition // Mathematics, 2023, vol. 11(16), ID 3519.
Ryumina E., Karpov A. Impact of visual modalities in multimodal personality and affective computing // The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2023, vol. XLVIII-2/W3-2023, pp. 217–224.
Karpov A., Dvoynikova A., Ryumina E. Intelligent Interfaces and Systems for Human-Computer Interaction // In Proc. of 7th International Conference «Intelligent Information Technologies for Industry» IITI 2023, St. Petersburg, Springer, Lecture Notes in Networks and Systems LNNS, 2023, vol. 776, pp. 3-13.
Dvoynikova A., Karpov A. Bimodal sentiment and emotion classification with multi-head attention fusion of acoustic and linguistic information // Computational Linguistics and Intellectual Technologies. In Proc. of 29th International Conference DIALOGUE-2023, Moscow, 2023, pp. 51-61.
Velichko A., Karpov A. An approach and software system for integral analysis of destructive paralinguistic phenomena in colloquial speech // Information and Control Systems, 2023, No. 4, pp. 2-11.
Ivanko D., Ryumina E., Ryumin D., Axyonov A., Kashevnik A., Karpov A. EMO-AVSR: Two-Level Approach for Audio-Visual Emotional Speech Recognition // In Proc. of SPECOM 2023, 2023, Springer LNCS vol. 14338, pp 1-14.
Dvoynikova A., Kondratenko K. Approach to automatic recognition of emotions in speech transcriptions // Journal of Instrument Engineering. 2023, Vol. 66, No. 10, pp. 818-827.
Software for Audio-Visual Emotions and Sentiment Recognition (AVESR), authors: Markitantov M, Ryumina E., Karpov A.
Software for Destructive Behaviour Detection (DesBDet), authors: Velichko A., Karpov A.

Results for 2022

In 2022, the 1st stage of the project, related to the development and research of the mathematical, information and linguistic support of an intelligent system for multimodal recognition of human affective states was completed.

An analytical review of modern scientific and technical literature on the topic of multimodal modeling of audio-visual signals for the analysis of affective states was conducted. We can conclude that neural network methods are gradually replacing traditional ones by achieving greater accuracy in recognizing affective states and quickly processing large amounts of data. The analysis of the existing multimodal corpora was carried out and some of them were obtained for analysis of affective states: for emotions (AFEW, AFEW-VA, AffWild2, SEWA, AffectNet); for emotions and sentiment (CMU-MOSEI, MELD); for aggression (parts of the TR and SD corpora); for depression (DAIC).

We collected and annotated Audio-Visual Aggressive Behavior in Online Streams dataset (AVABOS). This corpus contains video files obtained from open sources on the Internet, which contain individual and group aggressive behavior of Russian-speaking users, manifested during live video broadcasts. The database is designed for automatic audiovisual analysis of aggressive behavior, and is officially registered with Rospatent, certificate No. 2022623239 dated 12.05.2022.

Novel and improved existing mathematical tools (models, methods and algorithms) has been developed to extract informative features from audio, video and text modalities in order to automatically predict various affective states:

for the task of recognizing emotions, aggression and depression by audio modality, a method based on the use of expert and neural network features (openSMILE, openXBOW, AuDeep, DeepSpectrum, PANN, Wav2Vec) has been developed. The influence of data annotation quality on the effectiveness of emotion recognition methods on the RAMAS corpus was analyzed. Quantitative assessment of the developed methods for recognizing emotions, aggression and depression was carried out on the following corpora: RAMAS, TR & SD and DAIC, respectively. For audio modality, a method of audio data augmentation has been developed based on the modification of spectrogram images: rotation, rescaling, shift in width and height, brightness change, horizontal reflection, stretching, compression, and SpecAugment;
for the task of recognizing emotions and sentiment by text modality on the RAMAS and CMU-MOSEI corpora, methods for preprocessing orthographic text transcriptions (tokenization, deletion of punctuation, case reduction, lemmatization for Russian-language data and stemming for English-language data), as well as a neural network vectorization method Word2Vec, have been developed and studied. The advantages of the proposed methods are the preservation of the syntactic and semantic information of the text after vectorization, the small size of the vector space and the possibility of using different models for Russian and English. For augmentation of text data, an augmentation approach that combines methods for modifying text data: deletion, permutation, replacement of words, permutation of sentences, reverse translation, generative models and domain augmentation has been developed;
for the task of recognizing emotions by video modality, a method based on the ResNet-50 convolutional neural network was developed. This method was trained on the AffectNet corpus and is capable of extracting face textural features of different dimensions, which can be fed to both classical deterministic and neural network methods of machine learning. The developed method is also effective for the tasks of recognizing other affective states, in particular, aggression and depression. For video modality, a video data augmentation method has been developed. This method based on the use of image modification methods: Mixup, affine transformations, contrast adjustment and class weighting.

It is well-known that when communicating, people use both verbal and non-verbal manifestations of affective states (emotions, sentiment, aggression, depression). At the same time, the speaker expresses the semantic content of the communicative statement with the help of verbal information, which is representative for the expression of sentiment or the polarity of emotion. At the same time, the intensity of emotions is reflected in non-verbal manifestations, and is better transmitted through audio modality than video. To analyze the manifestation of depression, it is more efficient to use visual and acoustic information, linguistic information in this case cannot convey the full state of an affective disorder and the manifestation of emotions, but can only show the negative polarity of the statement. Multimodal recognition of affective states of a person makes it possible to analyze the manifestation of verbal and non-verbal information of the speaker simultaneously and obtain reliable information about the psychological state of the communicant.

Based on the results of theoretical and experimental research conducted in 2022, a series of 5 scientific papers on current results was prepared and published in publications and journals indexed in the international citation systems Scopus, Web of Science and RSCI, including in the Russian journals “Informatics and Automation” (Scopus and RSCI) and “Proceedings of VSU. Series: Systems Analysis and Information Technologies” (RSCI), as well as in the Proceedings of the 24th International Conference “Speech and Computer” SPECOM (India, top-level conference according to the international portal Research.com, 2 papers published in the Springer Lecture Notes in Computer Science series), the 24th International Congress on Acoustics ICA (Korea, the prestigious congress takes place every 3 years). In addition, an invited paper was presented at the 4th International Conference on Language Engineering and Applied Linguistics “Piotrowski’s Readings 2022” (St. Petersburg).

Addresses of Internet resources prepared for the Project:

Reportage Automated call-centers: the way from IVR to "lie detector" Delovoy Petersburg
Velichko A A speech singnal analysis method for automatic aggression detection in colloquial speech // Proceedings of VSU, series: Systems Analysis and Information Technologies, 2022, No. 4, pp. 1-9
Dvoynikova A., Markitantov M., Ryumina E., Uzdiaev M., Velichko A., Kagirov I., Kipyatkova I., Lyakso E., Karpov A. An analysis of automatic techniques for recognizing human's affective states by speech and multimodal data // In Proc. of the 24th International Congress on Acoustics ICA-2022, 2022, pp. 22-33.
Dvoynikova A., Markitantov M., Ryumina E., Uzdiaev M., Velichko A., Ryumin D., Lyakso E., Karpov A. Analysis of infoware and software for human affective states recognition // Informatics and Automation, 2022, Vol. 21, No. 6, pp. 1097-1144.
Mamontov D., Minker W., Karpov A. Self-Configuring Genetic Programming Feature Generation in Affect Recognition Tasks // In Proc. of International Conference on Speech and Computer (SPECOM), 2022, pp. 464-476.
Ryumina E., Ivanko D. Emotional Speech Recognition Based on Lip-Reading // In Proc. of International Conference on Speech and Computer (SPECOM), 2022, pp. 616-625.
Audio-Visual Aggressive Behavior in Online Streams corpus (AVABOS)

Project's head

Karpov A.A.

Number

N 22-11-00321

Period

2022-2024

Financing

Russian Science Foundation