Speech and Multimodal Interfaces Laboratory

Intelligent system for multimodal recognition of human's cognitive disorders

Intelligent system for multimodal recognition of human's cognitive disorders

This RSF project is aimed at solving the actual problem of multimodal recognition of people's cognitive disorders by analyzing their conversational speech and visual manifestations of behavior using modern methods of digital signal processing and deep machine learning. The goal of the project is to develop and research an intelligent computer system for multimodal analysis of human behavior in order to recognize cognitive disorders (in diseases such as Alzheimer's and Parkinson's diseases, dementia, depression, etc.) based on audio, video and text data to improve the efficiency and speed of non-contact diagnosis. Research on the automatic diagnosis of speech and multimodal manifestations of cognitive disorders is a highly sought-after interdisciplinary field of application of the latest information technologies and artificial intelligence in healthcare and wellbeing of people. This is explained by the prospects of the usage of artificial intelligence methods for timely, remote and equipment-light medical diagnostics, which is especially important for people who may be limited in movement due to age and health conditions, as well as due to their remote living areas and the inability to have an in-person appointment with a specialist. Such research must meet high standards of accuracy in recognizing disorders from users and specialists, as well as ethical requirements, which is why the development of new effective, reliable and explainable AI methods for interpreting decisions made is of particular relevance and significance.

During the project, it is planned to develop and research new models and improve existing ones, as well as methods, algorithms, and software solutions for comprehensive multimodal recognition of human cognitive disorders. In particular, pressing problems related to data augmentation of training audiovisual data in various languages (English, Greek, etc.) will be solved. The possibilities of obtaining new language-independent sets of features and their application to Russian-language data using expert, neural network approaches, and large language models will be explored. Approaches to machine classification (presence or absence of pathology) or regression (determining the severity of the disease) of cognitive disorders in question, as well as approaches to ensuring explainability of neural network features and probabilistic models of cognitive disorders, will also be investigated. It is also planned to investigate approaches for machine classification (presence or absence of pathology) or regression (determining the severity of the disease) of cognitive disorders in question, as well as approaches to ensure explainability of neural network features and probabilistic models of cognitive disorders. The main result of this project should be a prototype of an intelligent expert system for automatic recognition of human cognitive disorders based on comprehensive multimodal analysis of acoustic characteristics of voice, visual characteristics of facial expressions and gestures of a person, as well as linguistic components of his speech statements. It is expected that the results obtained will meet modern requirements and standards in this field and be at the advanced world level. The practical, scientific and technical significance of the tasks set in the project is confirmed by the high demand for the technologies being developed in the market of speech and multimodal expert technologies for healthcare and human well-being, as well as by a large number of foreign scientific publications devoted to this problem in leading scientific journals and proceedings of international conferences. The proposed system will be unique in its kind due to the possibility of comprehensive multimodal determination of the considered cognitive disorders in speech, the use of new sets of analyzed features, as well as the application of multi-level methods taking into account interdependencies between the considered cognitive disorders.

Results for 2025

In 2025, the first stage of the project, focused on studying the mathematical and informational foundations of an intelligent system for multimodal recognition of cognitive impairments, was completed.

An analytical review of modern scientific and technical literature on speech-based and multimodal methods for recognizing cognitive impairments was conducted. The review shows that despite progress in neural network methods, the limited size of available datasets and the requirements for medical transparency and high accuracy compel the use of linear models or, alternatively, the development of architectures that support explainability (XAI). Existing informational resources were analyzed, and access was obtained to several open speech and multimodal corpora containing data from individuals with cognitive impairments (ADReSS, ADReSSo, Taukadial, GRAADRD), Parkinson’s disease (WSM), and depression (DAIC-WOZ, eDAIC-WOZ).

A patent search covering 2005-2025 was conducted. The study identified 24 relevant patents, 9 of which are Russian ones. No works were found that aim to recognize several cognitive impairments simultaneously. No patent documents were found that involve simultaneous analysis of all modalities (video, audio, and text), which highlights the novelty and potential of our research.

New mathematical methods and improved existing mathematical tools (models, methods, and algorithms) were developed for the automatic modeling of various cognitive impairments:

  1. Methods for preprocessing, normalization, and noise reduction using classical and neural approaches were proposed and implemented. For video data (WSM vlogs), a unified processing pipeline was created, including Silero VAD for voice activity detection, whisper-timestamped (Whisper-large-v3-turbo) for aligned speech recognition, and OCEAN-AI for face detection. Long videos are segmented, and only segments containing both speech and the speaker’s face are retained, ensuring high data quality. For clinical audio data, the speaker-diarization-3.1 method is used to identify speakers, along with noisereduce for noise suppression and normalization in loudness. Transcriptions are extracted using Silero VAD, whisper-timestamped, and Whisper-large-v3-turbo / Whisper-large-v3 models with language-specific prompts that minimize recognition errors while preserving speech disfluencies as potential impairment
  2. Methods were proposed and implemented for extracting expert and neural network features from audio and text, accounting for natural-language specifics. Acoustic and prosodic audio features include eGeMAPs (OpenSMILE) and DigiPsych Prosody (WebRTC VAD), and text embeddings are generated from Whisper transcriptions. Expert audio features from OpenSMILE and textual features from BlaBla are used, along with parameters obtained from speech-recognition systems used in cognitive impairment detection. A new cross-lingual feature set was created, suitable for modeling cognitive impairments in a multimodal environment.
  3. Methods for text-data augmentation using LLMs and back-translation were designed and implemented. It was demonstrated that various cognitive impairments exhibit specific linguistic patterns (repetitions, syntactic simplification, etc.), which require different augmentation strategies: LLM-based paraphrasing is used for dementia, mild cognitive impairment, bipolar disorders, Parkinson’s and Alzheimer’s diseases; for depression and control groups, back-translation is applied. Augmentation quality is assessed using BLEU and BERT-score metrics.
  4. Methods for combining expert and neural features from different modalities were implemented and studied in corpora eDAIC-WOZ and WSM. Experimental research employed late fusion at the prediction level (audio, video, and text), where each modality is processed by a separate model, and the results are merged using an ensemble of classifiers with majority voting. Experiments with early feature-level fusion showed lower recognition accuracy. Future work will use graph-based architectures and cross-modal attention, approaches previously unused in this domain but promising for improving multimodal analysis of cognitive impairments.

The research revealed that linguistic information is the most significant for identifying cognitive deviations, as such impairments strongly affect vocabulary. However, linguistic features are language-dependent, limiting their applicability to cross-lingual analysis. Audio and video features can provide important complementary information, helping improve reliability and reduce the influence of language characteristics on prediction hypotheses.

Based on the analytical, theoretical, and experimental research conducted in 2025, a series of three scientific papers was prepared and published in journals indexed in Scopus and RSCI, including the Russian journals: Informatics and Automation (Scopus, RSCI, VAK Category 2), Scientific and Technical Journal of Information Technologies, Mechanics and Optics (Scopus, RSCI, VAK Category 2), as well as in the proceedings of the 27th International Conference “Speech and Computer” SPECOM-2025 (Szeged, Hungary) in Lecture Notes in Computer Science (Springer Nature, Scopus Q2).

 

Project's head
Number
N 25-11-00319
Period
2025-2027
Financing
Russian Science Foundation