Speech and Multimodal Interfaces Laboratory

Automatic multilingual speech recognition with support for code-switching (on the example of the Karelian and Russian languages)

Automatic multilingual speech recognition with support for code-switching (on the example of the Karelian and Russian languages)

The objective of this project is to develop a prototype multilingual speech recognition system with support for code-switching, on the example of Karelian and Russian.

There are many communities in the world that use two or more languages (multilingualism) in everyday communication. One of the most striking examples of multilingualism is India (over 400 living languages, and the vast majority of Indian citizens speak at least two languages). There are over 150 languages in Russia, which led to the emergence of developed multilingualism in a number of regions (Republic of Tatarstan, Republic of Tuva, Republic of Dagestan, etc.).

One phenomenon characteristic of multilingual communities is code switching. Code switching is the spontaneous transition of a speaker from one language or dialect to another. Code switching can occur both between utterances and within a sentence.

Developing an automatic speech recognition system with support for code-switching is a considerably more challenging task than development of a plain multilingual system. The main difficulty is the lack of training data. This primarily applies to text data, since written texts are usually stylistically edited, being free of any code-switching phenomena. Solution to this problem implies a lot of work on the collection and annotation of specific language data, as well as the development of methods for training data augmentation. Acoustic and language speech modeling regarding code switching is a non-trivial task, and in general, automatic speech recognition systems of this type perform worse than speech recognition systems with no support for code-switching. The development of a speech recognition system for the Karelian and Russian languages is further complicated by the fact that Karelian belongs to low-resource languages, i.e., languages with little information support (the absence or small amount of Internet resources, digitized databases, language processing software).

The development of the claimed system is of relevance for two reasons: first, the approaches and solutions found will be of importance for the development of speech recognition systems with code switching for other world languages as well; second, the development of such a system contributes to the study of the Karelian language, which is especially important due to the fact that the Karelian language belongs to the endangered languages.

The practical value of this study is that the creation of the claimed system contributes to the research of the low-resourced Karelian language, and the results of the project can be used in the work of field linguists dealing with language contacts and the modern Karelian language.

Results for 2025

During the second stage of the project in 2025, the members of the project have prepared a text corpus presenting Karelian-Russian code-switching, trained acoustic and language models for the Karelian language that account for potential code-switching to Russian, integrated the developed models into a prototype of an automatic Karelian speech recognition system with code-switching to Russian, as well as tested and quantitatively evaluated the developed prototype on a test speech database.

For acoustic modeling, Hidden Markov Models (HMMs), hybrid HMM/ANN models combining HMMs and ANNs, and pre-trained integrated multilingual models based on Wav2Vec 2.0 were used. Training was performed on the speech corpus collected during the first stage of the project (KarRusCoS database), which contains Karelian-Russian code-switching, as well as on a monolingual Karelian speech corpus (AnKaS database) collected during a previous project.

In order to train a language model, a Karelian text corpus containing Karelian-Russian code-switching was prepared. The text corpus included transcripts of the training part of the speech corpus and Karelian texts augmented via partial translation into Russian, as well as text obtained through EDA method.

Three types of language models were created: statistical models based on word trigrams, models based on pre-trained multilingual neural network architectures (BERT and T5), and models obtained by linear interpolation of the Karelian and Russian language models. The neural models were applied at the post-processing stage for rescoring the N-best list of recognition hypotheses and selecting the best hypothesis. The created models were evaluated using the perplexity (PPL) metric and the number of out-of-vocabulary (OOV) words calculated on the transcripts of the development and test parts of the speech corpus.

The developed acoustic and language models were integrated into a prototype system for automatic recognition of Karelian speech with code-switching. The prototype’s web application is available via the following link: https://huggingface.co/spaces/Mihaj/SMIL-Livvi-Krl-ASR.

Experimental studies of the created prototype were conducted as well, and its performance was quantitatively evaluated using the WER metric on the development and test parts of the Karelian speech corpus with code-switching. The best results were achieved with the use of an acoustic model based on Wav2Vec2.0-BERT and a trigram language model obtained by linear interpolation of a Karelian model, (trained on transcripts from the training part of the speech corpus, text data augmented via automatic translation of randomly selected words over 5 iterations, and text data augmented using the EDA method) and a Russian model with an interpolation coefficient of 0,7. The resulting WER was 25,82% on the development set and 29,00% on the test set. These results are on par with SotA for other low-resource languages with code-switching.

During the current stage, three papers were published, including one in the journal “Terra Linguistica” (Scopus Q1), one in the journal “Informatics and Automation” (Scopus), and one in the “Lecture Notes in Computer Science” series (Scopus). Presentations were made at the 27th International Conference “Speech and Computer (SPECOM 2025)” (Szeged, Hungary, October 13-14, 2025) and the 11th Interdisciplinary Workshop “Analysis of Conversational Russian Speech (AR3-2025)” (St. Petersburg, June 30 - July 1).

Addresses of Internet resources prepared for the Project:

  1. Kagirov I., Kiseleva K., Kipyatkova I. Code-switching analysis in Karelian language speakers' speech // Terra Linguistica. 2025. 16(2), pp. 22–40.
  2. Kipyatkova I., Kagirov I., Dolgushin M. Use of Pre-Trained Multilingual Models for Karelian Speech Recognition // Informatics and Automation. 2025. 24(2), pp. 604-630.
  3. Kipyatkova I., Kiseleva K., Dolgushin M., Kagirov I. Modeling Intra-word Code-Switching for Karelian ASR // In Proc. 27th International Conference on Speech and Computer SPECOM 2025, Springer LNCS, vol. 16188, 2025, pp. 104–117.
  4. Web-page of developed software Livvi-Karelian ASR

 

Results for 2024

At the first stage of the project in 2024, the following tasks were completed: conducting an analytical review of the research topic; recording, transcribing, and segmenting speech data in Livvi-Karelian with Karelian-Russian code switching; developing a phoneme alphabet merging the phonemes of the Karelian and Russian languages, and developing a phonemic vocabulary for the Karelian-Russian speech recognition system.

The analytical review encompasses more than 50 sources. It examines the main methods and approaches to building a speech recognition system that supports code-switching. It also addresses key techniques used for training low-resource systems. The review concludes that one of the most effective training methods for such systems is leveraging pre-trained multilingual models followed by fine-tuning on data the target languages. Additionally, various methods of speech and text data augmentation can be employed, including speech synthesis, partial automatic text translation, and text modification.

Samples of spontaneous speech in Livvi Karelian was recorded. A total of 37 native Karelian speakers (16 men and 21 women) took part in the recording sessions. After removing noisy fragments, the total duration of speech data is 3 hours. The Russian code accounts for 27% of the recordings. The recordings are stored in WAV files with a sampling rate of 16 kHz, 16 bits per sample, mono.

The audio recordings were transcribed and segmented into individual clauses. A speech corpus "Speech Database with Karelian-Russian Code-Switching (KarRusCoS)" containing annotations of the recorded speech data was created. KarRusCoS includes: audio recordings of Karelian speech as well as annotations representing the speaker's identification number; gender; transcriptions of his/her utterances; the duration of each clause; the number of words in Karelian; the number of words in Russian; the number of words with intra-word code-switching; and the total word count per phrase. A certificate of database registration has been obtained from the Federal Institute of Industrial Property (FIPS).

A phoneme alphabet was created by merging the phoneme sets of Karelian and Russian languages, resulting in a total set of 68 phonemes.

A phonemic vocabulary was made, combining word forms from both Karelian and Russian. Additionally, in order to account for intra-word code-switching, the vocabulary included Russian word stems with Karelian affixes. Phonemic transcriptions for all words in the vocabulary were generated automatically.

The findings of the research in 2024 were presented at the International Conference on Speech and Computer (SPECOM 2024) (Belgrade, Serbia), the 5th International Scientific Conference on Engineering and Applied Linguistics "Piotrovsky Readings 2024" (St. Petersburg, Russia), and the 20th Scientific Conference "Bubrich Readings: Traditions and Innovations in the Study of Finno-Ugric Languages and Cultures" (Petrozavodsk, Russia). The results obtained were published in the Lecture Notes in Computer Science series.

Addresses of Internet resources prepared for the Project:

  1. Kipyatkova I., Kagirov I., Dolgushin M., Rodionova A. Towards a Livvi-Karelian End-to-End ASR System // In Proc. of 26th International Conference on Speech and Computer SPECOM 2024, Springer LNCS, vol. 15299, Belgrade, Serbia, 2024, pp. 57-68.
  2. KarRusCoS – Speech Database with Karelian-Russian Code-Switching

 

Project's head
Number
N 24-21-00276
Period
2024-2025
Financing
Russian Science Foundation