Speech and Multimodal Interfaces Laboratory

Automatic speech recognition for under-resourced languages of Russia (on the example of the Karelian language)

Automatic speech recognition for under-resourced languages of Russia (on the example of the Karelian language)

In the course of this project, it is planned to conduct a study on the elaboration of speech recognition systems for under-resourced languages, using Karelian as an example language. The development of such a system is a rather difficult task, first of all, due to the fact that Karelian belongs to the so-called under-resourced languages, i.e., languages for which no significant speech and text corpora have been created. The general shortage of data makes it difficult to apply standard approaches to automatic speech-to-text conversion, which involve training a system with the use of large data sets.

The Karelian language is spoken by about 30 thousand people, about 25 thousand of whom live in Russia. Currently, there are no systems for automatic recognition of the Karelian speech yet, and Karelian is under-resourced and endangered, being under threat of extinction.

Elaboration of the proposed system for the automatic speech-to-text conversion of Karelian is of relevance for the following reasons: the approaches and solutions developed within this project, will be of significance for the elaboration of automatic speech-to-text conversion systems, as well as automatic speech recognition and machine translation systems, for others under-resourced languages of the world; the Karelian language is one of the languages of Russia, which, according to the data presented in "Endangered World Languages Atlas" by UNESCO, belongs to endangered languages; the development of the proposed system contributes to the research and preservation of endangered languages; in particular, it can be useful in the work of field linguists engaged in collecting of language samples.

The aim of the project is to develop a prototype system for automatic speech-to-text conversion for the Karelian language. Among the main tasks of the project are: collecting a speech corpus of Karelian; collecting and processing text data in Karelian; development of acoustic models; development of language models.

The practical significance of the research is that the acoustic and linguistic models being developed will be used in the system of automatic speech-to-text conversion for the Karelian language. Such a system can be used for machine translation from Karelian to Russian. In addition, automatic speech recognition systems can be used in automatic stenography systems for under-resourced and endangered languages, being a useful tool for preservation and research.

Results for 2023

In 2023, work was carried out in the second stage of the project, which included training acoustic and language models for the Karelian language, integrating the developed models into a prototype of an automatic Karelian speech recognition system, and testing the developed prototype.

Hidden Markov Models (HMM) and models based on Deep Neural Networks (DNN), representing hybrid HMM/DNN models, were used for acoustic modeling. The training of acoustic models was conducted on a speech corpus that was collected during the first stage of the project and expanded during the current stage. Additionally, augmentation methods such as pitch and speech rate perturbation, were applied to increase the volume of training data. Various architectures of DNNs were explored. During the experimental research conducted on the development part of the speech corpus, the hybrid model with factorized Time Delay Neural Networks (TDNN-F) showed the lowest Word Error Rate (WER).

For language modeling, word trigram and Recurrent Neural Networks (RNN)-based models were trained. A linear interpolation of trigram and neural network models was performed as well. For training, the text corpus collected during the first stage of the project was used. The trigram language model was used during the speech decoding stage. RNN-based models were applied in the post-processing stage to rescoring the N-best list of recognition hypotheses and to choose the best hypothesis. In the context of speech recognition, language models were evaluated based on perplexity on the test portion of the text corpus and on the WER metrics. The best results were achieved using an Bidirectional Long Short-Term Memory (BiLSTM) model with two hidden layers, interpolated with a trigram model with interpolation coefficient equal to 0.5.

The developed acoustic and language models were integrated into a prototype of an automatic Karelian speech recognition system, allowing the conversion of pre-recorded utterances in Livvi Karelian into text. The open-source Kaldi toolkit was used to create the system. Experimental studies with of the developed prototype were conducted, and its performance was quantitatively evaluated based on the WER metric. The best result achieved on the development set was WER = 23.22%. Testing the recognition system with mentioned above models gave WER=25.40%. The achieved results are on par with SOTA results for other low-resource languages.

During the current project stage, four papers were published, among them a paper in the “Mathematics” journal, (Q1 WoS), and papers in the journals "Information and Control Systems" (Scopus), “Lecture Notes in Computer Science” (Scopus), and “Proceedings of Petrozavodsk State University” (RSCI, Google Scholar, ERIH PLUS). A presentation was given at the international conference “Speech and Computer (SPECOM 2023)” (Hubli-Dharwad, India, November 29 - December 1, 2023). A certificate of registration was obtained for the FIPS database “Database of Annotations of Karelian Speech Recordings (AnKaS)”.

Addresses of Internet resources prepared for the Project:

  1. Kipyatkova I, Kagirov I. Deep Models for Low-Resourced Speech Recognition: Livvi-Karelian Case More
  2. Kipyatkova I., Kagirov I. Deep Models for Low-Resourced Speech Recognition: Livvi-Karelian Case // Mathematics. 2023, vol. 11(18), ID 3814.
  3. Kipyatkova I., Kagirov I. Automatic speech recognition system for Karelian // Information and Control Systems. 2023, no. 3, pp. 16-25.
  4. Kipyatkova I., Kagirov I. Phone Durations Modeling for Livvi-Karelian ASR // In Proc. 25th International Conference SPECOM 2023, Dharwad, India, Springer Lecture Notes in Computer Science, vol. 14339, 2023, pp. 87-99.
  5. Kipyatkova I., Rodionova A., Kagirov I., Krizhanovsky A. Speech and text data preparation for developing an automatic speech recognition system for the Karelian language // Proceedings of Petrozavodsk State University, 2023, no. 45(5), pp. 89–98.
  6. AnKaS – Database of Annotations of Karelian Speech Recordings


Results for 2022

Among the tasks, completed by the research team in year 2022, were an analytical review of the research topic, collecting and annotating of a corpus of continuous Karelian speech, as well as collecting of a corpus of Karelian texts.

The analytical review on the research topic encompasses more than 70 works, more than 50 of which have been published in the last 7 years. The review addresses the definition of low-resource languages, points out the main challenges that arise in the area of automatic speech recognition for low-resource languages, and outlines the basic methods aimed at solving the problems at issue. Based on the review, one can conclude that the most promising ways to solve the aforesaid problems are the expansion of the training corpora (data augmentation) and the transfer of parameters of a model trained on extraneous data to the model of the target language (transfer learning).

A study of linguistic and phonetic features of Karelian was carried out, resulting in a development of a phonemic alphabet for the Livvik dialect, including stressed and unstressed vowels, hard and soft consonants, long vowels and doubled consonants (the length of Karelian phonemes is a distinctive feature). In addition, the back allophone of the phoneme /i/ was singled out as a separate phoneme. In total, an inventory of 26 vowels and 56 consonants was identified. The developed alphabet was used for creation of phonemic transcriptions of Karelian words. A Python-based software module was developed in order to automatically create transcriptions. It performs grapheme-to-phoneme transformation for the input list of words in accordance with the transcription rules for Karelian.

A speech corpus of the Livvik dialect has been collected, based on the recordings of 10 radio broadcasts "Kodirandaine" ("Native Shore"), provided by the State Television and Radio Broadcasting Company "Karelia". The speech corpus includes audio recordings of 15 speakers (6 men and 9 women). The volume of records after the removal of “junk” fragments was 3.5 hours. All the recordings were transcribed, and the whole corpus was annotated and segmented into single phrases. The corpus was split into the training and the test parts. The training part includes 90% of the phrases, whereas the test part includes 10% of the phrases. The training part of the corpus was augmented by changing the speed of speech and the frequency of the fundamental tone. The augmented speech data was added to the real training data.

A corpus of texts in the Livvik dialect of the Karelian language has been collected with the use of books provided by the publishing houses "Periodika" and "Verso". Another important source was the open corpus of the Veps and Karelian languages VepKar, as well as text data from other open sources and transcripts of audio recordings from the training part of the speech corpus. The text corpus was automatically pre-processed with the use of a developed software module, namely, splitting the text into separate sentences, removing punctuation marks, replacing capital letters with lowercase letters, deleting text in brackets and removing repeated sentences. The volume of the corpus after the removal of repeated sentences was over 5M word usages. Based on the collected text corpus, a dictionary was created, which further will be used as a part of the Karelian speech recognition system. All the words included in the dictionary were automatically transcribed (phonemic transcription).

The results of the research conducted in 2022 were published in “Informatics and Automation” journal indexed in Scopus and RSCI databases, and were presented at the conference “Bubrikh readings: Languages and Cultures in the Digital Age”.

Addresses of Internet resources prepared for the Project:

  1. Kipyatkova I. S., Kagirov I. A. Analytical Review of Methods for Solving Data Scarcity Issues Regarding Elaboration of Automatic Speech Recognition Systems for Low-Resource Languages // Informatics and Automation. 2022. N. 21(4). pp. 678–709.
Project's head
N 22-21-00843
Russian Science Foundation