Automatic speech recognition for under-resourced languages of Russia (on the example of the Karelian language)
Automatic speech recognition for under-resourced languages of Russia (on the example of the Karelian language)
In the course of this project, it is planned to conduct a study on the elaboration of speech recognition systems for under-resourced languages, using Karelian as an example language. The development of such a system is a rather difficult task, first of all, due to the fact that Karelian belongs to the so-called under-resourced languages, i.e., languages for which no significant speech and text corpora have been created. The general shortage of data makes it difficult to apply standard approaches to automatic speech-to-text conversion, which involve training a system with the use of large data sets.
The Karelian language is spoken by about 30 thousand people, about 25 thousand of whom live in Russia. Currently, there are no systems for automatic recognition of the Karelian speech yet, and Karelian is under-resourced and endangered, being under threat of extinction.
Elaboration of the proposed system for the automatic speech-to-text conversion of Karelian is of relevance for the following reasons: the approaches and solutions developed within this project, will be of significance for the elaboration of automatic speech-to-text conversion systems, as well as automatic speech recognition and machine translation systems, for others under-resourced languages of the world; the Karelian language is one of the languages of Russia, which, according to the data presented in "Endangered World Languages Atlas" by UNESCO, belongs to endangered languages; the development of the proposed system contributes to the research and preservation of endangered languages; in particular, it can be useful in the work of field linguists engaged in collecting of language samples.
The aim of the project is to develop a prototype system for automatic speech-to-text conversion for the Karelian language. Among the main tasks of the project are: collecting a speech corpus of Karelian; collecting and processing text data in Karelian; development of acoustic models; development of language models.
The practical significance of the research is that the acoustic and linguistic models being developed will be used in the system of automatic speech-to-text conversion for the Karelian language. Such a system can be used for machine translation from Karelian to Russian. In addition, automatic speech recognition systems can be used in automatic stenography systems for under-resourced and endangered languages, being a useful tool for preservation and research.
Results for 2022
Among the tasks, completed by the research team in year 2022, were an analytical review of the research topic, collecting and annotating of a corpus of continuous Karelian speech, as well as collecting of a corpus of Karelian texts.
The analytical review on the research topic encompasses more than 70 works, more than 50 of which have been published in the last 7 years. The review addresses the definition of low-resource languages, points out the main challenges that arise in the area of automatic speech recognition for low-resource languages, and outlines the basic methods aimed at solving the problems at issue. Based on the review, one can conclude that the most promising ways to solve the aforesaid problems are the expansion of the training corpora (data augmentation) and the transfer of parameters of a model trained on extraneous data to the model of the target language (transfer learning).
A study of linguistic and phonetic features of Karelian was carried out, resulting in a development of a phonemic alphabet for the Livvik dialect, including stressed and unstressed vowels, hard and soft consonants, long vowels and doubled consonants (the length of Karelian phonemes is a distinctive feature). In addition, the back allophone of the phoneme /i/ was singled out as a separate phoneme. In total, an inventory of 26 vowels and 56 consonants was identified. The developed alphabet was used for creation of phonemic transcriptions of Karelian words. A Python-based software module was developed in order to automatically create transcriptions. It performs grapheme-to-phoneme transformation for the input list of words in accordance with the transcription rules for Karelian.
A speech corpus of the Livvik dialect has been collected, based on the recordings of 10 radio broadcasts "Kodirandaine" ("Native Shore"), provided by the State Television and Radio Broadcasting Company "Karelia". The speech corpus includes audio recordings of 15 speakers (6 men and 9 women). The volume of records after the removal of “junk” fragments was 3.5 hours. All the recordings were transcribed, and the whole corpus was annotated and segmented into single phrases. The corpus was split into the training and the test parts. The training part includes 90% of the phrases, whereas the test part includes 10% of the phrases. The training part of the corpus was augmented by changing the speed of speech and the frequency of the fundamental tone. The augmented speech data was added to the real training data.
A corpus of texts in the Livvik dialect of the Karelian language has been collected with the use of books provided by the publishing houses "Periodika" and "Verso". Another important source was the open corpus of the Veps and Karelian languages VepKar, as well as text data from other open sources and transcripts of audio recordings from the training part of the speech corpus. The text corpus was automatically pre-processed with the use of a developed software module, namely, splitting the text into separate sentences, removing punctuation marks, replacing capital letters with lowercase letters, deleting text in brackets and removing repeated sentences. The volume of the corpus after the removal of repeated sentences was over 5M word usages. Based on the collected text corpus, a dictionary was created, which further will be used as a part of the Karelian speech recognition system. All the words included in the dictionary were automatically transcribed (phonemic transcription).
The results of the research conducted in 2022 were published in “Informatics and Automation” journal indexed in Scopus and RSCI databases, and were presented at the conference “Bubrikh readings: Languages and Cultures in the Digital Age”.
Addresses of Internet resources prepared for the Project:
- Kipyatkova I. S., Kagirov I. A. Analytical Review of Methods for Solving Data Scarcity Issues Regarding Elaboration of Automatic Speech Recognition Systems for Low-Resource Languages // Informatics and Automation. 2022. N. 21(4). pp. 678–709.
Results for 2023
Addresses of Internet resources prepared for the Project:
- Kipyatkova I, Kagirov I. Deep Models for Low-Resourced Speech Recognition: Livvi-Karelian Case More