Self-Supervised Training for the Kazakh Speech Recognition System
Views: 232 / PDF downloads: 183
DOI:
https://doi.org/10.32523/2616-7182/bulmathenu.2023/4.2Keywords:
automatic speech recognition, Kazakh language, Wav2Vec 2.0, Wav2Vec2-XLSR, pre-trained transformer models, speech representation modelsAbstract
In recent times, advancements in neural models trained using extensive multilingual textual and spoken data have displayed promising potential for enhancing the situation of languages that lack resources. This study is centered on conducting experiments utilizing cutting-edge speech recognition models, specifically Wav2Vec2.0 and Wav2Vec2-XLSR, applied to the Kazakh language. The primary aim of this research is to assess the efficacy of these models in transcribing spoken Kazakh content. Additionally, the investigation seeks to explore the feasibility of leveraging data from other languages for initial training, and to assess whether refining the model with target language data can enhance its performance. As such, this study offers valuable insights into the viability of employing pre-trained multilingual models in the context of underresourced languages. The fine-tuned wav2vec2.0-XLSR model achieved exceptional results, boasting a character error rate (CER) of 1.9 and a word error rate (WER) of 8.9 when evaluated against the test set of the kazcorpus dataset. The outcomes of this analysis hold potential to advance the creation of robust and efficient Automatic Speech Recognition (ASR) systems tailored for the Kazakh language. These developments stand to benefit a range of applications, including speech-to-text translation, voice-activated assistants, and speech-driven communication tools.