Идентификация языка устной речи с использованием модели Wav2Vec2 для казахского языка

Zh. Kozhirbayev; S. Umbet

doi:10.32523/bulmathenu.2025/1.1

Authors

Zh. Kozhirbayev Назарбаев университет https://orcid.org/0000-0003-4235-9049
S. Umbet Трирский университет, ул. Университетсринг 15, 54296, Трир, Германия https://orcid.org/0000-0003-0506-7450

DOI:

https://doi.org/10.32523/bulmathenu.2025/1.1

Keywords:

language identification, spoken language identification, Kazakh language, Wav2Vec2, XLSR

Abstract

This study presents the development and fine-tuning of an oral language identification model using the XLSR (Cross-Lingual Speech Recognition) Wav2Vec2 variant. Trained on a rich and diverse dataset spanning six languages, with a particular focus on low-resource languages such as Kazakh, the model demonstrates remarkable capabilities in multilingual speech recognition. Thanks to extensive evaluation, the finely tuned model not only surpasses existing benchmarks, but also surpasses other modern models, including Whisper variants. Having achieved an impressive F1 score of 92.9% and an accuracy of 93%, the model demonstrates its performance in real multilingual and low-resource scenarios. This work makes a significant contribution to the development of speech recognition technologies by providing a reliable solution for language identification in various language environments, especially in underrepresented language settings. Its success highlights the potential of Wav2Vec2-based models in improving speech processing systems in low-resource multilingual contexts. The results of this analysis can contribute to the development of reliable and effective automatic speech recognition systems optimized for the Kazakh language. Such technologies will find applications in various fields, including speech-to-text conversion, voice assistants and voice communication tools.

Author Biographies

Zh. Kozhirbayev , Назарбаев университет

PhD, Senior Researcher, National Laboratory Astana

S. Umbet , Трирский университет, ул. Университетсринг 15, 54296, Трир, Германия

Masters student in Data Science at University of Trier, st. Universitetsring, Germany

References

Niesler, T. R., Willett, D. Language identification and multilingual speech recognition using discriminatively trained acoustic models // Proceedings of Interspeech. - Pittsburgh, PA, USA, 2006. - P. 134-137.

Baevski A., Zhou Y., Mohamed A., Auli M. wav2vec 2.0: A framework for self-supervised learning of speech representations // Advances in neural information processing systems. - 2020. - V. 33. - P. 12449-12460.

Song J., Ermon S. Multi-label contrastive predictive coding // Advances in Neural Information Processing Systems. - 2020. - V. 33. - P. 8161-8173.

Li S., Li L., Hong Q., Liu L. Improving Transformer-Based Speech Recognition with Unsupervised Pre-Training and Multi-Task Semantic Knowledge Learning // Proceedings of Interspeech. - Shanghai, China, 2020. - P. 5006-5010.

Schneider S., Baevski A., Collobert R., Auli, M. wav2vec: Unsupervised Pre-Training for Speech Recognition // Proceedings of Interspeech. - Graz, Austria, 2019. - P. 3465-3469.

Baevski A., Schneider S., Auli M. vq-wav2vec: Self-supervised learning of discrete speech representations // Proceedings of 8th International Conference on Learning Representations (ICLR). - Addis Ababa, Ethiopia, 2020. - P. 1-12.

Fan, Z., Li, M., Zhou, S., & Xu, B. (2021). Exploring wav2vec 2.0 on Speaker Verification and Language Identification // Proceedings of Interspeech. - Brno, Czechia, 2021. - P. 1509-1513.

Devlin J., Chang M. W., Lee K., Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding // Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. - Minneapolis, Minnesota, USA, 2019. - P. 4171-4186.

Conneau, A., Baevski, A., Collobert, R., Mohamed, A., Auli, M. Unsupervised cross-lingual representation learning for speech recognition // Proceedings of Interspeech. - Brno, Czechia, 2021. - P. 2426-2430.

Singh, G., Sharma, S., Kumar, V., Kaur, B., Bax, M., Masud, M. Spoken Language Identification Using Deep Learning // Computational Intelligence and Neuroscience. - 2021. - V.1. -P. 5123671.

Aysa, Z., Ablimit, M., Hamdulla, A. Multi-scale feature learning for language identification of overlapped speech // Applied Sciences. - 2023. - V.13(7). - P. 4235.

Kozhirbayev, Z., Yessenbayev, Z., Karabalayeva, M. Kazakh and Russian languages identification using long short-term memory recurrent neural networks // Proceedings of the 11th International Conference on Application of Information and Communication Technologies (AICT). – Moscow, Russia, 2017. -V. 1. -P. 1–5.

Kozhirbayev, Z., Yessenbayev, Z., Sharipbay А. Language identification in the spoken term detection system for the kazakh language in a multilinge environment // Journal of Mathematics, Mechanics and Computer Science. - 2019. -V. 96(4). -P. 88–98.

Kozhirbayev, Z., Yessenbayev, Z., Makazhanov, A. Document and word-level language identification for noisy user generated text // Proceedings of the 12th International Conference on Application of Information and Communication Technologies (AICT). - Almaty, Kazakhstan, 2018. - P. 1-4.

Shen, P., Lu, X., Li, S., Kawai, H. Conditional generative adversarial nets classifier for spoken language identification // Proceedings of Interspeech. - Stockholm, Sweden, 2017. – P. 2814-2818.

Valk, J., Alumäe, T. Voxlingua107: a dataset for spoken language recognition // Proceedings of IEEE Spoken Language Technology Workshop (SLT). - Shenzhen, China, 2021. - P. 652-658.

Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., Morais, R., Saunders, L., Tyers, F. M., Weber, G. Common Voice: A Massively-Multilingual Speech Corpus // Proceedings of the Twelfth Language Resources and Evaluation Conference. -Marseille, France, 2020. - P. 4218-4222.

Mussakhojayeva, S., Khassanov, Y., Varol, H. A. KSC2: An industrial-scale open-source Kazakh speech corpus // Proceedings of Interspeech. - Incheon, Korea, 2022. - P. 1367-1371.

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I. Robust speech recognition via large-scale weak supervision // Proceedings of the International conference on machine learning. – Hawaii, USA, 2023. - P. 28492-28518.