Is ASR the right tool for the construction of Spoken Corpus Linguistics in European Spanish?

San Martín, Mirari; Heras, Jónathan; Mata, Gadea; Gómez, Sara

Is ASR the right tool for the construction of Spoken Corpus Linguistics in European Spanish?

Journal:

Procesamiento del lenguaje natural

ISSN: 1135-5948

Year of publication: 2024

Issue: 73

Pages: 165-176

Type: Article

beta Ver similares en nube de resultados

DIALNET GOOGLE SCHOLAR Open access editor

More publications in: Procesamiento del lenguaje natural

Institutional repository: Open access Editor

Abstract

Spoken corpora are a valuable resource to explore naturally occurring discourse. However, large parts of those corpora remain untranscribed due to the high cost of manually transcribing audio files; and, therefore, the access to these resources is limited. This problem could be faced by means of Automatic Speech Recognition (ASR) tools, that have shown their potential to automatically transcribe audio files. In this work, we study two families of ASR models (Whisper and Seamless) for automatically transcribing files from the COSER corpus (that stands for Corpus Oral y Sonoro del Español Rural, in English Audible Corpus of Rural Spanish). Our results show that those ASR models can produce accurate transcriptions independently of the dialect of the speakers and their speed-rate; specially with the large v3 version of Whisper that is the model which produces the best results (mean WER of 0.292). However, in some cases the transcriptions do not perfectly align with those produced by humans, since human transcriptors reflect nuances introduced in the speech of speakers that are not captured with the ASR models. This shows that ASR tools can reduce the burden of manually transcribing hours of audios from spoken corpus, but human supervision is still needed.

Bibliographic References

Baevski, A., Y. Zhou, A. Mohamed, and M. Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33:12449–12460.
Bang, J.-U., S. Yun, S.-H. Kim, M.-Y. Choi, M.-K. Lee, Y.-J. Kim, D.-H. Kim, J. Park, Y.-J. Lee, and S.-H. Kim. 2020. Ksponspeech: Korean spontaneous speech corpus for automatic speech recognition. Applied Sciences, 10(19):6936.
Barrault, L., Y.-A. Chung, M. C. Meglioli, D. Dale, N. Dong, P.-A. Duquenne, H. Elsahar, H. Gong, K. Heffernan, J. Hoffman, et al. 2023. Seamlessm4t-massively multilingual & multimodal machine translation. arXiv preprint arXiv:2308.11596.
Fernández-Ordóñez, I. 2005. Coser. Corpus oral y sonoro del español rural.
Forsberg, M. 2003. Why is speech recognition difficult. Chalmers University of Technology.
Frota, S. and P. Prieto. 2015. Intonation in Romance: Systemic similarities and differences. Oxford University Press.
Gorisch, J., M. Gref, and T. Schmidt. 2020. Using automatic speech recognition in spoken corpus curation. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6423–6428.
Gulati, A., J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, et al. 2020. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100.
Hualde, J. I. 2013. Los sonidos del español: Spanish Language edition. Cambridge University Press.
Hualde, J. I. and P. Prieto. 2015. Intonational variation in spanish: European and american varieties. In Intonation in romance. Oxford University Press.
Huggins-Daines, D., M. Kumar, A. Chan, A. W. Black, M. Ravishankar, and A. I. Rudnicky. 2006. Pocketsphinx: A free, real-time continuous speech recognition system for hand-held devices. In 2006 IEEE international conference on acoustics speech and signal processing proceedings, volume 1, pages I–I. IEEE.
Kantharuban, A., I. Vulic, and A. Korhonen. 2023. Quantifying the dialect gap and its correlates across languages. In H. Bouamor, J. Pino, and K. Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7226–7245, Singapore, December. Association for Computational Linguistics.
Kennedy, G. 2014. An introduction to corpus linguistics. Routledge.
Knight, D. and S. Adolphs. 2022. Building a spoken corpus: What are the basics? In The Routledge Handbook of Corpus Linguistics. Routledge, pages 21–34.
Knight, D., S. Adolphs, P. Tennent, and R. Carter. 2008. The nottingham multimodal corpus: A demonstration. In Programme of the Workshop on Multimodal Corpora, page 64.
Levenshtein, V. I. et al. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10(8), pages 707–710. Soviet Union.
Li, X., Y. Jia, and C.-C. Chiu. 2023. Textless direct speech-to-speech translation with discrete speech representation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
Malik, M., M. K. Malik, K. Mehmood, and I. Makhdoom. 2021. Automatic speech recognition: a survey. Multimedia Tools and Applications, 80:9411–9457.
Mehrish, A., N. Majumder, R. Bharadwaj, R. Mihalcea, and S. Poria. 2023. A review of deep learning techniques for speech processing. Information Fusion, page 101869.
Mello, H. 2014. What corpus linguistics can offer contact linguistics: the c-oral-brasil corpus experience. PAPIA: Revista Brasileira de Estudos do Contato Lingu´ıstico, pages 407–427.
Moreno-Fernández, F. and R. Caravedo. 2022. Dialectología hispánica the routledge handbook of spanish dialectology.
Nazabal, O. J. 2021. Euskararen erritmoa neurtzen. Fontes linguae vasconum: Studia et documenta, 53(132):257–278.
Orihuela Gracia, S. 2021. Del lenguaje oral al lenguaje escrito: la transcripción como documento de archivo. Ph.D. thesis, Universitat Autònoma de Barcelona.
O’Shaughnessy, D. 2008. Automatic speech recognition: History, methods and challenges. Pattern Recognition, 41(10):2965–2979.
Pragt, L., P. van Hengel, D. Grob, and J.-W. A. Wasmann. 2022. Preliminary evaluation of automated speech recognition apps for the hearing impaired and deaf. Frontiers in Digital Health, 4:806076.
Radford, A., J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever. 2023. Robust speech recognition via largescale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR.
Ramabhadran, B., J. Huang, and M. Picheny. 2003. Towards automatic transcription of large spoken archives-english asr for the malach project. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP’03)., volume 1, pages I–I. IEEE.
Seaborn, K., N. P. Miyake, P. Pennefather, and M. Otake-Matsuura. 2021. Voice in human–agent interaction: A survey. ACM Computing Surveys (CSUR), 54(4):1–43.
Selouani, S. A. and M. Boudraa. 2010. Algerian arabic speech database (algasd): corpus design and automatic speech recognition application. Arabian Journal for Science and Engineering, 35(2):157–166.
Shareah, M., B. Mudhsh, and A. H. ALTakhayinh. 2015. An overview on dialectal variation. International Journal of Scientific and Research Publications, 5(6):1–5.
Tatman, R. and C. Kasten. 2017. Effects of talker dialect, gender & race on accuracy of bing speech and youtube automatic captions. In Interspeech, pages 934–938.
Woodard, J. and J. Nelson. 1982. An information theoretic measure of speech recognition performance. In Workshop on standardisation for speech I/O technology, Naval Air Development Center, Warminster, PA.
Yu, D. and L. Deng. 2016. Automatic speech recognition, volume 1. Springer.

Data source: Dialnet