Abstract
In this paper, we examine whether confidence scores produced by the C++ re-implementation of Whisper (Radford et al., in: International conference on machine learning, 2023) can be used to score L2 learners of English and classify them. We test whether the language prediction and its probability can be used to classify French learners of English using a specifically collected dataset for read speech and a graded corpus, the ANGLISH corpus (Tortel and Hirst, in: Proceedings of speech prosody 2010, 2010. https://doi.org/10.21437/SpeechProsody.2010-49). We show that probability scores associated with the Whisper subtokens can be used to classify learners into levels using the knn algorithm. We show the limitations of the language detection probability beyond an initial threshold where the native language L1 of the learner can actually be predicted by the speaker. We have also used the ISLE corpus (Menzel et al., in: Proceedings of LREC 2000: Language resources and evaluation conference, European Language Resources Association, 2000) to test the prediction of the levels of Italian and German learners of English (Atwell et al., in: ICAME Jurnal, 27:5–18, 2003). We show how language detection for Whisper’s multilingual larger models can be used to detect less advanced learners’ first language but cannot be used for learner level classification with advanced learners. Using a greedy alignment algorithm, we also discuss the confidence score assigned to Whisper output subtokens and how this may be used for speaker scoring, prediction of learner levels, and learner feedback. We show that low confidence scores and alternative transcriptions can be used as potential cues for learner pronunciation errors.
Similar content being viewed by others
Data availability
The ISLE corpus is available from ELRA (https://catalogue.elra.info/en-us/repository/browse/ELRA-S0083/). The ANGLISH data is available on the public repository ORTOLANG (https://www.ortolang.fr/market/corpora/sldr000731/v2).
Code availability
The alignment script is available on https://github.com/statsmaths/paper-replication.
Notes
We thank Maelle Amand for letting us use the dataset, described in Ballier et al. (2023).
The data was processed from the ELRA distribution.
References
Aksënova, A., Chen, Z., Chiu, C.-C., Esch, D., Golik, P., Han, W., King, L., Ramabhadran, B., Rosenberg, A., Schwartz, S., & Wang, G. (2022). Accented speech recognition: Benchmarking, pre-training, and diverse data. arXiv preprint. arXiv:2205.08014
Arora, V., Lahiri, A., & Reetz, H. (2018). Phonological feature-based speech recognition system for pronunciation training in non-native language learning. The Journal of the Acoustical Society of America, 143(1), 98–108.
Atwell, E., Howarth, P., & Souter, D. (2003). The isle corpus: Italian and German spoken learner’s English. ICAME Journal: International Computer Archive of Modern and Medieval English Journal, 27, 5–18.
Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33, 12449–12460.
Ballier, N., & Martin, P. (2015). Speech annotation of learner corpora. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The Cambridge handbook of learner corpus research (pp. 107–134). Cambridge University Press.
Ballier, N., Méli, A., Amand, M., & Yunès, J.-B. (2023). Using whisper llm for automatic phonetic diagnosis of L2 speech, a case study with French learners of English. In Proceedings of the 6th international conference on natural language and speech processing (ICNLSP 2023) (pp. 282–292).
Ballier, N., Namdarzadeh, B., & Zimina-Poirot, M. (2023). Translating dislocations or parentheticals: Investigating the role of prosodic boundaries for spoken language translation from French into English. In Machine translation summit 2023 (Vol. 19, pp. 119–131).
Casanova, E., Weber, J., Shulby, C. D., Junior, A. C., Gölge, E., & Ponti, M. A. (2022). YourTTS: Towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone. In International conference on machine learning (pp. 2709– 2720).
Chan, M. P. Y., Choe, J., Li, A., Chen, Y., Gao, X., & Holliday, N. (2022). Training and typological bias in ASR performance for world Englishes. In Proceedings of Interspeech, 2022 (pp. 1273–1277). https://doi.org/10.21437/Interspeech.2022-10869
Chanethom, V., & Henderson, A. (2022). Alignment in ASR and L1 listeners’ recognition of L2 learner speech: A replication study. In 15th International conference on native and non-native accents of English, Université de Łódź, Łódź, Poland. https://hal.science/hal-03929160
Dalby, J., & Kewley-Port, D. (1999). Explicit pronunciation training using automatic speech recognition technology. CALICO, 16(3), 425–445.
Gerganov, G. (2003). whisper.cpp: A high-performance inference of OpenAI’s whisper automatic speech recognition (ASR) model.
Inceoglu, S., Chen, W.-H., & Lim, H. (2023). Assessment of L2 intelligibility: Comparing l1 listeners and automatic speech recognition. ReCALL, 35(1), 89–104. https://doi.org/10.1017/S0958344022000192
Inceoglu, S., Lim, H., & Chen, W.-H. (2020). ASR for EFL pronunciation practice: Segmental development and learners’ beliefs. The Journal of Asia TEFL, 17(3), 824–840.
Islam, E., Park, C., & Hain, T. (2023). Exploring speech representations for proficiency assessment in language learning. In 9th Workshop on speech and language technology in education (SLaTE) proceedings (pp. 151– 155). International Speech Communication Association (ISCA).
Javed, T., Joshi, S., Nagarajan, V., Sundaresan, S., Nawale, J., Raman, A., Bhogale, K., Kumar, P., & Khapra, M. M. (2023). Svarah: Evaluating English ASR systems on Indian accents. In Proceedings of Interspeech (pp. 5087– 5091). https://doi.org/10.21437/Interspeech.2023-2588
Jiang, Z., Ren, Y., Ye, Z., Liu, J., Zhang, C., Yang, Q., Ji, S., Huang, R., Wang, C., Yin, X., Ma, Z., & Zhao, Z. (2023). Mega-TTS: Zero-shot text-to-speech at scale with intrinsic inductive bias. arXiv:2306.03509v1
Levenshtein, V. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10, 707–710.
Levinstein, B. A., & Herrmann, D. A. (2024). Still no lie detector for language models: Probing empirical and conceptual roadblocks. Philosophical Studies. https://doi.org/10.1007/s11098-023-02094-3
Martin, A., Daniel, E., & Ward, N. (1998). The use of the word error rate for evaluating automatic speech recognition systems. Proceedings of the IEEE International conference on acoustics, speech, and signal processing (Vol. 1, pp. 77–80).
Menzel, W., Atwell, E., Bonaventura, P., Herron, D., Howarth, P., Morton, R., & Souter, C. (2000). The ISLE corpus of non-native spoken English. In Proceedings of LREC 2000: Language resources and evaluation conference (Vol. 2, pp. 957–964). European Language Resources Association.
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovský, J., Stemmer, G., & Veselý, K. (2011). The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society. https://infoscience.epfl.ch/record/192584/files/Povey_ASRU2011_2011.pdf
R Core Team. (2024). R: A language and environment for statistical computing [computer software manual]. R Core Team.
Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust speech recognition via large-scale weak supervision. https://doi.org/10.48550/arXiv.2212.04356
Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2023). Robust speech recognition via large-scale weak supervision. In International conference on machine learning (pp. 28492–28518).
Rogers, C. L., Dalby, J. M., & DeVane, G. (1994). Intelligibility training for foreign-accented speech: A preliminary study. JASA, 96(5), 3348. https://doi.org/10.1121/1.410623
Tortel, A., & Hirst, D. (2010). Rhythm metrics and the production of English L1/L2. In Proceedings of speech prosody 2010 (p. 959). https://doi.org/10.21437/SpeechProsody.2010-49
Watson, C. S., Reed, D. J., Kewley-Port, D., & Maki, D. (1989). The Indiana Speech Training Aid (ISTRA) I: Comparisons between human and computer-based evaluation of speech quality. Journal of Speech, Language, and Hearing Research, 32(2), 245–251.
Weinberger, S. (2015). Speech accent archive. George Mason University. http://accent.gmu.edu
Witt, S. M., & Young, S. J. (2000). Phone-level pronunciation scoring and assessment for interactive language learning. Speech Communication, 30(2–3), 95–108.
Zhang, Z., Zhou, L., Wang, C., Chen, S., Wu, Y., Liu, S., Chen, Z., Liu, Y., Wang, H., Li, J., He, L., Zhao, S., & Wei, F. (2023). Speak foreign languages with your own voice: Cross-lingual neural codec language modeling. arXiv preprint. arXiv:2303.03926
Author information
Authors and Affiliations
Contributions
Nicolas Ballier conceptualised the paper and wrote the first draft. Taylor Arnold wrote the alignment script and contributed to the ISLE data processing. Jean-Baptiste Yunès wrote the C++ code to extract the probability from the Whisper predictions. Adrien Méli processed the data for the ablation analysis. Tori Fullerton qualitatively annotated the data for the calibration analysis. All authors of the manuscript have read and agreed to the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no competing interests
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1: Reference text of the reading task
Observing the steady fall of the barometer, Captain MacWhirr thought, “There’s some dirty weather knocking about.” This is precisely what he thought. He had had an experience of moderately dirty weather-the term dirty as applied to the weather implying only moderate discomfort to the seaman. Had he been informed by an indisputable authority that the end of the world was to be finally accomplished by a catastrophic disturbance of the atmosphere, he would have assimilated the information under the simple idea of dirty weather, and no other, because he had no experience of cataclysms, and belief does not necessarily imply comprehension. The wisdom of his county had pronounced by means of an Act of Parliament that before he could be considered as fit to take charge of a ship he should be able to answer certain simple questions on the subject of circular storms such as hurricanes, cyclones, typhoons; and apparently he had answered them, since he was now in command of the Nan-Shan in the China seas during the season of typhoons. But if he had answered he remembered nothing of it. He was, however, conscious of being made uncomfortable by the clammy heat. He came out on the bridge, and found no relief to this oppression. The air seemed thick. He gasped like a fish, and began to believe himself greatly out of sorts.
The Nan-Shan was ploughing a vanishing furrow upon the circle of the sea that had the surface and the shimmer of an undulating piece of gray silk. The sun, pale and without rays, poured down leaden heat in a strangely indecisive light, and the Chinamen were lying prostrate about the decks. [...] Captain MacWhirr noticed two of them especially, stretched out on their backs below the bridge. As soon as they had closed their eyes they seemed dead. Three others, however, were quarrelling barbarously away forward; and one big fellow, half naked, with herculean shoulders, was hanging limply over a winch; another, sitting on the deck, his knees up and his head drooping sideways in a girlish attitude, was plaiting his pigtail with infinite languor depicted in his whole person and in the very movement of his fingers. The smoke struggled with difficulty out of the funnel, and instead of streaming away spread itself out like an infernal sort of cloud, smelling of sulphur and raining soot all over the decks.
1.1 Appendix B: Examples of Whisper mistranscriptions of uncomfortable
Ref_text | Model | Gap_text | Count |
|---|---|---|---|
uncomfortable | medium_en | and comfortable | 1 |
uncomfortable | medium_en | and comforted | 1 |
uncomfortable | medium_en | comfortable | 1 |
uncomfortable | medium_en | incompatible | 1 |
uncomfortable | medium | “uncomfortable” | 1 |
uncomfortable | medium | comfortable | 1 |
uncomfortable | tiny.en | a comfortable | 1 |
uncomfortable | tiny.en | and comfortable | 4 |
uncomfortable | tiny.en | in campatible | 1 |
uncomfortable | tiny.en | meant and capable | 1 |
uncomfortable | tiny.en | of incontivable | 1 |
uncomfortable | tiny.en | uncountable | 1 |
uncomfortable | tiny.en | ungothable | 1 |
uncomfortable | tiny | and comfortable | 2 |
uncomfortable | tiny | comfortable | 1 |
uncomfortable | tiny | incomparable | 1 |
uncomfortable | tiny | incompatible | 1 |
uncomfortable | tiny | main and comfortable | 1 |
Uncomfortable | tiny | meant and comfortable | 1 |
uncomfortable | tiny | uncorruptable | 1 |
uncomfortable | tiny | ungovernable | 1 |
1.2 Appendix C: Examples of Whisper mistranscriptions of leaden
Ref_text | Model | Gap_text | Count |
|---|---|---|---|
leaden | medium_en | , leading it | 1 |
leaden | medium_en | laden | 1 |
leaden | medium_en | lead and | 2 |
leaden | medium_en | leading | 4 |
leaden | medium_en | leading it | 1 |
leaden | medium_en | linen and | 1 |
leaden | medium_en | Unleaded | 1 |
leaden | medium | , leading | 2 |
leaden | medium | , leading heads | 1 |
leaden | medium | , leading it | 1 |
leaden | medium | , let on | 1 |
leaden | medium | , letting it | 1 |
leaden | medium | -laden heats | 1 |
leaden | medium | a laden | 1 |
leaden | medium | laden | 4 |
leaden | medium | laden , ate | 1 |
leaden | medium | laden heats | 2 |
leaden | medium | laden hits | 1 |
leaden | medium | lead and | 1 |
leaden | medium | lid and | 3 |
leaden | medium | lid and hip | 1 |
leaden | medium | lid and hit | 1 |
leaden | tiny.en | , leading | 1 |
leaden | tiny.en | -leading | 1 |
leaden | tiny.en | a little | 1 |
leaden | tiny.en | laddened | 1 |
leaden | tiny.en | lead and | 15 |
leaden | tiny.en | lead and hits | 1 |
leaden | tiny.en | lead in | 3 |
leaden | tiny.en | leading | 2 |
leaden | tiny.en | leading heads | 1 |
leaden | tiny.en | leading it | 1 |
leaden | tiny.en | lid and hit | 1 |
leaden | tiny.en | little | 1 |
leaden | tiny | , leading | 2 |
leaden | tiny | , leading heads | 1 |
leaden | tiny | , leading hits | 1 |
leaden | tiny | , leading it | 1 |
leaden | tiny | , led | 1 |
leaden | tiny | , lit | 1 |
leaden | tiny | foredown led | 1 |
leaden | tiny | laden | 3 |
leaden | tiny | laden hits | 1 |
leaden | tiny | lead and | 11 |
leaden | tiny | lead in | 2 |
leaden | tiny | leading hits | 1 |
leaden | tiny | linen , hit | 1 |
leaden | tiny | on lead and | 1 |
leaden | tiny | powered and ledon | 1 |
leaden | tiny | the hidden hits | 1 |
leaden | tiny | the lead in | 1 |
leaden | tiny | the unleading hits | 1 |
Appendix D: Examples of Whisper mistranscriptions of Herculean
Ref_text | Model | Gap_text | Count |
|---|---|---|---|
herculean | medium_en | a Korean | 1 |
herculean | medium_en | a cholera | 1 |
herculean | medium_en | a kulean | 1 |
herculean | medium_en | achilles | 1 |
herculean | medium_en | arkadian | 1 |
herculean | medium_en | her Korean | 1 |
herculean | medium_en | her chilean | 1 |
herculean | medium_en | her clean | 3 |
herculean | medium_en | her culean | 1 |
herculean | medium_en | her curling | 1 |
herculean | medium_en | hickory | 1 |
herculean | medium | Herculean | 1 |
herculean | medium | a Cullian | 1 |
herculean | medium | aculure on | 1 |
herculean | medium | arcane | 1 |
herculean | medium | arculean | 1 |
herculean | medium | curly | 1 |
herculean | medium | her Acheulean | 1 |
herculean | medium | her clean | 3 |
herculean | medium | her curly on | 1 |
herculean | medium | her killian | 1 |
herculean | medium | heroclone | 1 |
herculean | medium | oculial | 1 |
herculean | tiny.en | Arkalian | 1 |
herculean | tiny.en | a crayon | 1 |
herculean | tiny.en | a curian | 1 |
herculean | tiny.en | a curling | 1 |
herculean | tiny.en | a hate -culline | 1 |
herculean | tiny.en | aculean | 1 |
herculean | tiny.en | aculian | 1 |
herculean | tiny.en | air -culion | 1 |
herculean | tiny.en | ekulean | 1 |
herculean | tiny.en | her cally and | 1 |
herculean | tiny.en | her clean | 2 |
herculean | tiny.en | her curling | 1 |
herculean | tiny.en | her curly and | 1 |
herculean | tiny.en | her gluing | 1 |
herculean | tiny.en | her kiln | 1 |
herculean | tiny.en | herculine | 1 |
herculean | tiny.en | oculi and | 1 |
herculean | tiny.en | our culian | 1 |
herculean | tiny.en | the hakiran , | 1 |
herculean | tiny | a chulian | 1 |
herculean | tiny | a chulian | 1 |
herculean | tiny | a clean | 1 |
herculean | tiny | a crayon | 1 |
herculean | tiny | a heckling | 1 |
herculean | tiny | aircolion | 1 |
herculean | tiny | echelion | 1 |
herculean | tiny | her Qulian | 1 |
herculean | tiny | her chulian | 2 |
herculean | tiny | her chuling | 1 |
herculean | tiny | her clean | 2 |
herculean | tiny | her collier and | 1 |
herculean | tiny | her culean | 1 |
herculean | tiny | her culey and | 1 |
herculean | tiny | her curly and | 2 |
herculean | tiny | her cutely and | 1 |
herculean | tiny | her killer and | 1 |
herculean | tiny | her killing | 1 |
herculean | tiny | herculey and | 1 |
herculean | tiny | herculian | 1 |
herculean | tiny | herkali | 1 |
herculean | tiny | herkilling | 1 |
herculean | tiny | herkilly and | 1 |
herculean | tiny | herkul and | 1 |
herculean | tiny | hurtly enchilers | 1 |
herculean | tiny | our Qliian | 1 |
herculean | tiny | percolian | 1 |
herculean | tiny | to be covered | 1 |
herculean | tiny | were her killion | 1 |
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ballier, N., Arnold, T., Méli, A. et al. Whisper for L2 speech scoring. Int J Speech Technol 27, 923–934 (2024). https://doi.org/10.1007/s10772-024-10141-5
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s10772-024-10141-5