Thanks to visit codestin.com
Credit goes to link.springer.com

Skip to main content
Log in

Whisper for L2 speech scoring

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

In this paper, we examine whether confidence scores produced by the C++ re-implementation of Whisper (Radford et al., in: International conference on machine learning, 2023) can be used to score L2 learners of English and classify them. We test whether the language prediction and its probability can be used to classify French learners of English using a specifically collected dataset for read speech and a graded corpus, the ANGLISH corpus (Tortel and Hirst, in: Proceedings of speech prosody 2010, 2010. https://doi.org/10.21437/SpeechProsody.2010-49). We show that probability scores associated with the Whisper subtokens can be used to classify learners into levels using the knn algorithm. We show the limitations of the language detection probability beyond an initial threshold where the native language L1 of the learner can actually be predicted by the speaker. We have also used the ISLE corpus (Menzel et al., in: Proceedings of LREC 2000: Language resources and evaluation conference, European Language Resources Association, 2000) to test the prediction of the levels of Italian and German learners of English (Atwell et al., in: ICAME Jurnal, 27:5–18, 2003). We show how language detection for Whisper’s multilingual larger models can be used to detect less advanced learners’ first language but cannot be used for learner level classification with advanced learners. Using a greedy alignment algorithm, we also discuss the confidence score assigned to Whisper output subtokens and how this may be used for speaker scoring, prediction of learner levels, and learner feedback. We show that low confidence scores and alternative transcriptions can be used as potential cues for learner pronunciation errors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from £29.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data availability

The ISLE corpus is available from ELRA (https://catalogue.elra.info/en-us/repository/browse/ELRA-S0083/). The ANGLISH data is available on the public repository ORTOLANG (https://www.ortolang.fr/market/corpora/sldr000731/v2).

Code availability

The alignment script is available on https://github.com/statsmaths/paper-replication.

Notes

  1. https://alphacephei.com/vosk/.

  2. https://github.com/ggerganov/whisper.cpp.

  3. We thank Maelle Amand for letting us use the dataset, described in Ballier et al. (2023).

  4. The data was processed from the ELRA distribution.

  5. https://youglish.com/pronounce/Herculean/english?.

References

  • Aksënova, A., Chen, Z., Chiu, C.-C., Esch, D., Golik, P., Han, W., King, L., Ramabhadran, B., Rosenberg, A., Schwartz, S., & Wang, G. (2022). Accented speech recognition: Benchmarking, pre-training, and diverse data. arXiv preprint. arXiv:2205.08014

  • Arora, V., Lahiri, A., & Reetz, H. (2018). Phonological feature-based speech recognition system for pronunciation training in non-native language learning. The Journal of the Acoustical Society of America, 143(1), 98–108.

    Article  Google Scholar 

  • Atwell, E., Howarth, P., & Souter, D. (2003). The isle corpus: Italian and German spoken learner’s English. ICAME Journal: International Computer Archive of Modern and Medieval English Journal, 27, 5–18.

    Google Scholar 

  • Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33, 12449–12460.

    Google Scholar 

  • Ballier, N., & Martin, P. (2015). Speech annotation of learner corpora. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The Cambridge handbook of learner corpus research (pp. 107–134). Cambridge University Press.

    Chapter  Google Scholar 

  • Ballier, N., Méli, A., Amand, M., & Yunès, J.-B. (2023). Using whisper llm for automatic phonetic diagnosis of L2 speech, a case study with French learners of English. In Proceedings of the 6th international conference on natural language and speech processing (ICNLSP 2023) (pp. 282–292).

  • Ballier, N., Namdarzadeh, B., & Zimina-Poirot, M. (2023). Translating dislocations or parentheticals: Investigating the role of prosodic boundaries for spoken language translation from French into English. In Machine translation summit 2023 (Vol. 19, pp. 119–131).

  • Casanova, E., Weber, J., Shulby, C. D., Junior, A. C., Gölge, E., & Ponti, M. A. (2022). YourTTS: Towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone. In International conference on machine learning (pp. 2709– 2720).

  • Chan, M. P. Y., Choe, J., Li, A., Chen, Y., Gao, X., & Holliday, N. (2022). Training and typological bias in ASR performance for world Englishes. In Proceedings of Interspeech, 2022 (pp. 1273–1277). https://doi.org/10.21437/Interspeech.2022-10869

  • Chanethom, V., & Henderson, A. (2022). Alignment in ASR and L1 listeners’ recognition of L2 learner speech: A replication study. In 15th International conference on native and non-native accents of English, Université de Łódź, Łódź, Poland. https://hal.science/hal-03929160

  • Dalby, J., & Kewley-Port, D. (1999). Explicit pronunciation training using automatic speech recognition technology. CALICO, 16(3), 425–445.

    Article  Google Scholar 

  • Gerganov, G. (2003). whisper.cpp: A high-performance inference of OpenAI’s whisper automatic speech recognition (ASR) model.

  • Inceoglu, S., Chen, W.-H., & Lim, H. (2023). Assessment of L2 intelligibility: Comparing l1 listeners and automatic speech recognition. ReCALL, 35(1), 89–104. https://doi.org/10.1017/S0958344022000192

    Article  Google Scholar 

  • Inceoglu, S., Lim, H., & Chen, W.-H. (2020). ASR for EFL pronunciation practice: Segmental development and learners’ beliefs. The Journal of Asia TEFL, 17(3), 824–840.

    Google Scholar 

  • Islam, E., Park, C., & Hain, T. (2023). Exploring speech representations for proficiency assessment in language learning. In 9th Workshop on speech and language technology in education (SLaTE) proceedings (pp. 151– 155). International Speech Communication Association (ISCA).

  • Javed, T., Joshi, S., Nagarajan, V., Sundaresan, S., Nawale, J., Raman, A., Bhogale, K., Kumar, P., & Khapra, M. M. (2023). Svarah: Evaluating English ASR systems on Indian accents. In Proceedings of Interspeech (pp. 5087– 5091). https://doi.org/10.21437/Interspeech.2023-2588

  • Jiang, Z., Ren, Y., Ye, Z., Liu, J., Zhang, C., Yang, Q., Ji, S., Huang, R., Wang, C., Yin, X., Ma, Z., & Zhao, Z. (2023). Mega-TTS: Zero-shot text-to-speech at scale with intrinsic inductive bias. arXiv:2306.03509v1

  • Levenshtein, V. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10, 707–710.

    MathSciNet  Google Scholar 

  • Levinstein, B. A., & Herrmann, D. A. (2024). Still no lie detector for language models: Probing empirical and conceptual roadblocks. Philosophical Studies. https://doi.org/10.1007/s11098-023-02094-3

  • Martin, A., Daniel, E., & Ward, N. (1998). The use of the word error rate for evaluating automatic speech recognition systems. Proceedings of the IEEE International conference on acoustics, speech, and signal processing (Vol. 1, pp. 77–80).

  • Menzel, W., Atwell, E., Bonaventura, P., Herron, D., Howarth, P., Morton, R., & Souter, C. (2000). The ISLE corpus of non-native spoken English. In Proceedings of LREC 2000: Language resources and evaluation conference (Vol. 2, pp. 957–964). European Language Resources Association.

  • Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovský, J., Stemmer, G., & Veselý, K. (2011). The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society. https://infoscience.epfl.ch/record/192584/files/Povey_ASRU2011_2011.pdf

  • R Core Team. (2024). R: A language and environment for statistical computing [computer software manual]. R Core Team.

  • Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust speech recognition via large-scale weak supervision. https://doi.org/10.48550/arXiv.2212.04356

  • Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2023). Robust speech recognition via large-scale weak supervision. In International conference on machine learning (pp. 28492–28518).

  • Rogers, C. L., Dalby, J. M., & DeVane, G. (1994). Intelligibility training for foreign-accented speech: A preliminary study. JASA, 96(5), 3348. https://doi.org/10.1121/1.410623

    Article  Google Scholar 

  • Tortel, A., & Hirst, D. (2010). Rhythm metrics and the production of English L1/L2. In Proceedings of speech prosody 2010 (p. 959). https://doi.org/10.21437/SpeechProsody.2010-49

  • Watson, C. S., Reed, D. J., Kewley-Port, D., & Maki, D. (1989). The Indiana Speech Training Aid (ISTRA) I: Comparisons between human and computer-based evaluation of speech quality. Journal of Speech, Language, and Hearing Research, 32(2), 245–251.

    Article  Google Scholar 

  • Weinberger, S. (2015). Speech accent archive. George Mason University. http://accent.gmu.edu

  • Witt, S. M., & Young, S. J. (2000). Phone-level pronunciation scoring and assessment for interactive language learning. Speech Communication, 30(2–3), 95–108.

    Article  Google Scholar 

  • Zhang, Z., Zhou, L., Wang, C., Chen, S., Wu, Y., Liu, S., Chen, Z., Liu, Y., Wang, H., Li, J., He, L., Zhao, S., & Wei, F. (2023). Speak foreign languages with your own voice: Cross-lingual neural codec language modeling. arXiv preprint. arXiv:2303.03926

Download references

Author information

Authors and Affiliations

Authors

Contributions

Nicolas Ballier conceptualised the paper and wrote the first draft. Taylor Arnold wrote the alignment script and contributed to the ISLE data processing. Jean-Baptiste Yunès wrote the C++ code to extract the probability from the Whisper predictions. Adrien Méli processed the data for the ablation analysis. Tori Fullerton qualitatively annotated the data for the calibration analysis. All authors of the manuscript have read and agreed to the final manuscript.

Corresponding author

Correspondence to Nicolas Ballier.

Ethics declarations

Conflict of interest

The authors declare that they have no competing interests

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: Reference text of the reading task

Observing the steady fall of the barometer, Captain MacWhirr thought, “There’s some dirty weather knocking about.” This is precisely what he thought. He had had an experience of moderately dirty weather-the term dirty as applied to the weather implying only moderate discomfort to the seaman. Had he been informed by an indisputable authority that the end of the world was to be finally accomplished by a catastrophic disturbance of the atmosphere, he would have assimilated the information under the simple idea of dirty weather, and no other, because he had no experience of cataclysms, and belief does not necessarily imply comprehension. The wisdom of his county had pronounced by means of an Act of Parliament that before he could be considered as fit to take charge of a ship he should be able to answer certain simple questions on the subject of circular storms such as hurricanes, cyclones, typhoons; and apparently he had answered them, since he was now in command of the Nan-Shan in the China seas during the season of typhoons. But if he had answered he remembered nothing of it. He was, however, conscious of being made uncomfortable by the clammy heat. He came out on the bridge, and found no relief to this oppression. The air seemed thick. He gasped like a fish, and began to believe himself greatly out of sorts.

The Nan-Shan was ploughing a vanishing furrow upon the circle of the sea that had the surface and the shimmer of an undulating piece of gray silk. The sun, pale and without rays, poured down leaden heat in a strangely indecisive light, and the Chinamen were lying prostrate about the decks. [...] Captain MacWhirr noticed two of them especially, stretched out on their backs below the bridge. As soon as they had closed their eyes they seemed dead. Three others, however, were quarrelling barbarously away forward; and one big fellow, half naked, with herculean shoulders, was hanging limply over a winch; another, sitting on the deck, his knees up and his head drooping sideways in a girlish attitude, was plaiting his pigtail with infinite languor depicted in his whole person and in the very movement of his fingers. The smoke struggled with difficulty out of the funnel, and instead of streaming away spread itself out like an infernal sort of cloud, smelling of sulphur and raining soot all over the decks.

1.1 Appendix B: Examples of Whisper mistranscriptions of uncomfortable

Ref_text

Model

Gap_text

Count

uncomfortable

medium_en

and comfortable

1

uncomfortable

medium_en

and comforted

1

uncomfortable

medium_en

comfortable

1

uncomfortable

medium_en

incompatible

1

uncomfortable

medium

“uncomfortable”

1

uncomfortable

medium

comfortable

1

uncomfortable

tiny.en

a comfortable

1

uncomfortable

tiny.en

and comfortable

4

uncomfortable

tiny.en

in campatible

1

uncomfortable

tiny.en

meant and capable

1

uncomfortable

tiny.en

of incontivable

1

uncomfortable

tiny.en

uncountable

1

uncomfortable

tiny.en

ungothable

1

uncomfortable

tiny

and comfortable

2

uncomfortable

tiny

comfortable

1

uncomfortable

tiny

incomparable

1

uncomfortable

tiny

incompatible

1

uncomfortable

tiny

main and comfortable

1

Uncomfortable

tiny

meant and comfortable

1

uncomfortable

tiny

uncorruptable

1

uncomfortable

tiny

ungovernable

1

1.2 Appendix C: Examples of Whisper mistranscriptions of leaden

Ref_text

Model

Gap_text

Count

leaden

medium_en

, leading it

1

leaden

medium_en

laden

1

leaden

medium_en

lead and

2

leaden

medium_en

leading

4

leaden

medium_en

leading it

1

leaden

medium_en

linen and

1

leaden

medium_en

Unleaded

1

leaden

medium

, leading

2

leaden

medium

, leading heads

1

leaden

medium

, leading it

1

leaden

medium

, let on

1

leaden

medium

, letting it

1

leaden

medium

-laden heats

1

leaden

medium

a laden

1

leaden

medium

laden

4

leaden

medium

laden , ate

1

leaden

medium

laden heats

2

leaden

medium

laden hits

1

leaden

medium

lead and

1

leaden

medium

lid and

3

leaden

medium

lid and hip

1

leaden

medium

lid and hit

1

leaden

tiny.en

, leading

1

leaden

tiny.en

-leading

1

leaden

tiny.en

a little

1

leaden

tiny.en

laddened

1

leaden

tiny.en

lead and

15

leaden

tiny.en

lead and hits

1

leaden

tiny.en

lead in

3

leaden

tiny.en

leading

2

leaden

tiny.en

leading heads

1

leaden

tiny.en

leading it

1

leaden

tiny.en

lid and hit

1

leaden

tiny.en

little

1

leaden

tiny

, leading

2

leaden

tiny

, leading heads

1

leaden

tiny

, leading hits

1

leaden

tiny

, leading it

1

leaden

tiny

, led

1

leaden

tiny

, lit

1

leaden

tiny

foredown led

1

leaden

tiny

laden

3

leaden

tiny

laden hits

1

leaden

tiny

lead and

11

leaden

tiny

lead in

2

leaden

tiny

leading hits

1

leaden

tiny

linen , hit

1

leaden

tiny

on lead and

1

leaden

tiny

powered and ledon

1

leaden

tiny

the hidden hits

1

leaden

tiny

the lead in

1

leaden

tiny

the unleading hits

1

Appendix D: Examples of Whisper mistranscriptions of Herculean

Ref_text

Model

Gap_text

Count

herculean

medium_en

a Korean

1

herculean

medium_en

a cholera

1

herculean

medium_en

a kulean

1

herculean

medium_en

achilles

1

herculean

medium_en

arkadian

1

herculean

medium_en

her Korean

1

herculean

medium_en

her chilean

1

herculean

medium_en

her clean

3

herculean

medium_en

her culean

1

herculean

medium_en

her curling

1

herculean

medium_en

hickory

1

herculean

medium

Herculean

1

herculean

medium

a Cullian

1

herculean

medium

aculure on

1

herculean

medium

arcane

1

herculean

medium

arculean

1

herculean

medium

curly

1

herculean

medium

her Acheulean

1

herculean

medium

her clean

3

herculean

medium

her curly on

1

herculean

medium

her killian

1

herculean

medium

heroclone

1

herculean

medium

oculial

1

herculean

tiny.en

Arkalian

1

herculean

tiny.en

a crayon

1

herculean

tiny.en

a curian

1

herculean

tiny.en

a curling

1

herculean

tiny.en

a hate -culline

1

herculean

tiny.en

aculean

1

herculean

tiny.en

aculian

1

herculean

tiny.en

air -culion

1

herculean

tiny.en

ekulean

1

herculean

tiny.en

her cally and

1

herculean

tiny.en

her clean

2

herculean

tiny.en

her curling

1

herculean

tiny.en

her curly and

1

herculean

tiny.en

her gluing

1

herculean

tiny.en

her kiln

1

herculean

tiny.en

herculine

1

herculean

tiny.en

oculi and

1

herculean

tiny.en

our culian

1

herculean

tiny.en

the hakiran ,

1

herculean

tiny

a chulian

1

herculean

tiny

a chulian

1

herculean

tiny

a clean

1

herculean

tiny

a crayon

1

herculean

tiny

a heckling

1

herculean

tiny

aircolion

1

herculean

tiny

echelion

1

herculean

tiny

her Qulian

1

herculean

tiny

her chulian

2

herculean

tiny

her chuling

1

herculean

tiny

her clean

2

herculean

tiny

her collier and

1

herculean

tiny

her culean

1

herculean

tiny

her culey and

1

herculean

tiny

her curly and

2

herculean

tiny

her cutely and

1

herculean

tiny

her killer and

1

herculean

tiny

her killing

1

herculean

tiny

herculey and

1

herculean

tiny

herculian

1

herculean

tiny

herkali

1

herculean

tiny

herkilling

1

herculean

tiny

herkilly and

1

herculean

tiny

herkul and

1

herculean

tiny

hurtly enchilers

1

herculean

tiny

our Qliian

1

herculean

tiny

percolian

1

herculean

tiny

to be covered

1

herculean

tiny

were her killion

1

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ballier, N., Arnold, T., Méli, A. et al. Whisper for L2 speech scoring. Int J Speech Technol 27, 923–934 (2024). https://doi.org/10.1007/s10772-024-10141-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s10772-024-10141-5

Keywords

Profiles

  1. Nicolas Ballier