Proficiency in the English language holds significant importance in the modern world, impacting higher education, workforce integration, social mobility, and global engagement. Throughout Latin America, governments have implemented policies and programs to expand language-learning opportunities. However, disparities in the allocation of class time persist due to a shortage of adequately trained teachers. This inequality especially affects students from low-income backgrounds and rural areas who lack the means to supplement public education with private lessons and complementary material. Consequently, a substantial portion of these students graduate without a fluent command of English or even basic conversational skills.14
The field of Computer-Assisted Language Learning (CALL) has made notable progress in improving language education, offering remote-learning solutions, alleviating teacher workloads, and providing learners with opportunities for stress-free, feedback-rich practice. While CALL systems are effective for learning grammar and vocabulary, they are still suboptimal for learning pronunciation. This is mostly due to their relatively poor performance in detecting errors over short speech segments. Historically, these systems have focused on achieving native-sounding pronunciation rather than prioritizing intelligibility (that is, reducing mispronunciations that could lead to communication breakdowns), which is a more pragmatic and effective objective.12
The research described in this article was conducted at the Institute of Computer Sciences at the Faculty of Exact and Natural Sciences, University of Buenos Aires, Argentina. Our primary goal is to use innovative technology to develop a free mobile and Web application tailored to the needs of Argentinian children and adults learning English pronunciation. We place special emphasis on evaluating pronunciation at the segmental level (phones or syllables), as it has been demonstrated to be the most effective approach for novice learners compared to evaluating at the phrase or paragraph level.12 Furthermore, our focus is on segmental-level errors with the most significant impact on intelligibility rates.
To address this problem, we designed and collected EpaDB, a database of non-native English speech by Argentinian speakers intended for the development of phone-level pronunciation scoring systems.18 Then, we explored two strategies for dealing with the extreme data scarcity of this task. In Sancinetti et al.,15 we used a simple transfer learning-based approach and showed that large gains can be achieved with this procedure compared to standard methods. In Vidal et al.,19 we explored the use of self-supervised learning (SSL) speech models and created an open source repository for comparing the performance of several SSL representations. In this article, we discuss our contributions to the area of pronunciation scoring.
Background
Pronunciation scoring is the task of generating a score for the quality of pronunciation in the speech uttered by a language learner. The systems designed for this task are usually called computer-aided pronunciation training (CAPT) systems. Two major approaches for designing CAPT systems can be found in the literature. The first approach treats pronunciation scoring as a phone recognition task, using non-native speech data during training.10,11 These systems identify pronunciation errors by comparing the phonetic transcription of a student’s speech to a native speaker’s target sequence. Dynamic programming algorithms are used to align these sequences for comparison, and feedback is provided by highlighting mispronounced phone variants.
The second approach frames the problem as the detection of mispronunciations, generating scores later compared to a threshold to make final decisions. This approach further divides into two groups. The first group relies solely on native data during training, frequently employing automatic speech recognition (ASR) technology to produce scores.4,8,20 The second group incorporates non-native data training the system to distinguish correctly pronounced segments from incorrectly pronounced ones, often using various input features and classifiers and frequently employing an ASR model as the central component.5,16 Approaches that use non-native data for model training generally outperform those that only use native data. Unfortunately, annotated non-native datasets are difficult to collect and are often limited in size.
In recent years, deep neural networks (DNNs) have gained prominence in various fields, including ASR.6 This progress has prompted investigations into applying DNNs for pronunciation assessment, yielding improvements over traditional methods. For systems relying on L1-specific non-native data, where data scarcity is a significant challenge, solutions often involve transfer-learning techniques, where models initially trained for a related source task with abundant available data are adapted to the task of interest. Examples of the application of transfer-learning approaches to pronunciation assessment include the fine-tuning of DNN models originally trained for ASR2,9,15 and the use of models trained with self-supervised learning techniques.21
Pronunciation assessment databases. The development of pronunciation-assessment systems requires the use of databases containing non-native speech annotated at a level (phrase, word, or phone) that matches that of the desired predictions. These databases are used to evaluate system performance and, in some cases, to train prediction models. To our knowledge, only three publicly available databases offer phone-level annotations of pronunciation quality: L2-Arctic,23 SpeechOcean987,22 and EpaDB.18 L2-Arctic features 3,600 annotated recordings by speakers with diverse native languages, including Hindi, Korean, Mandarin, Spanish, Arabic, and Vietnamese. From the 24 speakers in the dataset, four of them are native Spanish speakers. SpeechOcean987 contains 5,000 English recordings of Chinese native speakers. EpaDB, our database, contains 3,200 short English utterances recorded by 50 native Spanish speakers from Argentina, annotated at the phone level. Each speaker recorded 64 short English phrases, carefully designed to include at least one instance of every challenging phone for the target population. It also contains manually assigned phoneme boundaries for each phone, along with an overall score per phrase reflecting perceived non-nativeness. EpaDB’s speech data was recorded on participants’ personal computers using an online application, mimicking the intended usage scenario where users practice their pronunciation at home with their own computers. Two Spanish-native linguists conducted the annotations.
In Vidal et al.,17 we validated EpaDB by comparing it to the Spanish-speaker subset of the L2-ARCTIC corpus. We conducted an analysis of the most common substitution errors in both databases, revealing a substantial overlap with each other and with the expected problematic phones for Spanish learners of English, as documented in the literature. We also compared results from both databases using a state-of-the-art Goodness of Pronunciation (GOP) system trained on a large dataset of native English speech. The GOP method calculates scores for each phone in a phrase as posterior probabilities of the target phones (that is, the phones the student should pronounce), computed using the acoustic model from an ASR system trained solely on native data.8 Notably, for most phones, the results were similar across the two datasets. This is particularly interesting given that EpaDB recordings, collected online by individuals in their homes using their microphones, were notably noisier than those in L2-ARCTIC and often contained background noise.
Pronunciation assessment systems. In our work, we target systems of the second family described in the introduction, designed to generate a pronunciation score for each evaluated segment. These systems allow the adjustment of the operating point via a decision threshold, ensuring that false correction rates remain within an acceptable range to avoid frustrating or confusing users. In contrast, systems from the first family, which compare phonetic transcriptions to reference transcriptions directly return the predicted error lacking this flexibility.
Our proposed approach relies on the use of models trained for a speech-processing task for which a large amount of data is available. These models are leveraged to extract features that are then input to a simple linear model learned for our task of interest or the closely related task of phone recognition. In our work,19 we evaluated the use of various models trained in a self-supervised manner, meaning that no annotations are used for training. Instead, pretext tasks are used, like predicting different representations of the data.7 We also evaluated the use of a model trained for ASR using transcribed English audio from native speakers, called TDNN-F.13 Here, we present a subset of the results in Vidal et al.19 using one of the self-supervised models, WavLM+,1 which was the best-performing self-supervised model in our experiments, and using TDNN-F.
The accompanying figure shows a schematic of the proposed approach. A pre-trained upstream model (WavLM+ or TDNN-F) is used to extract a representation of the audio consisting of a vector every 20ms, resulting in a matrix where the rows are features and the columns are time steps, called frames. These representations are fed to a linear layer that produces one score per target phone in the English language per frame. Next, using automatic time alignments, the scores corresponding to the target phones detected at each frame are selected. Finally, during inference, phone-level scores are computed by averaging over all the frames for each of these targets. For the final classification into correct or incorrect pronunciations, phone-level scores are compared to a threshold tuned on the development data for each phone.
Two approaches are used for training the linear layer. In the first approach, which we call mispronunciation detection (MD), the linear layer is trained using a non-native dataset with pronunciation quality labels for each phone with the cross-entropy loss. Each output node corresponds to the probability of correct pronunciation for one target phone. In the second approach, which we call phone recognition (PR), the linear layer is trained only on native data to recognize the target phones. Then, during inference, scores are computed by averaging the posteriors for each target phone given by the alignments on the non-native data. This approach coincides with the standard GOP algorithm typically used as a baseline in much of the literature in pronunciation scoring.
Evaluation of the Proposed Approach
For both the MD and PR approaches, our systems assign a score for each target phone and each frame. Higher values correspond to incorrectly pronounced phones. Categorical decisions are made by comparing these scores with a threshold. Each possible threshold results in a false positive rate (FPR, the rate of false corrections) and false negative rate (FNR, the rate of missed corrections). We report the area under the ROC curve (AUC). The AUC is a standard metric that integrates the performance over all possible operating points given by different thresholds. We also report another metric we consider to be more appropriate for the task of pronunciation assessment where, as discussed earlier, the false-positive rate should be controlled to avoid frustrating the student with unnecessary corrections. We define Cost as 2 FPR + FNR, where FPR is penalized more than the FNR, prioritizing low FPR over low FNR.
Note that AUC ignores the problem of threshold selection, which is equivalent to assuming that the threshold will always be set optimally—something rarely possible in practice. On the other hand, because the Cost values are affected by the threshold selection process, they can be set to focus only on the point of interest for the specific application. For these reasons, we believe Cost is a more appropriate metric for measuring performance for this task.
The table shows the average AUC and cost for the PR and MD approaches for two upstream models, TDNN-F and WavLM+, for the EpaDB and L2-Arctic datasets. Despite some small discrepancies between the two metrics, the overall trend indicates that the systems that use non-native data for training (MD rows) perform better than those that only use native data (PR rows) and that the features extracted from the WavLM+ model are better for this task than those extracted from the TDNN-F+ model. Overall, the relative gains in Cost with respect to the baseline approach (PR + TDNN-F) are 21% for EpaDB and 10% for L2-Arctic. The smaller relative gains observed in L2-Arctic may be because this dataset is composed of a heterogeneous set of accents, making the MD training process less advantageous than for EpaDB, where the system is able to adapt to the specific Argentinian accent.
EpaDB | L2-Arctic | ||||
---|---|---|---|---|---|
Av AUC | Av Cost | Av AUC | Av Cost | ||
PR | TDNN-F | 0.71 | 0.85 | 0.71 | 0.83 |
WavLM+ | 0.67 | 0.82 | 0.67 | 0.84 | |
MD | TDNN-F | 0.80 | 0.73 | 0.76 | 0.79 |
WavLM+ | 0.83 | 0.67 | 0.83 | 0.74 |
Conclusion
CAPT systems can help learners of a second language improve their pronunciation skills. These systems are particularly useful when they provide feedback at the phone or syllable level, when they are adapted to the target population of speakers, and when they are carefully designed to avoid high false-correction rates and to focus on mistakes that threaten intelligibility. The goal of our work is to develop a pronunciation scoring tool, following these guidelines, for Argentinian children and adult learners of English.
One of the main difficulties in the development of CAPT systems is the limited availability of data. Data from the target population of interest is needed for evaluation since the performance of the systems greatly depends on the type of errors made by learners. Further, systems adapted to the population of interest tend to perform better than those that were only exposed to native data or to non-natives from other L1 languages. For this reason, the first stage of our project was the collection and annotation of EpaDB which is, to our knowledge, the largest publicly available dataset of Spanish speakers annotated with pronunciation scoring labels.
The next stage of our project was to develop CAPT systems, with a focus on methods that could efficiently take advantage of relatively small training datasets. We explored the use of models pre-trained on large speech datasets to generate speech representations, using them as input to a downstream model trained on our target task and data. We compared two training methods: one trained on non-native data for mispronunciation detection and one trained on native data for phone recognition. As anticipated, the MD approach outperformed the baseline PR approach. However, the PR approach is still relevant when training data specific to the target population is unavailable.
In the future, we plan to continue enriching EpaDB by including children’s speech and a larger number of adults. The most complex task in dataset creation is annotating each sample with phone-level pronunciation labels. Hence, we will explore the use of unsupervised or lightly supervised approaches for leveraging non-native data without pronunciation annotations or with phrase-level annotations. Finally, we will develop a gamified application for pronunciation learning that will be freely available and accessible from computers or mobile phones.
Acknowledgments
This work was funded by the Argentinian Ministry of Science, Technology, and Productive Innovation, by CONICET, and by a Google Research Award for Latin America.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment