Nguyen, T., & Tran, H. D. (2024). LingWav2Vec2: Linguistic-augmented wav2vec 2.0 for Vietnamese Mispronunciation Detection. Interspeech 2024, 2355–2359. https://doi.org/10.21437/interspeech.2024-1569
Abstract:
Pronunciation error detection algorithms rely on both acoustic
and linguistic information to identify errors. However, these algorithms
face challenges due to limited training data, often just
a few hours, insufficient for building robust phoneme recognition
models. This has led to the adoption of self-supervised
learning models like wav2vec 2.0. We propose an innovative
approach that combines acoustic and linguistic features by incorporating
a Linguistic Encoder with cross-attention, boosting
phoneme recognition. This strategy, requiring only an additional
4.3 million parameters, achieved a top-ranking F1-score
of 59.68% on the VLSP Vietnamese Mispronunciation Detection
2023 challenge, a 9.72% relative improvement over the
previous state-of-the-art. We further analyze the model’s performance
breakdown by components, offering deeper insights
into our LingWav2Vec2 architecture.
License type:
Attribution 4.0 International (CC BY 4.0)
Funding Info:
This research is supported by core funding from: Aural & Language Intelligence
Grant Reference no. : EC-2023-105