LingWav2Vec2: Linguistic-augmented wav2vec 2.0 for Vietnamese Mispronunciation Detection

Page view(s)
25
Checked on Jan 09, 2025
LingWav2Vec2: Linguistic-augmented wav2vec 2.0 for Vietnamese Mispronunciation Detection
Title:
LingWav2Vec2: Linguistic-augmented wav2vec 2.0 for Vietnamese Mispronunciation Detection
Journal Title:
Interspeech 2024
Publication Date:
01 September 2024
Citation:
Nguyen, T., & Tran, H. D. (2024). LingWav2Vec2: Linguistic-augmented wav2vec 2.0 for Vietnamese Mispronunciation Detection. Interspeech 2024, 2355–2359. https://doi.org/10.21437/interspeech.2024-1569
Abstract:
Pronunciation error detection algorithms rely on both acoustic and linguistic information to identify errors. However, these algorithms face challenges due to limited training data, often just a few hours, insufficient for building robust phoneme recognition models. This has led to the adoption of self-supervised learning models like wav2vec 2.0. We propose an innovative approach that combines acoustic and linguistic features by incorporating a Linguistic Encoder with cross-attention, boosting phoneme recognition. This strategy, requiring only an additional 4.3 million parameters, achieved a top-ranking F1-score of 59.68% on the VLSP Vietnamese Mispronunciation Detection 2023 challenge, a 9.72% relative improvement over the previous state-of-the-art. We further analyze the model’s performance breakdown by components, offering deeper insights into our LingWav2Vec2 architecture.
License type:
Attribution 4.0 International (CC BY 4.0)
Funding Info:
This research is supported by core funding from: Aural & Language Intelligence
Grant Reference no. : EC-2023-105
Description:
ISSN:
2308-457X
Files uploaded:

File Size Format Action
ling-wav2vec2-final-1.pdf 363.17 KB PDF Open