Can we train ASR systems on Code-switch without real code-switch data? Case study for Singapore's languages

Page view(s)
0
Checked on
Can we train ASR systems on Code-switch without real code-switch data? Case study for Singapore's languages
Title:
Can we train ASR systems on Code-switch without real code-switch data? Case study for Singapore's languages
Journal Title:
Interspeech 2025
Publication Date:
22 October 2025
Citation:
Nguyen, T., & Tran, H. D. (2025). Can we train ASR systems on Code-switch without real code-switch data? Case study for Singapore’s languages. Interspeech 2025, 753–757. https://doi.org/10.21437/interspeech.2025-654
Abstract:
Code-switching (CS), common in multilingual settings, presents challenges for ASR due to scarce and costly transcribed data caused by linguistic complexity. This study investigates building CS-ASR using synthetic CS data. We propose a phrase-level mixing method to generate synthetic CS data that mimics natural patterns. Utilizing monolingual augmented with synthetic phrase-mixed CS data to fine-tune large pretrained ASR models (Whisper, MMS, SeamlessM4T). This paper focuses on three under-resourced Southeast Asian language pairs: Malay-English (BM-EN), Mandarin-Malay (ZH-BM), and Tamil-English (TA-EN), establishing a new comprehensive benchmark for CS-ASR to evaluate the performance of leading ASR models. Experimental results show that the proposed training strategy enhances ASR performance on monolingual and CS tests, with BM-EN showing highest gains, then TA-EN and ZH-BM. This finding offers a cost-effective approach for CS-ASR development, benefiting research and industry.
License type:
Publisher Copyright
Funding Info:
This research is supported by core funding from: I2R - Aural & Language Intelligence
Grant Reference no. : EC-2023-105
Description:
ISSN:
2958-1796
Files uploaded:

File Size Format Action
cs-asr-interspeech25-final.pdf 250.37 KB PDF Open