B. Sisman, M. Zhang, M. Dong and H. Li, "On the Study of Generative Adversarial Networks for Cross-Lingual Voice Conversion," 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), SG, Singapore, 2019, pp. 144-151, doi: 10.1109/ASRU46091.2019.9003939.
Cross-lingual voice conversion (VC) aims to convert the source speaker's voice to sound like that of the target speaker, when the source and target speakers speak different languages. In this paper, we propose to use Generative Adversarial Networks (GANs) for cross-lingual voice-conversion. We further the studies on Variational Autoencoding Wasserstein GAN (VAW-GAN) and cycle-consistent adversarial network (CycleGAN), that are known to be effective for mono-lingual voice conversion. As cross-lingual voice conversion needs to converts the voice across different phonetic system, it is more challenging than mono-lingual voice conversion. By using VAW-GAN and CycleGAN, we successfully convert the speaker identity while carrying over the source speaker's linguistic content. The proposed idea is unique in the sense that it neither relies on bilingual data and their alignment, nor any external process, such as ASR. Moreover, it works with limited amount of training data of any two languages. To our best knowledge, this is the first comprehensive study of Generative Adversarial Networks in cross-lingual voice conversion. In the experiments, we achieve high-quality converted voice, that performs equally well or better than mono-lingual voice conversion.
This research is supported by Programmatic grant from the Singapore Governments Research, Innovation and Enterprise 2020 plan
(Advanced Manufacturing and Engineering domain).
This research / project is supported by the National Research Foundation, Singapore - AI Singapore Programme
Grant Reference no. : AISG-100E-2018-006