Deep Bidirectional LSTM Modeling of Timbre and Prosody for Emotional Voice Conversion

Deep Bidirectional LSTM Modeling of Timbre and Prosody for Emotional Voice Conversion
Title:
Deep Bidirectional LSTM Modeling of Timbre and Prosody for Emotional Voice Conversion
Other Titles:
Interspeech 2016
Keywords:
Publication Date:
07 May 2021
Citation:
Ming, H., Huang, D., Xie, L., Wu, J., Dong, M., Li, H. (2016) Deep Bidirectional LSTM Modeling of Timbre and Prosody for Emotional Voice Conversion. Proc. Interspeech 2016, 2453-2457.
Abstract:
Emotional voice conversion aims at converting speech from one emotion state to another. This paper proposes to model timbre and prosody features using a deep bidirectional long short-term memory (DBLSTM) for emotional voice conversion. A continuous wavelet transform~(CWT) representation of fundamental frequency (F0) and energy contour are used for prosody modeling. Specifically, we use CWT to decompose F0 into a five-scale representation, and decompose energy contour into a ten-scale representation, where each feature scale corresponds to a temporal scale. Both spectrum and prosody (F0 and energy contour) features are simultaneously converted by a sequence to sequence conversion method with DBLSTM model, which captures both frame-wise and long-range relationship between source and target voice. The converted speech signals are evaluated both objectively and subjectively, which confirms the effectiveness of the proposed method.
License type:
PublisherCopyrights
Funding Info:
Description:
ISBN:

Files uploaded: