Deep Bidirectional LSTM Modeling of Timbre and Prosody for Emotional Voice Conversion

Page view(s)
37
Checked on May 23, 2025
Deep Bidirectional LSTM Modeling of Timbre and Prosody for Emotional Voice Conversion
Title:
Deep Bidirectional LSTM Modeling of Timbre and Prosody for Emotional Voice Conversion
Journal Title:
Interspeech 2016
Keywords:
Publication Date:
07 May 2021
Citation:
Ming, H., Huang, D., Xie, L., Wu, J., Dong, M., Li, H. (2016) Deep Bidirectional LSTM Modeling of Timbre and Prosody for Emotional Voice Conversion. Proc. Interspeech 2016, 2453-2457.
Abstract:
Emotional voice conversion aims at converting speech from one emotion state to another. This paper proposes to model timbre and prosody features using a deep bidirectional long short-term memory (DBLSTM) for emotional voice conversion. A continuous wavelet transform~(CWT) representation of fundamental frequency (F0) and energy contour are used for prosody modeling. Specifically, we use CWT to decompose F0 into a five-scale representation, and decompose energy contour into a ten-scale representation, where each feature scale corresponds to a temporal scale. Both spectrum and prosody (F0 and energy contour) features are simultaneously converted by a sequence to sequence conversion method with DBLSTM model, which captures both frame-wise and long-range relationship between source and target voice. The converted speech signals are evaluated both objectively and subjectively, which confirms the effectiveness of the proposed method.
License type:
PublisherCopyrights
Funding Info:
Description:
ISBN:

Files uploaded: