Wu, Y., Wang, L., Lee, K. A., Liu, M., Dang, J. (2021). Joint Feature Enhancement and Speaker Recognition with Multi-Objective Task-Oriented Network. Interspeech 2021. doi:10.21437/interspeech.2021-1978
Recently, increasing attention has been paid to the joint training of upstream and downstream tasks, and to address the challenge of how to synchronize various loss functions in a multi-objective scenario. In this paper, to address the competing gradient directions between the speaker classification loss and the feature enhancement loss, we propose an asynchronous subregion optimization approach for the joint training of feature enhancement and speaker embedding neural networks. For the asynchronous subregion optimization, the squeeze and excitation (SE) method is introduced in the enhancement network to adaptively select important channels for speaker embedding. Furthermore, channel-wise feature concatenation is applied between the input feature and the enhanced feature to address the distortion of speaker information that is caused by enhancement loss. By using the proposed joint training network with asynchronous subregion optimization and channel-wise feature concatenation, we obtained relative gains of 11.95% and 6.43% in equal error rate on a noisy version of Voxceleb1 and VOiCES corpus, respectively.
This work was supported by the National Key R&D Program of China under Grant 2018YFB1305200, the National Natural Science Foundation of China under Grant 61771333 and the Tianjin Municipal Science and Technology Project under Grant 18ZXZNGX00330.