Self-Supervised Training of Speaker Encoder With Multi-Modal Diverse Positive Pairs

Page view(s)
Checked on Nov 24, 2024
Self-Supervised Training of Speaker Encoder With Multi-Modal Diverse Positive Pairs
Self-Supervised Training of Speaker Encoder With Multi-Modal Diverse Positive Pairs
Journal Title:
IEEE/ACM Transactions on Audio, Speech, and Language Processing
Publication Date:
20 April 2023
Tao, R., Lee, K. A., Das, R. K., Hautamäki, V., & Li, H. (2023). Self-Supervised Training of Speaker Encoder With Multi-Modal Diverse Positive Pairs. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31, 1706–1719.
We study a novel neural architecture and its training strategies of speaker encoder for speaker recognition without using any identity labels. The speaker encoder is trained to extract a fixed dimensional speaker embedding from a spoken utterance of various length. Contrastive learning is a typical self supervised learning technique. However, the contrastive learning of the speaker encoder depends very much on the sampling strategy of positive and negative pairs. It is common that we sample a positive pair of segments from the same utterance. Unfortunately, such poor-man’s positive pairs (PPP) lack the necessary diversity. In this work, we propose a multi-modal contrastive learning technique with novel sampling strategies. By cross-referencing between speech and face data, we find diverse positive pairs (DPP) for contrastive learning, thus improving the robustness of the speaker encoder. We train the speaker encoder on the VoxCeleb2 dataset without any speaker labels, and achieve an equal error rate (EER) of 2.89%, 3.17% and 6.27% under the proposed progressive clustering strategy, and an EER of 1.44%, 1.77% and 3.27% under the two-stage learning strategy with pseudo labels, on the three test sets of VoxCeleb1. This novel solution outperforms the state-of-the-art self-supervised learning methods by a large margin, at the same time, achieves comparable results with the supervised learning counterpart. We also evaluate our self-supervised learning technique on LRS2 and LRW datasets, where the speaker information is unknown. All experiments suggest that the proposed neural architecture and sampling strategies are robust across datasets.
License type:
Publisher Copyright
Funding Info:
This research is supported by the Huawei Noah’s Ark Lab

This research / project is supported by the National Natural Science Foundation of China - N/A
Grant Reference no. : 62271432

This research / project is supported by the Shenzhen Research Institute of Big Data - Internal Project Fund
Grant Reference no. : T00120220002

This research / project is supported by the Human-Robot Collaborative AI for Advanced Manufacturing and Engineering - N/A
Grant Reference no. : A18A2b0046
© 2023 IEEE.  Personal use of this material is permitted.  Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Files uploaded:

File Size Format Action
25659-final-paper.pdf 1.92 MB PDF Open