Tao, R., Lee, K. A., Das, R. K., Hautamäki, V., & Li, H. (2023). Self-Supervised Training of Speaker Encoder With Multi-Modal Diverse Positive Pairs. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31, 1706–1719. https://doi.org/10.1109/taslp.2023.3268568
Abstract:
We study a novel neural architecture and its training
strategies of speaker encoder for speaker recognition without
using any identity labels. The speaker encoder is trained to
extract a fixed dimensional speaker embedding from a spoken
utterance of various length. Contrastive learning is a typical self supervised learning technique. However, the contrastive learning of the speaker encoder depends very much on the sampling
strategy of positive and negative pairs. It is common that we
sample a positive pair of segments from the same utterance.
Unfortunately, such poor-man’s positive pairs (PPP) lack the
necessary diversity. In this work, we propose a multi-modal
contrastive learning technique with novel sampling strategies. By
cross-referencing between speech and face data, we find diverse
positive pairs (DPP) for contrastive learning, thus improving the
robustness of the speaker encoder. We train the speaker encoder
on the VoxCeleb2 dataset without any speaker labels, and achieve
an equal error rate (EER) of 2.89%, 3.17% and 6.27% under
the proposed progressive clustering strategy, and an EER of
1.44%, 1.77% and 3.27% under the two-stage learning strategy
with pseudo labels, on the three test sets of VoxCeleb1. This
novel solution outperforms the state-of-the-art self-supervised
learning methods by a large margin, at the same time, achieves
comparable results with the supervised learning counterpart. We
also evaluate our self-supervised learning technique on LRS2 and
LRW datasets, where the speaker information is unknown. All
experiments suggest that the proposed neural architecture and
sampling strategies are robust across datasets.
License type:
Publisher Copyright
Funding Info:
This research is supported by the Huawei Noah’s Ark Lab
This research / project is supported by the National Natural Science Foundation of China - N/A
Grant Reference no. : 62271432
This research / project is supported by the Shenzhen Research Institute of Big Data - Internal Project Fund
Grant Reference no. : T00120220002
This research / project is supported by the Human-Robot Collaborative AI for Advanced Manufacturing and Engineering - N/A
Grant Reference no. : A18A2b0046