Cross-Modal Audio-Visual Co-Learning for Text-Independent Speaker Verification

Page view(s)

Checked on Aug 10, 2025

Please use this identifier to cite or link to this item: https://oar.a-star.edu.sg/communities-collections/articles/20872

Title:

Cross-Modal Audio-Visual Co-Learning for Text-Independent Speaker Verification

Journal Title:

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

DOI:

10.1109/ICASSP49357.2023.10095883

Publication URL:

http://dx.doi.org/10.1109/icassp49357.2023.10095883

Authors:

MENG LIU, Kong Aik Lee, LONGBIAO WANG, Hanyi Zhang, Chang Zeng, JIANWU DANG

Keywords:

Publication Date:

05 May 2023

Citation:

Liu, M., Lee, K. A., Wang, L., Zhang, H., Zeng, C., & Dang, J. (2023, June 4). Cross-Modal Audio-Visual Co-Learning for Text-Independent Speaker Verification. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://doi.org/10.1109/icassp49357.2023.10095883

Abstract:

Visual speech (i.e., lip motion) is highly related to auditory speech due to the co-occurrence and synchronization in speech production. This paper investigates this correlation and proposes a cross-modal speech co-learning paradigm. The primary motivation of our cross-modal co-learning method is modeling one modality aided by exploiting knowledge from another modality. Specifically, two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation. Inside each booster, a max-feature-map embedded Transformer variant is proposed for modality alignment and enhanced feature generation. The network is co-learned both from scratch and with pretrained models. Experimental results on the test scenarios demonstrate that our proposed method achieves around 60% and 20% average relative performance improvement over baseline unimodal and fusion systems, respectively.

License type:

Publisher Copyright

Funding Info:

There was no specific funding for the research done

Description:

© 2023 IEEE. Published in ICASSP 2023 – 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), scheduled for 4-9 June 2023 in Rhodes Island, Greece. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE.

URI:

https://oar.a-star.edu.sg/communities-collections/articles/20872

ISSN:

2379-190X

Collections:

Institute for Infocomm Research

Files uploaded:

Manuscripts in This Item:

File	Size	Format	Action
25747-final-paper.pdf	331.06 KB	PDF	Open