Generate, Discriminate and Contrast: A Semi-Supervised Sentence Representation Learning Framework

Page view(s)

Checked on Aug 30, 2025

Please use this identifier to cite or link to this item: https://oar.a-star.edu.sg/communities-collections/articles/21390

Title:

Generate, Discriminate and Contrast: A Semi-Supervised Sentence Representation Learning Framework

Journal Title:

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

DOI:

10.18653/v1/2022.emnlp-main.558

Publication URL:

http://dx.doi.org/10.18653/v1/2022.emnlp-main.558

Authors:

Yiming Chen, Yan Zhang, Bin Wang, Zuozhu Liu, Haizhou Li

Keywords:

Publication Date:

04 August 2023

Citation:

Chen, Y., Zhang, Y., Wang, B., Liu, Z., Li, H. (2022). Generate, Discriminate and Contrast: A Semi-Supervised Sentence Representation Learning Framework. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 8150–8161. https://doi.org/10.18653/v1/2022.emnlp-main.558

Abstract:

Most sentence embedding techniques heavily rely on expensive human-annotated sentence pairs as the supervised signals. Despite the use of large-scale unlabeled data, the performance of unsupervised methods typically lags far behind that of the supervised counterparts in most downstream tasks. In this work, we propose a semi-supervised sentence embedding framework, GenSE, that effectively leverages large-scale unlabeled data. Our method include three parts: 1) Generate: A generator/discriminator model is jointly trained to synthesize sentence pairs from open-domain unlabeled corpus; 2) Discriminate: Noisy sentence pairs are filtered out by the discriminator to acquire high-quality positive and negative sentence pairs; 3) Contrast: A prompt-based contrastive approach is presented for sentence representation learning with both annotated and synthesized data. Comprehensive experiments show that GenSE achieves an average correlation score of 85.19 on the STS datasets and consistent performance improvement on four domain adaptation tasks, significantly surpassing the state-of-the-art methods and convincingly corroborating its effectiveness and generalization ability.

License type:

Publisher Copyright

Funding Info:

This research / project is supported by the Agency for Science, Technology and Research (A*STAR) - National Robotics Program: Human-Robot Interaction Phase I
Grant Reference no. : 1922500054

This research / project is supported by the Agency for Science, Technology and Research (A*STAR) - Advanced Manufacturing and Engineering (AME) Programmatic Funding Scheme
Grant Reference no. : A18A2b0046

This research / project is supported by the Shenzhen Research Institute of Big Data - NA
Grant Reference no. : T00120220002

This research / project is supported by the National Natural Science Foundation of China - NA
Grant Reference no. : 62106222

Description:

© 2022 Association for Computational Linguistics. Permission is granted to make copies for the purposes of teaching and research. Materials published in or after 2016 are licensed on a Creative Commons Attribution 4.0 International License.

URI:

https://oar.a-star.edu.sg/communities-collections/articles/21390

ISSN:

2022.emnlp-main.558

Collections:

Institute for Infocomm Research

Files uploaded:

https://aclanthology.org/2022.emnlp-main.558.pdf