Speech-Aware Multi-Domain Dialogue State Generation with ASR Error Correction Modules

Page view(s)
28
Checked on Apr 10, 2024
Speech-Aware Multi-Domain Dialogue State Generation with ASR Error Correction Modules
Title:
Speech-Aware Multi-Domain Dialogue State Generation with ASR Error Correction Modules
Journal Title:
Proceedings of The Eleventh Dialog System Technology Challenge
DOI:
Publication Date:
16 September 2023
Citation:
Ridong Jiang, Wei Shi, Bin Wang, Chen Zhang, Yan Zhang, Chunlei Pan, Jung Jae Kim, and Haizhou Li. 2023. Speech-Aware Multi-Domain Dialogue State Generation with ASR Error Correction Modules. In Proceedings of The Eleventh Dialog System Technology Challenge, pages 105–112, Prague, Czech Republic. Association for Computational Linguistics.
Abstract:
Prior research on dialogue state tracking (DST) is mostly based on written dialogue corpora. For spoken dialogues, the DST model trained on the written text should use the results (or hypothesis) of automatic speech recognition (ASR) as input. But ASR hypothesis often includes errors, which leads to significant performance drop for spoken dialogue state tracking. We address the issue by developing the following ASR error correction modules. First, we train a model to convert ASR hypothesis to ground truth user utterance, which can fix frequent patterns of errors. The model takes ASR hypotheses of two ASR models as input and fine-tuned in two stages. The corrected hypothesis is fed into a large scale pre-trained encoder-decoder model (T5) for DST training and inference. Second, if an output slot value from the encoder-decoder model is a name, we compare it with names in a dictionary crawled from Web sites and, if feasible, replace with the crawled name of the shortest edit distance. Third, we fix errors of temporal expressions in ASR hypothesis by using hand-crafted rules. Experiment results on the DSTC 11 speech-aware dataset, which is built on the popular MultiWOZ task (version 2.1), show that our proposed method can effectively mitigate the performance drop when moving from written text to spoken conversations.
License type:
Attribution 4.0 International (CC BY 4.0)
Funding Info:
This research is supported by core funding from: Institute for Infocomm Research (I²R)
Grant Reference no. :
Description:
ISBN:
2023.dstc-1.13