Qiongqiong Wang, Hardik Bhupendra Sailor, Tianchi Liu, Wenyu Zhang, Muhammad Huzaifah, Nattadaporn Lertcheva, Shuo Sun, Nancy F. Chen, Jinyang Wu, and AiTi Aw. 2025. Benchmarking Contextual and Paralinguistic Reasoning in Speech-LLMs: A Case Study with In-the-Wild Data. In Findings of the Association for Computational Linguistics: EMNLP 2025
Abstract:
Recent speech-LLMs have shown impressive
performance in tasks like transcription and translation, yet they remain limited in understanding
the paralinguistic aspects of speech crucial for
social and emotional intelligence. We propose
CP-Bench, a benchmark for evaluating speechLLMs on contextual paralinguistic reasoning
the integration of verbal content with non-verbal
cues like emotion and prosody. The benchmark
includes two curated question-answering (QA)
datasets requiring both linguistic and empathetic understanding. We evaluate state-of-theart speech-LLMs from both open- and closedsource models and perform a comprehensive
analysis across different question types. The
top two models were further analyzed under
temperature tuning to understand its effect on
this task. Our benchmark reveals a key gap
in existing evaluations and offers insights into
building more context-aware and emotionally
intelligent speech-capable LLMs.
License type:
Attribution 4.0 International (CC BY 4.0)
Funding Info:
This research / project is supported by the National Research Foundation, Singapore - National Large Language Models Funding
Grant Reference no. : EC-2024-021