Gao, Xiaoxue, and Nancy F. Chen. "Speech-mamba: Long-context speech recognition with selective state spaces models," in IEEE Spoken Language Technology Workshop, 2024, pp. 182--189.
Abstract:
Current automatic speech recognition systems struggle with modeling long speech sequences due to high quadratic complexity of Transformer-based models. Selective state space models such as Mamba has performed well on long-sequence modeling in natural language processing and computer vision tasks. However, research endeavors in speech technology tasks has been under-explored. We propose Speech-Mamba, which incorporates selective state space modeling in Transformer neural architectures. Long sequence representations with selective state space models in Speech-Mamba is complemented with lower-level representations from Transformer-based modeling. Speech-mamba achieves better capacity to model long-range dependencies, as it scales near-linearly with sequence length.
License type:
Publisher Copyright
Funding Info:
This research / project is supported by the Agency for Science, Technology and Research (A*STAR), and Institute for Infocomm Research (I2R) - SpeechEval Phase II: SHE4EDU (Speech Highlighter and Evaluation for Education)
Grant Reference no. : EC-2023-061