Unsupervised data selection and word-morph mixed language model for tamil low-resource keyword search

Unsupervised data selection and word-morph mixed language model for tamil low-resource keyword search
Title:
Unsupervised data selection and word-morph mixed language model for tamil low-resource keyword search
Other Titles:
2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Publication Date:
19 April 2015
Citation:
C. Ni, C. Leung, L. Wang, N. F. Chen and B. Ma, "Unsupervised data selection and word-morph mixed language model for tamil low-resource keyword search," 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, QLD, 2015, pp. 4714-4718. doi: 10.1109/ICASSP.2015.7178865
Abstract:
This paper considers an unsupervised data selection problem for the training data of an acoustic model and the vocabulary coverage of a keyword search system in low-resource settings. We propose to use Gaussian component index based n-grams as acoustic features in a submodular function for unsupervised data selection. The submodular function provides a near-optimal solution in terms of the objective being optimized. Different kinds of active learning techniques have been studied to address the data selection problem for phone recognition or LVCSR, but to the best of our knowledge, active learning for unsupervised data selection has not been investigated. We show that the selection of data plays an important role to the word error rate of the LVCSR system and the active term weighted value (ATWV) of the keyword search system. We conducted our experiments on 2014 NIST Open Keyword Search Evaluation (OpenKWS14) surprised language Tamil provided by IAPRA Babel program, in which 10 hours of data was selected from the full language pack (FLP) using our proposed acoustic feature based submodular approach. Our proposed approach provides a relative 23.2% and 20.7% ATWV improvement over two other data subsets, 10-hour in the limited language pack (LLP) defined by IARPA and 10-hour speech randomly selected from FLP, espectively. In addition, our approach also increases the vocabulary coverage of the lexicon, implicitly alleviating the out-of-vocabulary (OOV) problem: The number of OOV search terms in the baseline conditions drops from 1,686 and 1,171 to 972. Moreover, to further resolve the high OOV rate for morphologically-rich languages like Tamil, word-morph mixed language model is also considered.
License type:
PublisherCopyrights
Funding Info:
Description:
(c) 2015 IEEE.
ISSN:
1520-6149
2379-190X
ISBN:
978-1-4673-6997-8
Files uploaded:
File Size Format Action
There are no attached files.