Unsupervised data selection and word-morph mixed language model for tamil low-resource keyword search

Page view(s)

Checked on Sep 09, 2025

Please use this identifier to cite or link to this item: https://oar.a-star.edu.sg/communities-collections/articles/14484

Title:

Unsupervised data selection and word-morph mixed language model for tamil low-resource keyword search

Journal Title:

2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

DOI:

10.1109/ICASSP.2015.7178865

Publication URL:

https://doi.org/10.1109/ICASSP.2015.7178865

Authors:

Chongjia Ni, Cheung Chi Leung, Lei Wang, Nancy F. Chen, Bin Ma

Keywords:

Active learning, data selection, automatic speech recognition, submodular optimization

Publication Date:

19 April 2015

Citation:

C. Ni, C. Leung, L. Wang, N. F. Chen and B. Ma, "Unsupervised data selection and word-morph mixed language model for tamil low-resource keyword search," 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, QLD, 2015, pp. 4714-4718. doi: 10.1109/ICASSP.2015.7178865

Abstract:

This paper considers an unsupervised data selection problem for the training data of an acoustic model and the vocabulary coverage of a keyword search system in low-resource settings. We propose to use Gaussian component index based n-grams as acoustic features in a submodular function for unsupervised data selection. The submodular function provides a near-optimal solution in terms of the objective being optimized. Different kinds of active learning techniques have been studied to address the data selection problem for phone recognition or LVCSR, but to the best of our knowledge, active learning for unsupervised data selection has not been investigated. We show that the selection of data plays an important role to the word error rate of the LVCSR system and the active term weighted value (ATWV) of the keyword search system. We conducted our experiments on 2014 NIST Open Keyword Search Evaluation (OpenKWS14) surprised language Tamil provided by IAPRA Babel program, in which 10 hours of data was selected from the full language pack (FLP) using our proposed acoustic feature based submodular approach. Our proposed approach provides a relative 23.2% and 20.7% ATWV improvement over two other data subsets, 10-hour in the limited language pack (LLP) defined by IARPA and 10-hour speech randomly selected from FLP, espectively. In addition, our approach also increases the vocabulary coverage of the lexicon, implicitly alleviating the out-of-vocabulary (OOV) problem: The number of OOV search terms in the baseline conditions drops from 1,686 and 1,171 to 972. Moreover, to further resolve the high OOV rate for morphologically-rich languages like Tamil, word-morph mixed language model is also considered.

License type:

PublisherCopyrights

Funding Info:

Description:

URI:

https://oar.a-star.edu.sg/communities-collections/articles/14484

ISSN:

1520-6149
2379-190X

ISBN:

978-1-4673-6997-8

Collections:

Institute for Infocomm Research

Files uploaded:

Manuscripts in This Item:

File	Size	Format	Action
There are no attached files.