This paper considers an unsupervised data selection problem for the training data of an acoustic model and the vocabulary coverage of a keyword search system in low-resource settings. We propose to use Gaussian component index based n-grams as acoustic features in a submodular function for unsupervised data selection. The submodular function provides a near-optimal solution in terms of the objective being optimized. Different kinds of active learning techniques have been studied to address the data selection problem for phone recognition or LVCSR, but to the best of our knowledge, active learning for unsupervised data selection has not been investigated. We show that the selection of data plays an important role to the word error rate of the LVCSR system and the active term weighted value (ATWV) of the keyword search system. We conducted our experiments on 2014 NIST Open Keyword Search Evaluation (OpenKWS14) surprised language Tamil provided by IAPRA Babel program, in which 10 hours of data was selected from the full language pack (FLP) using our proposed acoustic feature based submodular approach. Our proposed approach provides a relative 23.2% and 20.7% ATWV improvement over two other data subsets, 10-hour in the limited language pack (LLP) defined by IARPA and 10-hour speech randomly selected from FLP, espectively. In addition, our approach also increases the vocabulary coverage of the lexicon, implicitly alleviating the out-of-vocabulary (OOV) problem: The number of OOV search terms in the baseline conditions drops from 1,686 and 1,171 to 972. Moreover, to further resolve the high OOV rate for morphologically-rich languages like Tamil, word-morph mixed language model is also considered.