Humans perform remarkably well for speech recognition using sparse and asynchronous events carried by electrical impulses. Motivated by the observations that human brains primarily learn features from environmental stimuli in an unsupervised manner and consume extremely low power for complex cognitive tasks, we propose a biologically plausible speech recognition mechanism using unsupervised self-organizing map (SOM) for feature representation and event-driven spiking neural network (SNN) for spatiotemporal pattern classification. Moreover, we improve the biological realism of the proposed framework by using mel-scaled filter bank as the front-end, so as to mimic the human auditory system. Our experiments on the TIDIGITS dataset achieve speech recognition accuracy surpassing those of other bio-inspired systems. The proposed SOM-SNN framework can be implemented using the artificial silicon cochlear and neuromorphic processor, so as to fully exploit the potential of event-based speech recognition system.