Deep, Long and Wide Artificial Neural Networks in Processing of Speech
Speaker:
Hynek Heřmanský
Abstract:
Up to recently, automatic recognition of speech (ASR) proceeded in a single stream: from a speech signal, through a feature extraction module and pattern classifier into search for the best word sequence. Features were mostly hand-crafted based and represented relative short (10-20 ms) instantaneous snapshots of speech signal. Introduction of artificial neural nets (ANNs) into speech processing allowed for much more ambitious and more effective schemes. Today's speech features for ASR are derived from large amounts of speech data, often using complex deep neural net architectures. The talk argues for ANNs that are not only deep but also wide (i.e., processing information in multiple parallel processing streams) and long (i.e., extracting information from speech segments much longer than 10-20 ms). Support comes from psychophysics and physiology of speech perception, as well as from speech data itself. The talk reviews history of gradual shift towards nonlinear multi-stream extraction of information from spectral dynamics of speech, and shows some advantages of this approach in ASR.