Deep, Long and Wide Artificial Neural Networks in Processing of Speech

Speaker:
Hynek Heřmanský
Abstract:
Up to recently, automatic recognition of speech (ASR) proceeded in a single stream: from a speech signal, through a feature extraction module and pattern classifier into search for the best word sequence. Features were mostly hand-crafted based and represented relative short (10-20 ms) instantaneous snapshots of speech signal. Introduction of artificial neural nets (ANNs) into speech processing allowed for much more ambitious and more effective schemes. Today's speech features for ASR are derived from large amounts of speech data, often using complex deep neural net architectures. The talk argues for ANNs that are not only deep but also wide (i.e., processing information in multiple parallel processing streams) and long (i.e., extracting information from speech segments much longer than 10-20 ms). Support comes from psychophysics and physiology of speech perception, as well as from speech data itself. The talk reviews history of gradual shift towards nonlinear multi-stream extraction of information from spectral dynamics of speech, and shows some advantages of this approach in ASR.
Length:
01:05:08
Date:
29/07/2014
views: 1302

Images:
Preview of 291.jpg
Image 291.jpg
Preview of 292.jpg
Image 292.jpg
Preview of 293.jpg
Image 293.jpg
Attachments: (video, slides, etc.)
60 MB
951 downloads
349 MB
1303 downloads
65 MB
1011 downloads