After many decades of slow incremental growth, computer-based automatic
recognition of human speech has recently gone through a much more rapid transition
from the research lab to mainstream application, available on most of the 1+ trillion
smartphones on the planet. (Worldwide, there are almost as many mobile phones as
people, and about 1 in 5 of these are smartphones.) This growth has been largely fueled
by the growth of raw computational power, rather than fundamental changes in speech
recognition technology itself. The methods used in nearly every state-of-the-art
automatic speech recognition system are based on the same statistical model that was
first used for speech more than 30 years ago, the Hidden Markov Model. Hidden
Markov Models are in many ways straightforward models, simple state machines that
take input sequences and identify the most likely corresponding state sequences. The
main strength of the approach is in its flexibility – flexibility to match sequences in a
non-linear temporal pattern, flexibility to learn more detailed models if more training
data is available, flexibility to connect multiple models together into longer continuous
patterns, and flexibility to incorporate whatever data features and probabilistic models
are best suited to the task. Nearly all of these benefits also carry over to the domain of
bioacoustics, specifically to the classification of animal vocalizations. Although there
are limits to this – human speech is better understood than animal communication –
there is also much to gain, and many improvements that are possible by taking
advantage of the large body of knowledge available through the long history of human
speech processing and recognition technology. Agreeing with this idea, this chapter
presents an overview of the use of Hidden Markov Models for classification, detection,
and clustering of bioacoustics signals.
Keywords: Feature extraction, Gaussian mixture models, Hidden Markov
models, Signal classification, Signal detection.