Sensory Systems/Auditory System

From Wikibooks, open books for an open world
Jump to: navigation, search



The sensory system for the sense of hearing is the auditory system. This wikibook covers the physiology of the auditory system, and its application to the most successful neurosensory prosthesis - cochlear implants. The physics and engineering of acoustics are covered in a separate wikibook, Acoustics. An excellent source of images and animations is "Journey into the world of hearing" [1].

The ability to hear is not found as widely in the animal kingdom as other senses like touch, taste and smell. It is restricted mainly to vertebrates and insects. Within these, mammals and birds have the most highly developed sense of hearing. The table below shows frequency ranges of humans and some selected animals:

Humans 20-20'000 Hz
Whales 20-100'000 Hz
Bats 1'500-100'000 Hz
Fish 20-3'000 Hz

The organ that detects sound is the ear. It acts as receiver in the process of collecting acoustic information and passing it through the nervous system into the brain. The ear includes structures for both the sense of hearing and the sense of balance. It does not only play an important role as part of the auditory system in order to receive sound but also in the sense of balance and body position.

Mother and child
Humpback whales in the singing position
Big eared townsend bat
Hyphessobrycon pulchripinnis fish

Humans have a pair of ears placed symmetrically on both sides of the head which makes it possible to localize sound sources. The brain extracts and processes different forms of data in order to localize sound, such as:

  • the shape of the sound spectrum at the tympanic membrane (eardrum)
  • the difference in sound intensity between the left and the right ear
  • the difference in time-of-arrival between the left and the right ear
  • the difference in time-of-arrival between reflections of the ear itself (this means in other words: the shape of the pinna (pattern of folds and ridges) captures sound-waves in a way that helps localizing the sound source, especially on the vertical axis.

Healthy, young humans are able to hear sounds over a frequency range from 20 Hz to 20 kHz. We are most sensitive to frequencies between 2000 to 4000 Hz which is the frequency range of spoken words. The frequency resolution is 0.2% which means that one can distinguish between a tone of 1000 Hz and 1002 Hz. A sound at 1 kHz can be detected if it deflects the tympanic membrane (eardrum) by less than 1 Angstrom, which is less than the diameter of a hydrogen atom. This extreme sensitivity of the ear may explain why it contains the smallest bone that exists inside a human body: the stapes (stirrup). It is 0.25 to 0.33 cm long and weighs between 1.9 and 4.3 mg.

Anatomy of the Auditory System[edit]

Human (external) ear

The aim of this section is to explain the anatomy of the auditory system of humans. The chapter illustrates the composition of auditory organs in the sequence that acoustic information proceeds during sound perception.
Please note that the core information for “Sensory Organ Components” can also be found on the Wikipedia page “Auditory system”, excluding some changes like extensions and specifications made in this article. (see also: Wikipedia Auditory system)

The auditory system senses sound waves, that are changes in air pressure, and converts these changes into electrical signals. These signals can then be processed, analyzed and interpreted by the brain. For the moment, let's focus on the structure and components of the auditory system. The auditory system consists mainly of two parts:

  • the ear and
  • the auditory nervous system (central auditory system)

The ear[edit]

The ear is the organ where the first processing of sound occurs and where the sensory receptors are located. It consists of three parts:

  • outer ear
  • middle ear
  • inner ear
Anatomy of the human ear (green: outer ear / red: middle ear / purple: inner ear)

Outer ear[edit]

Function: Gathering sound energy and amplification of sound pressure.
The folds of cartilage surrounding the ear canal (external auditory meatus, external acoustic meatus) are called the pinna. It is the visible part of the ear. Sound waves are reflected and attenuated when they hit the pinna, and these changes provide additional information that will help the brain determine the direction from which the sounds came. The sound waves enter the auditory canal, a deceptively simple tube. The ear canal amplifies sounds that are between 3 and 12 kHz. At the far end of the ear canal is the tympanic membrane (eardrum), which marks the beginning of the middle ear.

Middle ear[edit]

Micro-CT image of the ossicular chain showing the relative position of each ossicle.

Function: Transmission of acoustic energy from air to the cochlea.
Sound waves traveling through the ear canal will hit the tympanic membrane (tympanum, eardrum). This wave information travels across the air-filled tympanic cavity (middle ear cavity) via a series of bones: the malleus (hammer), incus (anvil) and stapes (stirrup). These ossicles act as a lever and a teletype, converting the lower-pressure eardrum sound vibrations into higher-pressure sound vibrations at another, smaller membrane called the oval (or elliptical) window, which is one of two openings into the cochlea of the inner ear. The second opening is called round window. It allows the fluid in the cochlea to move. The malleus articulates with the tympanic membrane via the manubrium, whereas the stapes articulates with the oval window via its footplate. Higher pressure is necessary because the inner ear beyond the oval window contains liquid rather than air. The sound is not amplified uniformly across the ossicular chain. The stapedius reflex of the middle ear muscles helps protect the inner ear from damage. The middle ear still contains the sound information in wave form; it is converted to nerve impulses in the cochlea.

Inner ear[edit]

Structural diagram of the cochlea Cross section of the cochlea
Cochlea.svg Cochlea-crosssection.svg

Function: Transformation of mechanical waves (sound) into electric signals (neural signals).
The inner ear consists of the cochlea and several non-auditory structures. The cochlea is a snail-shaped part of the inner ear. It has three fluid-filled sections: scala tympani (lower gallery), scala media (middle gallery, cochlear duct) and scala vestibuli (upper gallery). The cochlea supports a fluid wave driven by pressure across the basilar membrane separating two of the sections (scala tympani and scala media). The basilar membrane is about 3 cm long and between 0.5 to 0.04 mm wide. Reissner’s membrane (vestibular membrane) separates scala media and scala vestibuli. Strikingly, one section, the scala media, contains an extracellular fluid similar in composition to endolymph, which is usually found inside of cells. The organ of Corti is located in this duct, and transforms mechanical waves to electric signals in neurons. The other two sections, scala tympani and scala vestibuli, are located within the bony labyrinth which is filled with fluid called perilymph. The chemical difference between the two fluids endolymph (in scala media) and perilymph (in scala tympani and scala vestibuli) is important for the function of the inner ear.

Organ of Corti[edit]

The organ of Corti forms a ribbon of sensory epithelium which runs lengthwise down the entire cochlea. The hair cells of the organ of Corti transform the fluid waves into nerve signals. The journey of a billion nerves begins with this first step; from here further processing leads to a series of auditory reactions and sensations.

Transition from ear to auditory nervous system[edit]

Section through the spiral organ of Corti

Hair cells[edit]

Hair cells are columnar cells, each with a bundle of 100-200 specialized cilia at the top, for which they are named. These cilia are the mechanosensors for hearing. The shorter ones are called stereocilia, and the longest one at the end of each haircell bundlekinocilium. The location of the kinocilium determines the on-direction, i.e. the direction of deflection inducing the maximum hair cell excitation. Lightly resting atop the longest cilia is the tectorial membrane, which moves back and forth with each cycle of sound, tilting the cilia and allowing electric current into the hair cell.
The function of hair cells is not fully established up to now. Currently, the knowledge of the function of hair cells allows to replace the cells by cochlear implants in case of hearing lost. However, more research into the function of the hair cells may someday even make it possible for the cells to be repaired. The current model is that cilia are attached to one another by “tip links”, structures which link the tips of one cilium to another. Stretching and compressing, the tip links then open an ion channel and produce the receptor potential in the hair cell. Note that a deflection of 100 nanometers already elicits 90% of the full receptor potential.


The nervous system distinguishes between nerve fibres carrying information towards the central nervous system and nerve fibres carrying the information away from it:

  • Afferent neurons (also sensory or receptor neurons) carry nerve impulses from receptors (sense organs) towards the central nervous system
  • Efferent neurons (also motor or effector neurons) carry nerve impulses away from the central nervous system to effectors such as muscles or glands (and also the ciliated cells of the inner ear)

Afferent neurons innervate cochlear inner hair cells, at synapses where the neurotransmitter glutamate communicates signals from the hair cells to the dendrites of the primary auditory neurons. There are far fewer inner hair cells in the cochlea than afferent nerve fibers. The neural dendrites belong to neurons of the auditory nerve, which in turn joins the vestibular nerve to form the vestibulocochlear nerve, or cranial nerve number VIII.
Efferent projections from the brain to the cochlea also play a role in the perception of sound. Efferent synapses occur on outer hair cells and on afferent (towards the brain) dendrites under inner hair cells.

Auditory nervous system[edit]

The sound information, now re-encoded in form of electric signals, travels down the auditory nerve (acoustic nerve, vestibulocochlear nerve, VIIIth cranial nerve), through intermediate stations such as the cochlear nuclei and superior olivary complex of the brainstem and the inferior colliculus of the midbrain, being further processed at each waypoint. The information eventually reaches the thalamus, and from there it is relayed to the cortex. In the human brain, the primary auditory cortex is located in the temporal lobe.

Primary auditory cortex[edit]

The primary auditory cortex is the first region of cerebral cortex to receive auditory input. Perception of sound is associated with the right posterior superior temporal gyrus (STG). The superior temporal gyrus contains several important structures of the brain, including Brodmann areas 41 and 42, marking the location of the primary auditory cortex, the cortical region responsible for the sensation of basic characteristics of sound such as pitch and rhythm. The auditory association area is located within the temporal lobe of the brain, in an area called the Wernicke's area, or area 22. This area, near the lateral cerebral sulcus, is an important region for the processing of acoustic signals so that they can be distinguished as speech, music, or noise.

Auditory Signal Processing[edit]

Now that the anatomy of the auditory system has been sketched out, this topic goes deeper into the physiological processes which take place while perceiving acoustic information and converting this information into data that can be handled by the brain. Hearing starts with pressure waves hitting the auditory canal and is finally perceived by the brain. This section details the process transforming vibrations into perception.

Effect of the head[edit]

Sound waves with a wavelength shorter than the head produce a sound shadow on the ear further away from the sound source. When the wavelength is shorter than the head, diffraction of the sound leads to approximately equal sound intensities on both ears.

Difference in loudness and timing help us to localize the source of a sound signal.

Sound reception at the pinna[edit]

The pinna collects sound waves in air affecting sound coming from behind and the front differently with its corrugated shape. The sound waves are reflected and attenuated or amplified. These changes will later help sound localization.

In the external auditory canal, sounds between 3 and 12 kHz - a range crucial for human communication - are amplified. It acts as resonator amplifying the incoming frequencies.

Sound conduction to the cochlea[edit]

Sound that entered the pinna in form of waves travels along the auditory canal until it reaches the beginning of the middle ear marked by the tympanic membrane (eardrum). Since the inner ear is filled with fluid, the middle ear is kind of an impedance matching device in order to solve the problem of sound energy reflection on the transition from air to the fluid. As an example, on the transition from air to water 99.9% of the incoming sound energy is reflected. This can be calculated using:

 \frac{I_r}{I_i} = \left ( \frac {Z_2 - Z_1}{Z_2 + Z_1} \right ) ^2

with Ir the intensity of the reflected sound, Ii the intensity of the incoming sound and Zk the wave resistance of the two media ( Zair = 414 kg m-2 s-1 and Zwater = 1.48*106 kg m-2 s-1). Three factors that contribute the impedance matching are:

  • the relative size difference between tympanum and oval window
  • the lever effect of the middle ear ossicles and
  • the shape of the tympanum.
Mechanics of the amplification effect of the middle ear.

The longitudinal changes in air pressure of the sound-wave cause the tympanic membrane to vibrate which, in turn, makes the three chained ossicles malleus, incus and stirrup oscillate synchronously. These bones vibrate as a unit, elevating the energy from the tympanic membrane to the oval window. In addition, the energy of sound is further enhanced by the areal difference between the membrane and the stapes footplate. The middle ear acts as an impedance transformer by changing the sound energy collected by the tympanic membrane into greater force and less excursion. This mechanism facilitates transmission of sound-waves in air into vibrations of the fluid in the cochlea. The transformation results from the pistonlike in- and out-motion by the footplate of the stapes which is located in the oval window. This movement performed by the footplate sets the fluid in the cochlea into motion.

Through the stapedius muscle, the smallest muscle in the human body, the middle ear has a gating function: contracting this muscle changes the impedance of the middle ear, thus protecting the inner ear from damage through loud sounds.

Frequency analysis in the cochlea[edit]

The three fluid-filled compartements of the cochlea (scala vestibuli, scala media, scala tympani) are separated by the basilar membrane and the Reissner’s membrane. The function of the cochlea is to separate sounds according to their spectrum and transform it into a neural code. When the footplate of the stapes pushes into the perilymph of the scala vestibuli, as a consequence the membrane of Reissner bends into the scala media. This elongation of Reissner’s membrane causes the endolymph to move within the scala media and induces a displacement of the basilar membrane. The separation of the sound frequencies in the cochlea is due to the special properties of the basilar membrane. The fluid in the cochlea vibrates (due to in- and out-motion of the stapes footplate) setting the membrane in motion like a traveling wave. The wave starts at the base and progresses towards the apex of the cochlea. The transversal waves in the basilar membrane propagate with

 c_{trans} = \sqrt{\frac{\mu}{\rho}}

with μ the shear modulus and ρ the density of the material. Since width and tension of the basilar membrane change, the speed of the waves propagating along the membrane changes from about 100 m/s near the oval window to 10 m/s near the apex.

There is a point along the basilar membrane where the amplitude of the wave decreases abruptly. At this point, the sound wave in the cochlear fluid produces the maximal displacement (peak amplitude) of the basilar membrane. The distance the wave travels before getting to that characteristic point depends on the frequency of the incoming sound. Therefore each point of the basilar membrane corresponds to a specific value of the stimulating frequency. A low-frequency sound travels a longer distance than a high-frequency sound before it reaches its characteristic point. Frequencies are scaled along the basilar membrane with high frequencies at the base and low frequencies at the apex of the cochlea.

The position x of the maximal amplitude of the travelling wave corresponds in a 1-to-1 way to a stimulus frequency.

Sensory transduction in the cochlea[edit]

Most everyday sounds are composed of multiple frequencies. The brain processes the distinct frequencies, not the complete sounds. Due to its inhomogeneous properties, the basilar membrane is performing an approximation to a Fourier transform. The sound is thereby split into its different frequencies, and each hair cell on the membrane corresponds to a certain frequency. The loudness of the frequencies is encoded by the firing rate of the corresponding afferent fiber. This is due to the amplitude of the traveling wave on the basilar membrane, which depends on the loudness of the incoming sound.

Transduction mechanism in auditory or vestibular hair cell. Tilting the hair cell towards the kinocilium opens the potassium ion channels. This changes the receptor potential in the hair cell. The resulting emission of neurotransmitters can elicit an action potential (AP) in the post-synaptic cell.
Auditory haircells are very similar to those of the vestibular system. Here an electron microscopy image of a frog's sacculus haircell.

The sensory cells of the auditory system, known as hair cells, are located along the basilar membrane within the organ of Corti. Each organ of Corti contains about 16’000 such cells, innervated by about 30'000 afferent nerve fibers. There are two anatomically and functionally distinct types of hair cells: the inner and the outer hair cells. Along the basilar membrane these two types are arranged in one row of inner cells and three to five rows of outer cells. Most of the afferent innervation comes from the inner hair cells while most of the efferent innervation goes to the outer hair cells. The inner hair cells influence the discharge rate of the individual auditory nerve fibers that connect to these hair cells. Therefore inner hair cells transfer sound information to higher auditory nervous centers. The outer hair cells, in contrast, amplify the movement of the basilar membrane by injecting energy into the motion of the membrane and reducing frictional losses but do not contribute in transmitting sound information. The motion of the basilar membrane deflects the stereocilias (hairs on the hair cells) and causes the intracellular potentials of the hair cells to decrease (depolarization) or increase (hyperpolarization), depending on the direction of the deflection. When the stereocilias are in a resting position, there is a steady state current flowing through the channels of the cells. The movement of the stereocilias therefore modulates the current flow around that steady state current.

Lets look at the modes of action of the two different hair cell types separately:

  • Inner hair cells:

The deflection of the hair-cell stereocilia opens mechanically gated ion channels that allow small, positively charged potassium ions (K+) to enter the cell and causing it to depolarize. Unlike many other electrically active cells, the hair cell itself does not fire an action potential. Instead, the influx of positive ions from the endolymph in scala media depolarizes the cell, resulting in a receptor potential. This receptor potential opens voltage gated calcium channels; calcium ions (Ca2+) then enter the cell and trigger the release of neurotransmitters at the basal end of the cell. The neurotransmitters diffuse across the narrow space between the hair cell and a nerve terminal, where they then bind to receptors and thus trigger action potentials in the nerve. In this way, neurotransmitter increases the firing rate in the VIIIth cranial nerve and the mechanical sound signal is converted into an electrical nerve signal.
The repolarization in the hair cell is done in a special manner. The perilymph in Scala tympani has a very low concentration of positive ions. The electrochemical gradient makes the positive ions flow through channels to the perilymph. (see also: Wikipedia Hair cell)

  • Outer hair cells:

In humans outer hair cells, the receptor potential triggers active vibrations of the cell body. This mechanical response to electrical signals is termed somatic electromotility and drives oscillations in the cell’s length, which occur at the frequency of the incoming sound and provide mechanical feedback amplification. Outer hair cells have evolved only in mammals. Without functioning outer hair cells the sensitivity decreases by approximately 50 dB (due to greater frictional losses in the basilar membrane which would damp the motion of the membrane). They have also improved frequency selectivity (frequency discrimination), which is of particular benefit for humans, because it enables sophisticated speech and music. (see also: Wikipedia Hair cell)

With no external stimulation, auditory nerve fibres discharge action potentials in a random time sequence. This random time firing is called spontaneous activity. The spontaneous discharge rates of the fibers vary from very slow rates to rates of up to 100 per second. Fibers are placed into three groups depending on whether they fire spontaneously at high, medium or low rates. Fibers with high spontaneous rates (> 18 per second) tend to be more sensitive to sound stimulation than other fibers.

Auditory pathway of nerve impulses[edit]

Lateral lemniscus in red, as it connects the cochlear nucleus, superior olivary nucleus and the inferior colliculus. Seen from behind.

So in the inner hair cells the mechanical sound signal is finally converted into electrical nerve signals. The inner hair cells are connected to auditory nerve fibres whose nuclei form the spiral ganglion. In the spiral ganglion the electrical signals (electrical spikes, action potentials) are generated and transmitted along the cochlear branch of the auditory nerve (VIIIth cranial nerve) to the cochlear nucleus in the brainstem.

From there, the auditory information is divided into at least two streams:

  • Ventral Cochlear Nucleus:

One stream is the ventral cochlear nucleus which is split further into the posteroventral cochlear nucleus (PVCN) and the anteroventral cochlear nucleus (AVCN). The ventral cochlear nucleus cells project to a collection of nuclei called the superior olivary complex.

Superior olivary complex: Sound localization[edit]

The superior olivary complex - a small mass of gray substance - is believed to be involved in the localization of sounds in the azimuthal plane (i.e. their degree to the left or the right). There are two major cues to sound localization: Interaural level differences (ILD) and interaural time differences (ITD). The ILD measures differences in sound intensity between the ears. This works for high frequencies (over 1.6 kHz), where the wavelength is shorter than the distance between the ears, causing a head shadow - which means that high frequency sounds hit the averted ear with lower intensity. Lower frequency sounds don't cast a shadow, since they wrap around the head. However, due to the wavelength being larger than the distance between the ears, there is a phase difference between the sound waves entering the ears - the timing difference measured by the ITD. This works very precisely for frequencies below 800 Hz, where the ear distance is smaller than half of the wavelength. Sound localization in the median plane (front, above, back, below) is helped through the outer ear, which forms direction-selective filters.

There, the differences in time and loudness of the sound information in each ear are compared. Differences in sound intensity are processed in cells of the lateral superior olivary complexm and timing differences (runtime delays) in the medial superior olivary complex. Humans can detect timing differences between the left and right ear down to 10 μs, corresponding to a difference in sound location of about 1 deg. This comparison of sound information from both ears allows the determination of the direction where the sound came from. The superior olive is the first node where signals from both ears come together and can be compared. As a next step, the superior olivary complex sends information up to the inferior colliculus via a tract of axons called lateral lemniscus. The function of the inferior colliculus is to integrate information before sending it to the thalamus and the auditory cortex. It is interesting to know that the superior colliculus close by shows an interaction of auditory and visual stimuli.

  • Dorsal Cochlear Nucleus:

The dorsal cochlear nucleus (DCN) analyzes the quality of sound and projects directly via the lateral lemnisucs to the inferior colliculus.

From the inferior colliculus the auditory information from ventral as well as dorsal cochlear nucleus proceeds to the auditory nucleus of the thalamus which is the medial geniculate nucleus. The medial geniculate nucleus further transfers information to the primary auditory cortex, the region of the human brain that is responsible for processing of auditory information, located on the temporal lobe. The primary auditory cortex is the first relay involved in the conscious perception of sound.

Primary auditory cortex and higher order auditory areas[edit]

Sound information that reaches the primary auditory cortex (Brodmann areas 41 and 42). The primary auditory cortex is the first relay involved in the conscious perception of sound. It is known to be tonotopically organized and performs the basics of hearing: pitch and volume. Depending on the nature of the sound (speech, music, noise), is further passed to higher order auditory areas. Sounds that are words are processed by Wernicke’s area (Brodmann area 22). This area is involved in understanding written and spoken language (verbal understanding). The production of sound (verbal expression) is linked to Broca’s area (Brodmann areas 44 and 45). The muscles to produce the required sound when speaking are contracted by the facial area of motor cortex which are regions of the cerebral cortex that are involved in planning, controlling and executing voluntary motor functions.

Lateral surface of the brain with Brodmann's areas numbered.

Human Speech[edit]



The intensity of sound is typically expressed in deciBel (dB), defined as

 SPL = 20 * log \frac{p}{p_0}

where SPL = “sound pressure level” (in dB), and the reference pressure is p_0 = 2*10^{-5} N/m^2 . Note that this is much smaller than the air pressure (ca. 105 N/m2)! Also watch out, because sound is often expressed relative to "Hearing Level" instead of SPL.

  • 0 - 20 dB SPL ... hearing level (0 dB for sinusoidal tones, from 1 kHz – 4 kHz)
  • 60 dB SPL ... medium loud tone, conversational speech

Fundamental frequency, from the vibrations of the vocal cords in the larynx, is about 120 Hz for adult male, 250 Hz for adult female, and up to 400 Hz for children.

Frequency- and loudness-dependence of human hearing loss.


Formants are the dominant frequencies in human speech, and are caused by resonances of the signals from the vocal cord in our mouth etc. Formants show up as distinct peaks of energy in the sound's frequency spectrum. They are numbered in ascending order starting with the format at the lowest frequency.

Spectrogram of the German vowels "a,e,i,o,u". These correspond approximately to the vowels in the English words "hut, hat, hit, hot, put". Calculated using the MATLAB command "spectrogram(data, 512,256, 512, fs)". The chapter Power Spectrum of Non-stationary Signals below describes the mathematics behind the spectrogram.


Speech is often considered to consist of a sequence of acoustic units called phons, which correspond to linguistic units called phonemes. Phonemes are the smallest units of sound that allows different words to be distinguished. The word "dog", for example, contains three phonemes. Changes to the first, second, and third phoneme respectively produce the words "log", "dig", and "dot". English is said to contain 40 different phonemes, specified as in /d/, /o/, /g/ for the word "dog".

Speech Perception[edit]

The ability of humans to decode speech signals still easily exceeds that of any algorithm developed so far. While automatic speech recognition has become fairly successful in recognizing clearly spoken speech in environments with high Signal-to-noise ratio, once the conditions become a bit less than ideal, recognition algorithms tend to perform vary poorly compared to humans. It seems from this that our computer speech recognition algorithms have not yet come close to capturing the underlying algorithm that humans use to recognize speech.

Evidence has shown that the perception of speech takes quite a different route than the perception of other sounds in the brain. While studies on non-speech sound responses have generally found response to be graded with stimulus, speech studies have repeatedly found a discretization of response when a graded stimulus is presented. For instance, Lisker and Abramson,[2] played a pre-voiced 'b/p' sound. Whether the sound is interpreted as a /b/ or a /p/ depends on the voice onset time (VOT). They found that when smoothly varying the VOT, there was a sharp change (at ~20ms after the consonant is played) where subjects switched their identification from /b/ to /p/. Furthermore, subjects had a great deal of difficulty differentiating between two sounds in the same category (e.g. pairs of sounds with a VOTs of -10ms to 10m, which would both be /b/'s, than sounds with a 10ms to 30ms, which would be identified as a b and a p). This shows that some type of categorization scheme is going on. One of the main problems encountered when trying to build a model of speech perception is the so-called 'Lack of Invariance', which could more straightforwardly just be stated as the 'variance'. This term refers to the fact that a single phoneme (e.g. /p/ as in sPeech or Piety), has a great variety of waveforms that map to it, and that the mapping between an acoustic waveform and a phoneme is far from obvious and heavily context-dependent, yet human listeners reliably give the correct result. Even when the context is similar, a waveform will show a great deal of variance due to factors such as the pace of speech, the identity of the speaker and the tone in which he is speaking. So while there is no agreed-upon model of speech perception, the existing models can be split into two classes: Passive Perception and Active perception.

Passive Perception Models[edit]

Passive perception theories generally describe the problem of speech perception in the same way that most sensory signal-processing algorithms do: Some raw input signal goes in, and is processed though a hierarchy where each subsequent step extracts some increasingly abstract signal from the input. One of the early examples of a passive model was distinctive feature theory. The idea is to identify the presence of sets of binary values for certain features. For example, 'nasal/oral', 'vocalic/non-vocalic'. The theory is that a phoneme is interpreted as a binary vector of the presence or absence of these features. These features can be extracted from the spectrogram data. Other passive models, such as those described by Selfridge [3] and Uttley,[4] involve a kind of template-matching, where a hierarchy of processing layers extract features that are increasingly abstract and invariant to certain irrelevant features (such as identity of the speaker when classifying phonemes).

Active Perception Models[edit]

An entirely different take on speech perception are active-perception theories. These theories make the point that it would be redundant for the brain to have two parallel systems for speech perception and speech production, given that the ability produce a sound is so closely tied with the ability to identify it - proponents of these theories argue that it would be wasteful and complicated to maintain two separate databases-one containing the programs to identify phonemes, and another to produce them. They argue that speech perception is actually done by attempting to replicate the incoming signal, and thus using the same circuits for phoneme production as for identification. The Motor Theory of speech perception (Liberman et al., 1967), states that speech sounds are identified not by any sort of template matching, but by using the speech-generating mechanisms to try and regenerate a copy of the speech signal. It states that phonemes should not be seen as hidden signals within the speech, but as “cues” that the generating mechanism attempts to reproduce in a pre-speech signal. The theory states that speech-generating regions of the brain learn which speech-precursor signals will produce which sounds by the constant feedback loop of always hearing one's own speech. The babbling of babies, it is argued, is a way of learning this how to generate these “cue” sounds from pre-motor signals.[5]

A similar idea is proposed in the analysis-by-synthesis model, by Stevens and Halle.[6] This describes a generative model which attempts to regenerate a similar signal to the incoming sound. It essentially takes advantage of the fact that speech-generating mechanisms are similar between people, and that the characteristic features that one hears in speech can be reproduced by the speaker. As the speaker hears the sound, the speech centers attempt to generate the signal that's coming in. Comparators give constant feedback on the quality of the regeneration. The 'units of perception', are therefore not so much abstractions of the incoming sound, as pre-motor commands for generating the same speech.

Motor theories took a serious hit when a series of studies on what is now known as Broca's Aphasia were published. This condition impairs one's ability to produce speech sounds, without impairing the ability to comprehend them, whereas motor theory, taken in its original form, states that production and comprehension are done by the same circuits, so impaired speech production should imply impaired speech comprehension. The existence of Broca's aphasia appears to contradicts this prediction.[7]

Current Models[edit]

The TRACE model of speech perception. All connections beyond the input layer are bidirectional. Each unit represents some unit of speech such as a word of a phoneme.

One of the most influential computational models of speech perception is called TRACE.[8] TRACE is a neural-network-like model, with three layers and a recurrent connection scheme. The first layer extracts features from an input spectrogram in temporal order, basically simulating the cochlea. The second layer extracts phonemes from the feature information, and the third layer extracts words from the phoneme information. The model contains feed-forward (bottom-up) excitatory connections, lateral inhibitory connections, and feedback (top-down) excitatory connections. In this model, each computational unit corresponds to some unit of perception (e.g. the phoneme /p/ or the word "preposterous"). The basic idea is that, based on their input, units within a layer will compete to have the strongest output. The lateral inhibitory connections result in a sort of winner-takes-all circuit, in which the unit with the strongest input will inhibit its neighbors and become the clear winner. The feedback connections allow us to explain the effect of context-dependent comprehension - for example, suppose the phoneme layer, based on its bottom-up inputs, could not decide whether it had heard a /g/ or a /k/, but that the phoneme was preceded by 'an', and followed by 'ry'. Both the /g/ and /k/ units would initially be equally activated, sending inputs up to the word level, which would already contain excited units corresponding to words such as 'anaconda', 'angry', and 'ankle', which had been activated by the preceding 'an'. The excitement of the /g/ or /k/

Cochlear Implants[edit]

Cochlear implant

A cochlear implant (CI) is a surgically implanted electronic device that replaces the mechanical parts of the auditory system by directly stimulating the auditory nerve fibers through electrodes inside the cochlea. Candidates for cochlear implants are people with severe to profound sensorineural hearing loss in both ears and a functioning auditory nervous system. They are used by post-lingually deaf people to regain some comprehension of speech and other sounds as well as by pre-lingually deaf children to enable them to gain spoken language skills. (Diagnosis of hearing loss in newborns and infants is done using otoacoustic emissions, and/or the recording of auditory evoked potentials.) A quite recent evolution is the use of bilateral implants allowing recipients basic sound localization.

Parts of the cochlear implant[edit]

The implant is surgically placed under the skin behind the ear. The basic parts of the device include:


  • a microphone which picks up sound from the environment
  • a speech processor which selectively filters sound to prioritize audible speech and sends the electrical sound signals through a thin cable to the transmitter,
  • a transmitter, which is a coil held in position by a magnet placed behind the external ear, and transmits the processed sound signals to the internal device by electromagnetic induction,


The internal part of a cochlear implant (model Cochlear Freedom 24 RE)
  • a receiver and stimulator secured in bone beneath the skin, which converts the signals into electric impulses and sends them through an internal cable to electrodes,
  • an array of up to 24 electrodes wound through the cochlea, which send the impulses to the nerves in the scala tympani and then directly to the brain through the auditory nerve system

Signal processing for cochlear implants[edit]

In normal hearing subjects, the primary information carrier for speech signals is the envelope, whereas for music, it is the fine structure. This is also relevant for tonal languages, like Mandarin, where the meaning of words depends on their intonation. It was also found that interaural time delays coded in the fine structure determine where a sound is heard from rather than interaural time delays coded in the envelope, although it is still the speech signal coded in the envelope that is perceived.

The speech processor in a cochlear implant transforms the microphone input signal into a parallel array of electrode signals destined for the cochlea. Algorithms for the optimal transfer function between these signals are still an active area of research. The first cochlear implants were single-channel devices. The raw sound was band-passed filtered to include only the frequency range of speech, then modulated onto a 16 kHz wave to allow the electrical signal to electrically couple to the nerves. This approach was able to provide very basic hearing, but was extremely limited in that it was completely unable to take advantage of the frequency-location map of the cochlea.

The advent of multi-channel implants opened the door to try a number of different speech-processing strategies to facilitate hearing. These can be roughly divided into Waveform and Feature-Extraction strategies.

Waveform Strategies[edit]

These generally involve applying a non-linear gain on the sound (as an input audio signal with a ~30dB dynamic range must be compressed into an electrical signal with just a ~5dB dynamic range), and passing it through parallel filter banks. The first waveform strategy to be tried was Compressed Analog approach. In this system, the raw audio is initially filtered with a gain-controlled amplifier (the gain-control reduces the dynamic range of the signal). The signal is then passed through parallel band-pass filters, and the output of these filters goes on to stimulate electrodes at their appropriate locations.

A problem with the Compressed Analog approach was that the there was a strong interaction-effect between adjacent electrodes. If electrodes driven by two filters happened to be stimulating at the same time, the superimposed stimulation could cause unwanted distortion in the signals coming from hair cells that were within range of both of these electrodes. The solution to this was the Continuous Interleaved Sampling Approach - in which the electrodes driven by adjacent filters stimulate at slightly different times. This eliminates the interference effect between nearby electrodes, but introduces the problem that, due to the interleaving, temporal resolution suffers.

Schematic representation of Continuous Interleaved Sampling (CIS). The processing ("Proc") comprises the envelope detection, amplitude compression, digitization, and pulse modulation.

Feature-Extraction Strategies[edit]

These strategies focus less on transmitting filtered versions of the audio signal and more on extracting more abstract features of the signal and transmitting them to the electrodes. The first feature-extraction strategies looked for the formants (frequencies with maximum energy) in speech. In order to do this, they would apply wide band filters (e.g. 270 Hz low-pass for F0 - the base formant, 300 Hz-1 kHz for F1, and 1 kHz-4 kHz for F2), then calculate the formant frequency, using the zero-crossings of each of these filter outputs, and formant-amplitude by looking at the envelope of the signals from each filter. Only electrodes corresponding to these formant frequencies would be activated. The main limitation of this approach was that formants primarily identify vowels, and consonant information, which primarily resides in higher frequencies, was poorly transmitted. The MPEAK system later improved on this design my incorporating high-frequency filters which could better simulate unvoiced sounds (consonants) by stimulating high-frequency electrodes, and formant frequency electrodes at random intervals.[9][10][11]

Current Developments[edit]

Block diagram of the SPEAK processing scheme

Currently, the leading strategy is the SPEAK system, which combines characteristics of Waveform and Feature-Detection strategies. In this system, the signal passes through a parallel array of 20 band-pass filters. The envelope is extracted from each of these and several of the most powerful frequencies are selected (how many depends on the shape of the spectrum), and the rest are discarded. This is known as a 'n-of-m" strategy. The amplitudes of these are then logarithmically compressed to adapt the mechanical signal range of sound to the much narrower electrical signal range of hair cells.

Multiple microphones[edit]

On its newest implants, the company Cochlear uses 3 microphones instead of one. The additional information is used for beam-forming, i.e. extracting more information from sound coming from straight ahead. This can improve the signal-to-noise ratio when talking to other people by up to 15dB, thereby significantly enhancing speech perception in noisy environments.

Integration CI – Hearing Aid[edit]

Preservation of low-frequency hearing after cochlear implantation is possible with careful surgical technique and with careful attention to electrode design. For patients with remaining low-frequency hearing, the company MedEl offers a combination of a cochlea implant for the higher frequencies, and classical hearing aid for the lower frequencies. This system, called EAS for electric-acoustic stimulation, uses with a lead of 18mm, compared to 31.5 mm for the full CI. (The length of the cochlea is about 36 mm.) This results in a significant improvement of music perception, and improved speech recognition for tonal languages.

Fine Structure[edit]

Graph showing how envelope (in red) and phase (black dots, for zero crossings) of a signal can be simply derived with the Hilbert Transform.

For high frequencies, the human auditory system uses only tonotopic coding for information. For low frequencies, however, also temporal information is used: the auditory nerve fires synchronously with the phase of the signal. In contrast, the original CIs only used the power spectrum of the incoming signal. In its new models, MedEl incorporates the timing information for low frequencies, which it calls fine structure, in determining the timing of the stimulation pulses. This improves music perception, and speech perception for tonal languages like Mandarin.

Mathematically, envelope and fine-structure of a signal can be elegantly obtained with the Hilbert Transform (see Figure). The corresponding Python code is available under.[12]

Virtual Electrodes[edit]

The numbers of electrodes available is limited by the size of the electrode (and the resulting charge and current densities), and by the current spread along the endolymph. To increase the frequency specificity, one can stimulate two adjacent electrodes. Subjects report to perceive this as a single tone at a frequency intermediate to the two electrodes.

Simulation of the stimulation strength of a cochlear implant

Simulation of a cochlear implant[edit]

Sound processing in cochlear implant is still subject to a lot of research and one of the major product differentiations between the manufacturers. However, the basic sound processing is rather simple and can be implemented to gain an impression of the quality of sound perceived by patients using a cochlear implant. The first step in the process is to sample some sound and analyze its frequency. Then a time-window is selected, during which we want to find the stimulation strengths of the CI electrodes. There are two ways to achieve that: i) through the use of linear filters ( see Gammatone filters); or ii) through the calculation of the powerspectrum (see Spectral Analysis).

Cochlear implants and Magnetic Resonance Imaging[edit]

With more than 150 000 implantations worldwide, Cochlear Implants (CIs) have now become a standard method for treating severe to profound hearing loss. Since the benefits of CIs become more evident, payers become more willing to support CIs and due to the screening programs of newborns in most industrialized nations, many patients get CIs in infancy and will likely continue to have them throughout their lives. Some of them may require diagnostic scanning during their lives which may be assisted by imaging studies with Magnetic resonance imaging (MRI). For large segments of the population, including patients suffering from stroke, back pain or headache, MRI has become a standard method for diagnosis. MRI uses pulses of magnetic fields to generate images and current MRI machines are working with 1.5 Tesla magnet fields. 0.2 to 4.0 Tesla devices are common and the radiofrequency power can peak as high as 6 kW in a 1.5 Tesla machine.

Cochlear implants have been historically thought to incompatible with MRI with magnetic fields higher than 0.2 T. The external parts of the device always have to be removed. There are different regulations for the internal parts of the device. Current US Food and Drug Administration (FDA) guidelines allow limited use of MRI after CI implantation. The pulsar and Sonata (MED-EL Corp, Innsbruck, Austria) devices are approved for 0.2 T MRI with the magnet in place. The Hi-res 90K (Advanced Bionics Corp, Sylmar, CA, USA) and the Nucleus Freedom (Cochlear Americas, Englewood, CO, USA) are approved for up to 1.5 T MRI after surgical removal of the internal magnet. Each removal and replacement of the magnet can be done using a small incision under local anesthesia, but the procedure is likely to weaken the pocket of the magnet and to risk infection of the patient.

Cadaver studies have shown that there is a risk that the implant may be displaced from the internal device in a 1.5 T MRI scanner. However, the risk could be eliminated when a compression dressing was applied. Nevertheless, the CI produces an artifact that could potentially reduce the diagnostic value of the scan. The size of the artifact will be larger relative to the size of the patient’s head and this might be particularly challenging for MRI scans with children. A recent study by Crane et al., 2010 found out that the artifact around the area of the CI had a mean anterior-posterior dimension of 6.6 +/- 1.5 cm (mean +/- standard deviation) and a left-right dimension averaging 4.8 +/- 1.0 cm (mean +/- standard deviation) (Crane et al., 2010). ([13])

Computer Simulations of the Auditory System[edit]

Working with Sound[edit]

Audio signals can be stored in a variety of formats. They can be uncompressed or compressed, and the encoding can be open or proprietary. On Windows systems, the most common format is the WAV-format. It contains a header with information about the number of channels, sample rate, bits per sample etc. This header is followed by the data themselves. The usual bitstream encoding is the linear pulse-code modulation (LPCM) format.

Many programing languages provide commands for reading and writing WAV-files. When working with data in other formats, you have two options:

  • You can either you convert them into WAV-format, and go on from there. A very comprehensive free cross-platform solution to record, convert and stream audio and video is ffmpeg (
  • Or you can obtain special programs moduls for reading/writing the desired format.

Reminder of Fourier Transformations[edit]

To transform a continuous function, one uses the Fourier Integral:

F(k)=\int_{-\infty}^{\infty} {f(t)} \cdot e^{-2 \pi ikt} dt

where k represents frequency. Note that F(k) is a complex value: its absolute value gives us the amplitude of the function, and its phase defines the phase-shift between cosine and sine components.

The inverse transform is given by

f(t)=\int_{-\infty}^{\infty} F(k) \cdot e^{2 \pi ikt} dk
Fourier Transformation: a sum of sine-waves can make up any repititive waveform.

If the data are sampled with a constant sampling frequency and there are N data points,

f(\tau)= \sum_{n=0}^{N-1} F_n e^{2 \pi in \tau /N}

The coefficients Fn can be obtained by

 F_n = \sum_{\tau = 0}^{N-1} f(\tau) \cdot e^{-2 \pi in \tau/N}

Since there are a discrete, limited number of data points and with a discrete, limited number of waves, this transform is referred to as Discrete Fourier Transform (DFT). The Fast Fourier Transform (FFT) is just a special case of the DFT, where the number of points is a power of 2: N = 2^n .

Note that each F_n is a complex number: its magnitude defines to the amplitude of the corresponding frequency component in the signal; and the phase of F_n defines the corresponding phase (see illustration). If the signal in the time domain "f(t)" is real valued, as is the case with most measured data, this puts a constraint on the corresponding frequency components: in that case we have

 F_n = F_{N-n}^*

A frequent source of confusion is the question: “Which frequency corresponds to F_n?” If there are N data points and the sampling period is ''T_s'', the n^{th} frequency is given by

 f_n = \frac{n}{N \cdot T_s}, 1 \le n \le N (in \; Hz)

In other words, the lowest frequency is \frac{1}{N \cdot T_s} [in Hz], while the highest independent frequency is  \frac{1}{2T_s} due to the Nyquist-Shannon theorem. Note that in MATLAB, the first return value corresponds to the offset of the function, and the second value to n=1!

Spectral Analysis of Biological Signals[edit]

Power Spectrum of Stationary Signals[edit]

Most FFT functions and algorithms return the complex Fourier coefficients F_n. If we are only interested in the magnitude of the contribution at the corresponding frequency, we can obtain this information by

 P_n = F_n \cdot F_n^* = |F_n|^2

This is the power spectrum of our signal, and tells us how big the contribution of the different frequencies is.

Power Spectrum of Non-stationary Signals[edit]

Often one has to deal with signals that are changing their characteristics over time. In that case, one wants to know how the power spectrum changes with time. The simplest way is to take only a short segment of data at a time, and calculate the corresponding power spectrum. This approach is called Short Time Fourier Transform (STFT). However in that case edge effects can significantly distort the signals, since we are assuming that our signal is periodic.

"Hanning window"

To eliminate edge artifacts, the signals can be filtered, or "windowed". An examples of such a window is shown in the figure above. While some windows provide better frequency resolution (e.g. the rectangular window), others exhibit fewer artifacts such as spectral leakage (e.g. Hanning window). For a selected section of the signal, the data resulting from windowing are obtained by multiplying the signal with the window (left Figure):

Effects of windowing a signal. STFT Example.png

An example can show how cutting a signal, and applying a window to it, can affect the spectral power distribution, is shown in the right figure above. (The corrsponding Python code can be found at [14] ) Note that decreasing the width of the sample window increases the width of the corresponding powerspectrum!

Stimulation strength for one time window[edit]

To obtain the power spectrum for one selected time window, the first step is to calculate the power spectrum through the Fast Fourier Transform (FFT) of the time signal. The result is the sound intensity in frequency domain, and the corresponding frequencies. The second step is to concentrate those intensities on a few distinct frequencies ("binning"). The result is a sound signal consisting of a few distinct frequencies - the location of the electrodes in the simulated cochlea. Back conversion into the time domain gives the simulated sound signal for that time window.

The following Python function does sound processing on a given signal.

import numpy as np
def pSpect(data, rate):
    '''Calculation of power spectrum and corresponding frequencies, using a Hamming window'''
    nData = len(data)
    window = np.hamming(nData)
    fftData = np.fft.fft(data*window)
    PowerSpect = fftData * fftData.conj() / nData
    freq = np.arange(nData) * float(rate) / nData
    return (np.real(PowerSpect), freq)
def calc_stimstrength(sound, rate=1000, sample_freqs=[100, 200, 400]): 
    '''Calculate the stimulation strength for a given sound'''
    # Calculate the powerspectrum
    Pxx, freq = pSpect(sound, rate)
    # Generate matrix to sum over the requested bins
    num_electrodes = len(sample_freqs)
    sample_freqs = np.hstack((0, sample_freqs))
    average_freqs = np.zeros([len(freq), num_electrodes])
    for jj in range(num_electrodes):
        average_freqs[((freq>sample_freqs[jj]) * (freq<sample_freqs[jj+1])),jj] = 1
    # Calculate the stimulation strength (the square root has to be taken, to get the amplitude)
    StimStrength = np.sqrt(Pxx).dot(average_freqs)
    return StimStrength

Sound Transduction by Pinna and Outer Ear[edit]

The outer ear is divided into two parts: the visible part on the side of the head (the pinna), and the external auditory meatus (outer ear canal) leading to the eardrum, as shown in the figure below. With such a structure, the outer ear contributes the ‘spectral cues’ for people’s sound localization abilities, making people not only have the ability to detect and identify a sound, but also have the ability to localize a sound source. [15]

The Atonamy of Human Ear

Pinna Function[edit]

The Pinna’s cone shape enables it to gather sound waves and funnel them into the out ear canal. On top of that, its various folds make the pinna a resonant cavity which amplifies certain frequencies. Furthermore, the interference effects resulting from the sound reflection caused by the pinna are directionally dependent and will attenuate other frequencies. Therefore, the pinna could be simulated as a filter function applied to the incoming sound, modulating its amplitude and phase spectra.

Frequency Responses for Sounds from Two Different Directions by the Pinna [16]

The resonance of the pinna cavity can be approximated well by 6 normal modes [17]. Among these normal modes, the first mode, which mainly depends on the concha depth (i.e. the depth of the bowl-shaped part of the pinna nearest the ear canal), is the dominant one.

The cancellation of certain frequencies caused by the pinna reflection is called “pinna notch”. [17] As shown in the right figure [16], sound transmitted by the pinna goes through two paths, a direct path and a longer reflected path. The different paths have different length, and thereby produce phase differences. When the frequency of incoming sound signal reaches certain criterion, which is that the path difference is half of the sound wavelength, the interference of sounds via direct and reflected paths will be destructive. This phenomenon is called “pinna notch”. Normally the notch frequency could happen in the range from 6k Hz to 16k Hz depending on the pinna shape. It is also seen that the frequency response of pinna is directionally dependent. This makes the pinna contribute to the spatial cues for sound localization.

Ear Canal Function[edit]

The outer ear canal is approximately 25 mm long and 8 mm in diameter, with a tortuous path from the entrance of the canal to the eardrum. The outer ear canal can be modeled as a cylinder closed at one end which leads to a resonant frequency around 3k Hz. This way the outer ear canal amplifies sounds in a frequency range important for human speech. [18]

Simulation of Outer Ear[edit]

Based on the main functions of the outer ear, it is easy to simulate the sound transduction by the pinna and outer ear canal with a filter, or a filter bank, if we know the characteristics of the filter.

Many researchers are working on the simulation of human auditory system, which includes the simulation of the outer ear. In the next chapter, a Pinna-Related Transfer Function model is first introduced, followed by two MATLAB toolboxes developed by Finnish and British research groups, respectively.

Model of Pinna-Related Transfer Function by Spagnol[edit]

This part is entirely from the paper published by S.Spagnol, M.Geronazzo, and F.Avanzini. [19] In order to model the functions of the pinna, Spagnol developed a reconstruction model of the Pinna-Related Transfer Function (PRTF), which is a frequency response characterizing how sound is transduced by the pinna. This model is composed by two distinct filter blocks, accounting for resonance function and reflection function of the pinna respectively, as shown in the figure below.

General Model for the Reconstruction of PRTFs[19]

There are two main resonances in the interesting frequency range of the pinna[19], which can be represented by two second-order peak filters with fixed bandwidth f_b = 5 kHz [20]:

H_{res} (z)=  \frac{V_0 (1-h)(1-z^{-2})}{1+2dhz^{-1}+(2h-1)z^{-2}}


h=  \frac{1}{1+\tan(\pi\frac{f_B}{f_s})}
 d= -\cos(2\pi \frac{f_C}{f_s} )

and f_s is the sampling frequency, f_C the central frequency, and G the notch depth.

For the reflection part, three second-order notch filters of the form [21] are designed with the parameters including center frequency f_C, notch depth G, and bandwidth f_B.

H_{refl}(z)=  \frac{1+(1+k)\frac{H_0}{2}+d(1-k)z^{-1}+(-k-(1+k)\frac{H_0}{2})z^{-2}} {1+d(1-k) z^{-1}-kz^{-2}}

where d is the same as previously defined for the resonance function, and

H_0= V_0-1
k= \frac{\tan(\pi\frac{f_B}{f_s})-V_0}{\tan(\pi\frac{f_B}{f_s})+V_0}

each accounting for a different spectral notch.

By cascading the three in-series placed notch filters after the parallel two peak filters, an eighth-order filter is designed to model the PRTF.
By comparing the synthetic PRTF with the original one, as shown in the figures below, Spagnol concluded that the synthesis model for PRTF was overall effective. This model may have missing notches due to the limitation of cutoff frequency. Approximation errors may also be brought in due to the possible presence of non-modeled interfering resonances.

Original vs Synthetic PRTF Plots[19]

HUTear MATLAB Toolbox[edit]

Block Diagram of Generic Auditory Model of HUTear

HUTear is a MATLAB Toolbox for auditory modeling developed by Lab of Acoustics and Audio Signal Processing at Helsinki University of Technology [22]. This open source toolbox could be downloaded from here. The structure of the toolbox is shown in the right figure.

In this model, there is a block for “Outer and Middle Ear” (OME) simulation. This OME model is developed on the basis of Glassberg and Moor [23]. The OME filter is usually a linear filter. Auditory filter is generated with taking the "Equal Loudness Curves at 60 dB"(ELC)/"Minimum Audible Field"(MAF)/"Minimum Audible Pressure at ear canal"(MAP) correction into account. This model accounts for the outer ear simulation. By specifying different parameters with the "OEMtool", you may compare the MAP IIR approximation and MAP data, as shown in the figure below.

UI of OEMtool from HUTear Toolbox

MATLAB Model of the Auditory Periphery (MAP)[edit]

MAP is developed by researchers in the Hearing Research Lab at University of Essex, England [24]. Being a computer model of physiological basis of human hearing, MAP is an open-source code package for testing, developing the model, which could be downloaded from here. Its model structure is shown in the right figure.

MAP Model Structure

Within the MAP model, there is the “Outer Middle Ear (OME)” sub-model, allowing the user to test and create an OME model. In this OME model, the function of the outer ear is modeled as a resonance function. The resonances are composed by two parallel bandpass filters, respectively, representing concha resonance and outer ear canal resonance. These two filters are specified by the pass frequency range, gain and order. By adding the output of resonance filters to the original sound pressure wave, the output of the outer ear model is obtained.

To test the OME model, run the function named “testOME.m”. A figure plotting the external ear resonances and stapes peak displacement will be displayed. (as shown in the figure below)

External Ear Resonances and Stapes Peak Displacement from OME Model of MAP


The outer ear, including pinna and outer ear canal, can be simulated as a linear filter, or a filter bank. This reflects its resonance and reflection effect to incoming sound. It is worth noting that since the pinna shape varies from person to person, the model parameters, like the resonant frequencies, depend on the subject.

One aspect not included in the models described above is the Head-Related Transfer Function(HRTF). The HRTF describes how an ear receives a sound from a point sound source in space. It is not introduced here because it goes beyond the effect of the outer ear (pinna and outer ear canal) as it is also influenced by the effects of head and torso. There are plenty of literature and publications for HRTF for the interested reader.(wiki, tutorial 1,2, reading list for spatial audio research including HRTF)

Simulation of the Inner Ear[edit]

The shape and organisation of the basilar membrane means that different frequencies resonate particularly strongly at different points along the membrance. This leads to a tonotopic organisation of the sensitivity to frequency ranges along the membrane, which can be modeled as being an array of overlapping band-pass filters known as "auditory filters".[25] The auditory filters are associated with points along the basilar membrane and determine the frequency selectivity of the cochlea, and therefore the listener’s discrimination between different sounds.[26] They are non-linear, level-dependent and the bandwidth decreases from the base to apex of the cochlea as the tuning on the basilar membrane changes from high to low frequency.[26][27] The bandwidth of the auditory filter is called the critical bandwidth, as first suggested by Fletcher (1940). If a signal and masker are presented simultaneously then only the masker frequencies falling within the critical bandwidth contribute to masking of the signal. The larger the critical bandwidth the lower the signal-to-noise ratio (SNR) and the more the signal is masked.

ERB related to centre frequency. The diagram shows the ERB versus centre frequency according to the formula of Glasberg and Moore.[26]

Another concept associated with the auditory filter is the "equivalent rectangular bandwidth" (ERB). The ERB shows the relationship between the auditory filter, frequency, and the critical bandwidth. An ERB passes the same amount of energy as the auditory filter it corresponds to and shows how it changes with input frequency.[26] At low sound levels, the ERB is approximated by the following equation according to Glasberg and Moore:[26]

ERB = 24.7*(4.37F + 1) \,

where the ERB is in Hz and F is the centre frequency in kHz.

It is thought that each ERB is the equivalent of around 0.9mm on the basilar membrane.[26][27]

Gammatone Filters[edit]

Sample gamma tone impulse response.

One filter type used to model the auditory filters is the "gammatone filter". It provides a simple linear filter for describing the movement of one location of the basilar membrane for a given sound input, which is therefore easy to implement. Linear filters are popular for modeling different aspects of the auditory system. In general, they are IIR-filters (infinite impulse response) incorporating feedforward and feedback, which are defined by

 \sum\limits_{j = 0}^m {{a_{j + 1}}y(k - j)}  = \sum\limits_{i = 0}^n {{b_{i + 1}}x(k - i)}

where a1=1. In other words, the coefficients ai and bj uniquely determine this type of filter. The feedback-character of these filters can be made more obvious by re-shuffling the equation

 y(k) = {b_1}x(k) + {b_2}x(k - 1) + ... + {b_{n + 1}}x(k - n) - \left( {{a_2}y(k - 1) + ... + {a_{m + 1}}y(k - m)} \right)

(In contrast, FIR-filters, or finite impulse response filters, only involve feedforward: for them a_i=0 for i>1.)

General description of an "Infinite Impulse Response" filter.

Linear filters cannot account for nonlinear aspects of the auditory system. They are nevertheless used in a variety of models of the auditory system. The gammatone impulse response is given by

g(t) = at^{n-1} e^{-2\pi bt} \cos(2\pi ft + \phi), \,

where f is the frequency, \phi is the phase of the carrier, a is the amplitude, n is the filter's order, b is the filter's bandwidth, and t is time.

This is a sinusoid with an amplitude envelope which is a scaled gamma distribution function.

Variations and improvements of the gammatone model of auditory filtering include the gammachirp filter, the all-pole and one-zero gammatone filters, the two-sided gammatone filter, and filter cascade models, and various level-dependent and dynamically nonlinear versions of these.[28]

For computer simulations, efficient implementations of gammatone models are availabel for Matlab and for Python[29] .

When working with gammatone filters, we can elegantly exploit Parseval's Theorem to determine the energy in a given frequency band:

 \int_{ - \infty }^\infty  {{{\left| {f(t)} \right|}^2}dt = } \int_{ - \infty }^\infty  {{{\left| {F(\omega )} \right|}^2}d\omega }


  1. NeurOreille and authors (2010). "Journey into the world of hearing". 
  2. Lisker L., Abramson. "The voicing dimsension: Some experiments in comparative phonetics".(1970)
  3. Selfridge, O.C "Pandomonium: a paradigm for learning". 1959
  4. Uttley, AM. 1966, "The transmission of information and effects of local feedback in theoretical and neural networks." (1966).
  5. Liberman, A.M., Mattingly, IG, Turvey, MT. "Language codes and memory codes" (1967)
  6. Stevens, KN, Halle, M. "Remarks on analysis by synthesis and distinctive features"
  7. Hick, G. The role of mirror neurons in speech and language processing. (2010)
  8. McClelland, JL. The TRACE Model of Speech Perception (1986)
  12. T. Haslwanter (2012). "Hilbert Transformation [Python"]. private communications. 
  13. Crane BT, Gottschalk B, Kraut M, Aygun N, Niparko JK (2010) Magnetic resonance imaging at 1.5 T after cochlear implantation. Otol Neurotol 31:1215-1220
  14. T. Haslwanter (2012). "Short Time Fourier Transform [Python"]. private communications. 
  15. Semple, M.N. (1998), "Auditory perception: Sounds in a virtual world", Nature (Nature Publishing Group) 396 (6713): 721-724, doi:10.1038/25447 
  16. a b
  17. a b Shaw, E.A.G. (1997), "Acoustical features of the human ear", Binaural and spatial hearing in real and virtual environments (Mahwah, NJ: Lawrence Erlbaum) 25: 47 
  18. Federico Avanzini (2007-2008), Algorithms for sound and music computing, Course Material of Informatica Musicale (, pp. 432 
  19. a b c d Spagnol, S. and Geronazzo, M. and Avanzini, F. (2010), "Structural modeling of pinna-related transfer functions", In Proc. Int. Conf. on Sound and Music Computing (SMC 2010) (barcelona): 422-428 
  20. S. J. Orfanidis, ed., Introduction To Signal Processing. Prentice Hall, 1996.
  21. U. Zölzer, ed., Digital Audio Effects. New York, NY, USA: J.Wiley & Sons, 2002.
  23. Glasberg, B.R. and Moore, B.C.J. (1990), "Derivation of auditory filter shapes from notched-noise data", Hearing research (Elsevier) 47 (1-2): 103-138 
  25. Munkong, R. (2008), IEEE Signal Processing Magazine 25 (3): 98--117, doi:10.1109/MSP.2008.918418, Bibcode2008ISPM...25...98M 
  26. a b c d e f Moore, B. C. J. (1998). Cochlear hearing loss. London: Whurr Publishers Ltd.. ISBN 0585122563. 
  27. a b Moore, B. C. J. (1986), "Parallels between frequency selectivity measured psychophysically and in cochlear mechanics", Scand. Audio Suppl. (25): 129–52 
  28. R. F. Lyon, A. G. Katsiamis, E. M. Drakakis (2010). "History and Future of Auditory Filter Models". Proc. ISCAS. IEEE. 
  29. T. Haslwanter (2011). "Gammatone Toolbox [Python"]. private communications. 

Visual_System · Vestibular_System