Sensory Systems/Computer Models/Efficient Coding

Efficient coding

Why do we need efficient coding?

As already described, the visual signals are processed in the visual cortex in order to interpret the information. After understanding, how the visual information is processed, the question arises how the information could be coded.

Input Data Amount

Especially in the vision system, the amount of data is huge:  the retina senses about 1010 bits/sec, from which approx. 3-6 * 106 bits/sec are transmitted by ca. 1 million axons through each optical nerve.[1] [2] The result is that only 104 bit/sec make it to layer IV of V1. Since it is estimated that consciousness has a capability of <= 100 bit/sec1, reducing the data amount is not only sufficient, but also necessary.

Processing Speed and Accuracy

In human, neural cells fire with a rate of approximately 0.2 Hz to 10 Hz.[3] The coding of information relies also on exact timing and frequency of firing.[4] To make it more difficult, the processing network has also to deal with noise: retinal noise, i.e., "spontaneous fluctuations in the electrical signals of the retina’s photoreceptors"[5], arises in the rods by thermal decomposition of rhodopsin creating events "that cannot be distinguished from the events which occur when light falls on the rods and a quantum is absorbed"[6] and also arises in the cons having a molecular origin.[4] It is argued that retinal noise limits much more the visual sensitivity than noise in the central nervous system induced by random activities at the synapses of nervous cells creating additional action potentials.[7]

Energy Consumption

Every neural activity needs energy: the brain consumes about 20 % of the resting metabolism. An increase by one action potential per neuron per second will increase oxygen consumption by 145 mL/100 g grey matter/h. The human blood circulatory system provides about 1.5 l of blood per minute to the human brain supplying it with energy and oxygen.  "For an action potential frequency of 4 Hz in active cells, approximately 15% of a set of neurons should be simultaneously active to encode a condition".[8]

Solution

To deal with the circumstances of huge data amount to be processed by a nervous system limited in speed, accuracy and available energy, efficient coding is needed.

In the auditory system, the basic structure on which human (verbal) communication relies on are the phonemes, i.e. the distinct basic sound elements of a language that distinguish one word from another.[9] For example, the word "eye" consists just of one phoneme, /ai/, whereas the word "code" consists of the phonemes /k/, /ə/, /ʋ/, /d/.

Analogously, for the visual system, an efficient code would consist of image structures as basic elements that can be combined in order to represent the sensed environment (i.e. image).  As a model that preserves the basic characteristics of visual perceptive fields, Olshausen & Field proposed an optimization algorithm that finds a sparse code while preserving the information of the image.[10]

Technical Demonstration

Encoding and Decoding Process

The principle of information compression can be nicely demonstrated with the "k-means" method, applied to (2-dimensional) images. This is implemented as part of the python library scikit-image.[11] The idea as illustrated in Figure 1 is to compress an image or data in general, process it and afterwards, transform it back. The processing step is much more efficient in that way and in contrast to the methods present in biological systems, there exist also lossless compression methods, e.g., wavelets, that allow a correct back transformation.

Lossless Compression is not required for biological systems. Information loss is shown with an example of the previously mentioned k-means algorithm on scikit-learn[11] and also youtube.[12]

Introduction

Between the late 1990s and at the beginning of the 21st century Bruno Olshausen and Michael Lewicki respectively studied how natural images[10] and natural sounds[13] are encoded by the brain and tried to create a model which would replicate this process as accurately as possible. It was found that the process of both input signals could be modeled with very similar methods. The goal of efficient coding theory is to conceil a maximal amount of information about a stimulus by using a set of statistically independent characteristics[14]. Efficient coding of natural images arises to a population of localized, oriented Gabor wavelet-like filters[10],[15]. Gammatone filters are the equivalent of these for the auditory system. In order to distinguish shapes in an image the most important feature is edge detection, which is achieved with Gabor filters. In sound processing, sound onsets or 'acoustic edges' can be encoded by a pool of filters similar to a gammatone filterbank[13].

Vision

In 1996, Bruno Olshausen and his team were the first to create a learning algorithm which aims to find sparse linear codes for natural images and maximizes sparseness will form a group of localized, oriented, bandpass receptive fields, analogous to those found in the primary visual cortex[10].

They start out assuming that an image ${\displaystyle I(x,y)}$ can be depicted as a linear superposition of basis functions, ${\displaystyle \phi _{i}(x,y)}$:

${\displaystyle I(x,y)=\sum _{i}a_{i}\phi _{i}(x,y)}$

The parameters ${\displaystyle a_{i}}$ depend on which basis functions ${\displaystyle \phi _{i}(x,y)}$ are chosen, and are different for each image. The objective of efficient coding is to find a family of ${\displaystyle \phi _{i}(x,y)}$ that spans the image space and obtains parameters ${\displaystyle a_{i}}$ which are as statistically independent as possible.

Natural scenes contain many higher-order forms of statistical structure which are non-gaussian[16]. Using principal component analysis to attain these two objectives would thereby be unsuitable. Statistical dependencies among a pool of parameters can be detected as soon as the joint entropy is less than the sum of individual entropies:

${\displaystyle H(a_{1},a_{2},...,a_{n})<\sum _{i}H(a_{i})}$

Entropy here is meant as the Shannon entropy, which is the expected value (average) of a variable. The joint entropy is a measure of the uncertainty associated with a set of variables. It is assumed that natural images have a 'sparse structure', meaning the image can be expressed in function of a a small amount of characteristics amongst a larger set[17],[16]. The objective is to look for a code lowering entropy, where the probability distribution of each parameter is unimodal and tops out around zero. This can be articulated as an optimization problem[14]:

${\displaystyle E=-[{\text{preserve information}}]-\lambda [{\text{sparseness of }}a_{i}]}$

where ${\displaystyle \lambda }$ is positive weight coefficient. The first quantity evaluates the mean square error between the natural image and the reconstructed image.

${\displaystyle [{\text{preserve information}}]=-\sum _{x,y}[I(x,y)-\sum _{i}a_{i}\phi _{i}(x,y)]^{2}}$

The second quantity is attributed a higher cost if for a given picture the different parameters are distributed sparsely. This is calculated by adding up each coefficient's activity plugged in a nonlinear function ${\displaystyle S(x)}$.

${\displaystyle [{\text{sparseness of }}a_{i}]=-\sum _{i}S\left({\frac {a_{i}}{\sigma }}\right)}$

where ${\displaystyle \sigma }$ is a scaling constant. For ${\displaystyle S(x)}$, functions favoring amid activity states with equal variance those with the least amount of non-zero parameters(e.g. ${\displaystyle -e^{-x^{2}}}$, ${\displaystyle log(1+x^{2})}$, ${\displaystyle \left\vert x\right\vert }$).

By minimizing the total cost ${\displaystyle E}$ over ${\displaystyle a_{i}}$, learning is achieved. The ${\displaystyle \phi _{i}}$ converges by gradient descent on ${\displaystyle E}$ averaged over multiple image variations. The algorithm enables the basis functions to be overcomplete dimensionwise and non-orthogonal[18], without decreasing the state of sparseness.

After the learning process, the algorithm was tested on artificial datasets, confirming that it is suited to detecting sparse structure in the data. Basis functions are well localized, oriented and selective to diverse spatial scales. Arranging the response of each ${\displaystyle a_{i}}$ to spots at every position established a similarity between the receptive fields and the basis functions. All basis functions form together an accomplished image code spanning the joint space of spatial position, orientation and scale in a manner similar to wavelet codes.

To conclude, the results of Olshausen's team show that the two sufficient objectives for the emergence of localized, oriented, bandpass receptive fields are that information be preserved and the representation be sparse.

Audition

Fig.1: Time–frequency analysis. (a) The filters in a Fourier transform are localized in frequency but not in time. (b) Wavelet filters are localized in both time and frequency. (c–e) The statistical structure of the signals determines how the filter shapes derived from efficient coding of the different data ensembles are distributed in time–frequency space. Each ellipse is a schematic of the extent of a single filter in time–frequency space. (c) Environmental sounds. (d) Animal vocalizations. (e) Speech.

Lewicki published his findings after Olshausen, in 2002. He tested the efficient coding theory inspired from the prior paper to derive efficient codes for different classes of natural sounds, which were animal vocalizations, environmental sound and human speech.

They used independent component analysis (ICA), which enables the extraction of linear decomposition of signals minimizing correlations and higher-order statistical dependencies[19]. This learning algorithm then yields a filter for each data set, which can be interpreted in the form of a time-frequency windows. The filter shape is determined by the statistical structure of the ensemble[13].

When applied to the different sample sounds, the method obtained filters with time-frequency windows similar to that of a wavelet for environmental sounds where sound is localized in both time and frequency (Fig. 1c). For animal vocalizations a tiling pattern similar to Fourier transform is obtained where sound is localized in frequency but not in time (Fig. 1d). Speech contains a mixture of both with a weighting of 2:1 of environmental to animal sounds (Fig. 1e). That is due to the fact that speech is composed of harmonic vowels and non-harmonic consonants. These patterns have been observed experimentally in animals and humans previously[20].

In order to break down the core differences of these three types of sounds, Lewicki's team analyzed bandwidth, filter sharpness, and the temporal envelope. Bandwidth increases as a function of center frequency for environmental sounds, whereas it stays constant for animal vocalizations. Speech increases as well but less than environmental sounds. Due to the time/frequency trade-off the temporal envelope curves behave similarly. When comparing the sharpness with respect to center frequency of physiological measurements[21],[22] from speech data with the sharpness of the combined sound ensembles, consistency between both intricacies was confirmed.

It must be noted that several approximations were necessary to conduct this analysis. Their analysis omitted to include the variations in intensity of sound. The auditory system obeys to certain intensity thresholds according to which frequencies are chosen[23]. However the physiological measurements, with which these measurements are compared, are made using isolated pure tones, which in term limits the extent of application of this model but does not discredit it. Moreover the filters' symmetry in time does not match the physiologically characterized 'gamma-tone filters'. Modifying the algorithm to be causal is possible and the filters' temporal envelopes would then become asymmetric, similarly to gamma-tone filters.

Conclusion

There is an analogy which surfaces between these two systems. The location and spatial frequency of visual stimuli is encoded by the neurons in the visual cortex. The adjustment between these two variables is similar to that between timing and frequency in auditory coding.

Another interesting aspect of this parallel is why ICA elucidates the neural response properties in the earlier stages of analysis in the auditory system, while it elucidates the response properties of cortical neurons in the visual system. It must be noted that the neuronal anatomy of both systems differs. In the visual system a bottleneck occurs at the optic nerve, where information from 100 million photoreceptors is condensed into 1 million optic nerve fibers. The information is then spread by a factor of 50 in the cortex. In the auditory system no bottleneck occurs and information from 3000 cochlea inner hair cells directly bolster onto 30000 auditory nerve fibers. ICA is then actually assigned to the point of expansion in the representation[24].

References

1. Marcus E. Raichle: Two views of brain function Trends Cogn Sci. 2010 Apr;14(4):180-90
2. Anderson, C.H. et al. (2005) Directed visual attention and the dynamic control of information flow. In Neurobiology of Attention (Itti, L. et al., eds), pp. 11 – 17, Elsevier
3. György Buzsáki & Kenji Mizuseki: The log-dynamic brain: how skewed distributions affect network operations, Figure 3 e, f: http://www.nature.com/nrn/journal/v15/n4/fig_tab/nrn3687_F3.html
4. a b Wulfram Gerstner, Andreas K. Kreiter, Henry Markram, and Andreas V. M. Herz: Neural codes: Firing rates and beyond, http://www.pnas.org/content/94/24/12740.full
5. Fred Rieke, Denis A. Baylor, Origin and Functional Impact of Dark Noise in Retinal Cones, Neuron, Volume 26, Issue 1, April 2000, Pages 181-186, ISSN 0896-6273, http://dx.doi.org/10.1016/S0896-6273(00)81148-4
6. H. B. Barlow: Retinal noise and absolute threshold, J Opt Soc Am. 1956 Aug;46(8):634-9
7. Jonathan B. Demb, Peter Sterling, Michael A. Freed: How Retinal Ganglion Cells Prevent Synaptic Noise From Reaching the Spike Output, Journal of Neurophysiology Published 1 October 2004 Vol. 92 no. 4, 2510-2519
8. David Attwell and Simon B. Laughlin: An Energy Budget for Signaling in the Grey Matter of the Brain
9. https://en.oxforddictionaries.com/definition/phoneme
10. a b c d Olshausen, B. A. & Field, D. J. Emergence of simple-cell receptive-field properties by learning a sparse code for natural images. Nature 381, 607-609 (1996)
11. a b http://scikit-image.org/docs/dev/auto_examples/features_detection/plot_gabors_from_astronaut.html#sphx-glr-auto-examples-features-detection-plot-gabors-from-astronaut-py