
OPTICODEC-PC INTRODUCTION
1-9
About Perceptual Coders
CD-quality audio (16-bit words at 44.1 kHz sample rate) requires 705,600 bits per
second per channel, which is far too high for economical streaming. Perceptual cod-
ing reduces the number of bits per second necessary to transmit a high-quality audio
signal.
Perceptual coders exploit models of how humans perceive sound. In particular, per-
ceptual coders exploit the phenomenon of psychoacoustic masking. This means that
louder sounds will “drown out” (or “mask”) weaker sounds occurring at the same
time, particularly if the frequency of the louder sound is close to the weaker sound’s
frequency. Loud sounds not only mask weak sounds occurring simultaneously in
time (spectral masking), but can also drown out weak sounds occurring a few milli-
seconds before the loud sound starts or a few milliseconds after it stops (temporal
masking).
The basic principle of perceptual coding is to divide the audio into frequency bands
and then to code each frequency band with the minimum number of bits that will
yield no audible change in that band. Reducing the number of bits used to encode a
given frequency band raises the quantization noise floor in that band. If the noise
floor is raised too far, it can become audible and cause artifacts.
A second major source of artifacts in codecs is pre- and post-echo caused by ringing
of the narrow bandpass filters used to divide the signal into frequency bands. This
ringing worsens as the number of bands increases, so some codecs may adaptively
switch the number of bands in use, depending on whether the sound has significant
transient content. This ringing manifests itself as a smearing of sharp transient
sounds in music, such as those produced by claves and wood blocks.
Psychoacoustic Models
Perceptual coders exploit complex models of the human auditory system to estimate
whether a given amount of added noise can be heard. They then adjust the number
of bits used to code each frequency band such that the added noise is undetectable
by the ear if the total “bit budget” is sufficiently high. Because the psychoacoustic
model in a perceptual coder is an approximation that never exactly matches the be-
havior of the ear, it is desirable to leave some safety factor when choosing the num-
ber of bits to use for each frequency band. This safety factor is often called the
“mask-to-noise ratio,” measured in dB. For example, a mask-to-noise ratio of 12 dB
in a given band would mean that the quantization noise in that band could be
raised by 12 dB before it would be heard. (That is, there is a safety margin of two
bits in that band’s coding.) For the most efficient coding, the mask-to-noise ratio
should be the same in all bands, ensuring that the sound elements equitably share
the available bits in the transmission channel.
Increasing the number of bits per second in the transmission always improves the
mask-to-noise ratio. It is important to allocate extra bits to the transmission if the
audio will be processed after it has been decoded at the output of the perceptual
coder (for example, by a second “cascaded” perceptual coder, or by a multiband au-
dio processor such as Optimod-PC). Done correctly, this increased bitrate will raise
the mask-to-noise ratio far enough to prevent downstream processing from causing
the noise to become unmasked.
Because it
occurs in narrow frequency bands, unmasked noise does not
sound like familiar white noise at all. Instead, it most often sounds like