## A Look At A Sample

• Let's consider a 16-bit audio sample:

00 010100000101 01

• Sample consists of

• Headroom: Not recorded at maximum amplitude to avoid clipping

• Signal: The actual audio

• Noise: The low-order bits are typically garbage

• The signal and noise blend together

• As signals are manipulated, more noise creeps up into the signal bits because addition and multiplication

## Clipping

• Artifact of fixed-range representation of PCM sample: floating-point samples are basically unclippable

• If the amplitude to be represented goes over the max possible value or under the min, what to do?

• Not much except clamp (clip) the sample as close as you can get it

• Net effect: tops clipped off waves

• This happens in analog systems also because max/min voltages

• Discontinuity introduces harmonics: bad distortion

## Noise

• Several kinds of common audio noise

• Uniform "white noise": easy to make with a computer

• "Pink noise" that rolls off linearly with frequency ("1/f noise", "flicker noise")

• "Brownian noise" ("1/f^2 noise") from random walk in time domain

• Check out this Wikipedia article on noise colors

## Audio Compression

• Idea: build a simple parameterized approximate model of the audio signal

• In the time domain
• In the frequency domain
• Transmit the parameters as part of the compressed scheme

• A choice remains:

• Transmit the residue (error in approximation) as a separate compressed stream: lossless compression

• Throw the residue away: lossy compression

• Lossless (e.g. FLAC) is going to be limited for a lot of kinds of sounds. The fancier the model, the more kinds of sounds that can be compressed well

• Lossy is harder, because mustn't throw away stuff that wrecks the sound. Psychoacoustics is needed. Tends to be done in frequency domain; models are generalized

## Audio Compression: Stereo

• Typical to take a stereo pair and turn it into a mono channel (l + r) / 2 and a side channel (l - r)

• The side channel is typically low amplitude, and so can be compressed easily

• Side benefit: mono channel is easily extracted

## Audio Compression: FLAC

• Predict in time domain using polynomial model or Linear Predictive Code

• Encode residue using Rice codes (related to Huffman codes)

• Reliable compression > 2×

• Remember: the noise must be compressed and recreated also

## Psychoacoustics: Volume

• Solid Extron article

• Robinson-Dadson curve (AKA Fletcher-Munson curve)

• Three frequency bands

• Below 100 Hz: whatever
• 100—1000 Hz: bass
• 1—6 Khz: midrange
• 6—10 KHz: treble
• 10KHz and up: whatever
• Three volume bands

• 40 phon: low (A-weighting, midrange)
• 70 phon: normal (B-weighting, moderate midrange)
• 100 phon: loud (C-weighting, flat)
• 100+ phon: aircraft (C-weighting, treble)

## Volume, Loudness, Presence

• Volume knob is log: ideal midpoint around 50 dB

• Voltage levels are a mess, with multiple standards: usually 1—2 Vpp maximum.

• A "loudness" control typically provides a big bass boost and a smaller treble boost

• A "presence" control gives a treble boost, but with some feedback and distortion at high volume

## Psychoacoustics: Harmonics, Stretch Tuning, Masking

• Recall: harmonics are multiples of fundamental frequency produced by distortion

• Because the ear is not so sensitive at low and high frequencies (at normal volumes), it selectively hears midrange harmonics of bass notes

• This means that a piano, for example, needs to be "stretch tuned" so that the midrange harmonics sound in tune

• The low frequencies are partially "masked"

## Psychoacoustics: Time Scales

• Let's assume a 50Ksps sample rate

• Smallest useful sample chunk for most things: 100 samples, 50Hz, 2ms

• Fused sound: 500-2500 samples, 10-50 ms

• By 20ms (1000s) latencies will be perceptible

• By 100ms (5000s) latencies will be annoying: larger latencies are perceived as intolerable

## Application: Lossy Compression ala MP3

• Good Ars Technica MP3 tutorial

• High-level view:

• Split the input signal up into a bunch of frequency bands using a "polyphase filter"

• In each band:

• Use an FFT to figure out what's going on

• Use a DCT to get a power spectrum (noise subframes are speshul)

• Quantize the spectrum to reduce the number of bits (giving power errors due to noise)

• Huffman-encode the quantized coefficients to get a compact representation

• Combine all the compressed quantized coefficients to get a frame

• The details are quite complex: see something like Ogg Vorbis for a cleaner version