CS 410/510 SOUND Sp2019: Psychoacoustics and Compression

A Look At A Sample

Let's consider a 16-bit audio sample:
```
00 010100000101 01
```
Sample consists of
- Headroom: Not recorded at maximum amplitude to avoid clipping
- Signal: The actual audio
- Noise: The low-order bits are typically garbage
The signal and noise blend together
As signals are manipulated, more noise creeps up into the signal bits because addition and multiplication

Artifact of fixed-range representation of PCM sample: floating-point samples are basically unclippable
If the amplitude to be represented goes over the max possible value or under the min, what to do?
Not much except clamp (clip) the sample as close as you can get it
Net effect: tops clipped off waves
This happens in analog systems also because max/min voltages
Discontinuity introduces harmonics: bad distortion
This is why headroom

Several kinds of common audio noise
- Uniform "white noise": easy to make with a computer
- "Pink noise" that rolls off linearly with frequency ("1/f noise", "flicker noise")
- "Brownian noise" ("1/f^2 noise") from random walk in time domain
- Check out this Wikipedia article on noise colors

Idea: build a simple parameterized approximate model of the audio signal
- In the time domain
- In the frequency domain
Transmit the parameters as part of the compressed scheme
A choice remains:
- Transmit the residue (error in approximation) as a separate compressed stream: lossless compression
- Throw the residue away: lossy compression
Lossless (e.g. FLAC) is going to be limited for a lot of kinds of sounds. The fancier the model, the more kinds of sounds that can be compressed well
Lossy is harder, because mustn't throw away stuff that wrecks the sound. Psychoacoustics is needed. Tends to be done in frequency domain; models are generalized

Typical to take a stereo pair and turn it into a mono channel (l + r) / 2 and a side channel (l - r)
The side channel is typically low amplitude, and so can be compressed easily
Side benefit: mono channel is easily extracted

Solid Extron article
Robinson-Dadson curve (AKA Fletcher-Munson curve)
Three frequency bands
- Below 100 Hz: whatever
- 100—1000 Hz: bass
- 1—6 Khz: midrange
- 6—10 KHz: treble
- 10KHz and up: whatever
Three volume bands
- 40 phon: low (A-weighting, midrange)
- 70 phon: normal (B-weighting, moderate midrange)
- 100 phon: loud (C-weighting, flat)
- 100+ phon: aircraft (C-weighting, treble)

Volume knob is log: ideal midpoint around 50 dB
Voltage levels are a mess, with multiple standards: usually 1—2 Vpp maximum.
A "loudness" control typically provides a big bass boost and a smaller treble boost
A "presence" control gives a treble boost, but with some feedback and distortion at high volume

Recall: harmonics are multiples of fundamental frequency produced by distortion
Because the ear is not so sensitive at low and high frequencies (at normal volumes), it selectively hears midrange harmonics of bass notes
This means that a piano, for example, needs to be "stretch tuned" so that the midrange harmonics sound in tune
The low frequencies are partially "masked"

Let's assume a 50Ksps sample rate
Smallest useful sample chunk for most things: 100 samples, 50Hz, 2ms
Fused sound: 500-2500 samples, 10-50 ms
By 20ms (1000s) latencies will be perceptible
By 100ms (5000s) latencies will be annoying: larger latencies are perceived as intolerable

Good Ars Technica MP3 tutorial
High-level view:
- Split the input signal up into a bunch of frequency bands using a "polyphase filter"
- In each band:
  - Use an FFT to figure out what's going on
  - Use a DCT to get a power spectrum (noise subframes are speshul)
  - Quantize the spectrum to reduce the number of bits (giving power errors due to noise)
  - Huffman-encode the quantized coefficients to get a compact representation
- Combine all the compressed quantized coefficients to get a frame
The details are quite complex: see something like Ogg Vorbis for a cleaner version

Last modified: Tuesday, 23 April 2019, 11:11 AM