Psychoacoustics and Compression

A Look At A Sample

  • Let's consider a 16-bit audio sample:

    00 010100000101 01
    
  • Sample consists of

    • Headroom: Not recorded at maximum amplitude to avoid clipping

    • Signal: The actual audio

    • Noise: The low-order bits are typically garbage

  • The signal and noise blend together

  • As signals are manipulated, more noise creeps up into the signal bits because addition and multiplication

Clipping

  • Artifact of fixed-range representation of PCM sample: floating-point samples are basically unclippable

  • If the amplitude to be represented goes over the max possible value or under the min, what to do?

  • Not much except clamp (clip) the sample as close as you can get it

  • Net effect: tops clipped off waves

  • This happens in analog systems also because max/min voltages

  • Discontinuity introduces harmonics: bad distortion

  • This is why headroom

Noise

  • Several kinds of common audio noise

    • Uniform "white noise": easy to make with a computer

    • "Pink noise" that rolls off linearly with frequency ("1/f noise", "flicker noise")

    • "Brownian noise" ("1/f^2 noise") from random walk in time domain

    • Check out this Wikipedia article on noise colors

Audio Compression

  • Idea: build a simple parameterized approximate model of the audio signal

    • In the time domain
    • In the frequency domain
  • Transmit the parameters as part of the compressed scheme

  • A choice remains:

    • Transmit the residue (error in approximation) as a separate compressed stream: lossless compression

    • Throw the residue away: lossy compression

  • Lossless (e.g. FLAC) is going to be limited for a lot of kinds of sounds. The fancier the model, the more kinds of sounds that can be compressed well

  • Lossy is harder, because mustn't throw away stuff that wrecks the sound. Psychoacoustics is needed. Tends to be done in frequency domain; models are generalized

Audio Compression: Stereo

  • Typical to take a stereo pair and turn it into a mono channel (l + r) / 2 and a side channel (l - r)

  • The side channel is typically low amplitude, and so can be compressed easily

  • Side benefit: mono channel is easily extracted

Audio Compression: FLAC

  • Predict in time domain using polynomial model or Linear Predictive Code

  • Encode residue using Rice codes (related to Huffman codes)

  • Reliable compression > 2×

  • Remember: the noise must be compressed and recreated also

Psychoacoustics: Volume

  • Solid Extron article

  • Robinson-Dadson curve (AKA Fletcher-Munson curve)

  • Three frequency bands

    • Below 100 Hz: whatever
    • 100—1000 Hz: bass
    • 1—6 Khz: midrange
    • 6—10 KHz: treble
    • 10KHz and up: whatever
  • Three volume bands

    • 40 phon: low (A-weighting, midrange)
    • 70 phon: normal (B-weighting, moderate midrange)
    • 100 phon: loud (C-weighting, flat)
    • 100+ phon: aircraft (C-weighting, treble)

Volume, Loudness, Presence

  • Volume knob is log: ideal midpoint around 50 dB

  • Voltage levels are a mess, with multiple standards: usually 1—2 Vpp maximum.

  • A "loudness" control typically provides a big bass boost and a smaller treble boost

  • A "presence" control gives a treble boost, but with some feedback and distortion at high volume

Psychoacoustics: Harmonics, Stretch Tuning, Masking

  • Recall: harmonics are multiples of fundamental frequency produced by distortion

  • Because the ear is not so sensitive at low and high frequencies (at normal volumes), it selectively hears midrange harmonics of bass notes

  • This means that a piano, for example, needs to be "stretch tuned" so that the midrange harmonics sound in tune

  • The low frequencies are partially "masked"

Psychoacoustics: Time Scales

  • Let's assume a 50Ksps sample rate

  • Smallest useful sample chunk for most things: 100 samples, 50Hz, 2ms

  • Fused sound: 500-2500 samples, 10-50 ms

  • By 20ms (1000s) latencies will be perceptible

  • By 100ms (5000s) latencies will be annoying: larger latencies are perceived as intolerable

Application: Lossy Compression ala MP3

  • Good Ars Technica MP3 tutorial

  • High-level view:

    • Split the input signal up into a bunch of frequency bands using a "polyphase filter"

    • In each band:

      • Use an FFT to figure out what's going on

      • Use a DCT to get a power spectrum (noise subframes are speshul)

      • Quantize the spectrum to reduce the number of bits (giving power errors due to noise)

      • Huffman-encode the quantized coefficients to get a compact representation

    • Combine all the compressed quantized coefficients to get a frame

  • The details are quite complex: see something like Ogg Vorbis for a cleaner version

Last modified: Tuesday, 23 April 2019, 11:11 AM