We summarise three key approaches to extracting digital information from
acoustic signals.

Demystifying Audio Watermarking, Fingerprinting and

Modulation/Demodulation

In the real world, we’re continuously inferring meaning from sounds we hear
around us, from alarms and ring-tones to the complex semantics of spoken
language. Similarly, in the digital realm, there are lots of different ways to
embed and extract meaning from a piece of sound — and different motivations
for doing so.

In this post, we’ll examine three specific approaches to audio information
analysis: watermarking (as done by Digimarc and Fraunhofer), fingerprinting (as done by Shazam), and modulation/demodulation (as done by us at Chirp) .

Each has specific affordances and applications, with substantial overlap
between them. This article should help to distinguish between the three, and
explain where and why each can be best deployed.

Motivations behind audio information retrieval

There are a number of different reasons we may need to extract information
from an audio signal.

  • for communication : A dominant theme — and the one we’re concerned with at Chirp — is the use of audio as a data carrier, translating digital information into sound and then decoding it at the receiving end. The objective is simply to send a digital message, intact, from machine to machine. This has a long-reaching historical precedent, from Morse code to V.90 dial-up modems. Today, the big revolution is in doing so over the air, with “air-gapped” distances between the speakers and microphones of devices, a situation that is becoming increasingly relevant in the age of things.
  • to obtain metadata : When listening to a piece of music on the radio or podcast, it’s useful to be able to find out what it is and who it’s by. Assuming that all we have to go on is the audio itself (without an additional metadata stream such as m3u), this means that the metadata must be somehow encoded within the audio, or derived from it by identifying the work.
  • to protect rights-holders : For owners or licensees of copyrighted material, it’s important to check whether their library of material has been properly licensed by third-party broadcasters. This can be done by automatically listening in to radio and TV streams, identifying the material and confirming that the broadcaster has licensed the work. Likewise, a broadcaster can use information retrieval to check over their own library to confirm the rights are in place.
  • for musical analysis and performance : Fields such as computational musicology can provide a wealth of information on compositional trends, instrumentation and performance styles by automatically analysing a recorded work.
  • device synchronisation and second-screen display : Increasing number of traditional TV broadcasts now make use of devices such as tablets and mobile devices to display supplementary information, or give secondary interactive controls such as voting mechanisms. Embedding sync data in the audio stream allows the broadcast and accompanying device to stay in sync with each other, to trigger specific actions, or to make UX interfaces visible to the viewer.

Approaches

We’ll now look at three particular approaches to extracting information from
an audio stream: watermarking, fingerprinting, and modulation/demodulation.

Fingerprinting

Fingerprinting, or “content-based audio identification”, produces a
fingerprint of a snippet of audio by analysing its musical content and mapping
out its general contours — for example, looking for distinctive melodies or
rhythms. (In practice, most real-world implementations derive more
sophisticated measures by deriving properties of the frequency spectrum.)

This fragmentary fingerprint can then be looked up in a huge database or
“corpus” of known fingerprints, typically to identify the source music that
the fragment comes from. The best-known example of audio fingerprinting in the
consumer realm is Shazam, capable of identifying short fragments of music from
a vast database of tens of millions of tracks. The difficult feat of this kind
of real-world fingerprinting is that it must be (a) performed efficiently, in
near real-time despite the complexity of the search process; and (b) robust to
background noise and distortion, given that the sample may be taken from a
noisy environment or played from a low-bitrate MP3. (More about how Shazam
works).

Fingerprinting. From Cano et al (2005).

Fingerprinting is applied in the broadcast realm to protect rights holders from copyright infringement. Services such as Nielsen Broadcast Data Systems provide fingerprinting-as-a-service, maintaining a database of tracks on behalf of their rights holders and then scanning a substantial list of radio stations to ensure licensing is in place.

It can also be deployed for synchronisation and second-screen purposes, with
the ability to make broadcasts interactive by listening for the specific audio fingerprint in question.

The greatest power — and the greatest constraint — of fingerprinting is that
it does not modify the original source material in any way. This has the major
benefit of being non-destructive and not requiring any expensive or time-
consuming retroactive processing of media databases.

However, it means that arbitrary data cannot be directly embedded or
communicated in this way. Once the fingerprint is obtained, all it provides is
a digest of the track, which is then looked up in a database. This additional
lookup step typically requires that the device be internet-connected (in the
case of networked databases such as Shazam’s), and adds some architectural
complexity

Because fingerprint must be pre-populated in the corpus for the match to be
successful, the fingerprinting paradigm is a poor fit for cases where the data
to be embedded needs to be generated or modified in real-time, such as a
sports score.

Watermarking

Similarly to fingerprinting, audio watermarking is popular for rights
management in that it can also be used to identify a particular audio
recording or broadcast. Unlike fingerprinting, watermarking operates by
modifying an original piece of source material by layering additional
information on top of it.

The crux of watermarking is that it should be done in a way that is (a)
imperceptible to the listener, yet (b) be resilient to distortion and
compression. This is a challenging and often contradictory set of
requirements: compression codecs such as MP3 operate on the basis that they
explicitly strip out those parts of the acoustic signal that are not audible
to the listener. Yet a good watermark should persist and remain detectable
even when compressed using a lossy codec such as MP3.

How is this achieved? Typically, using clever combinations of steganography
(that is, concealing information by subtly modifying the source material) and
psychoacoustics (…alongside the scientific understanding of how humans
perceive sound). Some approaches layer low-frequency noise onto the original
recording, that can be correlated with an expected carrier pattern; others
make use of psychoacoustic phenomena such as the Haas effect, adding a subtle
echo to the recording that our brains perceive as part of the original sound.
For a summary of different approaches, see this BBC R&D audio watermarking
white paper.

Watermarking itself has different motivations, which affect the selected
approach. A watermark for copyright control should be as difficult as possible
to detect and remove from the original broadcast, and ideally use
cryptographic processes that can not easily be reverse-engineered to prevent
others from stripping or spoofing watermarks.

More benign motives for watermarking include the addition of metadata,
including subtitles or artist/track data, or sync information to a broadcast.
In these cases, the hard-to-remove requirement becomes diminished. Here, a
watermark could be as simple as a set of inaudible ultrasonic tones that are
overlaid onto the original material. This can obtain higher bit-rates than the
subtle approaches of steganographic watermarking, whilst retaining the
benefits of imperceptibility and dynamic payload support. Chirp’s ultrasonic
encoding protocols can be used to introduce inaudible watermarks of this type.

Modulation/Demodulation

The final approach we’ll look at is that used by Chirp in our technology
products. Audio data encoding — or modulation/demodulation — is a technology
that has been used since the early days of radio communication, from Morse
code to DTMF dial-tones to the V.90/V.92 56kbps protocols used by dial-up
modems and fax machines. This is, in fact, the etymology of “modem”: mo
dulation/ demodulation.

Unlike either of the previous approaches, it does not require an existing
audio signal to operate on. Instead, data is encoded by generating a new
signal whose properties are determined by the data to be transmitted. In the
simplest mapping, the presence of a signal denotes a “1”, and the absence of a
signal denotes a “0”.

Modulation. From Wikipedia: Modulation.

Where signal is present in the communications channel, such a distinction is
less clear, so early dial-up modems would use frequency A (in the case of Bell 103, 1270Hz), to denote a “1” and frequency B to denote a “0” (1070Hz).

Of course, it is possible to go beyond a binary on-or-off approach. Chirp’s
communication system maps integers to larger sets of frequencies: our standard
protocol uses tones of 32 different frequencies, resulting in far larger
throughput.

However, sending acoustic information between air-gapped devices has
challenges of its own. Background noise and reverberation serve to distort the
original signal, meaning that the transmission rate must be reduced to keep
reliability high.

There’s also the likelihood of missed tones — for example, in a situation in
which a passing siren obscures part of the original signal — which requires
additional error-correction to be factored in to correct and compensate for
misdetections.

Amplitude Shift Keying, a simple method of acoustic modulation

One of the key strengths of audio modulation is that information can be
encoded and decoded in real-time, without any external components such as a
look-up database. It’s a good fit for compact, dynamic payloads, which means
it’s appropriate for creating network-like communication links between
devices. It has a relatively high throughput versus watermarking (and
certainly fingerprinting), making it suitable for security-critical
applications such as exchanging authentication or payment tokens.

It is also relatively computationally light, particularly in comparison to the
great complexity of a fingerprinting server. This means that audio-based
communication is viable amongst simple devices such as IoT nodes.

Modulation plays some part in watermarking, in which data must be acoustically
encoded before it can be applied to the original source material. However, the
ultrasonic “watermarking” described above would be highly inappropriate for
copyright control or other integrity-critical purposes: it could trivially be
removed by a filter that strips out ultrasonic frequencies.

In Summary

The table below summarises some of the key affordances of each of these three
different approaches to audio information retrieval.

Further References