An audio fingerprint (also referred to as an acoustic fingerprint) is a
compact representation of some audio (be it music, environmental…

Audio fingerprinting — what is it and why is it useful?

An audio fingerprint (also referred to as an acoustic fingerprint) is a
compact representation of some audio (be it music, environmental sound etc.)
that encapsulates information that is specific to the audio that it
represents. The role of an audio fingerprint is to capture the signature of
a piece of sound, such as a song, that allows it to be differentiated from
other sounds. Audio fingerprinting has many applications, including
watermarking, monitoring broadcast/distribution of audio content, and content-based sound retrieval.The latter application will be the focus of this article, although the same methods and considerations apply to other uses of audio fingerprint
technology.

Much in the same way human fingerprints are used to identify a single person,
it is important to note that audio fingerprints are designed to be specific to
an instance of sound (an audio file), as opposed to a concept or class of
sounds (like ‘ambient music’ or ‘rain sounds’). So let’s see how this works in
practice.

Figure 1 — Spectrogram of some speech

Most audio fingerprinting technology that is used for services like Shazam and
AcoustID extracts fingerprints from a time-frequency representation of audio,
called a spectrogram (Figures 1 shows a spectrogram of some speech).
Spectrograms are great because they allow us to identify the frequency content
over time, and how loud or quiet each frequency is. This is all well and good,
but in the raw form spectrograms are not very useful as audio fingerprints,
for a couple of reasons. Firstly, they contain a lot of information, much of
which may be redundant for the purposes of audio fingerprinting. Secondly,
they are not robust to degradation of audio quality. This is shown in Figure
2, which is the same audio file as Figure 1, but played in a different
environment, with clearly audible background noise. It is clear that the
background noise has resulted in different spectrograms, but one thing is
immediately obvious — the peaks are mostly intact. As such, the spectrogram
peaks are a good starting point for generating a robust audio fingerprint.

Figure 2 — The same piece of speech played in a noisy environment

There are many approaches to identifying peaks, but what we essentially want
is to identify the salient points in each region of the spectrogram that are
not a result of any background noise. Each region can be considered as a 2D
window, the size of which will determine the number of peaks we end up with.

Figure 3 — Segment of speech with peaks annotated

Figures 3 and 4 show the peaks detected on our speech file, with and without
background noise. Note that the peaks appear more closely grouped at higher
frequencies, but this is just due to the logarithmic frequency scale used for
the plotting. One nice property of taking the peaks (as opposed to other audio
features such as spectral statistics or zero crossing rate) is robustness to
noise. Even with the clearly audible background noise, most of the peaks in
the clean audio file are also detected in the noisy one, with a few extra
peaks in introduced by the background noise.

Figure 4 — The same segment of speech played in a noisy environment, with‌‌peaks annotated

Once we have identified all the peaks in an audio file, we have the starting
point for a fingerprint. We could just take the coordinate of each peak as a
fingerprint, but it is easy to imagine that there are lots of pieces of audio
(or songs) that share some peak positions, so it is clear a single peak would
not suffice as a unique fingerprint. This is explained in this excellent
paper by Avery Wang (the man behind Shazam’s technology).

For a given frame of audio analysed over 1024 frequencies, we would have 1024 potential peak positions, equating to 10 bits of information. This is pretty low considering the potential size of the search space (millions of songs or many hours of audio), so we need a method of increasing the entropy of our fingerprint. There are a few clever ways to do this. Wang’s approach is to construct fingerprints from pairs of
peaks (Figure 5). The peaks are split up into target zones, and each zone is
allocated an anchor peak. Each peak in the target zone is paired with the
anchor. A hash can then be constructed for each peak, consisting of the
frequency of each peak pair and the distance between them.

Figure 5 — Pair of peaks used for the fingerprints (each fingerprint based on‌‌an anchor-peak pair)

But there are many alternative ways to increase the entropy. For example, we
could take all the peaks in a section of audio as the fingerprint (Figure 6).
The size of the section (i.e. width of the fingerprint) is an important factor
here, and will depend on the use case. If it is too short, then we risk having
low entropy (i.e. the fingerprint will not contain enough information to
differentiate it from fingerprints of other audio files). However, if it is
too long then we increase the amount of information required to store a
database of fingerprints and search the database for a given query. As such,
there is trade-off between the amount of information we need to make our
fingerprints unique and the practicality of searching for potentially millions
of fingerprints.

The approach that we developed for the Hijinx Alive Beat Bugs toys is unusual in that it is capable of performing real-time audio fingerprint recognition on a particularly low- spec computer — in fact, on an embedded microchip that costs less than $1 per unit.

It’s a very different set of constraints to the large scale fingerprinting used by Shazam, and an example of how fingerprinting can be used to create interactive audio experiences without high-performance hardware.

In this blog, we have explained what audio fingerprints are, and given some
examples of how they can be created. It is important to note that there is no
single approach to generating an audio fingerprint, although most of the best
performing methods are based on peaks in the spectrogram. Some approaches to
audio fingerprinting lend themselves more towards scaling up to potentially
millions of audio files (and hundreds or thousands of millions of
fingerprints), whereas others are more suited to specific hardware constraints
such as low memory and performance requirements.


Adib’s experience in audio span a wide range of areas, including sound
engineering, signal processing, machine learning, vocal analysis, and audio
perception. He holds a BSc in Audio Technology and is currently finishing up a
PhD in the Centre for Digital Music at Queen Mary University of London. Adib
joined Chirp to focus on researching machine listening and intelligent audio
systems.

Chirp is a technology company enabling a seamless transfer of digital
information via soundwaves, using a device’s loudspeaker and microphone only.
The transmission uses audible or inaudible ultrasound tones and takes place
with no network connection. To learn more visit chirp.io