At Chirp, we make a communications technology which promises zero-friction data exchange between nearby devices, by transforming information into packets of audio that can be sent from speakers and received by microphones. One of the key advantages of this is universality: Chirp can run on pretty much any device that can play and receive audio, from laptops to $1 microcontrollers.

To achieve this, we need to demonstrate that it works on any device. This is particularly tricky in the Android mobile ecosystem, in which there are hundreds of different handsets, each with different speakers, microphones, ADCs, DACs, casings, operating system versions, and other software audio capabilities.

It's impossible to test all of these devices manually. So, instead, we've harnessed the power of the AWS Device Farm to automate testing over thousands of real, physical devices.

The Device Farm: An on-demand fleet of 2500+automated handsets

One of our favourite, under-appreciated AWS resources, the Device Farm is a server room packed full of physical handsets which can be automated remotely to perform tasks. It's typically used for large-scale unit testing, or checking whether your UI renders correctly when it's put on actual devices.

It's particularly useful for catching rare bugs which happen on certain handsets, architectures, or older Android API versions.

Testing end-to-end transmission

Unlike typical software or UI tests, which can be tested purely in silico, our goal is to validate that a given device can both send and receive Chirp signals in real-world conditions. This means that it needs to:

  • Encode a Chirp signal, and play the audio intelligibly from the speaker, and
  • Receive audio from the microphone, and successfully decode the Chirp signal

This sounds straightforward, but there's a phenomenal range of performance across the spectrum of audio transducers. In particular, we know that speakers and mics have variable abilities to render and receive audio in the near-ultrasonic 17-20kHz range, which we rely on because it's inaudible to most adult humans but can be picked up by most mics. We need to make sure that our near-ultrasonic protocols will still work satisfactorily even on low-end devices.

We typically carry out end-to-end tests by placing two devices across a test laboratory from each other, fixing the background noise and reverberation conditions, and recording the success and error rates.

In the case of the Device Farm, testing can only be done on one device at a time. Although parallel testing is theoretically possible, we can't use it for our purposes, as devices nearby to each other may interfere with their tests. So, instead, we set up a special version of the Chirp SDK that can hear its own transmissions.

  • When a test starts, the device begins sending 100 chirps from its loudspeaker.
  • The same chirps are simultaneously decoded by the microphone, and a detailed log of decodes and errors is recorded.
  • Finally, the collective results are aggregated and logged.

Setting up the instrumentation tests with Espresso and JUnit

For each trial, we run a simple instrumentation test which triggers the start and stop buttons of our tester app. Each test loops through 100 pre-generated payloads and sends each one.

As for the receiver, there are a couple of situations here:

  • If a payload is successfully decoded and it is in the pre-generated list, we can consider it as a successfully received payload.
  • If a payload is not in the pre-generated this means it is a false-positive and there are serious issues with the decoder as this should never happen.
  • If the payload is successfully detected but not decoded, we can consider it as a failure, possibly because of the noisy environment or low-quality microphone or speaker.

During the process of sending the testing app will also record the audio input from the microphone so we can compare and analyse the results later.

We are usually running these tests on a big range of devices, starting from the old Samsung Galaxy S3 running on Android 4.3 (released in 2012) to the newer Samsung Galaxy S10 running on Android 9.

In the server room hosting the Device Farm, some devices may be in a close proximity range, which means we cannot run the tests in parallel on multiple devices as they may clash with each other.  As a consequence, running each full test is a slow process, and takes between 4 to 8 hours to run on ~100 devices. We typically schedule a series of tests to run overnight or over the weekend.

Once the stop button is clicked, the app will upload the recorded audio file to an S3 bucket and reports the results.

Analysing the tests with Python and boto

We aggregate the results of each experiment using the boto3 Python module, an excellent and Pythonic way to interact with AWS services. boto3 has a specialised DeviceFarm client for querying DF projects and experiment runs, which we've used to make a simple set of command-line tools which let us:

  • list the status of past and current experiment runs
  • select the run we want to query data for
  • export the full dataset in CSV, including the number of chirps detected and decoded, plus extra data such as CPU and memory usage, which are additionally provided by the Device Farm API

Once we've got the data locally in CSV, we can use numpy and matplotlib to analyse and graph the results.

Results and discoveries

The graph below shows the distribution of results across a couple of hundred tests. Over 178 devices received 100% of the chirps transmitted, and most of the rest received above 95%. A few devices received 0%, typically because they don't fulfil the minimum supported requirements or have problems in the insanely loud surrounds of the server room.

We've made a few interesting discoveries over the course of this research, and it's given a whole array of benefits.

Scale makes a difference.

Running on the Device Farm has let us scale up the magnitude and speed of our optimisation processes, from a dozen devices into the office to ~100 times this amount. Even with a 12-hour turnaround per test, it's a far less resource-intensive way to trial out improvements.

Testing on a broad spectrum of hardware has reduced our support load.

Because the Device Farm covers older and lower-spec devices, and historical versions of Android, it's flagged up all sorts of compatibility issues that we can solve before they reach our customers. This means less time spent fielding support queries, and more time spent improving our software.

Even major software updates are now worry-free.

Formerly, releasing a new version of the Chirp SDK could be daunting - particularly when we'd re-engineered audio innards, which can be notoriously risky. Now, we run a Device Farm test for each major release. As long as the results don't show any dip in performance, we're confident that it'll be a smooth upgrade for our users.

We can optimise for CPU.

Because the Device Farm results also include average CPU%, we can monitor processor load between releases, and optimise to reduce processor cycles - therefore extending effective battery life. For example, we've recently introduced support for the NEON instruction set, improving Android performance across countless Arm cores.

Even in the deafening server rooms, our protocols are standing up to the noise.

Despite background noise levels above 80dB(A), our fleet of chirping devices are able to hear themselves with clarity, which is a testament to the work that our research team has done on minimising the impact of background noise on transmission robustness.

Obviously, our customers are more likely to be using Chirp to share Wi-Fi credentials in meeting rooms, or exchange contact details in cafés and hotels - but it's good to know that, if Chirp users  do want to share data in a noisy data centre, the option is there.

Image: Spectrogram of device farm with near-ultrasonic chirps. Time is on the X-axis, and frequency is on the Y-axis.  At the top, you can see the Chirp data, encoded as tones within the near-ultrasonic range. In the audible range, you can see the sheer and continuous background noise.