Here at Chirp, we’re passionate about sound, how it works, how it’s used
everyday in nature and by people, and the endless applications it benefits
through technology.

When asked recently what some of the blockers to IoT adoption within the
Connected Home may be, Chirp’s CTO, James Nesfield noted the importance
of the ‘things’ that are coming into our shared space being sympathetic to the way in which we humans share our space with each other.

In this post, we will look at one of these particularly sympathetic mediums —
sound, and in particular, speech recognition based technologies and the
challenges that could prevent mass consumer adoption.

Within the Connected Home space, Amazon, Microsoft, Apple and Google are all
heavily invested in voice-activated assistants underpinned by Natural Language Processing (NLP) algorithms — cue Alexa, Cortana, Siri and Google Assistant respectively.

Voice recognition and NLP, two of the most complex areas of computer science
are now being widely adopted by industry. Indeed, the advantages to businesses
are clear — increased consumer behaviour data, process automation and of
course cost efficiencies, to name but a few.

In 2015, there were 1.7 million voice-first devices shipped. In 2016, there
were 6.5 million devices shipped. In 2017, VoiceLabs predicts
there will be 24.5 million devices shipped, leading to a total device
footprint of 33 million voice-first devices in circulation. This is certainly
getting closer to mainstream adoption in 2017 but is still an evolution as
opposed to a revolution.

But what are some of the blockers to consumer adoption? Statistics for voice
recognition uptake on mobile devices could provide a few clues as to where
these may lie. Whilst 39% of smartphone owners in the US are believed to use voice recognition software, this peaks amongst smartphone users of ages 18–24 at 48%.

The older generations seems less keen which could be down to a lack of
reliability of previous iterations of this technology. In 1995, the lowest
error rate (achieved by IBM) in speech recognition was 43%. After a further 9 years, IBM had cut that error rate to 15.2%. Microsoft are
claiming
that they have achieved a world record rate of 6.3% under an industry-standard evaluation. In spite of the great achievements in recent years (most tech giants are now able to proudly state an error rate of under 10%) human-level accuracy, which IBM estimates to be at about 4% percent, still significantly wins out.

For those slightly older consumers who struggled through the multiple-choice
based cinema telephony booking systems during the many many years the
technology was developing, it is unsurprising that speech recognition based
products seem an unwise investment.

As part of the generation more hesitant to adopt, I can draw on my own needs
and experiences to sympathise with those that are not willing to do so. Aside
from historical reliability issues, there are two much more human factors that
could play into the adoption challenges:

  1. The very human need to be listened and responded to
  2. How the sound of a human voice can evoke physical and emotional responses in other humans

The need to be listened to

It can be incredibly frustrating for any person when they either cannot be
heard, or are not being listened to or understood. This is why freedom of
speech, debate and democracy are so widely considered fundamentals in the
successful existence of a human society.

To pay for software and/or hardware from a provider that sells it on the basis
of its ability to interact with you but then does not consistently do so is
enough to test the patience of the most reasonable consumer. In order to
increase brand advocacy and consumer adoption, it is therefore important for
providers to communicate that these systems are learning machines — the more
you use them, the more accurate they become. Expectations must be set in order
to build consumer trust and confidence in speech recognition technology.

How the sound of a human voice can evoke responses

This article published by American Scientist explores some of the ways in which listeners are not only affected by the words we say, but also how we say them.

Inflection is of course a factor, however this article looks more closely on
the impact of ‘pitch’ and more specifically, how it influences our selection
of societal leaders — those within whom we place the most trust. Whilst the
findings were detailed and varied, the overarching suggestion is that lower
voices generally create perceptions of strength and competence and electoral
candidates with lower voices are significantly more likely to win elections.

Featuring the actual sound of the voices that Alexa, Siri, Cortana and Google
Home use is a trick that has not been missed by the big players, and there
certainly don’t seem to be any voice assistants responding in unwittingly high
pitches.

Also working in the field of data-over-sound, and thus providing what we
believe to be a sympathetic and natural means of communication for IoT, Chirp
understand how important gaining consumer trust is to increasing the use of
these game changing technologies. We pride ourselves on our reliability in the
most challenging of acoustic environments and are incredibly excited to see
speech-recognition technologies continue to become even more reliable —
hopefully one day reaching that of human-level accuracy.


To learn more about Chirp’s data-over-sound solutions, please visit us at
chirp.io or get in touch at contact@chirp.io