SoundStage! Simplifi | SoundStageSimplifi.com - How Good Can Voice Recognition Get?

Last month, I talked about the advantages that voice command can bring to audio enthusiasts -- and the complications that limit its applicability to music listening. This month I talk in a bit more depth about the prospects for voice-command technology: How much better can voice-command systems get, and might they someday be the primary user interface for audio systems?

Even considering their tendency to mess up when you request specific pieces of music, many of today’s voice-command (VC) systems are technically impressive. Consider the difficulties of their task. First, they have to pick up your voice clearly, and in a quiet, acoustically dead environment, that’s easy. But a smart speaker with VC is almost never used in such an environment; its microphones are also picking up the sounds coming from its own speaker, plus the reflections of those sounds -- and your voice -- from all the objects and surfaces in the room. Then there’s environmental noise, from HVAC systems, refrigerators, the TV in the next room, the neighbor running his leaf blower. Often, the mikes may confront a negative signal/noise ratio in which the level of the user’s voice is actually lower than the level of the noise and interference.

Yet the best VC systems can still pick out your voice from all the noise. Even the least-expensive smart speakers, such as Amazon’s Echo Dot, have arrays of multiple mikes that detect the direction your voice is coming from, then use a beamforming algorithm to focus their pickup pattern in that direction. Algorithms that cancel acoustic echoes subtract from the mikes the smart speaker’s own program material -- and, ideally, the reflections of that material and of the user’s voice. Noise-canceling algorithms subtract steady-state noise such as HVAC systems and, in cars, wind and road noise.

Amazon Echo Dot

These algorithms are good, and are getting better; I’ve heard a couple of demos of new algorithms and mike arrays that could reliably pick up my voice and execute my commands even when I was speaking at or below the level of the music the smart speaker was playing. That’ll help, especially considering that, other than habitually raising their voices a few dB when they utter a command, users seem unwilling to modify their behavior in ways that will deliver more reliable results.

From my experience with about a dozen VC speakers, and in talking with friends who own them, the performance of these devices seems to be in a constant state of flux. Most days mine work well, but once in a while they respond no more reliably to my commands than does my mother’s shih-tzu. I know that Amazon, Apple, and Google frequently release updates to their speakers’ algorithms via the Internet, and constantly upgrade the algorithms running on the server farms tasked with interpreting voice commands. But the tech industry has proven again and again that updates sometimes make software worse; it’s certainly possible that some algorithm updates make VC systems respond less well to certain commands -- at least until the next update. Of course, that may be only my perception. It’s certainly not the companies’ intent -- the more reliable VC systems are, the more money Amazon, Apple, and Google make.

Google recently demonstrated a new feature and a prototype of a new technology that might improve VC systems. Google Continued Conversations, already in use, keeps listening for eight seconds after you wake the device up by saying “Hey, Google,” so that the device is less likely to cut off and therefore misunderstand your command. Google Duplex, still under development, allows Google Assistant, the core technology behind Google’s VC systems, to carry on a conversation with a human, something that’s possible only if the voice-recognition software is quite sophisticated. (However, Google Duplex was demonstrated with the human on a phone, which eliminates most of the signal/noise concerns noted above.)

There’s no question that VC systems will improve, and at some point will probably not only be able to better understand your commands, but also be able to carry on a conversation to help you find what you want.

Right now, it can be a challenge just to get a smart speaker to play all four movements of Beethoven’s Ninth Symphony. But it’s not hard to imagine asking some future smart speaker to play Beethoven’s Ninth and have it respond by asking, “Want to hear Herbert von Karajan’s 1963 recording? A lot of people think that one’s the best. Or would you prefer the one Wilhelm Furtwängler conducted at the Bayreuth Festival in 1951? It’s really great, but it’s mono.”

“No, I want something newer, something better-sounding,” you’d say.

“How about the Claudio Abbado version from 2000? The Guardian gave it five stars.”

“OK, that one.”

It’s a tantalizing future, and based on what I’ve seen and heard, it’s likely you’ll be able to do this someday soon, even if people are talking in the next room and your air-conditioning’s running full blast. Whether it will be enough to win over vinyl fans, that I won’t even try to predict.

. . . Brent Butterworth
This email address is being protected from spambots. You need JavaScript enabled to view it.