The end-game of the voice UI (like that of the chat UI) is the command line interface.

To start out with, there are a handful of differences between interfaces centering around how learning curves are managed. While all major…

The end-game of the voice UI (like that of the chat UI) is the command line interface. So, it’s useful to take cues from currently-existing good CLI UX. (For instance, look at the differences between zsh & command.com, and the trends in the evolution of borne-compatible command shells since 1970.)

To start out with, there are a handful of differences between interfaces centering around how learning curves are managed. While all major command line interfaces front-load some learning (i.e., the user is expected to learn quite a bit early in the experience in order to understand basic operations — something that GUIs are loathe to do, and something that therefore limits how nuanced the control user have over GUIs can be with standard widgets), unix shells made big leaps in discoverability early: as of the GNU announcement in 1984, Stallman was already saying that any command shell should be expected to have auto-completion features, and online documentation systems like man & apropos already existed (soon to be joined with the hypertext system info) — this at a time when both the Macintosh & the Amiga were still under development & most users had never seen a GUI, and during a period when users were generally split between Microsoft BASIC & MS-DOS in terms of their command environment.

No virtual assistant I am aware of will list available functions or list the set of invocations they accept — in other words, there is no help system comparable to man or apropos — and since error reporting is nearly nonexistent, this means interacting with unfamiliar features is the equivalent of playing a classic text adventure. “I see no lamp here,” says Siri. This kind of interface is fine for a game (where part of the fun is figuring out the systems and thereby beating the snarky & frustrating UI), but if we intend to do real work with virtual assistants they need to operate a lot more like a good CLI and provide detailed and specific technical information about their functionality. (The Arctek Gemini, a robot released in 1982, had a voice-controlled virtual assistant that could do this; why can’t Apple?)

The second big factor is that, if we want to do real work with these interfaces, we need to reduce the amount of fuzzy matching and move toward a system in which every word is expected to be meaningful. Picking one or two key words out of a sentence will give few false positives when performing simple tasks, but cannot scale: looking at programming languages that resemble english (such as SQL) gives us a good idea about how pronounceable keywords can be combined in relatively natural ways to allow the formation of specific, precise, and complex queries. Having some front-loading of learning is necessary for truly nuanced queries, but simple ones could still sound like natural speech. Combined with a built-in help system, the mechanisms for using this could be made very discoverable.

Mode indication is a problem in speech interfaces that have complex behaviors. We expect a great deal of stacked context, and even today’s systems, which are capable of doing next to nothing, tend to fail miserably at consistently keeping track of stacked context or falling in and out of different modes predictably.

Finally, a big problem is that almost everybody who has learned to type can type faster on a real keyboard than they can speak. On-screen keyboards on smartphones, disabilities affecting the hands, and contexts where keyboards are not handy or usable may justify speech-based interfaces even when they are otherwise not ideal; some of these cases are better-served by chording keyboards or by improvements to predictive text systems.

If we want speech-controlled virtual assistants to graduate from toy to tool, we can’t allow ourselves to be seduced by flashy but ultimately hollow flat-pack futurism, and must instead admit that this tech will be inappropriate in most situations: voice control will be rude in public and unnecessary in the home, no less distracting while driving than actually typing on the phone, and less effective in the workplace than speaking to a coworker about the same topic.

The killer apps for this tech at the moment seem to be fielding questions from pre-literate children and taking orders while an adult is cooking; a move toward having clear and precise syntax could make it possible for more complex queries to be asked (especially those in the domain of what non-technical users assume current voice assistants can handle but that they can’t — expecting “will it rain next Tuesday” to work because “will it rain tomorrow” does), and build-in help systems can encourage curious children to learn the kind of thinking that underlies programming, giving them a leg up in school when it comes to mathematics and grammar.