Practically since the dawn of the Web, people have written about the possibility of offering voice access to websites over the phone. One of the biggest challenges is that many websites aren’t even accessible to special web browsers, e. g., IBM Home Page Reader, that are designed for people who are blind or who have severely impaired vision. This is partly due to the fact that many websites are coded such that the visual presentation information is deeply mixed with the content. Separating out the the visual presentation information is a major challenge, even for people who have control of their own website.
Most modern blogging or content management tools store the content in a database and use templates to render a presentation. However, the vast majority of the templates merely provide different skins, or themes, for a visual presentation. Perhaps you might get a template or two for printing or for PDA/mobile presentation, which are really just specialized visual presentations. Speech access requires a dialog-style interaction that is very different in nature. Perhaps even more importantly, our hearing system, a.k.a., our ears, don’t offer an analogue of page scanning. While most people can visually scan a web page quickly, it’s quite difficult for a person to pick out an item of interest from a series of short audio snippets that are played in quick succession. It’s also difficult or impossible to navigate between audio snippets at anywhere near the speed you can visually navigate a web page.
Now that I’ve rained on the SpeechWeb parade, let’s take a look at a recent article entitled “Call for a Public-Domain SpeechWeb” by Richard Frost in the November 2005 issue of Communications of the ACM. It’s not that I think the SpeechWeb is a bad idea; I just think it’s really hard to do well at a reasonable cost.
Frost has proposed an architecture where the speech browser (think web browser for your ears) and the speech recognition engine reside locally on an end user device, most likely a mobile device. The user would access a speech application on the browser in the same way that you access web applications that reside and mostly run on a remote web server. That seems fine so far. However, powerful speech recognition engines tend to be very CPU and memory intensive. That’s immediately a problem on small mobile devices.
An added complexity is that speech applications have to deal with much more ambiguity than web applications. With a web application, you can offer the user a wide variety of constrained input elements, such as buttons, combo boxes, and radio buttons. Imagine a web app where each step were a question followed by a free-form text box and where the app has to be coded to interpret whatever the user types. One of the tricks in writing a speech application, though, is that you design the questions so that people tend to answer them in a limited set of ways.
Frost has mostly avoided natural language interfaces and stuck with constrained recognition grammars. Natural language recognition always seems to be just a year or two away from being viable. Unfortunately, it’s been like that for over five years, and I haven’t seen any improvements or breakthroughs to make me think it is any less than 5 or even 10 years away from being commonly used in applications. At speech conferences, it seems like the same old directory assistance and call steering applications get trotted out as examples of how great the stuff works.
Near the end of the article, Frost talks about the promising development of lightweight, yet powerful, speech recognition engines, such as the X+V browser available with Opera via a collaboration with IBM. Maybe I need to set aside some time to download Opera and check this out.
There’s obviously a long ways to go to get to an environment where speech applications can run viably on handheld devices, but I think Frost’s suggestion is worth looking into. Eventually, mobile devices will have the horsepower to perform speech recognition accurately with large vocabularies. One advantage of doing the recognition locally is that you don’t have to worry about network issues, such as noise and lost packets. However, that’s not really a major issue with speech applications in call center. A bigger issue is dealing with the many different accents and styles of speech. If small speech recognition engines can be tuned to a single speaker in a cost effective way, that could greatly improve their accuracy.