Automatic speech recognition (ASR) technology addresses a very difficult to solve problem, but researchers have made a significant amount of progress over the last ten years or so. There are still many unsolved problems, but the quality of modern ASR engines has made speech a viable user interface for many different applications.
Speech recognition technologies are commonly used to recognize a speaker’s response to a prompt or to transcribe what a speaker has said. Other uses are telematics, which usually means a speech interface to systems in an automobile, or command and control of other devices, like desktop computers. The two most common approaches used to recognize a speaker’s response are often called grammar constrained recognition and natural language recognition. When ASR is used to transcribe speech, it is commonly called dictation. The telematics and other command and control systems that I am aware of (outside of science fiction movies) are grammar constrained.
Grammar Constrained Recognition: Works well when the speaker is providing very short responses to specific questions. To create a grammar, you specify the most likely words and phrases a person will say in response to a prompt and you map those words and phrases to a token, or a semantic concept. For example, you might create a “yes-no” grammar that maps yes, yeah, uh-huh, sure, and okay to the token “yes” and no, nope, nuh-uh, and no way dude to the token “no”. If the speaker says something that doesn’t match an entry in your grammar, recognition will fail.
Natural Language Recognition: Allows the speaker to provide more natural, sentence-length responses to specific questions. Natural language recognition uses statistical models. The general procedure is to create as large of a corpus as possible of typical responses, with each response matched up to a token or concept. In most approaches, a technique called Wizard of Oz is used. A person (the wizard) listens in real time or via recordings to a very large number of speakers responding naturally to a prompt. The wizard then selects the concept that represents what the user meant. A software program then analyzes the corpus of spoken utterances and their corresponding semantics and it creates a statistical model which can be used to map similar sentences to the appropriate concepts for future speakers.
For example, let’s say you want to route phone calls for a customer helpdesk by asking the caller to briefly describe their problem. For the concept “forward my call to the billing department”, you would want to recognize sentences like “I have a problem with my bill”, “I was charged incorrectly”, “How much do I owe this month”, etc. While you could construct a grammar with all the likely keywords (bill, charge, charged, owed, etc.), if the caller speaks in sentences, you may pick up multiple matches. You might also miss sentences that fit the right pattern, but just miss the pre-ordained keywords. It is difficult to create large, rich grammars that consider the context in which the words are said. In addition, as a grammar gets very large, the chances of having similar sounding words in the grammar greatly increases.
The obvious advantage of natural language recognition over the grammar constrained approach is that you don’t have to identify the exact words and phrases. A big disadvantage, though, is that for it to work well, the corpus must typically be very large. Creating a large corpus is time consuming and expensive.
Dictation: Used to transcribe someone’s speech, word for word. Unlike grammar constrained and natural language recognition, dictation does not require semantic understanding. The goal isn’t to understand what the speaker meant by their speech, just to identify the exact words. However, contextual understanding of what is being said can greatly improve the transcription. Many words, at least in English, sound alike. If the dictation system doesn’t understand the context for one of these words, it will not be able to confidently identify the correct spelling.
The challenge for developers of ASR engines is that the end customer judges them on one criteria – did it understand what I said? That leaves little room for differentiation. Of course, there are areas like multi-language support, tuning tools, integration API (the proposed standard MRCP or proprietary) , etc., but recognition quality is most visible. Because of the complex algorithms and language models required to implement a high quality speech recognition engine, it is both difficult for new companies to enter this market as well as difficult for existing vendors to maintain the necessary investment level to keep up and move ahead.
Currently, Nuance and ScanSoft dominate the speech recognition market. There are a lot of other small vendors, like Aculab, Loquendo, Lumenvox, etc., but they are essentially niche players. The speech recognition side of ScanSoft is actually composed of SpeechWorks and the products of several former niche players. IBM has also participated in the speech recognition engine market, but their ViaVoice product has gained traction primarily in the desktop command and control (grammar-constrained) and dictation markets.
This is all changing. The big software heavyweights, Microsoft (Speech Server) and IBM (references – main site, voice toolkit preview, eWeek article, older InternetNews article, new InternetNews article on VXML toolkits) are now making substantial investments in speech recognition. IBM claims to have put one hundred speech researchers on the problem of taking ASR beyond the level of human speech recognition by 2010. Bill Gates is also making very large investments in speech recognition research at Microsoft. At SpeechTEK, Gates predicted that by 2011 the quality of ASR will catch up to human speech recognition. IBM and Microsoft are still well behind Nuance and ScanSoft in technology and market share, but they are gaining on them fast.