Microsoft Beta to Make ‘em Talk

Microsoft has released a public beta of their Speech Server and a new beta of their Speech Application SDK.

Microsoft had previously teamed up primarily with Intel to propose a new standard called SALT that is somewhat competitive with VoiceXML. As of now, I contrast the two as:

SALT
useful for speech-enabling web applications
VoiceXML
useful for web-enabling speech applications

While this is an oversimplification, it reasonably reflects their current usage. Both VoiceXML and SALT based speech applications follow a similar pattern.

  1. Prompt the user
  2. Interpret the user’s response
  3. Act on the response

The action will often be to play/speak a new prompt.

The VoiceXML or SALT prompting tag will specify a recorded audio file or text that is synthesized by a text-to-speech engine. The user’s response is always interpreted in the context of a grammar. The grammar specifies the allowable responses. Multiple utterances (yeah, un-huh, sure, yep) will often be treated as the same response (yes). Other VoiceXML and SALT tags (although SALT relies much more on existing HTML tags) act like a decision tree to determine the following action. A series of these prompts and responses is called a dialog.

SALT is used primarily to mark up documents that are interpreted in a web browser on a client side device. SALT consists of a very small set of tags that add multimodality to HTML/XHTML-based web applications.

VoiceXML is primarily used to create speech applications that run on a server and are accessed via a telephone. Although plenty of proprietary speech application languages preceded VoiceXML, VoiceXML was the first widely accepted and implemented standard and it greatly simplified the integration of speech applications with existing server side web applications.

With Speech Server, Microsoft is clearly moving SALT onto VoiceXML’s turf. At the same time, IBM, Motorola, and Opera are proposing XHTML+Voice (a.k.a., X+V) as multimodal extensions to VoiceXML that would enable it to support the kinds of browser based applications that SALT now supports. Although Microsoft and IBM have been teaming up a lot on web services, they are very much in opposition with respect to the important speech technology standards.

Microsoft has developed their own speech recognition engine, but is partnering with SpeechWorks to supply a text -to-speech engine. In my experience with a previous version of the Microsoft speech recognition engine, I found it to be very mediocre. The only redeeming quality was that it was a free download.

Until now, third party interest in server side development with SALT has been extremely tepid in comparison with VoiceXML. I wonder if Microsoft will weave some of their developer magic with this server, or if it will be like one of their many other failed experiments. Of course, they’re big enough that they can survive quite a few failures, as long as they occasionally hit the big home run. I think they will end up being a big player in speech technologies in the future, but I very much doubt that SALT will become a commonly accepted standard in its current form.