10/19/2005: 11:25 pm: Speech, VoiceXML

Next week October 23-25 I will be in Phoenix for the ScanSoft/Nuance Conversations 2005 conference. ScanSoft formally changed their name to Nuance yesterday. During most of the time the Solutions Showcase area is open, you’ll be able to find me in the Voxify booth. If you’re also going to be at the conference, please stop by. I’ll be there until late Tuesday morning.

7/28/2005: 11:49 pm: Speech, VoiceXML

Many months ago, IBM announced that they were open sourcing and donating their Reusable Dialog Components library to the Jakarta project at the Apache Software Foundation. Finally, version 1.0 of the RDC has been released.

The RDC is a JSP tag library that simplifies the development of server side code for generating VoiceXML documents for use in voice and multimodal applications. The RDC originated as a bunch of static VoiceXML files. Nearly two years ago, someone in the Speech group at IBM told me that they had decided to switch to using JSPs to dynamically generate the VoiceXML documents. Dynamic generation is the only way to go for complex VoiceXML applications.

Although my SoccerPhone and PhoneBlogger projects use static VoiceXML files, a lot of the work in those applications is being done in Python. The dialog portions of those apps are fairly simple. I’ll probably port part or all of one of these apps to use RDC to get a feel for what it’s like to develop with.

7/12/2005: 11:50 pm: Everything Else, Soccer, SoccerPhone, Speech, VoiceXML

It’s been ages since I’ve written about Soccerphone, or even about anything at all. The last few weeks have just been too hectic. But, I did find time this weekend to make a few updates to Soccerphone, which is an automated speech application I built a few years ago so I could receive live Major League Soccer scores by phone.

One update of questionable merit was to use audio recordings made by me to replace some of the prompts that are currently being synthesized by a text-to-speech engine. Not only is the use of myself as a voice talent a rather dodgy decision, but also, there is still quite a bit of TTS. I’m not sure the recording effort really improved the quality of the app that much, if at all. It was fun to do the recordings, though.

Speaking of TTS, I switched from a female voice to one of the male voices that BeVocal supports. I’m now using Reed, which is a Nuance Vocalizer voice. Not only does the app sound better due to no longer switching back and forth between genders, but the TTS engine used to synthesize the Reed voice does a much better job of pronouncing names than the TTS engine used to synthesize the Jennifer voice.

I also finally got around to adding Chivas USA and Real Salt Lake to the grammar, so you can now say them at the Team Name prompt. I added FC Dallas to the grammar, but also left in their old name, the Dallas Burn.

Another minor update was to add a dummy recognition block just before the backend query. Without this, the confirmation prompt from the previous dialog wasn’t being played until the HTTP fetch completed. Since it sometimes takes more than five seconds to get the response back, the confirmation had sounded sort of odd when it was played so late.

3/23/2005: 11:49 pm: Speech, VoiceXML

Like most technical specialties, the speech applications industry has a large number of official standards. Fortunately, many of these standards are widely implemented. The implementations aren’t perfectly consistent, of course, but they’re close enough that at Voxify we’ve been able to get our speech applications platform to run on different VoiceXML browsers and ASR and TTS engines with relative ease.

Deborah Dahl recently wrote an article on speech standards and specifications for Speech Technology magazine that does an excellent job of organizing and describing the relevant standards. Deborah has been very active in the speech industry, especially in the multimodal interaction area. She’s currently the chair of the W3C’s Multimodal Interaction Working Group, in which my friend and former colleague, Wu Chou, is a key participant.

12/31/2004: 11:14 pm: Speech, VoiceXML

VoiceXML Review used to regularly publish interesting articles and news updates related to VoiceXML, but had fallen somewhat silent over the last six months or so. A lot has been going on with respect to VXML 2.1 and 3.0, so I’ve been hoping to see some more activity on the site. It’s good to see that someone from the VoiceXML Forum is updating it again.

One of the articles at first looked like a totally unrelated article on internship programs at IBM. However, the two interns who wrote about their experiences both worked on speech technologies with members of the Pervasive Computing group at IBM.

The team in Austin used the Opera browser and XHTML+Voice to create a multimodal application for finding info on movies playing at local cinemas. I remain quite skeptical about the value of multimodal applications on PDAs (at least with the PDA technologies of the next 1-2 years), but it’s still good to see people making progress with standards-based approaches like X+V for building multimodal apps.

11/28/2004: 1:45 pm: Software, Speech, VoiceXML

My main experience with XML-based programming languages has been with VoiceXML. One nice advantage of an XML-based language is that the syntax checker is essentially free, assuming the language provides a DTD, or preferably a Schema. Of course, most language compilers and interpreters also come with a syntax checker, so the DTD/Schema advantage is primarily a time saver for the language creator.

The Ant build program also uses an XML-based programming language. I’ve written quite a few Ant build files by hand, and the experience has brought little joy. However, writing old school make files was even more frustrating. The many problems with the use of XML with Ant have been noted by Bruce Eckel, Martin Fowler, and Patrick Logan. Patrick points out YAML as a possible compromise.

VoiceXML has the further complication that good VoiceXML applications tend to be a lot more dynamic than build files. Since the goal of a VoiceXML application is to create a natural sounding, automated voice dialog, you generally want to customize the dialog for each caller and to slightly vary the spoken dialog on each call so as to avoid a completely robotic, scripted effect. While you can accomplish this to some degree with static VoiceXML files, it’s far easier if you dyamically generate some or all of the VoiceXML at runtime. With an Ant build file, I think it would be rare that you would want to dynamically generate the build script every time you run it.

Manual creation and editing of Ant files and VXML files is frustrating and limiting. The resulting files seem overly verbose. Because they are so verbose, it’s difficult to keep large sections of the code in your head at one time.

Because manual VoiceXML coding is so difficult (hand-coding SALT-based applications is even worse) a lot of vendors have developed SDKs or graphical IDEs that mostly or completely hide the VoiceXML code from the developer. The good news is that this opens up speech application development to a lot more people. For the advanced developer, it also makes it easier to create large, dynamic applications. Of course, there are always downsides.

In my opinion, the biggest disadvantage is that most of these SDKs and IDEs throw portability out the window. While VXML code is relatively portable between different VXML browser implementations, the IDEs typically don’t generate VXML directly. Instead, they generate an intermediary form, usually consisting of POJOs (Plain Old Java Objects), servlets, JSPs, and/or XML files with custom tags. The intermediary form must then be processed by a server runtime layer that sits on top of a web or application server.

Nonetheless, there are several very good reasons that the SDK and IDE vendors have gone down this path:

  • Markup languages like VoiceXML and SALT are too low level for large, sophisticated applications.
  • Many developers, especially those in IT groups, prefer drag-and-drop graphical IDEs
  • Mapping of VoiceXML tags directly to graphical forms would provide only minimal abstraction for a new developer
  • The abstraction from VXML means a tool could theoretically use a single dialog design to dynamically generate applications in multiple markup languages, like SALT, XHTML, WML, XHTML+Voice, or a chatbot markup language for a single multi-modal interaction

The VoiceXML 3 standard is targeted to close some of this gap, by I haven’t been involved enough in the process so far to comment on how successful the voice browser working group will be.

The closest thing I’ve seen to this situation in the Java world is BEA WebLogic Designer. WebLogic Designer provides a significantly higher level abstraction above not only web services, but also web application and database integration development. The goal was to bring the good parts of Visual Basic to the corporate Java developer. The downside is that WebLogic Designer generates code that will run only with WebLogic runtime components. Therefore, the ease of use advantage costs you portability. Nonetheless, WebLogic Designer can provide a huge productivity advantage to many developers.

10/22/2004: 11:12 pm: Speech, VoiceXML

A great article on Voxify appeared in the East Bay Business Times this week. In addition to providing some interesting background on the company founders, page three of the article provides a good description of what we do.

In short, we design and build speech recognition applications that enable automated customer service solutions for clients in the travel, hospitality, and retail markets. On the technical side, we’ve built a really powerful speech applications platform on top of VoiceXML browsers and J2EE servers that lets us quickly build highly conversational speech applications. The platform we’ve built models the behaviors of the best call center agents. The article does a pretty good job of capturing the areas where I think we provide a lot more value than our competitors.

So, if for some reason you were wondering what I’m doing these days to pay my website hosting fees, check it out.

10/6/2004: 8:40 am: Speech, VoiceXML

A beta version of Sphinx-4, an open source speech recognition engine implemented in Java, was recently released. Sphinx development is centered out of Carnegie-Mellon University, with major contributions from employees at Sun, Mitsubishi Electric Research Labs, and HP and smaller contributions from individuals at UC Santa Cruz and MIT.

Ideas that are unlikely to come to fruition, but I like to imagine I have time to implement them, anyway:

  • Non-real time dictation engine for PhoneBlogger
  • Along with OpenVXI, Festival, and CCXML4J, a complete open source VoiceXML and CCXML server
  • Speech recognition engine on a Treo 600
9/14/2004: 12:39 am: Speech, VoiceXML

When I first heard the announcement (full story from NY Times requires registration, excerpt from C|Net doesn’t), I was hoping that IBM was open sourcing their ASR and TTS engines. But, it turned out to be two other parts of their voice portfolio.

IBM is donating source code for their Reusable Dialog Components to the Apache Software Foundation. The RDC were developed as chunks of static VoiceXML code that perform common dialog functions, such as collecting address information or dates. At the spring SpeechTek conference, someone from IBM told me they were porting the RDCs to JSPs that generate VoiceXML. If nothing else, I hope the RDCs will provide good code samples to further popularize the development of VoiceXML applications.

The Call Flow Builder

IBM is also donating some or all of their Voice Toolkit to the Eclipse organization. The Voice Toolkit was reimplemented as plug-ins for Eclipse about two years ago. It’s a pretty nice application, although the last time I checked out the preview version on the IBM alphaWorks site, it had a lot of complicated dependencies. Also, it was supported only on Windows. The official 5.1 release is now available. Unfortunately, it still runs only on Windows.

The Voice Toolkit Call Flow Builder is a fairly simple GUI for creating the basic dialog of a call flow as a directed graph (i. e., boxes and arrows). Once you get the call flow mostly scoped out, you can generate markup from your diagrams. The native XML dialect it generates can be automatically translated into VoiceXML. I don’t know if this feature is in 5.1, but I think they were also planning to support generation of JSPs, HTML, or whatever other markup language you wanted to generate. All you need to provide is the appropriate XSLT script to do the transformation.

The grammar development tools in the Toolkit Preview were nothing to get excited about, but I’ve yet to see good grammar development tools from anyone, and yes, I’ve looked at a lot of tools. The pronunication builder was pretty nice, though.

Once you generate the markup, you move into a more traditional programming environment where you edit markup and code. The RDCs mentioned above can be helpful to fill out the rest of your app, though I expect they will also make them available from the GUI.

The worst thing I could say about the Voice Toolkit was that when I tried it about six months ago, the documentation was pretty bad. There were huge chunks of missing information and way too many typos.

The NY Times ends with a comical quote from a director of marketing at Microsoft who claims that IBM is following Microsoft. Hmm, I didn’t know that Microsoft had open sourced their speech development tools under an OSI-compliant license. I think that’s news to everyone, including the development team at Microsoft.

Microsoft is clearly the follower in speech platforms and applications development. They’re still pretty far behind, even though they are making good progress. They shouldn’t be ashamed to be a follower in this space. They picked a very good time for entry. It’s just hard to take them seriously when their representatives make laughable claims.

9/8/2004: 11:57 pm: Speech, VoiceXML

And so does the W3C. Speech Synthesis Markup Language (SSML) Version 1.0 is now a W3C Recommendation. SSML is used with both VoiceXML and SALT to specify how text should be synthesized into speech. Congratulations to the co-editors from Nuance, Intel, and ScanSoft who ushered it through the process.

« Previous PageNext Page »

Fork me on GitHub