VoiceXML


4/9/2008: 8:41 am: RobertSpeech, VoiceXML

An article in Speech Technology magazine reports that in the most recent update to Gartner’s Magic Quadrant for IVRs, Microsoft Speech Server and Nuance Voice Portal got dropped. The disappearance of NVP is no surprise, since Nuance announced at their Conversations conference over two years ago that they would no longer enhance it.

Microsoft moved Speech Server into Office Communications Server last year, and really doesn’t seem to be promoting it as a standalone product, even though it can still be installed separately. Although I see virtually no push by Microsoft, or even their partners, to sell Speech Server into large contact centers, I’m still a little surprised Gartner dropped them.

We’ve been doing some testing on Speech Server at Voxify, and overall it works quite well. Getting it to work with our Asterisk-based PBX was a nightmare, but otherwise the install went pretty smoothly. Recognition performance using Microsoft’s ASR is generally similar to Nuance OSR, though recognition is very slow when doing nbest recognition for even medium sized values of n. Microsoft’s fairly faithful compliance with the VoiceXML standard (we find issues with every VXML browser vendor we have worked with) was another very pleasant surprise. The best surprise was the licensing costs. It is amazingly inexpensive considering the quantity and quality of features it includes.

One of my biggest concerns about Speech Server is that activity in discussion forums and blogs regarding the product has dwindled dramatically (at least in the places I have looked) over the last year. Without Microsoft pushing Speech Server, I think there will need to be pretty strong community support for it to gain a foothold. It would be really too bad if it ends up getting buried in the unified communication product line at Microsoft.

The rest of the report contained no surprises. Genesys is listed as the clear leader, and that is definitely what I have seen in the market. Acquiring VoiceGenie was a brilliant move on their part, and they have very good offerings for both enterprises and large VXML hosting providers. Nonetheless, there continue to be interesting developments at Nortel, Avaya and Voxeo, among others.

9/23/2007: 12:06 am: RobertSpeech, VoiceXML

AVIOS is having their second speech app building contest for student teams. “Cash and/or equipment prizes valued at over $1000 will be awarded to teams of student programmers who design and create applications judged by industry experts to be the most robust, useful, creative, innovative, and user friendly.” The finished application must be submitted by January 31, 2008.

Although the application must at least support speech input or output, students are encouraged to develop multi-modal apps. Many development environments are approved for use in building and running your application. If you are looking to get some long term useful experience out of this exercise, I strongly recommend that you build a VoiceXML app and host it one of the platforms like BeVocal, TellMe, etc. While working with a downloadable environment like Voxeo’s Prophecy would be highly educational, you’ll end up spending a lot of time dealing with stuff that the hosted platforms take care of for you. Even better would be to use a tool that dynamically generates VoiceXML, but there aren’t a lot of free tools available that do that.

6/14/2007: 9:16 am: RobertSoftware, Speech, VoiceXML

Patrick Barnard wrote a very nice post about Voxify on his Making Contact TMCnet blog after speaking with the heads of our sales and marketing groups. Patrick’s post aptly summarizes the nature of the hosted speech applications that Voxify provides.

For the sake of credibility regarding real world speech application implementations, it’s important to note that we don’t claim we can implement every imaginable integrated application in less than eight weeks. Patrick doesn’t say that either, but I can see how some people might jump to that conclusion. Some applications require the development of very complex call flows and extremely technically challenging integrations to back end systems. I think we still deploy these complex kinds of applications surprisingly quickly, though.

The telephony integration for a hosted speech application can add time, too, if a lot of changes need to be made to existing circuits or if new circuits need to be provisioned. The telecom companies have gotten a lot better about this, but it can still take them 1-2 weeks to provision a new line. Fortunately, we’re able to catch most of these situations up front and get all of the telecom work queued up early.

But, Voxify absolutely can design, develop and deploy integrated speech applications in less than eight weeks. We’ve done that for several clients, and we’ve made some changes to our platform that will enable us to deliver that fast much more often in the future.

Part of the reason we can develop speech applications so quickly is that we have the experience from developing a lot of applications. In addition, we took the time, either during those deployments or soon after, to capture that experience in our core platform or in reusable libraries. We now have a very powerful platform and a strong set of reusable horizontal (e.g., geographical location, billing and shipping address, credit card information, etc.) and vertical (e.g., flight info, hotel reservation, prescription refill, order status, etc.) libraries. We also have a very efficient set of deployment processes that we have honed during all of our previous deployments. And, oh yeah, there are a bunch of smart people in our office who continually amaze me.

4/11/2007: 11:56 pm: RobertSoccer, SoccerPhone, Software, Speech, VoiceXML

SoccerPhone is a speech application I wrote about five years ago so I could get live updates on Major League Soccer scores whenever I was away from an Internet connection. I wrote the application in VoiceXML, JavaScript, and Python. Since SoccerPhone gathers the live data by scraping information from the HTML scoreboard page on the MLS website, I often have to update my code when the MLS website changes each year. Fortunately, this year’s change was fairly minor.

  1. Call 1-877-33-VOCAL (877-338-6225)
  2. When asked for your PIN, enter or say 5425 (5425 = KICK)
  3. When asked for your userid, enter or say 6575425 (6575425 = MLSKICK)
  4. After you hear me say “Welcome to SoccerPhone”, you can say an MLS
    team name, such as “Houston Dynamo”, or say “all teams”.

Since SoccerPhone lets you request this week’s match results for a team by saying the team’s name, I also have to update the GRXML grammars when new teams are added. While I could dynamically generate grammars from the team names that the app extracts from the MLS website, it’s not that big of a deal to manually maintain the grammars. Also, manually coding them allows me to tune the gramars for better recognition rates.

If you’ve call the app before, you’ll be disappointed to hear that I am using the same lame voice talent, i.e., me. On the good side, though, I re-recorded a bunch of the prompts using Audacity. I also eliminated a little more of the TTS by adding additional recordings. I really should take advantage of the great recording studio we have at Voxify, but then I would feel obligated to use a real voice talent.

I haven’t updated the code at the SoccerPhone SourceForge project site, yet. But I will get to that soon. I got sidetracked by looking into converting the CVS repository to Subversion.

4/13/2006: 11:42 pm: RobertEverything Else, Speech, VoiceXML

Microsoft recently announced that Speech Server 2007 will provide support for speech applications written in VoiceXML. In order to penetrate the enterprise market for speech applications, Microsoft really had no choice. SALT-based applications remain as rare as hen’s teeth in the enterprise. Ok, maybe not that rare, but certainly the number pales in comparison to the number of VoiceXML-based applications. The press release says “More than 40,000 telephony ports of capacity have been licensed, and Speech Server customers are successfully answering more than 10 million calls per month on the platform”. I know of individual companies that by themselves handle more than that many calls per month with VoiceXML applications.
Also, it’s become pretty clear that VoiceXML is winning the mindshare of the standards committees. Of course, VoiceXML had a big advantage by preceding SALT by several years. Even in the multimodal space, SALT is very unlikely to become the anointed standard. Some of SALT will likely live on in VoiceXML 3.0 and beyond. That’s a very good thing for all of us, though, as I believe VoiceXML 3.0 and XHTML+V are going to be much better standards due to some of the good ideas that originated from the work on SALT.
I’m curious if part of the reason for Microsoft picking up some of the technology assets and a few people from failed start-up Unveil was to gain some additional VoiceXML experience in advance of this plan. After all, the headline of the press release I linked to above was “Microsoft Unveils Road Map for Speech Server 2007″. Then again, maybe not.

2/21/2006: 12:43 am: RobertSpeech, VoiceXML

In the afternoon, I attended two sessions on multimodal applications.

Dave Raggett from the W3C started the session with a talk on Speech Enabling Web Browsers. He has been working on some prototype applications that combine AJAX with speech. He uses a local HTTP server to handle audio on the device (which, for now, is a laptop). A remote HTTP server provides spech services. He uses AJAX, or more specifically, the XMLHTTP object and JavaScript to interact with the remote server. Audio is sent in the request and the interpretation is returned as EMMA markup (SRGS + SISR). He presented a sample application for ordering a pizza that even handled compound utterances. For a prototype, it worked reasonably well. The application was implemented in XHTML, CSS, and JavaScript. He also used AJAX for logging, which allowed him to maintain a synchronized log on the server.

Mark Randolph talked about how Motorola was trying to evolve push-to-talk to “push-to-ask”, i.e., making speech queries to an online database. They are working with SandCherry to commercialize speech apps that use a radio network rather than a telephone network. One nice think about the push-to-talk model is that it creates a clear endpoint for turn-taking in a speech app. They’ve introduced the +V Framework, which provides APIs to interface with local codecs. They are also doing distributed speech recognition by putting the front end of the SR on the device. An ANR codec is used for audio to be played back on the device. DMSP, which uses binary XML, is used to sync the local app with the remote app. Cepstral analysis and some noise reduction is done up front. Endpoint markers are added to aid with transcription. Noise reduction is done only on the sound captured during the push-to-talk phase, partly due to battery usage issues.

Luisa Cordano from Loquendo kicked off the second session. She talked about work they are doing with AirBus. SNOW is a project to provide multimodal access to maintenance info for workers. She played a video that demonstrated a worker being able to capture video with a head mounted camera, call up manuals via speech, and display information on a PDA. The speech and PDA media channels were synchronized.

Someone from Nortel talked about the benefits of standards and gave a high level overview of the kinds of speech and multimodal apps that companies have been building for many years.

Jim Barnett talked in more depth about X+V and SALT plus XHTML. He explained how the X+V tag provides explicit binding of slots. There was some good info in his talk, but not enough of it. This happened to a lot of the speakers at the end of sessions, as their time slots got compressed by earlier speakers.

Finally, Dave Burke at VoxPilot gave a glossy, and yet very informative and technical, presentation on video interactive services. He talked about what they are doing with 3g mobile video (H.324M, 64 kbps per channel) and video over IP. Video is H.264 or MPEG-4 and audio is AMR or G.723. RTP is used for the video stream for video over IP. They use the VXML tag for video. It works, but there has been some discussion on the voice browser working group mailing list about adding tags for other media, such as video. He also talked about video streaming with Skype and Sony IVE (Instant Video Everywhere).

2/4/2006: 1:53 pm: RobertSpeech, VoiceXML

Tuesday morning I attended sessions on core speech technology and dialog design.

Dr. Randy Ford from Sonum Technologies, talked about using strong Natural Language Processing (NLP) to improve speech recognition. He claimed that by using N-gram substitution (e.g., replace the likely misrec “think you” with “thank you”), phonetic tumbling, or a hybrid of the two, you can reasonably achieve a 20% improvement on the base recognition. I wish I could provide more detail on phonetic tumbling, but he had to rush through the end of his presentation and I was too busy taking notes on what he had previously said.

Yoon Kim from Novauris discussed using phonetic techniques to improve recognition for large lists. By taking into account syllable structure and stress, they feel they can significantly improve recognition performance over that of conventional SR engines for items in a large corpora. With respect to lexical stress, they are analysing the stress placed on consonants coming immediately before or after vowels, as well as many aspects of how the vowels themselves are pronounced. Recognition accuracy is locally decreased for unstressed vowels, so they have found it helps a lot in this case to also look closely at the stress placed on nearby consonants. Language-specific syllable structure affects how important this differentiation can be. While the English language generally has more complicated syllable structure and stress distinctions than Korean, Korean can have much greater distinction between consonants preceding or following vowels.

Vlad Sejnoha from Nuance then gave a talk on current work at Nuance on speech technology. The speech was very similar to ones given at last Fall’s Nuance Conversations conference. One of these days, I’ll post my notes from that conference.

In the Dialog Design panel, Sondra Ahlén spoke about Spanish voice talents, including Spanish language TTS voices. She provided a lot of interesting trivia on Spanish speakers (Countries with most Spanish speakers in order are Mexico, Columbia, the US, and Spain; 12% of US residents speak Spanish and half of those don’t speak English; Columbian Spanish is considered the standard dialect for Latin American; Mexican Spanish is generally considered the standard dialect for the US). She also gave some examples of the differences between the dialects. She recommended that you always use a native speaker and that you match the dialect to the greatest common population in the expected audience. She then played some sample recordings from the most popular Spanish TTS voices, pointing out that while they are not as well tuned as English TTS voices, they are still quite usable.

My friend Bob Cooper from Avaya then spoke about an older product that was developed when he was at Conita, which was later acquired by Avaya. He spoke about dialog design considerations for power users who use an application multiple times per day. Use auditory feedback and lots of audio cues, optimize for the common path, and replace separator words with distinctive sounds.

A grad student working with Intervoice presented his work on automatic tuning of context free grammars. He used the SONIC large vocabulary (~75k words) SR engine from CU Boulder to transcribe previously recorded utterances. He then re-ranked n-best lists using phonetic, local, syntactic, and sematic weights. Or, at least, so says my hastily scribbled notes. He then employed Princeton’s WordNetto provide automatic categorization via synonyms. Lexical chains were also used to classify the transcription. The most common utterances were automatically added to the grammar, and common sub-sequences were favored over longer sequences. He claimed that for one test with an initially untuned application, his automatic grammar tuner performed within 2% of a manually tuning performed by someone at Intervoice.

1/28/2006: 12:48 am: RobertSpeech, VoiceXML

I’ll be at SpeechTEK West in San Francisco next week. If anyone reading this will be there, email me or post a comment if you want to meet up. I’ll be in Voxify’s booth on Wednesday. Stop by if you want to hear first hand about the great platform and speech application templates we’ve built, as well as all the cool speech apps we’ve set up for clients. If you’re interested in joining us, bring your resumé, because we’re definitely hiring.

11/6/2005: 8:51 pm: RobertBlogging and RSS, PhoneBlogger, Software, Speech, VoiceXML

I haven’t posted about PhoneBlogger in quite a while, but I’m thinking about updating and enhancing some of the code. A lot has happened in the audio/phone blogging world since I announced PhoneBlogger January 9, 2003, and posted the PhoneBlogger source code on SourceForge.

One new buzzword is mobcasting. The Wikipedia page on mobcasting quotes Andy Carvin as writing:

A quick example: imagine a large protest at a political convention. During the protest, police overstep their authority and begin abusing protesters, sometimes brutally. A few journalists are covering the event, but not live. For the protestors and civil rights activists caught in the mêlée, the police abuses clearly need to be documented and publicized as quickly as possible.

This is quite similar to the scenario I was thinking of nearly three years ago when I announced PhoneBlogger:

A journalist could use it from a payphone (good luck finding one, though) or with a basic cellphone to immediately publish to the web from the scene of an unexpected event in progress. It’s moblogging for the people, man.

Note the quaint reference to a payphone. My point was that you wouldn’t need a fancy phone. Of course, mobile phones have come a long way since I wrote that. Carvin’s example also includes the use of camera/videophones, rather then just audio.

My favorite part of the Wikipedia article, though, is near the end where it says:

Carvin is now exploring the creation of an open-source mobcasting tool that could be installed on a server to allow for community mobcasts via a local telephone call.

I’ve been thinking about the same thing, too. While it makes life simpler for me to host the application with a VoiceXML hosting provider like BeVocal, I do like the idea of having a more self-contained app. It’s going to be pretty complicated, though, to sort out everything I need with a free PBX like Asterisk or sipX, a free VoiceXML browser like OpenVXI, a free ASR engine like Sphinx, and a free TTS engine like Festival. Dealing with PSTN calls will also be a hassle. If I implemented this, I would probably just deal with SIP. That led me down the path of looking into building or finding a SIP softphone that could run on a mobile phone. There is a Java API, JAIN-SIP, for building a Java SIP user agent. The phone would ned only a J2ME runtime. What with all these acronyms and integration efforts, I think you can guess why I haven’t taken all of this on by myself, yet.

I’m glad to see that people like Andy are doing really interesting things with audio blogging. I built PhoneBlogger solely because I thought it would be fun to build. I never really ended up using it.

10/19/2005: 11:25 pm: RobertSpeech, VoiceXML

Next week October 23-25 I will be in Phoenix for the ScanSoft/Nuance Conversations 2005 conference. ScanSoft formally changed their name to Nuance yesterday. During most of the time the Solutions Showcase area is open, you’ll be able to find me in the Voxify booth. If you’re also going to be at the conference, please stop by. I’ll be there until late Tuesday morning.

Next Page »