VoiceXML – WombatNation

Tweeting by Phone with PhoneBlogger

Sun, 20 Mar 2011 23:06:52 +0000

In late 2002, I thought it would be cool to build an application that allowed you to blog by phone. Tools, libraries and hosted services were a bit more limited back then, but after a few months of learning, coding and debugging, I managed to release the first version of PhoneBlogger in January 2003. Along the way, I learned a lot about Python, VoiceXML, JavaScript, XML-RPC, audio encoding, shared web hosting and command line tools for Linux.

Fast forward nearly ten years and not only have the tools and libraries come a long way, but there are many more free or inexpensive hosted services that simplify building a tool/service like PhoneBlogger. Instead of hosting the application code on a shared hosting site, I can now build and deploy on Google App Engine. Though scalability is not an issue for my personal use of PhoneBlogger, if it were turned into a public service, App Engine would make scaling much simpler and more economical. App Engine also makes deployment a snap, though with a small amount of work, so would Fabric. For my PhoneBlogger rewrite, I decided to use App Engine.

In the original version of PhoneBlogger, I coded a bunch of static VoiceXML and JavaScript for managing the telephone interaction with a caller. At the time, three of the most prominent services for VoiceXML developers were Tellme (now owned by Microsoft), BeVocal (now owned by Nuance) and Voxeo (still independent). I had to write slightly different code for Tellme and BeVocal, but the differences weren’t that significant. I think it would have been pretty simple to port to Voxeo, as well. Improved support of VoiceXML 2 would now likely allow me to use the same code on each platform.

While VoiceXML is still a great option for building speech apps, a couple of new services bring you simple APIs for building speech or DTMF (touchtone) applications, at the cost of portability. This time around I’ve started with Twilio. I very quickly turned a Python/GAE example from the Twilio website into a DTMF app for tweeting by phone. Although speech recognition allows you to build much more complex and natural applications, many simple applications can be built quickly and easily with just support for pressing keys to provide input. PhoneBlogger falls into that category, for now.

One very convenient thing about Twilio is that I can use their platform to capture and host recordings in a format that is simple to play back in a web browser. If I were really concerned about longevity of the recordings I could easily retrieve them and store them elsewhere, but I’m okay with keeping them on Twilio servers for now. That’s an easy enhancement to add later. The biggest downside for tweeting the Twilio links is that the Twilio recording URLs are ginormous. Fortunately, the goo.gl URL shortener made quick work of that problem.

I’m also going to take a look at porting my code to Tropo, which is a service offered by Voxeo. Tropo is built on Voxeo’s Prophecy platform and offers speech recognition as an option.

I decided to begin the rewrite by first supporting tweeting by phone. Twitter offers a great API, which is made even simpler by libraries like Tweepy. I highly recommend first checking out the OAuth support in any library for Twitter you might consider using. OAuth can be a complex beast, but libraries like Tweepy make it almost trivial.

The original PhoneBlogger source code and a couple iterations of it are available on SourceForge. I wasn’t particularly interested in learning about CVS at the time, so I just uploaded tarballs of all the code. While SourceForge has improved a lot, I’ve become more of a fan of GitHub. Google Code, LaunchPad and BitBucket are also great options. I started using LaunchPad when working on a Java library for Gearman, but then set up a couple of repos on GitHub when I started working on Log4mongo-Java. I’m much happier with Git, Bazaar and Mercurial than Subversion and CVS (Caveman Versioning System). I’ve already started posting code for the new phoneblogger project on GitHub.

As of now, the new version of PhoneBlogger supports tweeting by phone. All the code is on GitHub, along with a README file with the basic steps to set it up for yourself. In an upcoming blog post I’ll walk through those steps in a little more detail.

Gartner’s IVR Magic Quadrant

Wed, 09 Apr 2008 15:41:06 +0000

An article in Speech Technology magazine reports that in the most recent update to Gartner’s Magic Quadrant for IVRs, Microsoft Speech Server and Nuance Voice Portal got dropped. The disappearance of NVP is no surprise, since Nuance announced at their Conversations conference over two years ago that they would no longer enhance it.

Microsoft moved Speech Server into Office Communications Server last year, and really doesn’t seem to be promoting it as a standalone product, even though it can still be installed separately. Although I see virtually no push by Microsoft, or even their partners, to sell Speech Server into large contact centers, I’m still a little surprised Gartner dropped them.

We’ve been doing some testing on Speech Server at Voxify, and overall it works quite well. Getting it to work with our Asterisk-based PBX was a nightmare, but otherwise the install went pretty smoothly. Recognition performance using Microsoft’s ASR is generally similar to Nuance OSR, though recognition is very slow when doing nbest recognition for even medium sized values of n. Microsoft’s fairly faithful compliance with the VoiceXML standard (we find issues with every VXML browser vendor we have worked with) was another very pleasant surprise. The best surprise was the licensing costs. It is amazingly inexpensive considering the quantity and quality of features it includes.

One of my biggest concerns about Speech Server is that activity in discussion forums and blogs regarding the product has dwindled dramatically (at least in the places I have looked) over the last year. Without Microsoft pushing Speech Server, I think there will need to be pretty strong community support for it to gain a foothold. It would be really too bad if it ends up getting buried in the unified communication product line at Microsoft.

The rest of the report contained no surprises. Genesys is listed as the clear leader, and that is definitely what I have seen in the market. Acquiring VoiceGenie was a brilliant move on their part, and they have very good offerings for both enterprises and large VXML hosting providers. Nonetheless, there continue to be interesting developments at Nortel, Avaya and Voxeo, among others.

AVIOS Speech App Contest for Students

Sun, 23 Sep 2007 07:06:50 +0000

AVIOS is having their second speech app building contest for student teams. “Cash and/or equipment prizes valued at over $1000 will be awarded to teams of student programmers who design and create applications judged by industry experts to be the most robust, useful, creative, innovative, and user friendly.” The finished application must be submitted by January 31, 2008.

Although the application must at least support speech input or output, students are encouraged to develop multi-modal apps. Many development environments are approved for use in building and running your application. If you are looking to get some long term useful experience out of this exercise, I strongly recommend that you build a VoiceXML app and host it one of the platforms like BeVocal, TellMe, etc. While working with a downloadable environment like Voxeo’s Prophecy would be highly educational, you’ll end up spending a lot of time dealing with stuff that the hosted platforms take care of for you. Even better would be to use a tool that dynamically generates VoiceXML, but there aren’t a lot of free tools available that do that.

TMCnet Blog Post about Voxify

Thu, 14 Jun 2007 16:16:05 +0000

Patrick Barnard wrote a very nice post about Voxify on his Making Contact TMCnet blog after speaking with the heads of our sales and marketing groups. Patrick’s post aptly summarizes the nature of the hosted speech applications that Voxify provides.

For the sake of credibility regarding real world speech application implementations, it’s important to note that we don’t claim we can implement every imaginable integrated application in less than eight weeks. Patrick doesn’t say that either, but I can see how some people might jump to that conclusion. Some applications require the development of very complex call flows and extremely technically challenging integrations to back end systems. I think we still deploy these complex kinds of applications surprisingly quickly, though.

The telephony integration for a hosted speech application can add time, too, if a lot of changes need to be made to existing circuits or if new circuits need to be provisioned. The telecom companies have gotten a lot better about this, but it can still take them 1-2 weeks to provision a new line. Fortunately, we’re able to catch most of these situations up front and get all of the telecom work queued up early.

But, Voxify absolutely can design, develop and deploy integrated speech applications in less than eight weeks. We’ve done that for several clients, and we’ve made some changes to our platform that will enable us to deliver that fast much more often in the future.

Part of the reason we can develop speech applications so quickly is that we have the experience from developing a lot of applications. In addition, we took the time, either during those deployments or soon after, to capture that experience in our core platform or in reusable libraries. We now have a very powerful platform and a strong set of reusable horizontal (e.g., geographical location, billing and shipping address, credit card information, etc.) and vertical (e.g., flight info, hotel reservation, prescription refill, order status, etc.) libraries. We also have a very efficient set of deployment processes that we have honed during all of our previous deployments. And, oh yeah, there are a bunch of smart people in our office who continually amaze me.

SoccerPhone 2007

Thu, 12 Apr 2007 06:56:34 +0000

SoccerPhone is a speech application I wrote about five years ago so I could get live updates on Major League Soccer scores whenever I was away from an Internet connection. I wrote the application in VoiceXML, JavaScript, and Python. Since SoccerPhone gathers the live data by scraping information from the HTML scoreboard page on the MLS website, I often have to update my code when the MLS website changes each year. Fortunately, this year’s change was fairly minor.

Call 1-877-33-VOCAL (877-338-6225)
When asked for your PIN, enter or say 5425 (5425 = KICK)
When asked for your userid, enter or say 6575425 (6575425 = MLSKICK)
After you hear me say “Welcome to SoccerPhone”, you can say an MLS
team name, such as “Houston Dynamo”, or say “all teams”.

Since SoccerPhone lets you request this week’s match results for a team by saying the team’s name, I also have to update the GRXML grammars when new teams are added. While I could dynamically generate grammars from the team names that the app extracts from the MLS website, it’s not that big of a deal to manually maintain the grammars. Also, manually coding them allows me to tune the gramars for better recognition rates.

If you’ve call the app before, you’ll be disappointed to hear that I am using the same lame voice talent, i.e., me. On the good side, though, I re-recorded a bunch of the prompts using Audacity. I also eliminated a little more of the TTS by adding additional recordings. I really should take advantage of the great recording studio we have at Voxify, but then I would feel obligated to use a real voice talent.

I haven’t updated the code at the SoccerPhone SourceForge project site, yet. But I will get to that soon. I got sidetracked by looking into converting the CVS repository to Subversion.

Microsoft and VoiceXML

Fri, 14 Apr 2006 06:42:26 +0000

Microsoft recently announced that Speech Server 2007 will provide support for speech applications written in VoiceXML. In order to penetrate the enterprise market for speech applications, Microsoft really had no choice. SALT-based applications remain as rare as hen’s teeth in the enterprise. Ok, maybe not that rare, but certainly the number pales in comparison to the number of VoiceXML-based applications. The press release says “More than 40,000 telephony ports of capacity have been licensed, and Speech Server customers are successfully answering more than 10 million calls per month on the platform”. I know of individual companies that by themselves handle more than that many calls per month with VoiceXML applications.
Also, it’s become pretty clear that VoiceXML is winning the mindshare of the standards committees. Of course, VoiceXML had a big advantage by preceding SALT by several years. Even in the multimodal space, SALT is very unlikely to become the anointed standard. Some of SALT will likely live on in VoiceXML 3.0 and beyond. That’s a very good thing for all of us, though, as I believe VoiceXML 3.0 and XHTML+V are going to be much better standards due to some of the good ideas that originated from the work on SALT.
I’m curious if part of the reason for Microsoft picking up some of the technology assets and a few people from failed start-up Unveil was to gain some additional VoiceXML experience in advance of this plan. After all, the headline of the press release I linked to above was “Microsoft Unveils Road Map for Speech Server 2007″. Then again, maybe not.

SpeechTEK West 2006 Day 2 Afternoon

Tue, 21 Feb 2006 07:43:22 +0000

In the afternoon, I attended two sessions on multimodal applications.

Dave Raggett from the W3C started the session with a talk on Speech Enabling Web Browsers. He has been working on some prototype applications that combine AJAX with speech. He uses a local HTTP server to handle audio on the device (which, for now, is a laptop). A remote HTTP server provides spech services. He uses AJAX, or more specifically, the XMLHTTP object and JavaScript to interact with the remote server. Audio is sent in the request and the interpretation is returned as EMMA markup (SRGS + SISR). He presented a sample application for ordering a pizza that even handled compound utterances. For a prototype, it worked reasonably well. The application was implemented in XHTML, CSS, and JavaScript. He also used AJAX for logging, which allowed him to maintain a synchronized log on the server.

Mark Randolph talked about how Motorola was trying to evolve push-to-talk to “push-to-ask”, i.e., making speech queries to an online database. They are working with SandCherry to commercialize speech apps that use a radio network rather than a telephone network. One nice think about the push-to-talk model is that it creates a clear endpoint for turn-taking in a speech app. They’ve introduced the +V Framework, which provides APIs to interface with local codecs. They are also doing distributed speech recognition by putting the front end of the SR on the device. An ANR codec is used for audio to be played back on the device. DMSP, which uses binary XML, is used to sync the local app with the remote app. Cepstral analysis and some noise reduction is done up front. Endpoint markers are added to aid with transcription. Noise reduction is done only on the sound captured during the push-to-talk phase, partly due to battery usage issues.

Luisa Cordano from Loquendo kicked off the second session. She talked about work they are doing with AirBus. SNOW is a project to provide multimodal access to maintenance info for workers. She played a video that demonstrated a worker being able to capture video with a head mounted camera, call up manuals via speech, and display information on a PDA. The speech and PDA media channels were synchronized.

Someone from Nortel talked about the benefits of standards and gave a high level overview of the kinds of speech and multimodal apps that companies have been building for many years.

Jim Barnett talked in more depth about X+V and SALT plus XHTML. He explained how the X+V tag provides explicit binding of slots. There was some good info in his talk, but not enough of it. This happened to a lot of the speakers at the end of sessions, as their time slots got compressed by earlier speakers.

Finally, Dave Burke at VoxPilot gave a glossy, and yet very informative and technical, presentation on video interactive services. He talked about what they are doing with 3g mobile video (H.324M, 64 kbps per channel) and video over IP. Video is H.264 or MPEG-4 and audio is AMR or G.723. RTP is used for the video stream for video over IP. They use the VXML tag for video. It works, but there has been some discussion on the voice browser working group mailing list about adding tags for other media, such as video. He also talked about video streaming with Skype and Sony IVE (Instant Video Everywhere).

SpeechTEK West 2006 Day 2 Morning

Sat, 04 Feb 2006 20:53:00 +0000

Tuesday morning I attended sessions on core speech technology and dialog design.

Dr. Randy Ford from Sonum Technologies, talked about using strong Natural Language Processing (NLP) to improve speech recognition. He claimed that by using N-gram substitution (e.g., replace the likely misrec “think you” with “thank you”), phonetic tumbling, or a hybrid of the two, you can reasonably achieve a 20% improvement on the base recognition. I wish I could provide more detail on phonetic tumbling, but he had to rush through the end of his presentation and I was too busy taking notes on what he had previously said.

Yoon Kim from Novauris discussed using phonetic techniques to improve recognition for large lists. By taking into account syllable structure and stress, they feel they can significantly improve recognition performance over that of conventional SR engines for items in a large corpora. With respect to lexical stress, they are analysing the stress placed on consonants coming immediately before or after vowels, as well as many aspects of how the vowels themselves are pronounced. Recognition accuracy is locally decreased for unstressed vowels, so they have found it helps a lot in this case to also look closely at the stress placed on nearby consonants. Language-specific syllable structure affects how important this differentiation can be. While the English language generally has more complicated syllable structure and stress distinctions than Korean, Korean can have much greater distinction between consonants preceding or following vowels.

Vlad Sejnoha from Nuance then gave a talk on current work at Nuance on speech technology. The speech was very similar to ones given at last Fall’s Nuance Conversations conference. One of these days, I’ll post my notes from that conference.

In the Dialog Design panel, Sondra AhlÃ©n spoke about Spanish voice talents, including Spanish language TTS voices. She provided a lot of interesting trivia on Spanish speakers (Countries with most Spanish speakers in order are Mexico, Columbia, the US, and Spain; 12% of US residents speak Spanish and half of those don’t speak English; Columbian Spanish is considered the standard dialect for Latin American; Mexican Spanish is generally considered the standard dialect for the US). She also gave some examples of the differences between the dialects. She recommended that you always use a native speaker and that you match the dialect to the greatest common population in the expected audience. She then played some sample recordings from the most popular Spanish TTS voices, pointing out that while they are not as well tuned as English TTS voices, they are still quite usable.

My friend Bob Cooper from Avaya then spoke about an older product that was developed when he was at Conita, which was later acquired by Avaya. He spoke about dialog design considerations for power users who use an application multiple times per day. Use auditory feedback and lots of audio cues, optimize for the common path, and replace separator words with distinctive sounds.

A grad student working with Intervoice presented his work on automatic tuning of context free grammars. He used the SONIC large vocabulary (~75k words) SR engine from CU Boulder to transcribe previously recorded utterances. He then re-ranked n-best lists using phonetic, local, syntactic, and sematic weights. Or, at least, so says my hastily scribbled notes. He then employed Princeton’s WordNetto provide automatic categorization via synonyms. Lexical chains were also used to classify the transcription. The most common utterances were automatically added to the grammar, and common sub-sequences were favored over longer sequences. He claimed that for one test with an initially untuned application, his automatic grammar tuner performed within 2% of a manually tuning performed by someone at Intervoice.

SpeechTEK West 2006

Sat, 28 Jan 2006 07:48:58 +0000

I’ll be at SpeechTEK West in San Francisco next week. If anyone reading this will be there, email me or post a comment if you want to meet up. I’ll be in Voxify’s booth on Wednesday. Stop by if you want to hear first hand about the great platform and speech application templates we’ve built, as well as all the cool speech apps we’ve set up for clients. If you’re interested in joining us, bring your resumÃ©, because we’re definitely hiring.

Mobcasting

Mon, 07 Nov 2005 03:51:12 +0000

I haven’t posted about PhoneBlogger in quite a while, but I’m thinking about updating and enhancing some of the code. A lot has happened in the audio/phone blogging world since I announced PhoneBlogger January 9, 2003, and posted the PhoneBlogger source code on SourceForge.

One new buzzword is mobcasting. The Wikipedia page on mobcasting quotes Andy Carvin as writing:

A quick example: imagine a large protest at a political convention. During the protest, police overstep their authority and begin abusing protesters, sometimes brutally. A few journalists are covering the event, but not live. For the protestors and civil rights activists caught in the mÃªlÃ©e, the police abuses clearly need to be documented and publicized as quickly as possible.

This is quite similar to the scenario I was thinking of nearly three years ago when I announced PhoneBlogger:

A journalist could use it from a payphone (good luck finding one, though) or with a basic cellphone to immediately publish to the web from the scene of an unexpected event in progress. Itâ€s moblogging for the people, man.

Note the quaint reference to a payphone. My point was that you wouldn’t need a fancy phone. Of course, mobile phones have come a long way since I wrote that. Carvin’s example also includes the use of camera/videophones, rather then just audio.

My favorite part of the Wikipedia article, though, is near the end where it says:

Carvin is now exploring the creation of an open-source mobcasting tool that could be installed on a server to allow for community mobcasts via a local telephone call.

I’ve been thinking about the same thing, too. While it makes life simpler for me to host the application with a VoiceXML hosting provider like BeVocal, I do like the idea of having a more self-contained app. It’s going to be pretty complicated, though, to sort out everything I need with a free PBX like Asterisk or sipX, a free VoiceXML browser like OpenVXI, a free ASR engine like Sphinx, and a free TTS engine like Festival. Dealing with PSTN calls will also be a hassle. If I implemented this, I would probably just deal with SIP. That led me down the path of looking into building or finding a SIP softphone that could run on a mobile phone. There is a Java API, JAIN-SIP, for building a Java SIP user agent. The phone would ned only a J2ME runtime. What with all these acronyms and integration efforts, I think you can guess why I haven’t taken all of this on by myself, yet.

I’m glad to see that people like Andy are doing really interesting things with audio blogging. I built PhoneBlogger solely because I thought it would be fun to build. I never really ended up using it.