Speech


3/20/2011: 3:06 pm: Google App Engine, PhoneBlogger, Python, VoiceXML

In late 2002, I thought it would be cool to build an application that allowed you to blog by phone. Tools, libraries and hosted services were a bit more limited back then, but after a few months of learning, coding and debugging, I managed to release the first version of PhoneBlogger in January 2003. Along the way, I learned a lot about Python, VoiceXML, JavaScript, XML-RPC, audio encoding, shared web hosting and command line tools for Linux.

Fast forward nearly ten years and not only have the tools and libraries come a long way, but there are many more free or inexpensive hosted services that simplify building a tool/service like PhoneBlogger. Instead of hosting the application code on a shared hosting site, I can now build and deploy on Google App Engine. Though scalability is not an issue for my personal use of PhoneBlogger, if it were turned into a public service, App Engine would make scaling much simpler and more economical. App Engine also makes deployment a snap, though with a small amount of work, so would Fabric. For my PhoneBlogger rewrite, I decided to use App Engine.

In the original version of PhoneBlogger, I coded a bunch of static VoiceXML and JavaScript for managing the telephone interaction with a caller. At the time, three of the most prominent services for VoiceXML developers were Tellme (now owned by Microsoft), BeVocal (now owned by Nuance) and Voxeo (still independent). I had to write slightly different code for Tellme and BeVocal, but the differences weren’t that significant. I think it would have been pretty simple to port to Voxeo, as well. Improved support of VoiceXML 2 would now likely allow me to use the same code on each platform.

While VoiceXML is still a great option for building speech apps, a couple of new services bring you simple APIs for building speech or DTMF (touchtone) applications, at the cost of portability. This time around I’ve started with Twilio. I very quickly turned a Python/GAE example from the Twilio website into a DTMF app for tweeting by phone. Although speech recognition allows you to build much more complex and natural applications, many simple applications can be built quickly and easily with just support for pressing keys to provide input. PhoneBlogger falls into that category, for now.

One very convenient thing about Twilio is that I can use their platform to capture and host recordings in a format that is simple to play back in a web browser. If I were really concerned about longevity of the recordings I could easily retrieve them and store them elsewhere, but I’m okay with keeping them on Twilio servers for now. That’s an easy enhancement to add later. The biggest downside for tweeting the Twilio links is that the Twilio recording URLs are ginormous. Fortunately, the goo.gl URL shortener made quick work of that problem.

I’m also going to take a look at porting my code to Tropo, which is a service offered by Voxeo. Tropo is built on Voxeo’s Prophecy platform and offers speech recognition as an option.

I decided to begin the rewrite by first supporting tweeting by phone. Twitter offers a great API, which is made even simpler by libraries like Tweepy. I highly recommend first checking out the OAuth support in any library for Twitter you might consider using. OAuth can be a complex beast, but libraries like Tweepy make it almost trivial.

The original PhoneBlogger source code and a couple iterations of it are available on SourceForge. I wasn’t particularly interested in learning about CVS at the time, so I just uploaded tarballs of all the code. While SourceForge has improved a lot, I’ve become more of a fan of GitHub. Google Code, LaunchPad and BitBucket are also great options. I started using LaunchPad when working on a Java library for Gearman, but then set up a couple of repos on GitHub when I started working on Log4mongo-Java. I’m much happier with Git, Bazaar and Mercurial than Subversion and CVS (Caveman Versioning System). I’ve already started posting code for the new phoneblogger project on GitHub.

As of now, the new version of PhoneBlogger supports tweeting by phone. All the code is on GitHub, along with a README file with the basic steps to set it up for yourself. In an upcoming blog post I’ll walk through those steps in a little more detail.

10/8/2008: 9:27 am: Speech

I got an outbound notification call from United this morning about a change to an upcoming flight. When United introduced Easy Update years ago, I was a big fan. I even sent email to United customer service to find out which vendor implemented it for them.

Today’s call was more like Excruciating Update, though. The call started off on a bad note when it didn’t detect me pressing 1 to indicate that I was a human, as opposed to an answering machine. The first couple of prompts were okay, but then the gaps between the prompts .. got …. longer …….. and ………………. longer. Nearing the end of the details for the second flight, each prompt was spaced out by at least 10 seconds. That is pretty painful, since each digit in the flight number is read as a separate prompt.

Most people would have hung up by then, but I work in the business, so I wanted to listen in to the bitter end. Which happened during the second flight details.

Halfway through playing the arrival time, the notification app either crashed or bailed out and switched to a message informing me that they were having technical difficulties and I should call a toll free number to speak to a person. The third party provider for this service (West bought Centerpost) was presumably having serious load issues that slowed down their servers dramatically.

9/2/2008: 8:55 am: Speech

In a comment on another post, a friend pointed out a fun game from Language Trainers Group for guessing accents . You listen to someone read a few lines from a poem and then you answer a multiple choice question to identify their country of origin. For a couple of them, you can get bonus points for guessing which city the speaker is from. And if you’re not from that country, it’s very likely to be a wild guess.

I got about half of the 16 right. I would have gotten nearly three quarters of them correct if I had always gone with my first guess. I was too suspicious that the obvious answer wasn’t the right answer.

8/29/2008: 10:15 pm: Speech

This past Monday I was one of the panelists for the CRMXchange Great Debate webcast on Speech Self Service. We covered the following topics, and a bit more, during the webcast:

  • Enterprise and contact center trends that are driving new investments in IVR/speech applications
  • Best practices for determining the most beneficial activities to offer your customers via your IVR/speech application
  • How to realize the greatest return from your IVR/speech investments
  • Do’s and don’ts for IVR applications – what activities and functions should and should not be offered by an IVR/speech system
  • Which parts of a business outside the call center can benefit from speech applications

It was a lot of fun and I hope I was able to provide the attendees with interesting and useful information, specifically about what we have learned at Voxify in our many customer deployments. I thought the other panelists did a good job, though I was surprised that none of them named specific customers. I think it is really helpful to people looking to deploy speech applications to hear about the successful experiences of specific customers in businesses very similar to theirs.

You can can hear a recording of the webcast (registration required) on the CRMXchange site.

7/5/2008: 12:07 am: Speech

There’s a very detailed (and long) article on the state and future of speech recognition and speech synthesis in the New York Times from late June. Although the prognosis is not that positive, it is written almost with the challenge of a Turing test for speech recognition, i.e., a computer recognizing the semantics of human speech as well as a human. Also, quite a bit of the article focuses on the detection of type and level of emotion from a speaker’s speech.

The article might give a reader the impression that not much is going on with advancements in emotional speech prosody in commercially available text-to-speech engines, but anyone who has listened to a demo of the Loquendo TTS engine would tell you differently.

6/26/2008: 10:38 pm: Speech

The Austin Capital Metro CIO deserves a lot of credit for owning up on the Austin CapMetro blog to some major issues with their IVR applications for bus schedules, etc. It sounds like they have some grammar definition, timeout setting and confidence level setting issues with their app, though it is harder to know for sure without taking a look at it. I would love to help, but it depends on how they have written the app.

I do have to disagree with one of his other IVR-related posts where he states that:

But when a rider calls in to find out how to get from Downtown to Highland mall in the shortest time possible, an IVR will not do a good job of handling this question (a lot of human judgment and discretion is required which an IVR just can’t muster).

You would be surprised how well a speech app can handle that kind of problem. Of course, it won’t be cheap, as you have to think through the common starting and destination points callers might use, know when to ask for more detail (downtown isn’t sufficient info for providing directions unless you are in Mayberry RFD), algorithms for computing shortest time based on the schedules, etc., but it can definitely be done. Now, there are certainly many customer service kinds of apps that are very difficult to handle with a speech app, but directions is not one of them.

In deciding whether it is worth building an app for this function, you have to look at the total number of minutes of calls like this being handled by live agents. If the number is low (and “low” varies with the complexity of the problem, and thus the solution cost, of course), then it may better to leave the calls to a small number of trained agents who can handle many other types of calls. But, once you want to offer this service beyond regular working hours or if you expect the call traffic for this type of call to be very spiky, it may be worth building an app to take the calls.

4/9/2008: 9:15 am: Speech

When I posted a couple days ago about Spinvox taking in a very large funding round, I missed an announcement that same day about Nuance’s new voicemail to text service, which they have decided to cryptically call Voicemail to Text. Nuance is providing this service only through telecom carriers.

The thing I found most interesting is that the Voicemail to Text product page states that Nuance’s transcription software is supported by over 3,000 human transcriptionists. Well, they don’t specifically say human, but I think that’s a safe bet. I would have thought that if any company could completely automate the transcription process, it would be Nuance. Then again, I often can’t understand all of the words in the voicemail messages I receive, and last time I checked, I was human.

Recently I interviewed for a position at Voxify an engineer who worked on such a service at a company that develops unified messaging software. They were trying to fully automate the voicemail transcription process, though they seemed to be targeting for a much less complete transcription. That would still be useful if you receive a lot of voicemail messages, as it might allow you to better prioritize the order in which you go through the backlog. I get upwards of three voicemails a week from my retinue of admirers, so this isn’t such a problem for me, though it would let me quickly skip through the majority of those messages that are wrong numbers.

: 8:41 am: Speech, VoiceXML

An article in Speech Technology magazine reports that in the most recent update to Gartner’s Magic Quadrant for IVRs, Microsoft Speech Server and Nuance Voice Portal got dropped. The disappearance of NVP is no surprise, since Nuance announced at their Conversations conference over two years ago that they would no longer enhance it.

Microsoft moved Speech Server into Office Communications Server last year, and really doesn’t seem to be promoting it as a standalone product, even though it can still be installed separately. Although I see virtually no push by Microsoft, or even their partners, to sell Speech Server into large contact centers, I’m still a little surprised Gartner dropped them.

We’ve been doing some testing on Speech Server at Voxify, and overall it works quite well. Getting it to work with our Asterisk-based PBX was a nightmare, but otherwise the install went pretty smoothly. Recognition performance using Microsoft’s ASR is generally similar to Nuance OSR, though recognition is very slow when doing nbest recognition for even medium sized values of n. Microsoft’s fairly faithful compliance with the VoiceXML standard (we find issues with every VXML browser vendor we have worked with) was another very pleasant surprise. The best surprise was the licensing costs. It is amazingly inexpensive considering the quantity and quality of features it includes.

One of my biggest concerns about Speech Server is that activity in discussion forums and blogs regarding the product has dwindled dramatically (at least in the places I have looked) over the last year. Without Microsoft pushing Speech Server, I think there will need to be pretty strong community support for it to gain a foothold. It would be really too bad if it ends up getting buried in the unified communication product line at Microsoft.

The rest of the report contained no surprises. Genesys is listed as the clear leader, and that is definitely what I have seen in the market. Acquiring VoiceGenie was a brilliant move on their part, and they have very good offerings for both enterprises and large VXML hosting providers. Nonetheless, there continue to be interesting developments at Nortel, Avaya and Voxeo, among others.

4/3/2008: 3:44 pm: Speech

Nancy Jamison posted a nice write up on her blog about a recent Voxify webcast where Voxify presented with Continental on a new outbound voice app we just rolled out for Continental that calls customers up to 24 hours before their flight and offers to check them into their flight. It’s an especially great app for Continental’s frequent fliers, since the sooner they check in, the better their chance of getting an upgrade. Nancy provides a good description of the main features of the app near the end of her post. I’m especially excited about this deployment, because I developed the integration to the remote dialer that is actually placing the phone calls.

As Nancy points out, this is the kind of outbound call that customers actually do want to receive. We’re working on a lot of stuff like this, so hopefully more of the outbound calls people receive in the future will be helpful calls, instead of just telemarketing, collections and surveys with no compensation for your time.

The outbound calling apps we build also go way beyond “read-only” notification calls. These are interactive calls that let you do things like ask to have a message repeated or reschedule the call for a more convenient time. Rescheduling a call using DTMF (i.e., pressing digits on a keypad) is terrible comparing to doing it with speech. For this application, speech recognition is also used to prompt for how you want to receive your check in confirmation, check in for multiple flights, collect information about infants or passengers under the age of thirteen, ask about upgrades, and a lot more.

4/2/2008: 9:58 am: Speech

SpinVox certainly has come a long way in the last few years since I checked out their service for converting voicemails to text. They launched at nearly the perfect time. Large vocabulary speech recognizers have been around for a long time, but in the last few years they have become particularly plentiful and cheap.

Also, SMS has taken off in the US to the point that there is now a huge number of potential customers who would be interested in getting the gist of a voicemail texted to them. There is also a fast growing population of users with phones capable of email access who would want the full transcription emailed to them so they could review it and potentially respond. If the voicemail message contains things you might want to write down, like phone numbers, names, addresses, etc., the automatic transcription saves you even more time. Of course, that assumes the transcription is at least as correct as what you would have written down.

Of course, SMS has been popular for much longer in other countries, but language is obviously an issue. Sure there are a lot of potential customers in Finland, but that means you need a recognizer with a very good Finnish language model. But that’s going to help you out only in Finland. Obviously, tapping into a large country that uses the same language as several other large countries is pretty desirable when you are trying to really scale up a business.

Next Page »


Fork me on GitHub