Archive for February, 2006

2/21/2006: 12:43 am: Speech, VoiceXML

In the afternoon, I attended two sessions on multimodal applications.

Dave Raggett from the W3C started the session with a talk on Speech Enabling Web Browsers. He has been working on some prototype applications that combine AJAX with speech. He uses a local HTTP server to handle audio on the device (which, for now, is a laptop). A remote HTTP server provides spech services. He uses AJAX, or more specifically, the XMLHTTP object and JavaScript to interact with the remote server. Audio is sent in the request and the interpretation is returned as EMMA markup (SRGS + SISR). He presented a sample application for ordering a pizza that even handled compound utterances. For a prototype, it worked reasonably well. The application was implemented in XHTML, CSS, and JavaScript. He also used AJAX for logging, which allowed him to maintain a synchronized log on the server.

Mark Randolph talked about how Motorola was trying to evolve push-to-talk to “push-to-ask”, i.e., making speech queries to an online database. They are working with SandCherry to commercialize speech apps that use a radio network rather than a telephone network. One nice think about the push-to-talk model is that it creates a clear endpoint for turn-taking in a speech app. They’ve introduced the +V Framework, which provides APIs to interface with local codecs. They are also doing distributed speech recognition by putting the front end of the SR on the device. An ANR codec is used for audio to be played back on the device. DMSP, which uses binary XML, is used to sync the local app with the remote app. Cepstral analysis and some noise reduction is done up front. Endpoint markers are added to aid with transcription. Noise reduction is done only on the sound captured during the push-to-talk phase, partly due to battery usage issues.

Luisa Cordano from Loquendo kicked off the second session. She talked about work they are doing with AirBus. SNOW is a project to provide multimodal access to maintenance info for workers. She played a video that demonstrated a worker being able to capture video with a head mounted camera, call up manuals via speech, and display information on a PDA. The speech and PDA media channels were synchronized.

Someone from Nortel talked about the benefits of standards and gave a high level overview of the kinds of speech and multimodal apps that companies have been building for many years.

Jim Barnett talked in more depth about X+V and SALT plus XHTML. He explained how the X+V tag provides explicit binding of slots. There was some good info in his talk, but not enough of it. This happened to a lot of the speakers at the end of sessions, as their time slots got compressed by earlier speakers.

Finally, Dave Burke at VoxPilot gave a glossy, and yet very informative and technical, presentation on video interactive services. He talked about what they are doing with 3g mobile video (H.324M, 64 kbps per channel) and video over IP. Video is H.264 or MPEG-4 and audio is AMR or G.723. RTP is used for the video stream for video over IP. They use the VXML tag for video. It works, but there has been some discussion on the voice browser working group mailing list about adding tags for other media, such as video. He also talked about video streaming with Skype and Sony IVE (Instant Video Everywhere).

2/20/2006: 8:15 pm: Hurricane Katrina, Hurricane Rita

I’ve uploaded many photos to my website gallery and to this blog, but Hurricane Katrina has made me realize just how important it is to use my website to backup the photos on only one of my computers and the photo prints I have still yet to scan. Although Dreamhost’s servers are in earthquake country in Los Angeles, as least it’s far enough away from the San Francisco Bay area that if something happens in Oakland, my photos will be safe in L. A. Obviously, I also have backups of what is on the website, so I am protected if something happens down south.

When you ask most people what they would save if they had to leave their home in an emergency, photographs tend to be pretty high on the list. Though it’s always nice to have original prints, digital copies on a web server make for a pretty convenient backup. The important thing, though, is not too wait. Even though you have more advance warning with a hurricane, you’re almost certainly going to be too busy packing up and evacuating to spend a couple of hours preparing and uploading photos.

2/12/2006: 11:24 am: Entertainment

If you live in the San Francisco Bay Area and have an HD over-the-air receiver, you may have run into problems in the last week if you tried to watch the Olympics in HD on NBC 11. Instead of getting a nice, crisp HD image of people moving quickly across the white stuff, you would have seen an unusually poor SD image of NBC weather on 11.1 and nothing on 11.2. Even worse, there was no crawl text along the bottom of the screen to explain the problem.

My wife did some searching on the Web and finally managed to track down the phone number of the newsdesk at the station. They explained that they had installed some new equipment just a week before and that this resulted in many HD receivers not being able to immediately lock onto the signal from the new equipment. If you dig around on their website, you can find the explanation there.

The fix is to force your HD OTA receiver to rememorize/reacquire KNTV’s digital channels at 11.1 and 11.2. If your receiver is like ours, it’s easiest to have it just reaquire all the channels.

2/4/2006: 1:53 pm: Speech, VoiceXML

Tuesday morning I attended sessions on core speech technology and dialog design.

Dr. Randy Ford from Sonum Technologies, talked about using strong Natural Language Processing (NLP) to improve speech recognition. He claimed that by using N-gram substitution (e.g., replace the likely misrec “think you” with “thank you”), phonetic tumbling, or a hybrid of the two, you can reasonably achieve a 20% improvement on the base recognition. I wish I could provide more detail on phonetic tumbling, but he had to rush through the end of his presentation and I was too busy taking notes on what he had previously said.

Yoon Kim from Novauris discussed using phonetic techniques to improve recognition for large lists. By taking into account syllable structure and stress, they feel they can significantly improve recognition performance over that of conventional SR engines for items in a large corpora. With respect to lexical stress, they are analysing the stress placed on consonants coming immediately before or after vowels, as well as many aspects of how the vowels themselves are pronounced. Recognition accuracy is locally decreased for unstressed vowels, so they have found it helps a lot in this case to also look closely at the stress placed on nearby consonants. Language-specific syllable structure affects how important this differentiation can be. While the English language generally has more complicated syllable structure and stress distinctions than Korean, Korean can have much greater distinction between consonants preceding or following vowels.

Vlad Sejnoha from Nuance then gave a talk on current work at Nuance on speech technology. The speech was very similar to ones given at last Fall’s Nuance Conversations conference. One of these days, I’ll post my notes from that conference.

In the Dialog Design panel, Sondra Ahlén spoke about Spanish voice talents, including Spanish language TTS voices. She provided a lot of interesting trivia on Spanish speakers (Countries with most Spanish speakers in order are Mexico, Columbia, the US, and Spain; 12% of US residents speak Spanish and half of those don’t speak English; Columbian Spanish is considered the standard dialect for Latin American; Mexican Spanish is generally considered the standard dialect for the US). She also gave some examples of the differences between the dialects. She recommended that you always use a native speaker and that you match the dialect to the greatest common population in the expected audience. She then played some sample recordings from the most popular Spanish TTS voices, pointing out that while they are not as well tuned as English TTS voices, they are still quite usable.

My friend Bob Cooper from Avaya then spoke about an older product that was developed when he was at Conita, which was later acquired by Avaya. He spoke about dialog design considerations for power users who use an application multiple times per day. Use auditory feedback and lots of audio cues, optimize for the common path, and replace separator words with distinctive sounds.

A grad student working with Intervoice presented his work on automatic tuning of context free grammars. He used the SONIC large vocabulary (~75k words) SR engine from CU Boulder to transcribe previously recorded utterances. He then re-ranked n-best lists using phonetic, local, syntactic, and sematic weights. Or, at least, so says my hastily scribbled notes. He then employed Princeton’s WordNetto provide automatic categorization via synonyms. Lexical chains were also used to classify the transcription. The most common utterances were automatically added to the grammar, and common sub-sequences were favored over longer sequences. He claimed that for one test with an initially untuned application, his automatic grammar tuner performed within 2% of a manually tuning performed by someone at Intervoice.

2/2/2006: 11:18 pm: The Unusual and the Weird

New Scientist recently ran an article on an unusual infectious cancer that has killed one third of the wild population of Tasmanian Devils. I took this picture of a Tasmanian Devil at the Tasmanian Devil Park and Wildlife Rescue Center in Taranna in Tasmania, Australia. Several of the Devils there had terrible facial scars along their snouts. It looked as if chunks of flesh had been torn away. I was told that the injuries were caused by other Devils at feeding time.

Supposedly, the Devils have terrible eyesite and they would accidentally bite each while ripping into the roadkill they were served for dinner. While the careless biting may have caused some of the damage, it now seems more likely that it just contributed to the spread of the cancer, and was not the direct cause of the wounds.

More photos from my trip to Tasmania, including a sleeping Devil.

2/1/2006: 2:01 am: Speech

The attendance at SpeechTEK West 2006 seems lighter than past years. One issue is that the technical sessions were in meeting rooms far away from the business sessions, so it was a little hard to tell just how many people were actually there in total. The business sessions were definitely more lightly attended than the technical sessions. I wanted to catch up on business issues the first day, so I focused on the industry workshops.

I started out with the Retail Industry Workshop Monday morning. The CTO of Voxify, Amit Desai, was one of the panelists. I’m obviously biased, but I think he did a great job. His presentation was very informative. One of the key points he covered was the ability of speech applications to help companies handle huge spikes in call volumes. Sometimes, the peaks occur for more extended periods of time for retailers, such as the few months before the end of the year holidays. This spike is becoming even more compressed, as people more often buy holiday gifts online and have become comfortable in having gifts shipped directly to the recipients. Even when the spike lasts a few months, it is very difficult for a retailer to plan, hire, and train enough staff to handle all the calls they will receive.

The spikes can be even more dramatic when a retailer offers a short duration promotion. For example, Voxify handled the calls for some commercials that ran during a couple recent major sporting events, with another big one still to come. Our speech applications received around 1,000 simultaneous calls each time the commercial ran. No callers had to wait in queue. Since most people called right after the commercial aired, the volume of calls had mostly fallen off after less than thirty minutes. If live agents had been used, an equal number of agents would have had to have been available in order to also not make a single caller wait in a queue. Even if you did force many of the callers to listen to hold music, hundreds of additional trained agents would have been needed for only about twenty minutes. This is clearly a situation where speech applications can bring a huge benefit.

Companies in the travel and hospitality business have similar spikes, but they also suffer from unpredictable, weather-related spikes. We saw huge increases in calls to our travel & hospitality client applications after Huricane Katrina. We see similar spikes every winter when a big snowstorm strikes part of the US or Canada.

Back to SpeechTEK. Someone from Versay (looking at my notes, I realize that I was pretty bad about not writing down the speakers’ names) talked about VoIP and speech. One of their clients had IVRs in many branches so that local number access could be provided. They have moved that customer over to using a VoIP network. Many of the big VoIP providers, like Level3, provide local number access for most of the US. I think the use of VoIP networks for hosted speech applications is going to be a big trend over the next few years.

Currently, they are using G.711 as the audio codec. Although this doesn’t save you any bandwidth (in fact it eats up quite a bit of bandwidth from the 64 kbps for the RTP payload and the roughly 30 kbps additional overhead for the RTP, UDP, IP, and ethernet headers), he said they felt the bandwidth costs weren’t that bad. Although VoIP brings the promise of lower bitrate codecs, speech recognition engines need all the signal they can get in order to accurately recognize speech. Many of the lower bitrate codecs take advantage of limitations in human audio perception. Speech recognition algorithms don’t have those same limitations, so those algorithms throw away data they could potentially use. He did say they were evaluating some of the lower bitrate codecs, though, for potential future use.

I then attended the Financial Services workshop. Someone from Loquendo, a spin-off from Telecom Italia, started off with a demo of their TTS engine. It was extremely impressive. I’ve listened to the output from quite a few TTS engines, but this one was by far the best. The base engine is quite good, but the ability to fine tune the prosody via SSML tags is amazing.

She then listed quite a few of their customers and many of the applications they had built for these customers, though I would have preferred more detail about just a few of the apps, rather than just a long list. Many of the apps were very simple, but some of them sounded quite complicated. One very interesting app is a Java app running on mobile phones for ebankinter.com that generates about 2% of the trading volume on the Madrid Stock Exchange. This multi-modal app (you can speak to it and also see and interact with related text on the mobile phone screen) is pre-packaged with mobile phones, primarily Blackberries. In the near term, I think this is the only viable way to do a large scale deployment of a multi-modal app. You just face too many issues with getting the app to work reliably with all the different devices that customers are going to want to use.

Someone from Adeptra gave a really interesting presentation on auto-resolution, i.e., automatically verifying instances of fraud by calling a card holder after a purchase. Other vendors provide tools for rating a transactions for likelihood of fraud. Credit card issuers use these ratings to determine when they should call a card holder to ensure that they actually initiated the transaction. The card issuers can save a lot of money by catching fraud early.

The problem is that the systems produce a lot of false positives. While they want to err on the side of safety, they don’t want to annoy their customers. Also, paying people to make these routine calls costs a lot of money. Adeptra offers speech applications that automate placing the outbound caller, detecting whether a person answered the phone (can be as simple as asking them to press any key on their phone), verifying their identity using the same questions a live agent would use, and then asking them whether they initiated the transaction in question.

They also offer apps for collections. He said that about 85% of the targets are people that just need to be reminded to make the payment. These people usually prefer a call from a computer rather than a live person, because it is less embarassing. Another 10% are in some financial trouble, but will typically pay companies in the order they contact them. By automating the calls, Adeptra’s clients can get in line first. The final 5% are the deadbeats that won’t pay a human or a computer.

The Financial Services workshop also included a presentation from someone from TellMe. While his presentation was not particularly specific to Financial Services, it was a useful, general discussion on UI design for speech apps. They have developed a quantitative approach to rating speech apps. As part of developing this system, they had to internally come to some agreement about the importance of all the major elements that make up callers’ experiences. I think there is a lot of value in having that discussion amongst an app development team before developing an application. They feel the most important parts are the interaction quality and the production quality, but they also rate things like accessibility and seamless agent interface. He played a lot of demos of really bad DTMF and speech apps, and a couple decent ones.

Finally, I caught the end of the Healthcare workshop. Healthcare calls can be difficult to automate due to privacy issues, but their are still a lot of opportunities. Medicare related apps are particularly difficult to develop due to all the privacy and general regulatory issues. Even then, there are plenty of opportunities to provide these applications in a hosted environment, as well as on premise.


Fork me on GitHub