WombatNation: Speech Archives

May 15, 2004

Avaya to Voxify

After nearly eleven years at Avaya/Lucent/Mosaix/ViewStar (strung together by two acquisitions and a spin-off), I've decided to move on to Voxify, a small startup in Alameda. My last day at Avaya was May 7th. The killer blow for me was that this month Avaya is moving the R&D team from the Dublin office to the Milpitas office. I could already barely tolerate the 25-mile commute to Dublin. I was seriously dreading the 40+ mile commute to Milpitas through the extremely nasty traffic on 238 and 880. I dread no more.

Fortunately for me, near the end of March I ran into a fellow physics and philosophy major at Rice who recently started working at Voxify. Once I learned that Voxify builds sophisticated speech recognition applications for automating customer service calls, I was intrigued. After learning that the development work is in Java and VoiceXML for deployment on Linux servers, I needed to get an interview. After learning how smart and cool the people there were, I was sold.

Starting the first week of June, I'll be the lead architect for applications. While VoiceXML development started out as a hobby for me a couple years ago, I was able to turn my contact center automation development and architectural work at Avaya into an architectural role on a new VoiceXML-based platform. I'm very fortunate that that experience helped me get an opportunity to focus completely on building speech recognition apps with a co-located development team.

While I greatly enjoyed my time at Avaya, working at a 15,000 person company with over 2,000 people in R&D spread out over the US and a few other countries can be a little distracting at times. It makes sense to try to leverage the work of all those developers, but I've learned just how hard distributed development can be. The December/January 2003-2004 of ACM Queue had a couple of excellent articles on distributed development, though, of course, no silver bullet.

I'm not sure how you solve or avoid the problems of distributed development at any software company that large. Building a very large company organically at a single site takes a very long time. Acquisitions are almost always required to build large companies, but it often happens that the companies you want to acquire are nowhere near your current office or offices. Obviously, you could try to force everyone to move to a single site, or at least to a very small set of sites, but you will inevitably lose a significant number of key employees who don't want to relocate.

Posted by Robert at 12:30 AM | link | comments (0) | trackback (0)

April 24, 2004

Audio Credit Card

New Scientist - Credit card only works when spoken to

Beepcard has announced a new credit card they have developed that supports audio-based authentication for credit card transactions, via technology embedded within the card itself. This is a very cool idea, assuming they can get past a couple technology and personal adoption issues.

Beepcard had previously developed a credit card that could be used to verify that a remote customer had physical possession of the credit card being used for an online transaction. The customer would hold the special credit card up to a microphone hooked up to the computer being used to facilitate the transaction. The customer pressed a button and the card would emit a pseudo-random sound. The actual sound is determined by an algorithm simultaneously run on a chip on the card and running on a server. The sound is recorded by an applet that can be installed by the customer or downloaded from a website. Beepcard's software running on a remote server would then verify whether the correct sound was emitted. Since the sound is cryptographically (3DES) unpredictable, you don't have to worry about a replay attack.

Although the article doesn't mention it (but Beepcard's website hints at this), I don't see why a company couldn't ask the customer to hold the card up to a telephone's microphone and press the button, record the sound on the call center's equuipment, and then verify the recording with the server's calculation. That would provide additional security even for orders through a human or automated call center agent. Of course, calls over cellphones or poor connections might have problems. Sampling rates for telephone calls are typically around 8 kHz with 8-bit samples, so a second or two of audio should be able to provide you with plenty of information bits for a secure audio code. Heck, the RSA SecurID token I used to have at work used only a six digit number as the ID code.

Their new credit card contains a microphone. You speak your password and the card authenticaes you. Assuming they used digit-only passwords, the voice recognition software needs to distinguish between only ten digits., albeit in a speaker independent manner. Of course, this is still quite an accomplishment for software running on a very small, extremely low power, CPU.

Some day, this will be extended to speaker authentication with non-secret phrases. You will speak a large set of phrases and a model will be constructed for your speech patterns. You will then be prompted to repeat a varying, non-secret phrase, such as count from 1 to 6, or say the alphabet from f to j. The randomness will make it harder for a thief to use a recording and the non-secret nature of the phrase will allow you to use in public settings.

Of course, the challenges include:

Battery life - they are targeting to support 10 transactions a day for two years
Thicker, more fragile card - the card is three times as thick as a normal card, and obviously more fragile
Customer security concerns - even though the card should make transactions more secure, people often fear new technology, especially if it is difficult to explain to them exactly how it works
Spoken passwords - Since you have to speak your password, it is suitable for use only where you don't think anyone else can hear you
Hoarse voices - if the customer can't speak normally, they can't use the card unless they tell someone else their password. This will be an even bigger problem for speaker authentication.

Posted by Robert at 06:04 PM | link | comments (1) | trackback (0)

April 07, 2004

Speech Recognition

Automatic speech recognition (ASR) technology addresses a very difficult to solve problem, but researchers have made a significant amount of progress over the last ten years or so. There are still many unsolved problems, but the quality of modern ASR engines has made speech a viable user interface for many different applications.

Speech recognition technologies are commonly used to recognize a speaker's response to a prompt or to transcribe what a speaker has said. Other uses are telematics, which usually means a speech interface to systems in an automobile, or command and control of other devices, like desktop computers. The two most common approaches used to recognize a speaker's response are often called grammar constrained recognition and natural language recognition. When ASR is used to transcribe speech, it is commonly called dictation. The telematics and other command and control systems that I am aware of (outside of science fiction movies) are grammar constrained.

Grammar Constrained Recognition: Works well when the speaker is providing very short responses to specific questions. To create a grammar, you specify the most likely words and phrases a person will say in response to a prompt and you map those words and phrases to a token, or a semantic concept. For example, you might create a "yes-no" grammar that maps yes, yeah, uh-huh, sure, and okay to the token "yes" and no, nope, nuh-uh, and no way dude to the token "no". If the speaker says something that doesn't match an entry in your grammar, recognition will fail.

Natural Language Recognition: Allows the speaker to provide more natural, sentence-length responses to specific questions. Natural language recognition uses statistical models. The general procedure is to create as large of a corpus as possible of typical responses, with each response matched up to a token or concept. In most approaches, a technique called Wizard of Oz is used. A person (the wizard) listens in real time or via recordings to a very large number of speakers responding naturally to a prompt. The wizard then selects the concept that represents what the user meant. A software program then analyzes the corpus of spoken utterances and their corresponding semantics and it creates a statistical model which can be used to map similar sentences to the appropriate concepts for future speakers.

For example, let's say you want to route phone calls for a customer helpdesk by asking the caller to briefly describe their problem. For the concept "forward my call to the billing department", you would want to recognize sentences like "I have a problem with my bill", "I was charged incorrectly", "How much do I owe this month", etc. While you could construct a grammar with all the likely keywords (bill, charge, charged, owed, etc.), if the caller speaks in sentences, you may pick up multiple matches. You might also miss sentences that fit the right pattern, but just miss the pre-ordained keywords. It is difficult to create large, rich grammars that consider the context in which the words are said. In addition, as a grammar gets very large, the chances of having similar sounding words in the grammar greatly increases.

The obvious advantage of natural language recognition over the grammar constrained approach is that you don't have to identify the exact words and phrases. A big disadvantage, though, is that for it to work well, the corpus must typically be very large. Creating a large corpus is time consuming and expensive.

Dictation: Used to transcribe someone's speech, word for word. Unlike grammar constrained and natural language recognition, dictation does not require semantic understanding. The goal isn't to understand what the speaker meant by their speech, just to identify the exact words. However, contextual understanding of what is being said can greatly improve the transcription. Many words, at least in English, sound alike. If the dictation system doesn't understand the context for one of these words, it will not be able to confidently identify the correct spelling.

The challenge for developers of ASR engines is that the end customer judges them on one criteria - did it understand what I said? That leaves little room for differentiation. Of course, there are areas like multi-language support, tuning tools, integration API (the proposed standard MRCP or proprietary) , etc., but recognition quality is most visible. Because of the complex algorithms and language models required to implement a high quality speech recognition engine, it is both difficult for new companies to enter this market as well as difficult for existing vendors to maintain the necessary investment level to keep up and move ahead.

Currently, Nuance and ScanSoft dominate the speech recognition market. There are a lot of other small vendors, like Aculab, Loquendo, Lumenvox, etc., but they are essentially niche players. The speech recognition side of ScanSoft is actually composed of SpeechWorks and the products of several former niche players. IBM has also participated in the speech recognition engine market, but their ViaVoice product has gained traction primarily in the desktop command and control (grammar-constrained) and dictation markets.

This is all changing. The big software heavyweights, Microsoft (Speech Server) and IBM (references - main site, voice toolkit preview, eWeek article, older InternetNews article, new InternetNews article on VXML toolkits) are now making substantial investments in speech recognition. IBM claims to have put one hundred speech researchers on the problem of taking ASR beyond the level of human speech recognition by 2010. Bill Gates is also making very large investments in speech recognition research at Microsoft. At SpeechTEK, Gates predicted that by 2011 the quality of ASR will catch up to human speech recognition. IBM and Microsoft are still well behind Nuance and ScanSoft in technology and market share, but they are gaining on them fast.

Posted by Robert at 11:05 PM | link | comments (0) | trackback (0)

March 31, 2004

Kai-Fu Lee Keynote at SpeechTEK

Kai-Fu Lee is the VP for Speech Solutions at Microsoft. He spoke at SpeechTEK after Bill Gates last week, going into much more detail on Microsoft Speech Server. Microsoft is targeting medium (25-250 agent equivalents), and large (250+ agent equivalents) enterprises. This was a bit surprising to me, as Speech Server appears to be a typical Microsoft 1.0 release, lacking in features, average performance, and somewhat less than stable (based on the demos, anyway). I expect they will actually have far more success on the low end, but I understand the need to put on a good show about the product being enterprise ready.

There isn't a lot that is innovative in their solution. It's good and it's cheap, and there's a lot to be said for that, but mostly it's a clone of what many other companies have been doing with VoiceXML for quite a few years. As Kai-Fu said, Microsoft is good at volume sales. I think they should be proud of what they have created, but they're still a few years behind most of their competitors. The race is on.

Kai-Fu said that customer's have told them that speech systems are too expensive, too complex, too inflexible with respect to scaling and deployment, and not well integrated. Microsoft appears to have taken a good shot at the first. We'll have to wait to see how they do against the other objectives.

A product manager then gave a really basic demo of changing a hotel reservation. The first call failed to connect, but Speech Server managed to respond to the second call. This was followed by a demo of a simple multimodal app using Pocket IE Explorer and speech recognition. The Pocket PC UI giving feedback on microphone signal strength was cool.

The biggest news by far was their pricing. They do pricing per simultaneous speech channel and per processor. They also provide a low end Standard Edition and a high end Enterprise Edition.

Standard Edition - 4-24 channels - $8,000 per processor
Enterprise Edition - 24-96 channels per node - $18,000 per processor

The packages include the development tools, a SALT browser, ASR, and TTS. Both editionsinclude the ScanSoft Speechify TTS engine. That's good, because my experience with their TTS software had been pretty iffy. Their ASR software was mediocre, too, but I have heard from several sources that it is significantly improved. You can use the Enterprise edition with ScanSoft's OSR ASR engine, but you can use only Microsoft ASR with the Standard edition. Nuance fans need not apply for either version. Perhaps Nuance would not bend to the OEM pricing levels that Wal-Mart, I mean Microsoft, demanded.

Kai-Fu glossed over the fact that you still have to buy all the telephony hardware and software from Intel/Dialogic and Intervoice if you actually want to use Speech Server with live telephone calls. Also, VoIP is not supported, just plain old PSTN style calls. The Microsoft website links you to some partner sites where you can request a price quote for a starter system. Microsoft offers a full-featured 180-day trial version of Speech Server, but you still have to buy all the telephony equipment. Even the most basic set-up will cost you about $1000 at a deep discount from their partners trying to grab marketshare.

Standard edition is an all-in-1 box. Everything has to run on the same server, so you might see some performance problems if your application are complex and have elaborate recognition requirements. Also, you get no failover capabilities.

Kai-Fu said that speech application development costs are way too high. They hope to unleash a significant portion of the alleged 7 million developers using Visual Studio onto building speech apps. I worry that this will be like early version of VB and Front Page all over again, with a sea of really bad speech apps to replace the bad desktop apps and bad websites. What makes it even worse, is that voice user interfaces are even harder to design than graphical user interfaces. The Microsoft speech tools are not bad, but they have a very long way to go before your average developer is going to be able to write a speech app that you can tolerate using more than once.

The presentation was followed by a couple customer demos, none of which went smoothly. First up was the NYC Department of Education. They have a web portal that parents can use to get info (absences, grades, food menus, etc.) about their kids and their school. They wanted to speech enable it so as to offer access to those familes without computers. However, my understanding from Kai-Fu's speech was that Speech Server supports English only. I suspect that the parents in many of the families without computers speak little to no English. The presenter called the number three times before he finally got ringback, but Speech Server never answered the call. After a couple minutes, an assistant finally got it to work. They did a pretty basic speech enabling of the web portal. Nothing exciting, but it did show that Speech Server actually worked. It wasn't clear whether the problems were operator error or Speech Server failing to answer calls.

The next demo was a semi-disaster. The executive director for some part of the State of Alabama Corrections had trouble seeing the keyboard and the phone. Like the previous presenter, he brought up their web portal. He made a big deal about claiming that he would use his own personal information, so as to not release any private information for a citizen of Alabama. He then proceeded to type in his OWN SOCIAL SECURITY NUMBER, in plain view of a couple hundred people whom he did not know. This brought up a web page with his birth date, height, weight, driver's license number, license tag on his car, description of his car, etc. The NYC guy at least had a fake family in his system to use for demos. I couldn't believe this guy hadn't done the same.

Then the fun began. To demonstrate how a police officer would use the speech app, he called into the system and read in a license tag. Although his voice sounded pretty clear to me, the ASR engine (they didn't say if it was MS or ScanSoft) misrecognized several characters. It then read back some private information about a vehicle owned by a tow truck company in Clayton, Alabama. So much for the protection of the private info of Alabama citizens. After another try using the NATO phonetic alphabet (Alpha, Bravo, Charlie, etc.) for the letters, he got it to work.

This was followed by a demo from two people from Grange Insurance in Seattle. Their demo actually worked on the first try.

Finally an ISV, Solar Software, and an SI, Accenture, gave demos. Their demos went very well. Solar Software speech enabled Microsoft CRM. Accenture showed a multimodal app of questionable value, but at least it worked. Their argument for going with Speech Server was that it was inexpensive (Kai-Fu Lee prefers the term "better economics") and they could use the Visual Studio environment that they were already familiar with. Given the short timeframe to give these demos, it's a little tough to do something really fancy. So, I probably shouldn't be so hard on these guys.

Posted by Robert at 10:45 PM | link | comments (0) | trackback (0)

March 25, 2004

Bill Gates Keynote at SpeechTEK

Bill Gates was the main keynote speaker at SpeechTEK/VisualStudio Live/MS Mobile Devcon on Wednesday. This was the first time I've ever been in the same room with the richest man in the world. Just me, the rich guy, and a few thousand of my very best friends.

The first ten minutes of his speech were fairly content free. Quick summary: "Hardware sure is getting faster, year after year." Things livened up when he switched to a video of a parody commercial. This is a Microsoft tradeshow tradition, and is definitely something I admire Gates for doing. The parodies are usually very funny, and often self-deprecating. This time it was a parody of a series of Microsoft Office commercials that celebrate the accomplishments of the IT worker in a style that reminds me of old NFL highlights videos. He aparently used the same video at the International Consumer Electronics Show in Las Vegas in January.

The clip featured Bill and co-workers (no Ballmer, but for all I know, everyone else was an actor) sitting at a conference room table with an array of PCs, cellphones, Pocket PCs, routers, etc. all laid out and hooked up in a big jumble of wires. As the camera panned across the table and the deep-voiced narrator talked about the hard working IT staff (I'm not doing the video justice here, it was actually quite funny) the wires ended up hooking into a toaster. Bill, at the other end of the table, pressed a couple keys on the keyboard and two pieces of toast shot out from the toaster. The camera then cut to Bill jumping up and down in slow motion with toast in hand and celebrating with his co-workers. Lots of poorly executed high-fives, in standard mocking geek style. If you've seen the original commercials, you can probably imagine this better than I am describing it.

Then it cut to Bill, with toast in hand, and his co-workers running down a corridor in the office, gleefully leaping into the air and shouting with huge, stupid grins on their faces. Finally, they all dance around Bill and goofily celebrate as he spins a piece of toast on the ground like a football player celebrating in the endzone after scoring a touchdown. The final text and narration glorifies their proud accomplishment of having used Microsoft Office to program a toaster. You really had to be there.

Gates then talked about four key areas of focus for Microsoft; at least the areas they wanted to push to this audience.

Mobility
Speech
Web services
Location based services

He brought on a staff member to demonstrate new features in Visual Studio, a.k.a, Whidbey. The demonstrator showed off some Visual Basic coding. Overall, Visual Studio 2005 seemed pretty slow, but the compiler was unbelievably slow. The presenter looked like he was just about to give up on it before it finally finished. One new feature they are pushing hard is code snippets. Other IDEs have had this for many years, but it's an innovation for Microsoft. Code snippets could be a good thing, or a very bad thing. The code snippets feature allows you to bring up a context menu and select from a list of a few hundred code snippets Microsoft will provide, plus any code snippets you decide to add. Think of a code snippet as boiler plate code, or a template. This could definitely save you a lot of time. But, it can also create a copy and paste disaster. Rather than using common subroutines, you could end up (especially on a multi-person development team) with many slightly different versions of essentially the same code.

Although he made a disclaimer that, in Julia Childs style, he was working with a previously prepared UI for his sample app, the presenter claimed that in just three lines of code (he used a code snippet to paste in a bunch more code) he finished up a web-based app for working with auto insurance claims.

Another guy came out to talk about Visual Studio for devices. His demo consisted of creating a photo blogging tool on the fly. Admittedly, he did have only ten minutes or so, but the app he created really didn't do that much and most of the code that did the real work was already prepared in advance. He then published the app to a Microsoft mobile dev portal and then downloaded it to his camera phone. He then wanted to use a Pocket PC to show that the photo had appeared on the blog.

The camera switched over to show his Pocket PC, which was displaying a note reminding him about his presentation. In probably the biggest demo disaster of the day, he couldn't dismiss the reminder. His Pocket PC had just locked up. After a short bit of desperate mashing of the buttons and poking the screen with a stylus, he bailed out and switched to a regular PC. While the earlier presenter was able to gloss over the slowness of Visual Studio 2005 by saying "if it was ready, we would have already released it," I'm assuming this guy was using released software.

The only interesting part of his demo was the location services. The blogging app was able to get his location (presumably to the accuracy that cellphone towers will allow) and automatically supply it.

Another Microsoft product manager type then gave a demo of their speech development tools. To no surprise, they are nicely integrated into Visual Studio. He showed off a data table navigator that automatically creates a grammar based on bound data. The grammar editor was very nice, but the prompt editor was quite weak. The tool also provides a built-in simulation environment so you can do basic functional testing of your app.

The surprising aspect of the speech demos was that all they showed was that Microsoft can now do what lots of other companies have been doing for five to ten years.

Although they tried to pass off the psuedo-standard SALT specification as superior to VoiceXML because SALT has multimodal capabilities designed in, they did not demo any multimodal capabilities. In the video they showed to demonstrate how a hotel might some day use Speech Server, the multimodal examples were pretty gratuitous.

Gates finished up by saying their goal was to provide seamless speech UI across devices. Their vision includes support for pervasive multimodal interaction and speech dictation.

Posted by Robert at 10:27 PM | link | comments (0) | trackback (0)

March 21, 2004

SpeechTEK

I'll be at SpeechTEK Spring across the bay in San Francisco Tuesday through Friday. Leave a comment or email me at robert AT wombatnation DOT com if you'll be there, too, and would like to meet up. I won't be manning the Avaya booth, but I'll definitely stop by there while the Expo is open.

I plan to blog parts of the conference, though I don't whether I'll try to do it live or via tape delay.

Posted by Robert at 05:09 PM | link | comments (0) | trackback (0)