Archive for September, 2004

9/27/2004: 11:31 pm: RobertSpeech

IBM is working with researchers at the University of Twente in the Netherlands to develop a system that listens in on calls received at a call center, uses speech recognition to identify topics that a caller is asking about, looks up related information, and displays it to a call center agent. Sort of like the annoying kid who sat next to you in class and was always trying to impress you by pointing out how much he knew.

The system actually does sounds interesting, but I’m pretty skeptical. First, the speech recognition system has to recognize what a caller is asking about. The prototype currently uses keyword recognition. That can work well, assuming you choose your grammar very wisely and you actually do quite a bit better than simple keyword recognition. Without putting the keywords into some context, the application is likely to behave like a loyal, but very stupid dog, running off and fetching lots of information that is tangential to what the caller is really interested in. You then run the risk of the system distracting the call center agent and leading him or her to pay more attention to the info the system fetched than to what the customer is saying.

The problem is that the system must not only understand the caller’s utterances and dig up useful, related info, but it must also present that info to the agent in a way that the agent can then effectively present back to the caller. You now have a lot of steps at which misinterpretation can occur, sort of like the old telephone game.

Another advertised feature of the system is that it can warn the agent if it does not hear a warning that the agent is supposed to give the caller, such as about penalties for not making mortgage payments.

If the system does not “hear” the keywords of that warning the operator will receive a sharp on-screen reminder before the call ends.

And what kind of a “sharp” reminder would this be, a swift poke in the ribs, a high voltage shock on the nose? Ah, I forget this is an English magazine and my use of certain adjectives doesn’t all jive with the Queen’s use.

9/26/2004: 4:46 pm: RobertThe Unusual and the Weird

How do I love, thee, peat? Let me count the ways.

  1. You amend the soils of my plants by helping the soil hold water, while still allowing sufficient oxygen transfer.
  2. You halt the germination of malted barley at the perfect moment, and give whisky a lovingly smoky flavor
  3. And finally, you help make for a good toilet?

My brother sends word of an article in the October 2004 issue of Cruising World on composting toilets that use peat moss. To keep the odor down, you really need to have an electric fan for the vent pipe. However, it’s quite possible you could get by with a solar-powered fan, since you wouldn’t need to run it all the time. A manual crank is used to mix waste with the peat moss.

The article briefly mentions Incinolet, stating that:

“The Incinolet (www.incinolet.com) is an electric incinerating toilet, very popular with commercial fisherman and other commercial users who have ample electric power.”

9/23/2004: 10:40 pm: RobertFood and Drink

Bavarian Beer Foamer

Oktoberfest Shop : Bavarian Beer Foamer

This is an interesting repurposing of a common European kitchen tool. I use a nearly identical tool (except for the cool Beyerischer flag decal or sticker) to make a pseudo café latte every morning. My morning ritual is to heat about 1/3 of a coffee mug of milk in the microwave, foam it using an Aerolatte frother, and then add a double espresso using my Capresso 2000.

These frothers are typically about six inches long, battery powered, and have a ring at the bottom that is surrounded by a coil. The long metal wire connecting the ring and coil to the base rotates at a very fast speed. This apparatus works amazingly well to froth most liquids that have the chemical make-up to support foamy bubbles.

The Aerolatte works quite well for frothing milk. I’m not so sure about using it to put a new head of foam on a beer. That just strikes me as wrong. Especially if used with a 1 liter maß krug.

9/16/2004: 11:15 pm: RobertMusic

Johnny Ramone

From SFGate.com – Johnny Ramone, member of punk legends ‘The Ramones,’ dies at 55

We lost another one of the Ramones this week. Guitarist Johnny Ramone died at the age of fifty-five after being diagnosed with prostate cancer five years ago. Tommy is the only founding member left, and believe it or not, he was the drummer.

Just like The Clash, The Ramones were a huge influence on me in high school. When I graduated from college and finally had enough money to buy a CD player (believe it or not, kids, CD players used to cost more than CDs), the first CD I bought was Ramones Mania, which was a 20-song, sort of greatest hits, compilation. I was also lucky enough to see them perform live in Houston in around 1988.

It’s curious that the article mentioned “I wanna be sedated” and “Blitzkrieg Bop”, since those are the two main ringtones that I use on my phone. Those two, and “Beat on the Brat”.

9/15/2004: 12:14 am: RobertBicycling, Treo 600

Sunday was the T-Mobile International bike race in San Francisco. Below are some thumbnail images of the photos I took with the camera on my Treo 600. The full-sized images are on another page.

Taylor Street Hill
Peloton paceline
USPS rider leading the peloton
Support cars
tmobile intl photo 1
tmobile intl photo 2
tmobile intl photo 3
tmobile intl photo 4

9/14/2004: 12:39 am: RobertSpeech, VoiceXML

When I first heard the announcement (full story from NY Times requires registration, excerpt from C|Net doesn’t), I was hoping that IBM was open sourcing their ASR and TTS engines. But, it turned out to be two other parts of their voice portfolio.

IBM is donating source code for their Reusable Dialog Components to the Apache Software Foundation. The RDC were developed as chunks of static VoiceXML code that perform common dialog functions, such as collecting address information or dates. At the spring SpeechTek conference, someone from IBM told me they were porting the RDCs to JSPs that generate VoiceXML. If nothing else, I hope the RDCs will provide good code samples to further popularize the development of VoiceXML applications.

The Call Flow Builder

IBM is also donating some or all of their Voice Toolkit to the Eclipse organization. The Voice Toolkit was reimplemented as plug-ins for Eclipse about two years ago. It’s a pretty nice application, although the last time I checked out the preview version on the IBM alphaWorks site, it had a lot of complicated dependencies. Also, it was supported only on Windows. The official 5.1 release is now available. Unfortunately, it still runs only on Windows.

The Voice Toolkit Call Flow Builder is a fairly simple GUI for creating the basic dialog of a call flow as a directed graph (i. e., boxes and arrows). Once you get the call flow mostly scoped out, you can generate markup from your diagrams. The native XML dialect it generates can be automatically translated into VoiceXML. I don’t know if this feature is in 5.1, but I think they were also planning to support generation of JSPs, HTML, or whatever other markup language you wanted to generate. All you need to provide is the appropriate XSLT script to do the transformation.

The grammar development tools in the Toolkit Preview were nothing to get excited about, but I’ve yet to see good grammar development tools from anyone, and yes, I’ve looked at a lot of tools. The pronunication builder was pretty nice, though.

Once you generate the markup, you move into a more traditional programming environment where you edit markup and code. The RDCs mentioned above can be helpful to fill out the rest of your app, though I expect they will also make them available from the GUI.

The worst thing I could say about the Voice Toolkit was that when I tried it about six months ago, the documentation was pretty bad. There were huge chunks of missing information and way too many typos.

The NY Times ends with a comical quote from a director of marketing at Microsoft who claims that IBM is following Microsoft. Hmm, I didn’t know that Microsoft had open sourced their speech development tools under an OSI-compliant license. I think that’s news to everyone, including the development team at Microsoft.

Microsoft is clearly the follower in speech platforms and applications development. They’re still pretty far behind, even though they are making good progress. They shouldn’t be ashamed to be a follower in this space. They picked a very good time for entry. It’s just hard to take them seriously when their representatives make laughable claims.

9/9/2004: 11:41 pm: RobertSpeech

A work colleague (thanks, Rob!) emailed me the text from an article in the NY Times about speech recognition applications (free registration required). It’s a mostly favorable discussion of commercial speech applications and speech recognition technologies, although, to no surprise, they ran across a couple frustrated users.

One of the areas taking the biggest hit in the article is natural language speech recognition, e. g., “How can I help you?” While a few people out there (e.g., the guy from IBM interviewed the article and one of the Microsoft guys who spoke at last Spring’s SpeechTek) seem to be living in the fusion world where the big delivery is always just ten years away. But, if you look at how the pace at which NLSR has improved in the last couple years, it’s really hard to believe that it’s going to be hugely better in ten years. Moore’s Law will definitely help out, but if it were just a CPU cycle problem, why don’t you see anybody using grids or supercomputers to deliver human-like NLSR? I believe it’s going to take several major, major scientific breakthroughs before NLSR is good enough to be widely used.

NLSR works great if you have to ask only a small set of questions, but you need to be able to handle a wide range of answers spoken in a wide variety of ways. The problem is that building the statistical language models for the questions and answers is a lot of work and it gets very expensive very quick. But, there are obviously significant advantages to allowing people to respond in full sentences.

Directed dialog works great when you have a large set of questions, but for which the answers are more predictable. While any good quality speech application will beef up the grammar to handle the extra “uhs”, “ums”, “please”, and “thanks” of everyday speech, speakers are still restricted to a more limited set of utterances, at least with respect to short phrases instead of sentences. Nonetheless, a well designed directed dialog application can be highly usable, and yet still relatively inexpensive to build.

For some applications, a hybrid of the two can work well, with the initial question or two handled via NLSR, and the rest of the conversation handled as directed dialog. The downside, though, is that hybrid applications can be much more expensive to build. You have to license more products and you need developers experienced in more technologies. Also, callers can be misled by the open-ended nature of the initial question. They then get frustrated when full sentences aren’t understood as responses to the other prompts. As they say in project management, it’s all about setting expectations appropriately.

9/8/2004: 11:57 pm: RobertSpeech, VoiceXML

And so does the W3C. Speech Synthesis Markup Language (SSML) Version 1.0 is now a W3C Recommendation. SSML is used with both VoiceXML and SALT to specify how text should be synthesized into speech. Congratulations to the co-editors from Nuance, Intel, and ScanSoft who ushered it through the process.

9/4/2004: 5:52 pm: RobertPrivacy and Security, Speech, VoiceXML

Although the ability to spoof caller ID has been around for quite awhile, I wasn’t aware of any public services that offered that capability. On August 31, a company called *38 launched a service for spoofing caller ID. With stories quickly appearing on SlashDot and the New York Times (registration required), *38 picked up a lot of publicity very quickly.

Perhaps too quickly for founder Jason Jepson, as an article in the Houston Chronicle revealed that he received “harassing e-mail and phone messages and even a death threat taped to his front door”. Since the *38 website suggests that the service would primarily be targeted at bill collection agencies, I presumed the threats would have been from people running from the repo man. But, he contends that they are coming from hackers who are upset that a tool available only in the underground was suddenly now available to anyone, somewhat like how magicians get mad when another magician reveals how a popular trick works.

I would have thought a more likely source of the threats would be the big phone companies, since caller ID is nearly pure profit for them. If people stop trusting caller ID, there goes a fantastic source of revenue for them.

As a sheer coincidence, early last week I built a caller ID spoofing application on our speech platform at work. It was a really simple app to write, and it works like a charm. The very next day I saw the article on *38 on Slashdot.

With their service, you first register your phone number with them and agree to pay $20/month plus 7-10 cents per minute, based on calling volume. Then, you go to their website and enter a number to call and the calling number you want to spoof. An automated service calls you back, dials the first number, while spoofing the caller ID with the second number.

I like my implementation better, since it doesn’t require Internet access. I call a toll free number that connects me to my application hosted by a VoiceXML service provider. My app then asks you to enter (speech or DTMF) the number to call. Then, it asks you for the number to spoof. Seconds later, the phone at the first number is ringing, but the calling number that that person sees (assuming they have caller ID support) is the second number.

From the NY Times article:

“The developers of Star38, who say they required only 65 lines of computer code and $3,000 to create their service …”

Heh, the original version of mine was 51 lines of commented code and took me only about four hours of coding and testing time to complete. Even if I was charging 1999 dotcom era consultant wages, that would come in well under $3,000. If I had written it in static VoiceXML, it would have been about twenty-five lines of code (and that’s human readable code with no wacky obfuscations to shorten the length). I could easy rewrite it in fewer than twenty lines of clean, albeit uncommented, code on our platform, which dynamically generates VoiceXML.

9/3/2004: 12:17 am: RobertSpeech

Speech-enabled cars have come a long way since the annoying recorded message “The door is ajar” appeared in cars in the 80’s, before quickly disappearing from later models. Honda and IBM are now working on a car that not only talks to you, but also listens. IBM’s contribution is an embedded version of their ViaVoice software that recognizes 700 commands and around 1.7 million street and city names.

Tonight I was riding in a friend’s car and I was using the built-in GPS-based navigational system. The display was a large, bright LCD display built into the front dash and the user interface was quite good. However, there’s no way I could have entered street addresses while driving. A speech recognition interface like what IBM and Honda are developing would have been much, much better.

While a car is a pretty noisy environment, I often use my cellphone with an earbud to call into my VoiceXML apps while driving. The speech recognition performance is usually very good. As long as you don’t have the windows down while driving really fast or through a very noisy area, IBM’s embedded ViaVoice system should be able to work pretty well.