After nearly eleven years at Avaya/Lucent/Mosaix/ViewStar (strung together by two acquisitions and a spin-off), I've decided to move on to Voxify, a small startup in Alameda. My last day at Avaya was May 7th. The killer blow for me was that this month Avaya is moving the R&D team from the Dublin office to the Milpitas office. I could already barely tolerate the 25-mile commute to Dublin. I was seriously dreading the 40+ mile commute to Milpitas through the extremely nasty traffic on 238 and 880. I dread no more.
Fortunately for me, near the end of March I ran into a fellow physics and philosophy major at Rice who recently started working at Voxify. Once I learned that Voxify builds sophisticated speech recognition applications for automating customer service calls, I was intrigued. After learning that the development work is in Java and VoiceXML for deployment on Linux servers, I needed to get an interview. After learning how smart and cool the people there were, I was sold.
Starting the first week of June, I'll be the lead architect for applications. While VoiceXML development started out as a hobby for me a couple years ago, I was able to turn my contact center automation development and architectural work at Avaya into an architectural role on a new VoiceXML-based platform. I'm very fortunate that that experience helped me get an opportunity to focus completely on building speech recognition apps with a co-located development team.
While I greatly enjoyed my time at Avaya, working at a 15,000 person company with over 2,000 people in R&D spread out over the US and a few other countries can be a little distracting at times. It makes sense to try to leverage the work of all those developers, but I've learned just how hard distributed development can be. The December/January 2003-2004 of ACM Queue had a couple of excellent articles on distributed development, though, of course, no silver bullet.
I'm not sure how you solve or avoid the problems of distributed development at any software company that large. Building a very large company organically at a single site takes a very long time. Acquisitions are almost always required to build large companies, but it often happens that the companies you want to acquire are nowhere near your current office or offices. Obviously, you could try to force everyone to move to a single site, or at least to a very small set of sites, but you will inevitably lose a significant number of key employees who don't want to relocate.
Automatic speech recognition (ASR) technology addresses a very difficult to solve problem, but researchers have made a significant amount of progress over the last ten years or so. There are still many unsolved problems, but the quality of modern ASR engines has made speech a viable user interface for many different applications.
Speech recognition technologies are commonly used to recognize a speaker's response to a prompt or to transcribe what a speaker has said. Other uses are telematics, which usually means a speech interface to systems in an automobile, or command and control of other devices, like desktop computers. The two most common approaches used to recognize a speaker's response are often called grammar constrained recognition and natural language recognition. When ASR is used to transcribe speech, it is commonly called dictation. The telematics and other command and control systems that I am aware of (outside of science fiction movies) are grammar constrained.
Grammar Constrained Recognition: Works well when the speaker is providing very short responses to specific questions. To create a grammar, you specify the most likely words and phrases a person will say in response to a prompt and you map those words and phrases to a token, or a semantic concept. For example, you might create a "yes-no" grammar that maps yes, yeah, uh-huh, sure, and okay to the token "yes" and no, nope, nuh-uh, and no way dude to the token "no". If the speaker says something that doesn't match an entry in your grammar, recognition will fail.
Natural Language Recognition: Allows the speaker to provide more natural, sentence-length responses to specific questions. Natural language recognition uses statistical models. The general procedure is to create as large of a corpus as possible of typical responses, with each response matched up to a token or concept. In most approaches, a technique called Wizard of Oz is used. A person (the wizard) listens in real time or via recordings to a very large number of speakers responding naturally to a prompt. The wizard then selects the concept that represents what the user meant. A software program then analyzes the corpus of spoken utterances and their corresponding semantics and it creates a statistical model which can be used to map similar sentences to the appropriate concepts for future speakers.
For example, let's say you want to route phone calls for a customer helpdesk by asking the caller to briefly describe their problem. For the concept "forward my call to the billing department", you would want to recognize sentences like "I have a problem with my bill", "I was charged incorrectly", "How much do I owe this month", etc. While you could construct a grammar with all the likely keywords (bill, charge, charged, owed, etc.), if the caller speaks in sentences, you may pick up multiple matches. You might also miss sentences that fit the right pattern, but just miss the pre-ordained keywords. It is difficult to create large, rich grammars that consider the context in which the words are said. In addition, as a grammar gets very large, the chances of having similar sounding words in the grammar greatly increases.
The obvious advantage of natural language recognition over the grammar constrained approach is that you don't have to identify the exact words and phrases. A big disadvantage, though, is that for it to work well, the corpus must typically be very large. Creating a large corpus is time consuming and expensive.
Dictation: Used to transcribe someone's speech, word for word. Unlike grammar constrained and natural language recognition, dictation does not require semantic understanding. The goal isn't to understand what the speaker meant by their speech, just to identify the exact words. However, contextual understanding of what is being said can greatly improve the transcription. Many words, at least in English, sound alike. If the dictation system doesn't understand the context for one of these words, it will not be able to confidently identify the correct spelling.
The challenge for developers of ASR engines is that the end customer judges them on one criteria - did it understand what I said? That leaves little room for differentiation. Of course, there are areas like multi-language support, tuning tools, integration API (the proposed standard MRCP or proprietary) , etc., but recognition quality is most visible. Because of the complex algorithms and language models required to implement a high quality speech recognition engine, it is both difficult for new companies to enter this market as well as difficult for existing vendors to maintain the necessary investment level to keep up and move ahead.
Currently, Nuance and ScanSoft dominate the speech recognition market. There are a lot of other small vendors, like Aculab, Loquendo, Lumenvox, etc., but they are essentially niche players. The speech recognition side of ScanSoft is actually composed of SpeechWorks and the products of several former niche players. IBM has also participated in the speech recognition engine market, but their ViaVoice product has gained traction primarily in the desktop command and control (grammar-constrained) and dictation markets.
This is all changing. The big software heavyweights, Microsoft (Speech Server) and IBM (references - main site, voice toolkit preview, eWeek article, older InternetNews article, new InternetNews article on VXML toolkits) are now making substantial investments in speech recognition. IBM claims to have put one hundred speech researchers on the problem of taking ASR beyond the level of human speech recognition by 2010. Bill Gates is also making very large investments in speech recognition research at Microsoft. At SpeechTEK, Gates predicted that by 2011 the quality of ASR will catch up to human speech recognition. IBM and Microsoft are still well behind Nuance and ScanSoft in technology and market share, but they are gaining on them fast.
Kai-Fu Lee is the VP for Speech Solutions at Microsoft. He spoke at SpeechTEK after Bill Gates last week, going into much more detail on Microsoft Speech Server. Microsoft is targeting medium (25-250 agent equivalents), and large (250+ agent equivalents) enterprises. This was a bit surprising to me, as Speech Server appears to be a typical Microsoft 1.0 release, lacking in features, average performance, and somewhat less than stable (based on the demos, anyway). I expect they will actually have far more success on the low end, but I understand the need to put on a good show about the product being enterprise ready.
There isn't a lot that is innovative in their solution. It's good and it's cheap, and there's a lot to be said for that, but mostly it's a clone of what many other companies have been doing with VoiceXML for quite a few years. As Kai-Fu said, Microsoft is good at volume sales. I think they should be proud of what they have created, but they're still a few years behind most of their competitors. The race is on.
Kai-Fu said that customer's have told them that speech systems are too expensive, too complex, too inflexible with respect to scaling and deployment, and not well integrated. Microsoft appears to have taken a good shot at the first. We'll have to wait to see how they do against the other objectives.
A product manager then gave a really basic demo of changing a hotel reservation. The first call failed to connect, but Speech Server managed to respond to the second call. This was followed by a demo of a simple multimodal app using Pocket IE Explorer and speech recognition. The Pocket PC UI giving feedback on microphone signal strength was cool.
The biggest news by far was their pricing. They do pricing per simultaneous speech channel and per processor. They also provide a low end Standard Edition and a high end Enterprise Edition.
- Standard Edition - 4-24 channels - $8,000 per processor
- Enterprise Edition - 24-96 channels per node - $18,000 per processor
The packages include the development tools, a SALT browser, ASR, and TTS. Both editionsinclude the ScanSoft Speechify TTS engine. That's good, because my experience with their TTS software had been pretty iffy. Their ASR software was mediocre, too, but I have heard from several sources that it is significantly improved. You can use the Enterprise edition with ScanSoft's OSR ASR engine, but you can use only Microsoft ASR with the Standard edition. Nuance fans need not apply for either version. Perhaps Nuance would not bend to the OEM pricing levels that Wal-Mart, I mean Microsoft, demanded.
Kai-Fu glossed over the fact that you still have to buy all the telephony hardware and software from Intel/Dialogic and Intervoice if you actually want to use Speech Server with live telephone calls. Also, VoIP is not supported, just plain old PSTN style calls. The Microsoft website links you to some partner sites where you can request a price quote for a starter system. Microsoft offers a full-featured 180-day trial version of Speech Server, but you still have to buy all the telephony equipment. Even the most basic set-up will cost you about $1000 at a deep discount from their partners trying to grab marketshare.
Standard edition is an all-in-1 box. Everything has to run on the same server, so you might see some performance problems if your application are complex and have elaborate recognition requirements. Also, you get no failover capabilities.
Kai-Fu said that speech application development costs are way too high. They hope to unleash a significant portion of the alleged 7 million developers using Visual Studio onto building speech apps. I worry that this will be like early version of VB and Front Page all over again, with a sea of really bad speech apps to replace the bad desktop apps and bad websites. What makes it even worse, is that voice user interfaces are even harder to design than graphical user interfaces. The Microsoft speech tools are not bad, but they have a very long way to go before your average developer is going to be able to write a speech app that you can tolerate using more than once.
The presentation was followed by a couple customer demos, none of which went smoothly. First up was the NYC Department of Education. They have a web portal that parents can use to get info (absences, grades, food menus, etc.) about their kids and their school. They wanted to speech enable it so as to offer access to those familes without computers. However, my understanding from Kai-Fu's speech was that Speech Server supports English only. I suspect that the parents in many of the families without computers speak little to no English. The presenter called the number three times before he finally got ringback, but Speech Server never answered the call. After a couple minutes, an assistant finally got it to work. They did a pretty basic speech enabling of the web portal. Nothing exciting, but it did show that Speech Server actually worked. It wasn't clear whether the problems were operator error or Speech Server failing to answer calls.
The next demo was a semi-disaster. The executive director for some part of the State of Alabama Corrections had trouble seeing the keyboard and the phone. Like the previous presenter, he brought up their web portal. He made a big deal about claiming that he would use his own personal information, so as to not release any private information for a citizen of Alabama. He then proceeded to type in his OWN SOCIAL SECURITY NUMBER, in plain view of a couple hundred people whom he did not know. This brought up a web page with his birth date, height, weight, driver's license number, license tag on his car, description of his car, etc. The NYC guy at least had a fake family in his system to use for demos. I couldn't believe this guy hadn't done the same.
Then the fun began. To demonstrate how a police officer would use the speech app, he called into the system and read in a license tag. Although his voice sounded pretty clear to me, the ASR engine (they didn't say if it was MS or ScanSoft) misrecognized several characters. It then read back some private information about a vehicle owned by a tow truck company in Clayton, Alabama. So much for the protection of the private info of Alabama citizens. After another try using the NATO phonetic alphabet (Alpha, Bravo, Charlie, etc.) for the letters, he got it to work.
This was followed by a demo from two people from Grange Insurance in Seattle. Their demo actually worked on the first try.
Finally an ISV, Solar Software, and an SI, Accenture, gave demos. Their demos went very well. Solar Software speech enabled Microsoft CRM. Accenture showed a multimodal app of questionable value, but at least it worked. Their argument for going with Speech Server was that it was inexpensive (Kai-Fu Lee prefers the term "better economics") and they could use the Visual Studio environment that they were already familiar with. Given the short timeframe to give these demos, it's a little tough to do something really fancy. So, I probably shouldn't be so hard on these guys.
No, I'm not asking you to vote for SoccerPhone for President in 2004. I'm just letting anyone who cares know that SoccerPhone is working this year without me having to make any changes to the code. Fortunately, the people running the MLS website didn't make any significant changes to the HTML code on the live scores page. In case you are wondering what any of this means:
SoccerPhone is a free, automated service that provides live Major League Soccer scores by phone.
I wrote this application because I wanted to have remote access to updated MLS scores, I wanted to learn how to create VoiceXML applications, and I wanted to learn how to code in Python.
Bill Gates was the main keynote speaker at SpeechTEK/VisualStudio Live/MS Mobile Devcon on Wednesday. This was the first time I've ever been in the same room with the richest man in the world. Just me, the rich guy, and a few thousand of my very best friends.
The first ten minutes of his speech were fairly content free. Quick summary: "Hardware sure is getting faster, year after year." Things livened up when he switched to a video of a parody commercial. This is a Microsoft tradeshow tradition, and is definitely something I admire Gates for doing. The parodies are usually very funny, and often self-deprecating. This time it was a parody of a series of Microsoft Office commercials that celebrate the accomplishments of the IT worker in a style that reminds me of old NFL highlights videos. He aparently used the same video at the International Consumer Electronics Show in Las Vegas in January.
The clip featured Bill and co-workers (no Ballmer, but for all I know, everyone else was an actor) sitting at a conference room table with an array of PCs, cellphones, Pocket PCs, routers, etc. all laid out and hooked up in a big jumble of wires. As the camera panned across the table and the deep-voiced narrator talked about the hard working IT staff (I'm not doing the video justice here, it was actually quite funny) the wires ended up hooking into a toaster. Bill, at the other end of the table, pressed a couple keys on the keyboard and two pieces of toast shot out from the toaster. The camera then cut to Bill jumping up and down in slow motion with toast in hand and celebrating with his co-workers. Lots of poorly executed high-fives, in standard mocking geek style. If you've seen the original commercials, you can probably imagine this better than I am describing it.
Then it cut to Bill, with toast in hand, and his co-workers running down a corridor in the office, gleefully leaping into the air and shouting with huge, stupid grins on their faces. Finally, they all dance around Bill and goofily celebrate as he spins a piece of toast on the ground like a football player celebrating in the endzone after scoring a touchdown. The final text and narration glorifies their proud accomplishment of having used Microsoft Office to program a toaster. You really had to be there.
Gates then talked about four key areas of focus for Microsoft; at least the areas they wanted to push to this audience.
- Mobility
- Speech
- Web services
- Location based services
He brought on a staff member to demonstrate new features in Visual Studio, a.k.a, Whidbey. The demonstrator showed off some Visual Basic coding. Overall, Visual Studio 2005 seemed pretty slow, but the compiler was unbelievably slow. The presenter looked like he was just about to give up on it before it finally finished. One new feature they are pushing hard is code snippets. Other IDEs have had this for many years, but it's an innovation for Microsoft. Code snippets could be a good thing, or a very bad thing. The code snippets feature allows you to bring up a context menu and select from a list of a few hundred code snippets Microsoft will provide, plus any code snippets you decide to add. Think of a code snippet as boiler plate code, or a template. This could definitely save you a lot of time. But, it can also create a copy and paste disaster. Rather than using common subroutines, you could end up (especially on a multi-person development team) with many slightly different versions of essentially the same code.
Although he made a disclaimer that, in Julia Childs style, he was working with a previously prepared UI for his sample app, the presenter claimed that in just three lines of code (he used a code snippet to paste in a bunch more code) he finished up a web-based app for working with auto insurance claims.
Another guy came out to talk about Visual Studio for devices. His demo consisted of creating a photo blogging tool on the fly. Admittedly, he did have only ten minutes or so, but the app he created really didn't do that much and most of the code that did the real work was already prepared in advance. He then published the app to a Microsoft mobile dev portal and then downloaded it to his camera phone. He then wanted to use a Pocket PC to show that the photo had appeared on the blog.
The camera switched over to show his Pocket PC, which was displaying a note reminding him about his presentation. In probably the biggest demo disaster of the day, he couldn't dismiss the reminder. His Pocket PC had just locked up. After a short bit of desperate mashing of the buttons and poking the screen with a stylus, he bailed out and switched to a regular PC. While the earlier presenter was able to gloss over the slowness of Visual Studio 2005 by saying "if it was ready, we would have already released it," I'm assuming this guy was using released software.
The only interesting part of his demo was the location services. The blogging app was able to get his location (presumably to the accuracy that cellphone towers will allow) and automatically supply it.
Another Microsoft product manager type then gave a demo of their speech development tools. To no surprise, they are nicely integrated into Visual Studio. He showed off a data table navigator that automatically creates a grammar based on bound data. The grammar editor was very nice, but the prompt editor was quite weak. The tool also provides a built-in simulation environment so you can do basic functional testing of your app.
The surprising aspect of the speech demos was that all they showed was that Microsoft can now do what lots of other companies have been doing for five to ten years.
Although they tried to pass off the psuedo-standard SALT specification as superior to VoiceXML because SALT has multimodal capabilities designed in, they did not demo any multimodal capabilities. In the video they showed to demonstrate how a hotel might some day use Speech Server, the multimodal examples were pretty gratuitous.
Gates finished up by saying their goal was to provide seamless speech UI across devices. Their vision includes support for pervasive multimodal interaction and speech dictation.
Archive of W3C News in 2004 - Working Draft: VoiceXML 2.1
As announced on the W3C website, the voice browser working group email discussion list, and at the SpeechTEK conference I'm attending, the working draft for VoiceXML 2.1 was released today.
The best news for me is that the <data> tag is part of the draft. The data tag lets you retrieve an XML document via an HTTP request and continue on in the same VXML document. Both Tellme (Hey, what's up with the dumbed down, Flash-crazed, nearly content-free new Tellme website? Please bring back the old site, which actually contained useful info.) and BeVocal already implement the data tag in their VXML browsers. I used the data tag in SoccerPhone and PhoneBlogger to retrieve an XML document containing configuration data. I then parse the XML document with ECMA/JavaScript to extract the config data.
Another advantage of the data tag is that it makes it easier to develop simple XML over HTTP web services that you can easily reuse with non-VXML applications. I'm talking about simpler than SOAP and XML-RPC web services. Just good old RESTful style web services. Without the data tag, the only standard way to get data back to a VXML app was to have the HTTP request return a VXML document to transition to. That makes it hard to reuse your data integration service. You typically end up having to wrap the data integration service with a simple VXML document just to keep the dialog going.
The <foreach> tag is also pretty handy. I used it for looping through JavaScript arrays in SoccerPhone. Since it is not yet an official part of the spec, I ended up having to implement the SoccerPhone VXML code slightly differently between Tellme and BeVocal.
Finally, it's really great to see consultation transfer get added. Many call center applications are difficult or impossible to implement without support for consultative transfers. Lots of VXML broswer vendors added support anyway, just in a proprietary way.
I'll be at SpeechTEK Spring across the bay in San Francisco Tuesday through Friday. Leave a comment or email me at robert AT wombatnation DOT com if you'll be there, too, and would like to meet up. I won't be manning the Avaya booth, but I'll definitely stop by there while the Expo is open.
I plan to blog parts of the conference, though I don't whether I'll try to do it live or via tape delay.
World Wide Web Consortium Issues VoiceXML 2.0 as a W3C Recommendation
The W3C advanced VoiceXML 2.0 and Speech Recognition Grammar Specification (SRGS) from Candidate Recommendation status to Recommendation status. Although vendors have been delivering products that implement the VXML 2.0 and SRGS specs for several years now, it's good for the specifications to reach the final stage of approval from the W3C. Hopefully, we will now see quicker progress on CCXML, VXML 2.1, and promotion of SSML to Recommendation status.
W3C risks patent tussle in standard push | CNET News.com
A software patent problem continues to hang over VoiceXML 2.0, which reached proposed recommendation status this month. The problem is that the Patent Advisory Group that was supposed to make sure that any relevant patents would be offered on royalty free terms never got Rutgers University to respond in regards to a relevant patent. Rutgers previous position was to offer it under Reasonable and Non-Discriminatory Terms (RAND), which could mean license fees.
While the patent is an immediate concern for the VoiceXML standard, I believe it is an equal concern for the SALT standard that is primarily advocated by Microsoft. It may also be an issue for certain types of software used to audio enable the web for blind users. Furthermore, it could affect proprietary IVR systems that use URLs to retrieve "audio enabled pages".
Last year the W3C established a Royalty-Free Patent Policy with respect to software patents affecting potential standards. The policy states that the W3C will not adopt any standard that cannot be implemented on a royalty free basis. While it is acceptable that a relevant technology be patented, the owner of each relevant patent must offer the patent under a royalty free license, as opposed to RAND.
The patent that is assigned to Rutgers University is titled "Method and system for audio access to information in a wide area computer network" and was originally filed in December 1996.
I am not a lawyer and I have not read the entire patent filing, but it seems fairly clear to me that this patent covers some uses of VoiceXML and SALT. Specifically, the patent makes claims regarding the use of an audio web server to make data from resources on a wide area network available through an audio interface. The patent focuses on audio enabling existing data resources that are accessed a WAN, and it suggests that the WWW is an example of such a network.
While audio enabling standard websites is a common use of VoiceXML and a likely use of SALT via Microsoft Speech Server, it is by no means the only use. However, the patent also refers to audio enabling databases. I could not find any claims related to conducting transactions, only accessing data.
But, many of the claims (and there are 31 of them) are very vague. Hopefully, that backfires and reduces the effectiveness of the patent. Many of the claims refer to other claims and are very difficult to keep straight.
The common reaction of software engineers to software patents that they hear about in the news is that they are obvious ideas and should never have been granted. Of course, lots of inventions sound obvious once you hear them described. Although I am obviously biased, I do think much of this particular patent falls into the obvious category. There are some novel and interesting ideas, but those ideas account for only a small handful of the claims. Also, very similar work was conducted at Lucent, and possibly at IBM and Motorola, as early as 1994.
Now here's something every kid will love. Santa Claus and his elf Pixel have been outsourced as VoiceXML applications. I guess wages for customer service reps at the North Pole have gone up too much.
Sign up at www.talktosantaclaus.com to have your children confused by a Text-To-Speech driven elf. The Santa Claus bits are all pre-recorded audio. All silly criticisms aside, I think it's a really cool idea.
First you provide some information about your child (pet, spouse, or unsuspecting friend at work) - their name, some sage advice (Stop asking if we're there yet; No beer until you've finished your breakfast), and an item they want for Xmas. If you provide an email address for the lucky victim, she will get an email address with a phone number to call and a secret code to provide. After going through the sign-up process, you can give it a trial run. Do it now. You won't regret it. C'mon, all the lazy parents are doing it. I don't even have kids and I did it.
French speakers are in luck. The companies that developed the application, Elix, Nü Echo, and CONCEPT S2i, are all in Quebec. They provided both French and English versions of Talk to Santa Claus.
Southern drawls have thwarted voice recognition equipment used by the Shreveport Police Department to route non-emergency calls.
"In Louisiana, we have a problem with Southern drawl and what I call lazy mouth. Because of that, the system often doesn't recognize what [callers] say," [Capt. John Dunn] said.
Having grown up in the deep South and remembering how people speak there, I'm certainly not surprised that a speech recognition that is not tuned specifically for Southern US English accents might not work too well.
The "lazy mouth" that I remember was a combination of slurred speech, extra syllables, and the occasional omission of nouns or verbs that could be implied from the context, assuming you were a local. However, the exact nature of the accent varies significantly from Texas to Louisiana to Mississippi and across the rest of the South.
A couple of researchers from Carnegie Mellon and Hitachi have written a paper on an extension to VoiceXML to more easily support complex dialog systems. Their focus is on scenarios where you have multiple related dialogs and want to allow for flexible transitions between these dialogs. From the abstract:
This paper describes DialogXML, an extension to VoiceXML that supports a more implicitly declarative language for dialog scenarios, and ScenarioXML, a straightforward combination of DialogXML with the template-filling mechanism of Java Server Pages.
Essentially the same group also recently published a paper in the Information Processing Society of Japan SIG Notes on a spoken dialog management architecture for car telematics systems using VoiceXML, DialogXML, and ScenarioXML.
This is really interesting stuff, but I still struggle with the idea of using an XML-based language for programming. I really like the idea of being able to validate my code against a DTD or schema, but the code ends up being really verbose and hard to read. It's even worse than JSP or ASP coding. XML just doesn't seem like the most user friendly way to describe these kinds of state transition diagrams.
I just uploaded the 0.3 release of SoccerPhone to the SourceForge project site. This is a minor release. The only changes I made were to accomodate recent changes to the MLSNet.com live scores page. Unfortunately, they have been changing fundamental aspects of their HTML markup nearly every week. Sometimes I get lucky and their changes don't break my code, but too often they do.
While I would love it if MLSNet.com offered an XML feed, perhaps as SportsML, I would be happy if they used CSS more extensively to separate content from presentation. Removing the presentation markup and using meaningful tags to indicate structure would make my life a lot easier. Although they do this in a couple places, in many places they are still using a class called smtext to present unrelated content as small text. Also, the table-based page structure is a nightmare to parse and to understand.
This post was created with PhoneBlogger. Click to listen to the recorded message.
This was an experiment in audioblogging a live music performance. The music you hear is from a performance by Uprock at Digital Mix, a benefit for EFF.
I don't think that bands need to worry too much any time soon about people bootlegging shows this way, although my setup was by no means ideal.
- I wasn't getting a strong signal from inside the performance space
- I have a 3+ year-old Sanyo SCP-4000 cellphone
- The acoustics in the performance space were not that great
- I could definitely have picked a better spot from which to take the recording
and, of course, there's the roughly 4 KHz bandpass filter imposed by the telephone network. Nonetheless, I think this was a very promising experiment. Most importantly, I think it is a good justification for investing in a newer, better cellphone.
I think this is a really smart move by HP. Scott McGlashan, the CTO and co-founder of PipeBeach, is one of the two chairs of the W3C Voice Browser working group. Before this move, HP was generally not considered to be a significant player in the VoiceXML market.
HP's existing OpenCall platform is targeted at telecom service providers. While they already had a VoiceXML browser suitable for use by service providers, PipeBeach gives them a better VoiceXML toolset and some desirable applications (e.g., email by phone and a voice portal) to immediately start selling through their Services group.
I think HP partners HeyAnita! and VoiceGenie have a lot to be concerned about by this move.
One mistake in the CNET article is that it talks about combining PipeBeach products with HP OpenCall speechWeb. SpeechWeb is PipeBeach's product. Of course, the press release makes it sound like OpenCall speechWeb already exists, although it actually will be a product of the merger.
Microsoft has released a public beta of their Speech Server and a new beta of their Speech Application SDK.
Microsoft had previously teamed up primarily with Intel to propose a new standard called SALT that is somewhat competitive with VoiceXML. As of now, I contrast the two as:
- SALT
- useful for speech-enabling web applications
- VoiceXML
- useful for web-enabling speech applications
While this is an oversimplification, it reasonably reflects their current usage. Both VoiceXML and SALT based speech applications follow a similar pattern.
- Prompt the user
- Interpret the user's response
- Act on the response
The action will often be to play/speak a new prompt.
The VoiceXML or SALT prompting tag will specify a recorded audio file or text that is synthesized by a text-to-speech engine. The user's response is always interpreted in the context of a grammar. The grammar specifies the allowable responses. Multiple utterances (yeah, un-huh, sure, yep) will often be treated as the same response (yes). Other VoiceXML and SALT tags (although SALT relies much more on existing HTML tags) act like a decision tree to determine the following action. A series of these prompts and responses is called a dialog.
SALT is used primarily to mark up documents that are interpreted in a web browser on a client side device. SALT consists of a very small set of tags that add multimodality to HTML/XHTML-based web applications.
VoiceXML is primarily used to create speech applications that run on a server and are accessed via a telephone. Although plenty of proprietary speech application languages preceded VoiceXML, VoiceXML was the first widely accepted and implemented standard and it greatly simplified the integration of speech applications with existing server side web applications.
With Speech Server, Microsoft is clearly moving SALT onto VoiceXML's turf. At the same time, IBM, Motorola, and Opera are proposing XHTML+Voice (a.k.a., X+V) as multimodal extensions to VoiceXML that would enable it to support the kinds of browser based applications that SALT now supports. Although Microsoft and IBM have been teaming up a lot on web services, they are very much in opposition with respect to the important speech technology standards.
Microsoft has developed their own speech recognition engine, but is partnering with SpeechWorks to supply a text -to-speech engine. In my experience with a previous version of the Microsoft speech recognition engine, I found it to be very mediocre. The only redeeming quality was that it was a free download.
Until now, third party interest in server side development with SALT has been extremely tepid in comparison with VoiceXML. I wonder if Microsoft will weave some of their developer magic with this server, or if it will be like one of their many other failed experiments. Of course, they're big enough that they can survive quite a few failures, as long as they occasionally hit the big home run. I think they will end up being a big player in speech technologies in the future, but I very much doubt that SALT will become a commonly accepted standard in its current form.
While porting SoccerPhone from TellMe to BeVocal, I ran into a couple differences between the two as a development environment and a deployment environment.
Porting Code from TellMe to BeVocal
The porting process went pretty quickly. Fortunately, the Python CGI scripts didn't have to change. Three cheers for standards and for application communication via XML over HTTP.
VoiceXML Changes
- Add DTD DOCTYPE to all vxml files so that VoiceXML syntax checker can check for well-formedness
- Must use the BeVocal DTD if using any BeVocal extensions, like the data tag or bevocal:foreach tag
- audio tag must have an attribute like src.
- break tags must be inside prompt tags
Although TellMe also offers a VXML syntax checker, it assumes you are using their DTD. I haven't tried it with a different DTD, yet, to see if it would actually use it. I like the fact that BeVocal requires it, since it forced me to identify which parts of my code were non-standard.
TellMe also supports the data tag and the foreach tag. The data tag is really cool, as it allows you to return an XML document from a CGI script or a servlet (anything on the other side of an HTTP GET). I hope it makes it into the VoiceXML 2.1 spec.
Tellme allows you to treat an audio tag like a prompt tag and does not require that break tags be inside a prompt. I think BeVocal's stricter interpretations of the spec are correct.
Grammar File Changes
Both TellMe and BeVocal support Nuance grammar files. I had already decided that I would switch over to the standard SRGS XML format as part of the move to BeVocal. My grammar file wasn't that complicated, but the lack of good examples for a simple SRGS XML grammar made it far too arduous. I have posted on the SoccerPhone SourceForge project site the source code for both the TellMe code (GSL grammar) and the BeVocal code (SRGS XML grammar) as part of release 0.2
BeVocal requires an xml:lang attribute for a grammar, even if it is a dtmf grammar for which that tag is ignored. I haven't read the SRGS spec closely enough to know if this is an error in their implementation or an oddity of the spec. Also, if I had stuck with Nuance grammar files, I would have needed to specify the grammar type as type="application/x-nuance-gsl" instead of "application/x-gsl".
JavaScript Changes
Fortunately, the change was simple. The BeVocal JavaScript interpreter didn't allow the DOM function getElementsByTagname() to take two arguments. The TellMe interpreter let me pass in a second argument, even though I'm pretty sure that was just a mistake on my part. I assume their interpreter just ignored the extra argument. My experience with both SoccerPhone and PhoneBlogger has been that the most painful part of development has been writing JavaScript code to parse XML files.
Development Tools
Both sites have really nice on-line development and debugging tools. Right now, I can't say that I have a clear favorite. The TellMe seems a little more cohesive, but the BeVocal site seems more up to date. The TellMe development tools (at least the free, online ones) have improved in only a few, minor ways in over a year.
The BeVocal Vocal Debugger looked pretty cool, but I didn't spend much time with it, as the Trace Tool was sufficient for me to find all the problems.
Text To Speech
TellMe is the big winner here for using AT&T Natural Voices. It is far superior to whatever BeVocal is using. In addition to the superior sound quality and accuracy, the TTS engine on Tellme is better at guessing context. The best example is "minute". Let's say a game is in progress in the 47th minute. After reading the score, I have SoccerPhone say "minute 47". BeVocal's TTS engine pronounces it as "my-nyewt", as if it were something small. TellMe's TTS engine pronounces it correctly.
I just published the 0.2 release of SoccerPhone to the SourceForge project site. The main two features of this release are:
- Support for the 2003 version of the MLS live scores page
- Support for BeVocal as well as TellMe as a VoiceXML gateway
I had always wanted to port SoccerPhone to another VoiceXML Gateway, but never had a strong enough need to prioritize that activity over other critical activities, like going to actual soccer matches. Well, that changed when TellMe dropped support for application extensions. I tried out Voxeo as well, but ran into a lot of problems just trying to get a simple VoiceXML application working.
The Technology section of Der Spiegel Online has a long article on audio blogging [German | Google Translation] titled "audio blogs: Voices from the Web". PhoneBlogger makes an appearance in the Internet links sidebar as "Audioblog solutions (III)" and in the main text of the article.
As translated by Google:
"The ink of the W3C-Empfehlung is not yet completely drying, there urge already ready for occupancy Web log solutions of Bevoice, Tellme, Audblog or open SOURCE projects such as Phoneblogger into the market."
A better translation might be "Although the W3C standard for VoiceXML is only recently complete, the audio weblog solutions of BeVocal, TellMe, AudBlog, and open source projects like PhoneBlogger have already entered the market."
Harold's Audioblog/Mobileblogging News blog also showed up in the links sidebar.
Bad news for my free, public SoccerPhone service, which ran as a TellMe Extension. I received the following email from TellMe today:
VoiceXML Developer,
Tellme has made many investments in VoiceXML over the past four years. One of these investments was in the Extensions program, with the goal of making VoiceXML a more utilized public standard. Now with VoiceXML well on its way to standardization in the W3C and with hundreds of thousands of VoiceXML applications in production, it is clear that investment has paid off. It is time for us to retire the Extensions program and invest in other areas. As of Wednesday, April 9th we will no longer host Extensions on 1-800-555-TELL or http://studio.tellme.com. Developers can continue to build VoiceXML applications on Tellme Studio.
Thank you for your individual contribution in making VoiceXML the most widely-used and successful voice standard in the world.
The Tellme Development Team
Fortunately, it looks like TellMe will still support developer level access (i.e., you need the admin password) to a VoiceXML application, which should be sufficient for most deployments of PhoneBlogger. I'll now have to look into BeVocal and HeyAnita, although a quick scan of their websites doesn't suggest that they provide a service similar to TellMe Extensions.
Although I will miss it, this was one of the last remaining relics of the dotcom era. While Extensions got TellMe a decent amount of good PR, I imagine it cost them quite a bit of money to host it, especially when you consider the time that employees were putting into administering a free, hosted service as opposed to one of their services that generates revenue.
I just wish they would have kept it, but without a toll-free number. A lot of people with cellphones have nationwide long distance included in their plan, so TellMe was paying toll charges for nothing. Or, at least I think most people choose the long distance plans. If they don't, they should. I very rarely make a long distance call from my house anymore.
Eric Snowdeal indicates on his ex machina that he has run into the same problem.
Well, enough fooling around with development and alpha code. I now have a version of PhoneBlogger that is in good enough shape to demonstrate. Click on the "recorded message" link in my post before this one to hear it in action.
PhoneBlogger is an automated voice application. After asking you for info about which pre-configured blog you wish to post to, it records your audio message. Finally, it posts a blog entry that links to the recorded audio.
Moblogging, or mobile blogging, is a hot topic right now. While PhoneBlogger isn't as fancy as tools that let you post pictures from your camera phone to your weblog with text via email, it requires nothing more than an ordinary phone connection. A journalist could use it from a payphone (good luck finding one, though) or with a basic cellphone to immediately publish to the web from the scene of an unexpected event in progress. It's moblogging for the people, man. However, I would still love to have a camera phone set up to use Joi Ito's mail2entry Python script.
I have another webpage I am working on that will have more info about PhoneBlogger, in case you want to know more about it. I'm also planning to put the code up on a SourceForge project. For now, please email me at robert AT wombatnation DOT com to get a copy of the code.
