In brief: 27 Feb 2003 [dive into mark]

While I’m going to try to add text transcription to PhoneBlogger, I don’t how successful it will be.

In one of the items in his Feb 27 blog entry, Mark suggests that when you are audio/phone blogging, “any audio content needs to be supplemented with a simultaneous text transcript.”

The problem with this in most cases is that speaker independent, natural language speech recognition is just not up to the challenge, yet. While you can buy dictation software that does a reasonably good job of creating transcripts, the really good stuff is too expensive and the training period is longer than most people are willing to endure, especially most of the casual audio blogging customers of a service like Audblog.

The type of speech recognition most people are familiar with is called directed dialog. This means that only a restricted grammar is available. For example, United Airline’s service at 1-800-824-6200 works well because it is listening only for words related to an airplane flight, like “arrival” or “departure”. Both PhoneBlogger and SoccerPhone use directed dialogs.

On top of this, the usable voice frequency for your regular PSTN-based phone call is about 300 Hz to 3400 Hz. Interestingly, while most vowel sounds are strongest below 3 kHz (including fundamentals and harmonics), consonants are usually more concentrated above 3 kHz. Since quite a bit of information in your voice signal can range up to around 5 kHz or so, especially for a child, the limited bandwidth available makes recognition over a phone call that much harder. And don’t even get me started on VoIP over the Internet.