Speech Recognition for Cell Phones Comes of Age

The new Google Nexis One. (Image credit: Google)

Speech recognition technology has come a long way in recent years, and one of the fastest areas of growth is the cellphone market.

Now, the availability of 3G-enabled mobile devices with fast, always-on Internet connections and the ability to train voice modeling software with millions of phone users – a process called crowd sourcing – is helping fuel a new breed of mobile speech-recognition apps that work quickly and are amazingly accurate.

Speech recognition software has been around for years, but they were often frustrating to use because they typically required users to "train" them for optimal word recognition or to speak slowly.

"In the early days, the capabilities of the technology combined with the computing power of the various devices required that you have training so that [the software] would have data about the specific user ... and not use up too much computer power," explained Mike Thompson, senior vice president and general manager of Nuance Mobile, which makes the Dragon Dictation and Dragon Search apps for the iPhone and iPad. (Read more iPad news.)

But the computing power of today's smartphones is such that voice training is no longer required. The digital voice models that form the basis of today's speech recognition software are sophisticated enough that they can learn — on their own — their users' verbal quirks.

They're also fast: Dragon Dictation, for example, can transcribe words spoken at normal speed.

The power of the masses

Mobile voice-recognition apps also have other advantages over their older desktop counterparts.

One is the ability to communicate with powerful central computers, or servers, that can combine information from millions of users and then make broad generalizations that help improve the apps' overall ability to recognize words.

"The first time you speak to the phone, we put a cookie" — a kind of digital tag — "on your device and when you say something we call up your personal language model from our servers and use it to get better accuracy," said Dave Grannen, president and CEO of speech recognition software maker Vlingo, which also has an app for the iPhone.

An individual's voice model contains information about his accent and unique way of pronouncing certain words, among other things.

The servers can combine the voice models of several speakers who have similar accents to improve the accuracy for that population.

"If you're from India and speaking English as a second language on Vlingo, we work pretty darned well. If you're from Germany speaking English, it doesn't work so well," Grannan told TechNewsDaily.

The reason? Vlingo has many more Indian-speaking users that German-speaking ones, so the voice model for Indians is generally better than that for Germans.

Smart apps

Today's speech-recognition apps for smartphones can also learn from their mistakes. If an app misspells a word, users can use the keyboards on their devices to correct the mistake, and the correction is noted on the server so it is less likely to recur.

Dragon Dictation and Dragon Search also pay attention to where a speaker is talking and can take steps to reduce background noise so a person's words are more understandable.

"If you're driving down the road in your car, you might have the window partway down, or the radio is on, or there's another person in the car with you. All of those kinds of sounds are predictable and can be eliminated through something called acoustic echo cancellation," said Dragon Dictation's Thompson.

Acoustic echo cancellation is a server-side process and also benefits from crowd sourcing. The more people who use the apps in similarly noisy environments, the better the software gets at ignoring background noise.

"Just like many forms of software, as you collect more data and expertise, you're continually pouring that back into the products," Thompson said in a telephone interview.

'Getting mainstream'

Vlingo's Grannan notes that it's only been in recent years, as fast 3G-enabled cellphones have become ubiquitous, that crowd sourcing and server-side voice analyses has really taken off.

"Before we had 3G, it was hard to do this," Grannan said.

In the future, speech recognition software will be more deeply integrated into a variety of devices, Thompson predicts.

"You're going to see a large number of devices roll out with speech recognition baked into the device," he said. "It will be built into messaging systems and the search functionality and all the apps on a phone."

This trend is already happening. Apple's iPhone 3GS, for example, includes native speech recognition capabilities that allow users to voice-dial people in their address books.

Speech recognition "is getting mainstream attention, and that's driving our business in a very positive way," Thompson said.