Speech Recognition - Piyush Modi on speech technology: Science fact or fiction?

by Diana Lightmoon, Executive Editor, Telecom Reseller

When asked about the state of the art of speech technology, Piyush Modi,
Vice-President of Applications and Media Processing Technologies at IP Unity, laughed and stated that we were far from Hal. Hal, if you'll recall, was the talking computer in Stanley Kubrick's film, 2001: A Space Odyssey. The reference to the film, based on a novel by Arthur C. Clarke, is appropriate given that Modi admits his own interest in the technology was originally motivated by a science fiction-like curiosity about artificial intelligence.

Modi was in Tucson working on a Ph.D. in image-processing when he got involved with Bell Labs and turned his attention to speech-related technology. Sound signal techniques were similar to what he had been doing, though the problems were slightly different, and he found, as he says, that he really "fancied it." What he particularly liked "was the ability to marry what researchers do in the lab with real life problems." It was a marriage with enough mystery to keep him intrigued for a lifetime.

On one side is the speech technology itself, which was the part Modi addressed while at Bell Labs. "A key area that interested me," he says, "was that when we talk to each other, we take a lot of things for granted. Sometimes we don't understand each other but because we have understanding and dialogue logic built into our communication, it allows us to figure out what has been said. The earlier research into speech didn't account for such things. Those kind of issues appealed to me and helped get into the nitty-gritty technology that would allow natural dialogue."

He explains that within intelligent dialogue there are certain markers which can be ordered across many applications. These "recipes" don't require any tuning anymore. These include categories such as identifying people's phone numbers, identifying people saying yes or no, or variations thereof.. "We can do basic things like getting ID account numbers recognition. It works best if numeric, but even alphanumeric can be recognized to pretty good satisfaction."

But to improve the speech technology beyond these simple dialogues requires better speech algorithms, Modi says. "Better ability for algorithms to adapt to channel and language variations means more CPU power. As the task gets more complex, it requires more intense CPU and memory footprint, so solutions get expensive and require a large number of boxes. This is one of the major bottlenecks, why you don't see this technology getting used more broadly. It's expensive to deploy as well as to manage."

Which brings us to the other side of the marriage: how to packet the algorithm to make the technology deployable. This is where Modi's work at IP Unity comes into play. "We built an architecture that has the power of four Pentium CPU's in one single blade. It gets managed like all the other things inside the box. What this means is that as you go through the cycles of software upgrades, as you go through management failures, everything becomes simpler. You're not creating another beast or another set of boxes that requires separate treatment."

And why would a company go to the expense to deploy the technology? "Because people like it." According to Modi, the economy could never have boomed as it did in the late eighties and nineties if it hadn't been for the innovation of the touch-tone phone. Call volume exploded with the advent of the ability to interact with a company using the phone keypad. And yet we're all too familiar with the nightmare of getting lost somewhere in a phone tree, not knowing where you are or how to get out. Speech is the key that will set you free.

"People like to talk to natural things. They don't want to hear tens of options. Speech asks an open-ended question." Yet Modi doesn't see the technology stopping there. He anticipates that eventually voice will be used to interface with whole networks giving people more access to data. "We're entering a world in which we are removing the shackles from proprietary networks." He readily admits that the issues of deployment are mind-boggling and complex in terms of security and voice verification, and that no one platform addresses it today. Yet this doesn't stop him from envisioning a future where not only speech is used to interact with systems, but other modalities, too.

"Soon there will be sensors, and as we go on, even more. We'll be asking what is the modality most appropriate to the task. For example, take a GPS system in your car. It may be very easy to just say the place you want to go to. But once it gives you a map, it's better to look at it rather than to have it described to you. And then you might want to point at the map with your finger, touching it. So an intelligent dialogue with a system won't be dependent on speech only."

So we may be far from Hal, but not so far that we can't see the day when computers will become even more a part of our lives than they are today. If Modi is right, speech technology is just the beginning.

© 2003 Telecom Reseller. All Rights Reserved.