Jump to content
RemedySpot.com

Fwd: [RD] CURR: NYTimes.com Article: Beyond Voice Recognition, to a Computer ...

Rate this topic


Guest guest

Recommended Posts

In a message dated 09/12/2003 12:48:13 PM Eastern Daylight Time, calfieri@... writes:

PERSONAL computers have changed a lot in the last few

decades, but not in the way that people communicate with

them. Typing on a keyboard, with the help of a mouse,

remains the most common interface.

But pounding away at a set of keys is hard on the hands and

tethers users to the keyboard. Automatic speech recognition

offers some relief - the systems work reasonably well for

office dictation, for instance. But voice recognition is

not effective in noisy places like cars, train stations or

the corner cash machine, and it may stumble even under the

best of conditions. Humans are still much better than any

computer at the subtleties of speech recognition.

But teaching computers to read lips might boost the

accuracy of automatic speech recognition. Listeners

naturally use mouth movements to help them understand the

difference between "bat" and "pat," for instance. If

distinctions like this could be added to a computer's

databank with the aid of cheap cameras and powerful

processors, speech recognition software might work a lot

better, even in noisy places.

Scientists at I.B.M.'s research center in Westchester

County, at Intel's centers in China and California and in

many other labs are developing just such digital

lip-reading systems to augment the accuracy of speech

recognition.

Chalapathy Neti, a senior researcher at I.B.M.'s J.

Research Center in Yorktown Heights, N.Y., has spent

the past four years focusing on how to boost the

performance of speech recognition with cameras. Dr. Neti

manages the center's research in audiovisual speech

technologies. "We humans fuse audio and visual perception

in deciding what is being spoken," he said. A computer, he

said, can be trained to do this job, too.

At I.B.M., the process starts by getting the computer and

camera to locate the person who is speaking, searching for

skin-tone pixels, for instance, and then using statistical

models that detect any object in that area that resembles a

face. Then, with the face in view, vision algorithms focus

on the mouth region, estimating the location of many

features, including the corners and center of the lips.

If the camera looked solely at the mouth, though, only

about 12 to 14 sounds could be distinguished visually, Dr.

Neti said - for instance, the difference between the

explosive initial "p" and its close relative "b." So the

group enlarged the visual region to include many types of

movements. "We tried using additional visible articulators

like jaw movements and the lower cheek, and other movements

of tongue and teeth," he said, "and that turned out to be

beneficial." Then the visual and audio features were

combined and analyzed by statistical models that predicted

what the speaker was saying.

Using inexpensive laptop cameras, the group tested the new

system repeatedly. When they introduced a lot of background

audio noise, Dr. Neti said, the combination audio and

visual analysis of speech worked well, demonstrating up to

a 100 percent improvement in accuracy compared with using

audio alone.

These were promising results, but as Dr. Neti pointed out,

a studio is not the world. Many camera-based systems that

work well in the controlled conditions of a laboratory fail

when they are tested in a car, for instance, where the

lighting is uneven or people face away from the camera.

To handle circumstances like this, he and his colleagues

are developing several solutions. One is an audiovisual

headset, now in prototype, with a tiny camera mounted on

the boom. "This way, the mouth region can always be seen,"

he said, independent of head movement or walking. I.B.M. is

also exploring the use of infrared illuminators for the

mouth region to provide constant lighting.

Dr. Neti said that such headsets might prove useful in

workplaces where people fill out forms or enter data by

using speech recognition software.

Another solution to changing video conditions is a feedback

system devised by the I.B.M. research group. "Our system

tracks confidence levels as it combines audio and visual

features," making a decision on the relative weight of the

two sources, Dr. Neti said. When a speaker faces away from

the microphone, he said, the confidence level becomes zero

and the system ignores the visual information and simply

uses audio information. When the visual information is

strong, it is included.

"The more pixels you can get for the mouth region," he

said, "the better information you'll have."

The goal of the system is always to do better than when

relying on an audio or video stream alone. "At worst, it is

as good as audio," Dr. Neti said. "At best, it is much

better."

At Intel, too, researchers have developed software for

combined audiovisual analysis of speech and released the

software for public use as part of the company's Open

Source Computer Vision Library, said Ara V. Nefian, a

senior Intel researcher who led the project. "We extract

visual features and then acoustic features, and combine

them using a model that analyzes them jointly," he said. In

tests, the system could identify four out of five words in

noisy environments.

"The results were as good for Chinese as for English," Dr.

Nefian added, suggesting that the system could be

introduced elsewhere.

Aggelos Katsaggelos, a professor of electrical and computer

engineering at Northwestern University in ton, Ill.,

is also developing an audiovisual speech recognition

system. He said that a future application might be improved

security, using such a system, for instance, to determine

whether recent videos that have surfaced indeed showed

Saddam Hussein himself or an imposter. "In principle, if

one can use both video and audio analysis one can have a

better accuracy in identifying people," he said.

Iain s, a research scientist at Carnegie Mellon

University's Robotics Institute who works mainly on face

tracking and modeling, said that audiovisual speech

recognition was a logical step. "Psychology showed this 50

years ago," he said. "If you can see a person speaking, you

can understand that person better."

http://www.nytimes.com/2003/09/11/technology/circuits/11next.html?ex=1064384

722 & ei=1 & en=cbbced2317c66237

---------------------------------

Get Home Delivery of The New York Times Newspaper. Imagine

reading The New York Times any time & anywhere you like!

Leisurely catch up on events & expand your horizons. Enjoy

now for 50% off Home Delivery! Click here:

http://www.nytimes.com/ads/nytcirc/index.html

HOW TO ADVERTISE

---------------------------------

For information on advertising in e-mail newsletters

or other creative advertising opportunities with The

New York Times on the Web, please contact

onlinesales@... or visit our online media

kit at http://www.nytimes.com/adinfo

For general information about NYTimes.com, write to

help@....

Copyright 2003 The New York Times Company

Beyond Voice Recognition, to a Computer That Reads Lips

September 11, 2003

By ANNE EISENBERG

PERSONAL computers have changed a lot in the last few

decades, but not in the way that people communicate with

them. Typing on a keyboard, with the help of a mouse,

remains the most common interface.

But pounding away at a set of keys is hard on the hands and

tethers users to the keyboard. Automatic speech recognition

offers some relief - the systems work reasonably well for

office dictation, for instance. But voice recognition is

not effective in noisy places like cars, train stations or

the corner cash machine, and it may stumble even under the

best of conditions. Humans are still much better than any

computer at the subtleties of speech recognition.

But teaching computers to read lips might boost the

accuracy of automatic speech recognition. Listeners

naturally use mouth movements to help them understand the

difference between " bat " and " pat, " for instance. If

distinctions like this could be added to a computer's

databank with the aid of cheap cameras and powerful

processors, speech recognition software might work a lot

better, even in noisy places.

Scientists at I.B.M.'s research center in Westchester

County, at Intel's centers in China and California and in

many other labs are developing just such digital

lip-reading systems to augment the accuracy of speech

recognition.

Chalapathy Neti, a senior researcher at I.B.M.'s J.

Research Center in Yorktown Heights, N.Y., has spent

the past four years focusing on how to boost the

performance of speech recognition with cameras. Dr. Neti

manages the center's research in audiovisual speech

technologies. " We humans fuse audio and visual perception

in deciding what is being spoken, " he said. A computer, he

said, can be trained to do this job, too.

At I.B.M., the process starts by getting the computer and

camera to locate the person who is speaking, searching for

skin-tone pixels, for instance, and then using statistical

models that detect any object in that area that resembles a

face. Then, with the face in view, vision algorithms focus

on the mouth region, estimating the location of many

features, including the corners and center of the lips.

If the camera looked solely at the mouth, though, only

about 12 to 14 sounds could be distinguished visually, Dr.

Neti said - for instance, the difference between the

explosive initial " p " and its close relative " b. " So the

group enlarged the visual region to include many types of

movements. " We tried using additional visible articulators

like jaw movements and the lower cheek, and other movements

of tongue and teeth, " he said, " and that turned out to be

beneficial. " Then the visual and audio features were

combined and analyzed by statistical models that predicted

what the speaker was saying.

Using inexpensive laptop cameras, the group tested the new

system repeatedly. When they introduced a lot of background

audio noise, Dr. Neti said, the combination audio and

visual analysis of speech worked well, demonstrating up to

a 100 percent improvement in accuracy compared with using

audio alone.

These were promising results, but as Dr. Neti pointed out,

a studio is not the world. Many camera-based systems that

work well in the controlled conditions of a laboratory fail

when they are tested in a car, for instance, where the

lighting is uneven or people face away from the camera.

To handle circumstances like this, he and his colleagues

are developing several solutions. One is an audiovisual

headset, now in prototype, with a tiny camera mounted on

the boom. " This way, the mouth region can always be seen, "

he said, independent of head movement or walking. I.B.M. is

also exploring the use of infrared illuminators for the

mouth region to provide constant lighting.

Dr. Neti said that such headsets might prove useful in

workplaces where people fill out forms or enter data by

using speech recognition software.

Another solution to changing video conditions is a feedback

system devised by the I.B.M. research group. " Our system

tracks confidence levels as it combines audio and visual

features, " making a decision on the relative weight of the

two sources, Dr. Neti said. When a speaker faces away from

the microphone, he said, the confidence level becomes zero

and the system ignores the visual information and simply

uses audio information. When the visual information is

strong, it is included.

" The more pixels you can get for the mouth region, " he

said, " the better information you'll have. "

The goal of the system is always to do better than when

relying on an audio or video stream alone. " At worst, it is

as good as audio, " Dr. Neti said. " At best, it is much

better. "

At Intel, too, researchers have developed software for

combined audiovisual analysis of speech and released the

software for public use as part of the company's Open

Source Computer Vision Library, said Ara V. Nefian, a

senior Intel researcher who led the project. " We extract

visual features and then acoustic features, and combine

them using a model that analyzes them jointly, " he said. In

tests, the system could identify four out of five words in

noisy environments.

" The results were as good for Chinese as for English, " Dr.

Nefian added, suggesting that the system could be

introduced elsewhere.

Aggelos Katsaggelos, a professor of electrical and computer

engineering at Northwestern University in ton, Ill.,

is also developing an audiovisual speech recognition

system. He said that a future application might be improved

security, using such a system, for instance, to determine

whether recent videos that have surfaced indeed showed

Saddam Hussein himself or an imposter. " In principle, if

one can use both video and audio analysis one can have a

better accuracy in identifying people, " he said.

Iain s, a research scientist at Carnegie Mellon

University's Robotics Institute who works mainly on face

tracking and modeling, said that audiovisual speech

recognition was a logical step. " Psychology showed this 50

years ago, " he said. " If you can see a person speaking, you

can understand that person better. "

http://www.nytimes.com/2003/09/11/technology/circuits/11next.html?ex=1064384

722 & ei=1 & en=cbbced2317c66237

---------------------------------

Get Home Delivery of The New York Times Newspaper. Imagine

reading The New York Times any time & anywhere you like!

Leisurely catch up on events & expand your horizons. Enjoy

now for 50% off Home Delivery! Click here:

http://www.nytimes.com/ads/nytcirc/index.html

HOW TO ADVERTISE

---------------------------------

For information on advertising in e-mail newsletters

or other creative advertising opportunities with The

New York Times on the Web, please contact

onlinesales@... or visit our online media

kit at http://www.nytimes.com/adinfo

For general information about NYTimes.com, write to

help@....

Copyright 2003 The New York Times Company

Link to comment
Share on other sites

Join the conversation

You are posting as a guest. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...