Guest guest Posted September 12, 2003 Report Share Posted September 12, 2003 In a message dated 09/12/2003 12:48:13 PM Eastern Daylight Time, calfieri@... writes: PERSONAL computers have changed a lot in the last few decades, but not in the way that people communicate with them. Typing on a keyboard, with the help of a mouse, remains the most common interface. But pounding away at a set of keys is hard on the hands and tethers users to the keyboard. Automatic speech recognition offers some relief - the systems work reasonably well for office dictation, for instance. But voice recognition is not effective in noisy places like cars, train stations or the corner cash machine, and it may stumble even under the best of conditions. Humans are still much better than any computer at the subtleties of speech recognition. But teaching computers to read lips might boost the accuracy of automatic speech recognition. Listeners naturally use mouth movements to help them understand the difference between "bat" and "pat," for instance. If distinctions like this could be added to a computer's databank with the aid of cheap cameras and powerful processors, speech recognition software might work a lot better, even in noisy places. Scientists at I.B.M.'s research center in Westchester County, at Intel's centers in China and California and in many other labs are developing just such digital lip-reading systems to augment the accuracy of speech recognition. Chalapathy Neti, a senior researcher at I.B.M.'s J. Research Center in Yorktown Heights, N.Y., has spent the past four years focusing on how to boost the performance of speech recognition with cameras. Dr. Neti manages the center's research in audiovisual speech technologies. "We humans fuse audio and visual perception in deciding what is being spoken," he said. A computer, he said, can be trained to do this job, too. At I.B.M., the process starts by getting the computer and camera to locate the person who is speaking, searching for skin-tone pixels, for instance, and then using statistical models that detect any object in that area that resembles a face. Then, with the face in view, vision algorithms focus on the mouth region, estimating the location of many features, including the corners and center of the lips. If the camera looked solely at the mouth, though, only about 12 to 14 sounds could be distinguished visually, Dr. Neti said - for instance, the difference between the explosive initial "p" and its close relative "b." So the group enlarged the visual region to include many types of movements. "We tried using additional visible articulators like jaw movements and the lower cheek, and other movements of tongue and teeth," he said, "and that turned out to be beneficial." Then the visual and audio features were combined and analyzed by statistical models that predicted what the speaker was saying. Using inexpensive laptop cameras, the group tested the new system repeatedly. When they introduced a lot of background audio noise, Dr. Neti said, the combination audio and visual analysis of speech worked well, demonstrating up to a 100 percent improvement in accuracy compared with using audio alone. These were promising results, but as Dr. Neti pointed out, a studio is not the world. Many camera-based systems that work well in the controlled conditions of a laboratory fail when they are tested in a car, for instance, where the lighting is uneven or people face away from the camera. To handle circumstances like this, he and his colleagues are developing several solutions. One is an audiovisual headset, now in prototype, with a tiny camera mounted on the boom. "This way, the mouth region can always be seen," he said, independent of head movement or walking. I.B.M. is also exploring the use of infrared illuminators for the mouth region to provide constant lighting. Dr. Neti said that such headsets might prove useful in workplaces where people fill out forms or enter data by using speech recognition software. Another solution to changing video conditions is a feedback system devised by the I.B.M. research group. "Our system tracks confidence levels as it combines audio and visual features," making a decision on the relative weight of the two sources, Dr. Neti said. When a speaker faces away from the microphone, he said, the confidence level becomes zero and the system ignores the visual information and simply uses audio information. When the visual information is strong, it is included. "The more pixels you can get for the mouth region," he said, "the better information you'll have." The goal of the system is always to do better than when relying on an audio or video stream alone. "At worst, it is as good as audio," Dr. Neti said. "At best, it is much better." At Intel, too, researchers have developed software for combined audiovisual analysis of speech and released the software for public use as part of the company's Open Source Computer Vision Library, said Ara V. Nefian, a senior Intel researcher who led the project. "We extract visual features and then acoustic features, and combine them using a model that analyzes them jointly," he said. In tests, the system could identify four out of five words in noisy environments. "The results were as good for Chinese as for English," Dr. Nefian added, suggesting that the system could be introduced elsewhere. Aggelos Katsaggelos, a professor of electrical and computer engineering at Northwestern University in ton, Ill., is also developing an audiovisual speech recognition system. He said that a future application might be improved security, using such a system, for instance, to determine whether recent videos that have surfaced indeed showed Saddam Hussein himself or an imposter. "In principle, if one can use both video and audio analysis one can have a better accuracy in identifying people," he said. Iain s, a research scientist at Carnegie Mellon University's Robotics Institute who works mainly on face tracking and modeling, said that audiovisual speech recognition was a logical step. "Psychology showed this 50 years ago," he said. "If you can see a person speaking, you can understand that person better." http://www.nytimes.com/2003/09/11/technology/circuits/11next.html?ex=1064384 722 & ei=1 & en=cbbced2317c66237 --------------------------------- Get Home Delivery of The New York Times Newspaper. Imagine reading The New York Times any time & anywhere you like! Leisurely catch up on events & expand your horizons. Enjoy now for 50% off Home Delivery! Click here: http://www.nytimes.com/ads/nytcirc/index.html HOW TO ADVERTISE --------------------------------- For information on advertising in e-mail newsletters or other creative advertising opportunities with The New York Times on the Web, please contact onlinesales@... or visit our online media kit at http://www.nytimes.com/adinfo For general information about NYTimes.com, write to help@.... Copyright 2003 The New York Times Company Beyond Voice Recognition, to a Computer That Reads Lips September 11, 2003 By ANNE EISENBERG PERSONAL computers have changed a lot in the last few decades, but not in the way that people communicate with them. Typing on a keyboard, with the help of a mouse, remains the most common interface. But pounding away at a set of keys is hard on the hands and tethers users to the keyboard. Automatic speech recognition offers some relief - the systems work reasonably well for office dictation, for instance. But voice recognition is not effective in noisy places like cars, train stations or the corner cash machine, and it may stumble even under the best of conditions. Humans are still much better than any computer at the subtleties of speech recognition. But teaching computers to read lips might boost the accuracy of automatic speech recognition. Listeners naturally use mouth movements to help them understand the difference between " bat " and " pat, " for instance. If distinctions like this could be added to a computer's databank with the aid of cheap cameras and powerful processors, speech recognition software might work a lot better, even in noisy places. Scientists at I.B.M.'s research center in Westchester County, at Intel's centers in China and California and in many other labs are developing just such digital lip-reading systems to augment the accuracy of speech recognition. Chalapathy Neti, a senior researcher at I.B.M.'s J. Research Center in Yorktown Heights, N.Y., has spent the past four years focusing on how to boost the performance of speech recognition with cameras. Dr. Neti manages the center's research in audiovisual speech technologies. " We humans fuse audio and visual perception in deciding what is being spoken, " he said. A computer, he said, can be trained to do this job, too. At I.B.M., the process starts by getting the computer and camera to locate the person who is speaking, searching for skin-tone pixels, for instance, and then using statistical models that detect any object in that area that resembles a face. Then, with the face in view, vision algorithms focus on the mouth region, estimating the location of many features, including the corners and center of the lips. If the camera looked solely at the mouth, though, only about 12 to 14 sounds could be distinguished visually, Dr. Neti said - for instance, the difference between the explosive initial " p " and its close relative " b. " So the group enlarged the visual region to include many types of movements. " We tried using additional visible articulators like jaw movements and the lower cheek, and other movements of tongue and teeth, " he said, " and that turned out to be beneficial. " Then the visual and audio features were combined and analyzed by statistical models that predicted what the speaker was saying. Using inexpensive laptop cameras, the group tested the new system repeatedly. When they introduced a lot of background audio noise, Dr. Neti said, the combination audio and visual analysis of speech worked well, demonstrating up to a 100 percent improvement in accuracy compared with using audio alone. These were promising results, but as Dr. Neti pointed out, a studio is not the world. Many camera-based systems that work well in the controlled conditions of a laboratory fail when they are tested in a car, for instance, where the lighting is uneven or people face away from the camera. To handle circumstances like this, he and his colleagues are developing several solutions. One is an audiovisual headset, now in prototype, with a tiny camera mounted on the boom. " This way, the mouth region can always be seen, " he said, independent of head movement or walking. I.B.M. is also exploring the use of infrared illuminators for the mouth region to provide constant lighting. Dr. Neti said that such headsets might prove useful in workplaces where people fill out forms or enter data by using speech recognition software. Another solution to changing video conditions is a feedback system devised by the I.B.M. research group. " Our system tracks confidence levels as it combines audio and visual features, " making a decision on the relative weight of the two sources, Dr. Neti said. When a speaker faces away from the microphone, he said, the confidence level becomes zero and the system ignores the visual information and simply uses audio information. When the visual information is strong, it is included. " The more pixels you can get for the mouth region, " he said, " the better information you'll have. " The goal of the system is always to do better than when relying on an audio or video stream alone. " At worst, it is as good as audio, " Dr. Neti said. " At best, it is much better. " At Intel, too, researchers have developed software for combined audiovisual analysis of speech and released the software for public use as part of the company's Open Source Computer Vision Library, said Ara V. Nefian, a senior Intel researcher who led the project. " We extract visual features and then acoustic features, and combine them using a model that analyzes them jointly, " he said. In tests, the system could identify four out of five words in noisy environments. " The results were as good for Chinese as for English, " Dr. Nefian added, suggesting that the system could be introduced elsewhere. Aggelos Katsaggelos, a professor of electrical and computer engineering at Northwestern University in ton, Ill., is also developing an audiovisual speech recognition system. He said that a future application might be improved security, using such a system, for instance, to determine whether recent videos that have surfaced indeed showed Saddam Hussein himself or an imposter. " In principle, if one can use both video and audio analysis one can have a better accuracy in identifying people, " he said. Iain s, a research scientist at Carnegie Mellon University's Robotics Institute who works mainly on face tracking and modeling, said that audiovisual speech recognition was a logical step. " Psychology showed this 50 years ago, " he said. " If you can see a person speaking, you can understand that person better. " http://www.nytimes.com/2003/09/11/technology/circuits/11next.html?ex=1064384 722 & ei=1 & en=cbbced2317c66237 --------------------------------- Get Home Delivery of The New York Times Newspaper. Imagine reading The New York Times any time & anywhere you like! Leisurely catch up on events & expand your horizons. Enjoy now for 50% off Home Delivery! Click here: http://www.nytimes.com/ads/nytcirc/index.html HOW TO ADVERTISE --------------------------------- For information on advertising in e-mail newsletters or other creative advertising opportunities with The New York Times on the Web, please contact onlinesales@... or visit our online media kit at http://www.nytimes.com/adinfo For general information about NYTimes.com, write to help@.... Copyright 2003 The New York Times Company Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You are posting as a guest. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.