Car Navigation Based on Audio-Visual Speech Recognition

Audio-visual speech recognition is a novel extension of acoustic speech recognition and has received a lot of attention in the last few decades. The main motivation behind audio-visual speech recognition is the characteristic demonstrated by the "McGurk effect" which can be explained as the complementary use of acoustic and visual information (such as lip movement) for speech perception. The primary advantage of the visual speech signal is that it is not affected by acoustic noise and cross talk among speakers. Another advantage is the complementary structure of phonemes and visemes, which are the smallest acoustically and visually distinguishing units of a language, respectively. Currently Automatic Speech Recognizers' (ASR's) recognition rates decrease significantly in noisy environments such as automobiles. Such noise present in the environment degrades the acoustic signal, which contains the main information for speech recognition. Thus the audio signal cannot be used reliably to identify the speech correctly. 

Vehicle manufacturers around the world are developing and experimenting vehicle navigation systems that use speech to replace manual input of control commands. Noise is a very serious problem for such application due to the interior noise, noise from the car engine, music and other traffic. Since visual speech information is not affected by acoustic noise, we are using both acoustic and visual speech information to recognize speech. Even if the acoustic speech information is not reliable, speech can still be recognized by using the visual speech information.

For audio-visual speech recognition, we need a recognizer, acoustic and visual feature extractors and a feature fusion mechanism. The most important part of audio-visual speech recognition is the implementation of the recognition engine. We have completed the implementation of the recognition engine based on Hidden Markov Models, which we use to model the acoustic and the visual speech signals. Our recognition engine can handle both discrete and continuous word recognition. For continuous word recognition, dynamic programming is used.

The acoustic feature extraction scheme is standard, where 13-dimensional Mel Frequency Cepstral Coefficients are used. However, the visual feature extraction technique needs much research. We focused mostly on the geometric visual features in our research and we have implemented a real time lip tracking system, from which we expect to extract visual speech features.

Once the acoustic and visual features are obtained, the next step is to fuse these features. But since acoustic and visual speech features are not synchronous this fusion is not straightforward. We designed a feature fusion system based on direct identification. This system obtains the fused features without losing any information from either the acoustic or visual speech features. The block diagram of the audio-visual speech recognition system is shown in Figure 1.


Figure 1: Block diagram of audio-visual speech recognition system.

Our experiments have demonstrated that the audio-visual speech recognition system greatly improved the recognition rate especially at high noise levels and out-performed both acoustic-only and visual-only speech recognition systems. This is a promising result, because for most real-world applications, the recognizer should deal with the noise present in the environment, as in the car navigation system. From Figure 2, it is very clear that audio-visual speech recognition is more robust to noise.


Figure 2: Recognition rate at various noise levels.

During our research we have published many papers in this area. Our next step in this project will be unconstrained continuous speech recognition for car navigation and assistance. Currently we are doing research on real time lip tracking and a continuous speech recognition engine.


Contact Person: Dr AD Cheok
Tel: 6874 6850, Fax: 6879 1103
Email: eleadc@nus.edu.sg