CONTENTS
BEING BIOLOGICAL
SIMULACRA
SPEECH SYNTHESIS
VOCAL TRACTS
ARTICULATORS
SPEECH PRODUCTION
MCGURK
SPEECHREADING
FACIAL ANIMATION
AVATARS
BACKGROUND
DIRECTORY
BIBLIOGRAPHY

Sumit Basu
 
Talking Heads:
Articulators
 

An interview with
Sumit Basu


Q: Please provide a brief overview of your lip modeling work and other audiovisual research.

A: Our work in lip modeling was initially driven by an interest in doing audio-visual speech recognition. As we surveyed the field, though, we saw that a great many interesting approaches to fusing audio and visual information were already being developed (more on this later).

However, we did see a significant gap in the developments, and this was in the lip modeling and tracking stage. Almost all of the models were quite simple, and the tracking mechanisms all assumed a frontal view (i.e., they assumed that the user's head would always be facing the camera). In realistic scenarios, the head is moving constantly in 3D, so it is necessary to be able to contend with this when tracking the lips. Furthermore, most of the tracking mechanisms depended on fragile features such as contours, which would not be very robust to non-uniform lighting or occlusions.

This is where we chose to jump in. We realized that to do audio-visual speech recognition, we would need to use a 3D model. Furthermore, we realized that we needed a strong enough model to deal with lighting problems, occlusion, etc. We were able to do this because despite the large number of muscles that influence the facial region, their motions are strongly correlated. As a result, the subspace of possible lip motions seemed relatively small, and we set out to discover this subspace and then use it for the tracking and synthesis of lip motions.

To accomplish this, we started out with a physical model of the lips as a prior. We then used 3D observations from a real speaker to drive this model. We then decomposed the resulting deformations of the model to discover the underlying subspace of lip motions, which we modeled as a linear subspace of 10 modes. We then developed a tracking algorithm based on maximum a posteriori estimation using whatever statistical features are available (in the sequences on our web page, we have used the color of the lips and skin) to track the lips. In essence, this method finds the best fit of the lip shape given whatever data is available. Because the set of possible fits is highly constrained by the subspace, this method is quite robust to noise, erroneous measurements, occlusions, and so on.

The resulting learned space has proved to be extremely powerful - it has allowed us to estimate the 3D shape of the lips quite accurately using only 2D information. Because our model is in 3D, we can use it for synthesis as well (by moving the parameters through the subspace of lip motions).

Example audio-visual sequences showing the tracking and synthesis can be seen at http://www.media.mit.edu/~sbasu/lips. In addition, a paper describing our work in detail will appear shortly in the journal Speech Communication, in a special issue on Audio-Visual Speech Perception.
 


Q: What drives your work, theoretically?

A: One of the main tools we use in this work is the Finite Element Method, which allows us to model complex physical structures as an assemblage of many simple structures. The other is the wide range of tools in the field of statistical modeling and estimation.
 

Q: What is the most difficult issue, or issues, that you face when doing your research?

A: Data. Gathering, labeling, and storing the kind of data that is necessary is always a challenge. Even more importantly, finding data that is truly representative is very difficult. My greatest wish in my work is always that I had orders of magnitude more data to work with!
 

Q: What are your visions for the future of this sort of research?

A: On the practical side, I see audio-visual speech recognition solving the "headset mike" problem. At this time, it is not possible to get reasonable speech recognition rates (audio-only) without using a noise-cancelling microphone. Once we (as a community) learn how to correctly model the interrelations between audio and video signals and integrate their information, we will finally move past the cumbersome necessity of a headset. We still have some distance to go, but I firmly believe that we will achieve this goal.

To this end, I hope that the many researchers doing work on audio-visual speech recognition will integrate some of the modeling and tracking methods we have developed. I feel that the robustness of our algorithms to head pose and occlusions could greatly benefit their systems. In particular, the capability to deal with the head being in any position will allow audio-visual recognizers to be much less constrained (i.e., the user will be able to move naturally while speaking). It is also useful when the computer/camera is not the one being spoken to, as in when trying to recognize the audio-visual speech of two people having a conversation (i.e., neither would be facing the camera).

On the science side, I see us finally gaining an understanding of how audio-visual cues are integrated in the brain. Certainly I agree that simply finding a machine model that works will not give us this answer, but I doubt that even the former will happen so easily. As in other domains, we will have to borrow from the wisdom of evolution to find the right approaches, and by the end I think will have a greater understanding of the human system's capabilities and limitations.
 


Q: Do you have any comments on related work by others that you consider to be exciting?

A: There is a variety of work by other researchers that I find very interesting. The work of Eric Vatikiotis-Bateson and his colleagues on measuring the correlations/predictability of vocal tract parameters from facial images is very promising. I think his work will be able to tell us a great deal about the limits of information available from visual vs. audio sources, and perhaps some insight on how to integrate them.

The work of Lionel Reveret and Christian Benoit on lip tracking is quite interesting - they have also been working on a 3D approach, and I applaud their efforts! I look forward to seeing Reveret's thesis.

Last, I think the range of work in integrating audio and video cues is very exciting. Of particular interest are the many Boltzmann machine frameworks for integrating the separate cues in parallel Markov chains. Most of this work derives from Michael Jordan (MIT then, now Univ. of California at Berkeley) and his students' pioneering work in generalized Markov models, and includes the work on Boltzmann zippers (David Stork), Coupled HMM's (Matt Brand and Nuria Oliver), and others. One aspect that I think is currently missing from this work, though, is looking carefully at the mechanisms of production and the resulting perceptually important features for each modality. Most current approaches treat both lip motions and audio in a frame-based manner (i.e., sampling feature values at a constant rate). However, for the video signal, some of the most important features will be missed under such an approach. For closures, for example, the important thing is not so much that there was a closure, but instead exactly when it was. I think such visual features will need to be dealt with explicitly before we can achieve human-level performance increases when augmenting speech with lip information.



A sample audio-visual sequence demonstrates the performance of Sumit's system. A paper describing the details of the latest techniques is in press, but for now you can view the tracking by clicking on the QuickTime movie below. WARNING: the audio and video tracks are properly synced, but your QuickTime player may disregard this (the INDY, for example, does not bother to keep sync). Also, to really see the motions in detail, Sumit suggests that after viewing the sequence at normal speed, you look at it at half speed or slower as well to see the subtleties in the motion. For further information on this latest work, please contact Sumit directly at sbasu@media.mit.edu.

QuickTime movie



 
Sumit Basu is a Ph.D. candidate in Electrical Engineering and Computer Science at the MIT MediaLab, in Cambridge, Massachusetts.
He can be reached via email at: sbasu@media.mit.edu or visit his website.
 

PREVIOUS CONTENTS NEXT