|
|
|
|
Q: |
Please provide a brief overview of your lip modeling work and other audiovisual research.
|
|
A: |
Our work in lip modeling was initially driven by an interest in doing
audio-visual speech recognition. As we surveyed the field, though, we
saw that a great many interesting approaches to fusing audio and
visual information were already being developed (more on this later).
However, we did see a significant gap in the developments, and this
was in the lip modeling and tracking stage. Almost all of the models
were quite simple, and the tracking mechanisms all assumed a frontal
view (i.e., they assumed that the user's head would always be facing
the camera). In realistic scenarios, the head is moving constantly in
3D, so it is necessary to be able to contend with this when tracking
the lips. Furthermore, most of the tracking mechanisms depended on
fragile features such as contours, which would not be very robust to
non-uniform lighting or occlusions.
This is where we chose to jump in. We realized that to do
audio-visual speech recognition, we would need to use a 3D model.
Furthermore, we realized that we needed a strong enough model to deal
with lighting problems, occlusion, etc. We were able to do this
because despite the large number of muscles that influence the facial
region, their motions are strongly correlated. As a result, the
subspace of possible lip motions seemed relatively small, and we set
out to discover this subspace and then use it for the tracking and
synthesis of lip motions.
To accomplish this, we started out with a physical model of the lips
as a prior. We then used 3D observations from a real speaker to drive
this model. We then decomposed the resulting deformations of the
model to discover the underlying subspace of lip motions, which we
modeled as a linear subspace of 10 modes. We then developed a
tracking algorithm based on maximum a posteriori estimation using
whatever statistical features are available (in the sequences on our
web page, we have used the color of the lips and skin) to track the
lips. In essence, this method finds the best fit of the lip shape
given whatever data is available. Because the set of possible fits is
highly constrained by the subspace, this method is quite robust to
noise, erroneous measurements, occlusions, and so on.
The resulting learned space has proved to be extremely powerful - it
has allowed us to estimate the 3D shape of the lips quite accurately
using only 2D information. Because our model is in 3D, we can use it
for synthesis as well (by moving the parameters through the subspace
of lip motions).
Example audio-visual sequences showing the tracking and synthesis can
be seen at
http://www.media.mit.edu/~sbasu/lips.
In addition, a paper
describing our work in detail will appear shortly in the journal
Speech Communication, in a special issue on Audio-Visual Speech
Perception.
|
|
Q: |
What drives your work, theoretically?
|
|
A: |
One of the main tools we use in this work is the Finite Element Method,
which allows us to model complex physical structures as an assemblage
of many simple structures. The other is the wide range of tools in
the field of statistical modeling and estimation.
|
|
Q: |
What is the most difficult issue, or issues, that you face when doing your research?
|
|
A: |
Data. Gathering, labeling, and storing the kind of data that is
necessary is always a challenge. Even more importantly, finding data
that is truly representative is very difficult. My greatest wish in
my work is always that I had orders of magnitude more data to work with!
|
|
Q: |
What are your visions for the future of this sort of research?
|
|
A: |
On the practical side, I see audio-visual speech recognition solving
the "headset mike" problem. At this time, it is not possible to get
reasonable speech recognition rates (audio-only) without using a
noise-cancelling microphone. Once we (as a community) learn how to
correctly model the interrelations between audio and video signals and
integrate their information, we will finally move past the cumbersome
necessity of a headset. We still have some distance to go, but I
firmly believe that we will achieve this goal.
To this end, I hope that the many researchers doing work on
audio-visual speech recognition will integrate some of the modeling
and tracking methods we have developed. I feel that the robustness of
our algorithms to head pose and occlusions could greatly benefit their
systems. In particular, the capability to deal with the head being in
any position will allow audio-visual recognizers to be much less
constrained (i.e., the user will be able to move naturally while
speaking). It is also useful when the computer/camera is not the one
being spoken to, as in when trying to recognize the audio-visual
speech of two people having a conversation (i.e., neither would be
facing the camera).
On the science side, I see us finally gaining an understanding of how
audio-visual cues are integrated in the brain. Certainly I agree that
simply finding a machine model that works will not give us this
answer, but I doubt that even the former will happen so easily. As in
other domains, we will have to borrow from the wisdom of evolution to
find the right approaches, and by the end I think will have a greater
understanding of the human system's capabilities and limitations.
|
|
Q: |
Do you have any comments on related work by others that you consider to be exciting?
|
|
A: |
There is a variety of work by other researchers that I find very
interesting. The work of Eric Vatikiotis-Bateson
and his colleagues on
measuring the correlations/predictability of vocal tract parameters
from facial images is very promising. I think his work will be able
to tell us a great deal about the limits of information available from
visual vs. audio sources, and perhaps some insight on how to integrate them.
The work of Lionel Reveret
and Christian Benoit on lip tracking is
quite interesting - they have also been working on a 3D approach, and
I applaud their efforts! I look forward to seeing Reveret's thesis.
Last, I think the range of work in integrating audio and video
cues is very exciting. Of particular interest are the many
Boltzmann machine frameworks for integrating the separate cues in
parallel Markov chains. Most of this work derives from
Michael Jordan
(MIT then, now Univ. of California at Berkeley) and his students'
pioneering work in generalized Markov models, and includes the work on
Boltzmann zippers (David Stork),
Coupled HMM's (Matt Brand and
Nuria Oliver),
and others. One aspect that I think is currently missing
from this work, though, is looking carefully at the mechanisms of
production and the resulting perceptually important features for each
modality. Most current approaches treat both lip motions and audio in
a frame-based manner (i.e., sampling feature values at a constant
rate). However, for the video signal, some of the most important
features will be missed under such an approach. For closures, for
example, the important thing is not so much that there was a closure,
but instead exactly when it was. I think such visual features will
need to be dealt with explicitly before we can achieve human-level
performance increases when augmenting speech with lip information.
|
|
A sample audio-visual sequence demonstrates the performance of Sumit's system.
A paper describing the details of the latest techniques is in press,
but for now you can view the tracking by clicking on the QuickTime movie below.
WARNING: the audio and video tracks are properly synced, but
your QuickTime player may disregard this (the INDY, for example, does not bother to keep sync).
Also, to really see the motions in detail, Sumit suggests that after
viewing the sequence at normal speed, you look at it at half speed or slower as well to see
the subtleties in the motion. For further information on this
latest work, please contact Sumit directly at sbasu@media.mit.edu.
Sumit Basu is a Ph.D. candidate in Electrical Engineering and Computer Science at the
MIT MediaLab,
in Cambridge, Massachusetts.
He can be reached via email at:
sbasu@media.mit.edu or visit his
website.
|
|