A Geometric Approach to Sound Source Localization from Time-Delay Estimates

PERCEPTION Interpretation and Modeling of Images and Sounds

Vision, perception and multimedia interpretation

Perception, Cognition and Interaction

http://team.inria.fr/perception Laboratoire Jean Kuntzmann (LJK) CNRS Institut polytechnique de Grenoble Université Joseph Fourier (Grenoble) Creation of the Team: 2006 September 01, updated into Project-Team: 2008 January 01 Project-Team 3.4. - Machine learning and statistics 5.1. - Human-Computer Interaction 5.10.2. - Perception 5.10.5. - Robot interaction (with the environment, humans, other robots) 5.3. - Image processing and analysis 5.4. - Computer vision 5.7. - Audio modeling and processing 8.2. - Machine learning 8.5. - Robotics 5.6. - Robotic systems Radu Horaud Chercheur

Grenoble

Team leader, Inria, Senior Researcher oui Laurent Girin Enseignant

Grenoble

Grenoble INP, Professor oui Siléye Ba Chercheur

Grenoble

Inria, granted by ERC Advanced Grant VHIA Georgios Evangelidis Chercheur

Grenoble

Inria, granted by ERC Advanced Grant VHIA Soraya Arias Technique

Grenoble

Inria Research Engineer, 40% Fabien Badeig Technique

Grenoble

Inria, granted by EU FP7 STREP EARS Quentin Pelorson Technique

Grenoble

Inria, granted by ANR MIXCAM Yutong Ban PhD

Grenoble

Inria, granted by ERC Advanced Grant VHIA Vincent Drouard PhD

Grenoble

Inria, granted by ERC Advanced Grant VHIA Israel Dejene Gebru PhD

Grenoble

Inria, granted by Cordi-S Dionyssos Kounades-Bastian PhD

Grenoble

Inria, granted by FP7 STREP EARS Stéphane Lathuilière PhD

Grenoble

Inria, granted by ERC Advanced Grant VHIA Benoît Massé PhD

Grenoble

Inria, granted by ERC Advanced Grant VHIA Xiaofei Li PostDoc

Grenoble

Inria, granted by FP7 STREP EARS project Christine Evers Visiteur

Grenoble

Imperial College London, UK, granted by FP7 STREP EARS Sharon Gannot Visiteur

Grenoble

Bar-Ilan University, Israel, granted by ERC Advanced Grant VHIA Nathalie Gillot Assistant

Grenoble

Inria Gianni Delannoye-Vignoble AutreCategorie

Grenoble

Inria, Intern, from Feb 2015 until Jul 2015 Pierre Girardeau AutreCategorie

Grenoble

Inria, Intern, from Apr 2015 until Sep 2015 George Sterpu AutreCategorie

Grenoble

Inria, Intern, from Feb 2015 until Jul 2015 Overall Objectives Overall Objectives

Auditory and visual perception play a complementary role in human interaction. Perception enables people to communicate based on verbal (speech and language) and non-verbal (facial expressions, visual gaze, head movements, hand and body gesturing) communication. These communication modalities have a large degree of overlap, in particular in social contexts. Moreover, the modalities disambiguate each other whenever one of the modalities is weak, ambiguous, or corrupted by various perturbations. Human-computer interaction (HCI) has attempted to address these issues, e.g., using smart & portable devices. In HCI the user is in the loop for decision taking: images and sounds are recorded purposively in order to optimize their quality with respect to the task at hand.

However, the robustness of HCI based on speech recognition degrades significantly as the microphones are located a few meters away from the user. Similarly, face detection and recognition work well under limited lighting conditions and if the cameras are properly oriented towards a person. Altogether, the HCI paradigm cannot be easily extended to less constrained interaction scenarios which involve several users and whenever is important to consider the social context.

The PERCEPTION team investigates the fundamental role played by audio and visual perception in human-robot interaction (HRI). The main difference between HCI and HRI is that, while the former is user-controlled, the latter is robot-controlled, namely it is implemented with intelligent robots that take decisions and act autonomously. The mid term objective of PERCEPTION is to develop computational models, methods, and applications for enabling non-verbal and verbal interactions between people, analyze their intentions and their dialogue, extract information and synthesize appropriate behaviors, e.g., the robot waves to a person, turns its head towards the dominant speaker, nods, gesticulates, asks questions, gives advices, waits for instructions, etc. The following topics are thoroughly addressed by the team members: audio-visual sound-source separation and localization in natural environments, for example to detect and track moving speakers, inference of temporal models of verbal and non-verbal activities (diarisation), continuous recognition of particular gestures and words, context recognition, and multimodal dialogue.

Research Program Audio-Visual Scene Analysis

From 2006 to 2009, R. Horaud was the scientific coordinator of the collaborative European project POP (Perception on Purpose), an interdisciplinary effort to understand visual and auditory perception at the crossroads of several disciplines (computational and biological vision, computational auditory analysis, robotics, and psychophysics). This allowed the PERCEPTION team to launch an interdisciplinary research agenda that has been very active for the last five years. There are very few teams in the world that gather scientific competences spanning computer vision, audio signal processing, machine learning and human-robot interaction. The fusion of several sensorial modalities resides at the heart of the most recent biological theories of perception. Nevertheless, multi-sensor processing is still poorly understood from a computational point of view. In particular and so far, audio-visual fusion has been investigated in the framework of speech processing using close-distance cameras and microphones. The vast majority of these approaches attempt to model the temporal correlation between the auditory signals and the dynamics of lip and facial movements. Our original contribution has been to consider that audio-visual localization and recognition are equally important. We have proposed to take into account the fact that the audio-visual objects of interest live in a three-dimensional physical space and hence we contributed to the emergence of audio-visual scene analysis as a scientific topic in its own right. We proposed several novel statistical approaches based on supervised and unsupervised mixture models. The conjugate mixture model (CMM) is an unsupervised probabilistic model that allows to cluster observations from different modalities (e.g., vision and audio) living in different mathematical spaces , . We thoroughly investigated CMM, provided practical resolution algorithms and studied their convergence properties. We developed several methods for sound localization using two or more microphones . The Gaussian locally-linear model (GLLiM) is a partially supervised mixture model that allows to map high-dimensional observations (audio, visual, or concatenations of audio-visual vectors) onto low-dimensional manifolds with a partially known structure . This model is particularly well suited for perception because it encodes both observable and unobservable phenomena. A variant of this model, namely probabilistic piecewise affine mapping has also been proposed and successfully applied to the problem of sound-source localization and separation . The European project HUMAVIPS (2010-2013), coordinated by R. Horaud, applied audio-visual scene analysis to human-robot interaction.

Stereoscopic Vision

Stereoscopy is one of the most studied topics in biological and computer vision. Nevertheless, classical approaches of addressing this problem fail to integrate eye/camera vergence. From a geometric point of view, the integration of vergence is difficult because one has to re-estimate the epipolar geometry at every new eye/camera rotation. From an algorithmic point of view, it is not clear how to combine depth maps obtained with different eyes/cameras relative orientations. Therefore, we addressed the more general problem of binocular vision that combines the low-level eye/camera geometry, sensor rotations, and practical algorithms based on global optimization , . We studied the link between mathematical and computational approaches to stereo (global optimization and Markov random fields) and the brain plausibility of some of these approaches: indeed, we proposed an original mathematical model for the complex cells in visual-cortex areas V1 and V2 that is based on steering Gaussian filters and that admits simple solutions . This addresses the fundamental issue of how local image structure is represented in the brain/computer and how this structure is used for estimating a dense disparity field. Therefore, the main originality of our work is to address both computational and biological issues within a unifying model of binocular vision. Another equally important problem that still remains to be solved is how to integrate binocular depth maps over time. Recently, we have addressed this problem and proposed a semi-global optimization framework that starts with sparse yet reliable matches and proceeds with propagating them over both space and time. The concept of seed-match propagation has then been extended to TOF-stereo fusion.

Audio Signal Processing

Audio-visual fusion algorithms necessitate that the two modalities are represented in the same mathematical space. Binaural audition allows to extract sound-source localization (SSL) information from the acoustic signals recorded with two microphones. We have developed several methods, that perform sound localization in the temporal and the spectral domains. If a direct path is assumed, one can exploit the time difference of arrival (TDOA) between two microphones to recover the position of the sound source with respect to the position of the two microphones. The solution is not unique in this case, the sound source lies onto a 2D manifold. However, if one further assumes that the sound source lies in a horizontal plane, it is then possible to extract the azimuth. We used this approach to predict possible sound locations in order to estimate the direction of a speaker . We also developed a geometric formulation and we showed that with four non-coplanar microphones the azimuth and elevation of a single source can be estimated without ambiguity . We also investigated SSL in the spectral domain. This exploits the filtering effects of the head related transfer function (HRTF): there is a different HRTF for the left and right microphones. The interaural spectral features, namely the ILD (interaural level difference) and IPD (interaural phase difference) can be extracted from the short-time Fourier transforms of the two signals. The sound direction is encoded in these interaural features but it is not clear how to make SSL explicit in this case. We proposed a supervised learning formulation that estimates a mapping from interaural spectral features (ILD and IPD) to source directions using two different setups: audio-motor learning and audio-visual learning . Currently we generalize this approach to an arbitrary number of microphones ,

Visual Reconstruction With Multiple Color and Depth Cameras

For the last decade, one of the most active topics in computer vision has been the visual reconstruction of objects, people, and complex scenes using a multiple-camera setup. The PERCEPTION team has pioneered this field and by 2006 several team members published seminal papers in the field. Recent work has concentrated onto the robustness of the 3D reconstructed data using probabilistic outlier rejection techniques combined with algebraic geometry principles and linear algebra solvers . Subsequently, we proposed to combine 3D representations of shape (meshes) with photometric data . The originality of this work was to represent photometric information as a scalar function over a discrete Riemannian manifold, thus generalizing image analysis to mesh and graph analysis. Manifold equivalents of local-structure detectors and descriptors were developed . The outcome of this pioneering work has been twofold: the formulation of a new research topic now addressed by several teams in the world, and allowed us to start a three year collaboration with Samsung Electronics. We developed the novel concept of mixed camera systems combining high-resolution color cameras with low-resolution depth cameras , ,. Together with our start-up company 4D Views Solutions and with Samsung, we developed the first practical depth-color multiple-camera multiple-PC system and the first algorithms to reconstruct high-quality 3D content .

Registration, Tracking and Recognition of People and Actions

The analysis of articulated shapes has challenged standard computer vision algorithms for a long time. There are two difficulties associated with this problem, namely how to represent articulated shapes and how to devise robust registration and tracking methods. We addressed both these difficulties and we proposed a novel kinematic representation that integrates concepts from robotics and from the geometry of vision. In 2008 we proposed a method that parameterizes the occluding contours of a shape with its intrinsic kinematic parameters, such that there is a direct mapping between observed image features and joint parameters . This deterministic model has been motivated by the use of 3D data gathered with multiple cameras. However, this method was not robust to various data flaws and could not achieve state-of-the-art results on standard dataset. Subsequently, we addressed the problem using probabilistic generative models. We formulated the problem of articulated-pose estimation as a maximum-likelihood with missing data and we devised several tractable algorithms , . We proposed several expectation-maximization procedures applied to various articulated shapes: human bodies, hands, etc. In parallel, we proposed to segment and register articulated shapes represented with graphs by embedding these graphs using the spectral properties of graph Laplacians . This turned out to be a very original approach that has been followed by many other researchers in computer vision and computer graphics.

Highlights of the Year Highlights of the Year

Robotic Demonstration at ICMI'15. The PERCEPTION team was present at the ACM International Conference on Multimodal Interaction – ICMI'15 (November 2015,Seattle WA, USA) with the demonstration A Distributed Architecture for Interacting with NAO . This software package enables robot programming using various languages, e.g. C, C++, Matlab, and Python. This distributed architecture is available under the NAOLab open-source software package. The development of NAOLab is part of PERCEPTION's participation in EU FP7 projects and is funded by STREP project Embodied Audition for RobotS (EARS) and ERC Advanced Grant Vision and Hearing in Action (VHIA).

The Xerox Foundation University Affairs Committee (UAC) awarded Radu Horaud and Florence Forbes (EPI MISTIS) with a three year grant Advanced and Scalable Graph Signal Processing Techniques (2015-2017). Collaboration with Arijit Biswas and Anirban Mondal, research scientists at Xerox Research Center India (XRCI), Bangalore. Information about these awards is available at page 9 of this document available online: http://www.xerox.com/downloads/usa/en/innovation/innovation_xig_brochure.pdf.

MOOC on Binaural Hearing for Robots. In May-June 2015 Radu Horaud taught a five hour MOOC dealing with the fundamental principles of robot hearing, from binaural signal processing to robotic implementations. MOOC content available at https://team.inria.fr/perception/mooc-bhr/ and at https://www.france-universite-numerique-mooc.fr/courses/inria/41004/session01/about.

Awards

Vincent Drouard (PhD student) and his co-authors received the “Best Student Paper Award" (second place) at IEEE ICIP'15 for the paper Head Pose Estimation via High-Dimensional Regression . The conference took place in Quebec City, Canada, September 2015. There were five papers awarded, two “Best Paper" and three “Best Student Paper" out of a total of 1033 (oral and poster) papers presented at the conference. IEEE ICIP is the premier international image processing conference series held every year. The work is funded by the ERC Advanced Grant VHIA.

Dionyssos Kounades-Bastian (PhD student) and his co-authors received the “Best Student Paper Award" at IEEE WASPAA'15 for the paper A Variational EM Algorithm for the Separation of Moving Sound Sources . The conference took place in New Paltz, NY, USA, October 2015. There were six papers nominated for the award, out of a total of 80 (oral and poster) papers presented at the workshop. The IEEE WASPAA workshop series is among the premier international forums in the field of audio and acoustic signal processing, held every other year. The work is funded by the EU STREP project EARS and the ERC Advanced Grant VHIA.

New Software and Platforms Associations of Audio Cues with 3D Locations Library

Functional Description

Library to associate some auditory cues with 3D locations (points). It provides an estimation of the emitting state of each of the input locations. There are two main assumptions:

The 3D locations are valid during the acquisition interval related to the audio cues

The 3D locations are the only possible locations for the sound sources, no new locations will be created in this module

The software provides also a multimodal fusion library.

Participants: Xavier Alameda-Pineda, Antoine Deleforge, Jordi Sanchez-Riera and Radu Horaud

Contact: Radu Horaud

Supervised Binaural Mapping Software

Functional Description

The SBM Matlab toolbox for “Supervised Binaural Mapping", contains a set of functions and scripts for supervised binaural sound source separation and localization. The approach consists in learning the acoustic space of a system using a set of white-noise measurements. Once the acoustic space is learned, it can be used to efficiently localize one or several natural sound sources such as speech, and to separate their signals.

Participants: Antoine Deleforge, Soraya Arias and Radu Horaud

Contact: Radu Horaud

URL: https://team.inria.fr/perception/supervised-binaural-mapping/

Audiovisual Robotic Heads

Functional Description

The team has developed two audiovisual (AV) robot heads: the POPEYE head and the NAO stereo head. Both are equipped with a binocular vision system and with four microphones. The software modules comprise stereo matching and reconstruction, sound-source localization and audio-visual fusion. POPEYE has been developed within the European project POP in collaboration with the project-team MISTIS and with two other POP partners: the Speech and Hearing group of the University of Sheffield and the Institute for Systems and Robotics of the University of Coimbra. The NAO stereo head was developed under the European project HUMAVIPS in collaboration with Aldebaran Robotics (which manufactures the humanoid robot NAO) and with the University of Bielefeld, the Czech Technical Institute, and IDIAP. The software modules that we develop are compatible with both these robot heads.

Contact: Radu Horaud

URL: https://team.inria.fr/perception/popeye/

MIXCAM Platform

Functional Description

We developed a multiple camera platform composed of both high-definition color cameras and low-resolution depth cameras. This platform combines the advantages of the two camera types. On one side, depth (time-of-flight) cameras provide coarse low-resolution 3D scene information. On the other side, depth and color cameras can be combined such as to provide high-resolution 3D scene reconstruction and high-quality rendering of textured surfaces. The software package developed during the period 2011-2015 contains the calibration of TOF cameras, alignment between TOF and color cameras, TOF-stereo fusion, and image-based rendering. These software developments were performed in collaboration with the Samsung Advanced Institute of Technology, Seoul, Korea. The multi-camera platform and the basic software modules are products of 4D Views Solutions SAS, a start-up company issued from the PERCEPTION group.

Participants: Quentin Pelorson, Georgios Evangelidis, Soraya Arias, Radu Horaud.

Contact: Radu Horaud

URL: https://team.inria.fr/perception/mixcam-project/

NaoLAB

Functional Description

NAOLab is a middleware for the development of robotic applications in C, C++, Python and Matlab, using the humanoid robot NAO networked with a PC. NAOLab enables the joint use of NAO's on-board computing resources and external resources. More precisely, it allows the development of applications that combine embedded libraries, e.g. motion control, image/sound acquisition and transmission, etc., with external toolboxes, e.g. OpenCV, Matlab toolboxes, etc. The NAOLab toolbox has the following characteristic. The middleware complexity is transparent to the users. An user-friendly interface is provided through C++ and Python libraries extended with mex functions for Matlab. This enables the development of sophisticated audio and visual processing algorithms without the stringent constraints of the NAOqi SDK. NAOLab and NAOqi share the same modular approach, namely there are three categories of modules: vision, audio and motion. An interface (vision, audio, motion) is associated with each NAOqi module. Each interface deals with sensor-data access and actuator control. The role of these interfaces is twofold: (i) to feed the sensor data into a memory space that is subsequently shared with existing software or with software under development, and (ii) to send to the robot commands generated by the external modules.

Participants: Fabien Badeig, Quentin Pelorson, Soraya Arias, Radu Horaud.

Contact: Radu Horaud

URL: https://team.inria.fr/perception/research/naolab/

New Results Supervised Audio-Source Localization

We addressed the problem of localizing audio sources using binaural measurements. After proposing an unsupervised method , we proposed a supervised formulation that simultaneously localizes multiple sources at different locations . The approach is intrinsically efficient because, contrary to prior work, it relies neither on source separation, nor on monaural segregation. The method starts with a training stage that establishes a locally-linear Gaussian regression between the directional coordinates of all the sources and the auditory features extracted from binaural measurements. While fixed-length wide-spectrum sounds (white noise) are used for training to reliably estimate the model parameters, we show that the testing (localization) can be extended to variable-length sparse-spectrum sounds (such as speech), thus enabling a wide range of realistic applications. Indeed, we demonstrate that the method can be used for audio-visual fusion, namely to map speech signals onto images and hence to spatially align the audio and visual modalities, thus enabling to discriminate between speaking and non-speaking faces. We release a novel corpus of real-room recordings that allow quantitative evaluation of the co-localization method in the presence of one or two sound sources. Experiments demonstrate increased accuracy and speed relative to several state-of-the-art methods. More recently the method has been extended to an arbitrary number of microphones , . Moreover, we have started to develop a method that extracts the direct path on an acoustic wave in order to enable robust audio-source localization in reverberant environments .

Websites:

https://team.inria.fr/perception/research/acoustic-learning/

https://team.inria.fr/perception/research/binaural-ssl/

https://team.inria.fr/perception/research/local-rtf/

Multichannel Audio-Source Separation

We address the problem of separating audio sources from time-varying convolutive mixtures. We proposed an unsupervised probabilistic framework based on the local complex-Gaussian model combined with non-negative matrix factorization. The time-varying mixing filters are modeled by a continuous temporal stochastic process. This model extends the case of static filters which corresponds to static audio sources. While static filters can be learn in advance, e.g. , time-varying filters cannot and therefore the problem is more complex. We present a variational expectation-maximization (VEM) algorithm that employs a Kalman smoother to estimate the time-varying mixing matrix, and that jointly estimates the source parameters. The sound sources are then separated by Wiener filters constructed with the estimators provided by the VEM algorithm. Extensive experiments on simulated data show that the proposed method outperforms a block-wise version of a state-of-the-art baseline method. This work is part of the PhD topic of Dionyssos Kounades Bastian and is conducted in collaboration with Sharon Gannot (Bar Ilan University) and Xavier Alameda Pineda (University of Trento). It received the best student paper award at WASPAA'15 . An extended version has been submitted to IEEE Transactions on Audio, Speech, and Language Processing .

Audio-Visual Speaker Tracking and Recognition

Any multi-party conversation system benefits from speaker diarization, that is, the assignment of speech signals among the participants. More generally, in HRI and CHI scenarios it is important to recognize the speaker over time. We propose to address speaker diarization and speaker recognition using both audio and visual data. We cast the diarization problem into a tracking formulation whereby the active speaker is detected and tracked over time. A probabilistic tracker exploits the spatial coincidence of visual and auditory observations and infers a single latent variable which represents the identity of the active speaker. Visual and auditory observations are fused using our recently developed weighted-data mixture model , while several options for the speaking turns dynamics are fulfilled by a multi-case transition model. The modules that translate raw audio and visual data into image observations are also described in detail. The performance of the proposed trackers , are tested on challenging data-sets that are available from recent contributions which are used as baselines for comparison. Currently we are developing a variational framework for the on-line tracking of multiple persons .

Websites:

https://team.inria.fr/perception/research/speakerloc/

https://team.inria.fr/perception/research/speechturndet/

https://team.inria.fr/perception/research/avdiarization/

Head Pose Estimation

Head pose estimation is an important task, because it provides information about cognitive interactions that are likely to occur. Estimating the head pose is intimately linked to face detection. We addressed the problem of head pose estimation with three degrees of freedom (pitch, yaw, roll) from a single image and in the presence of face detection errors. Pose estimation is formulated as a high-dimensional to low-dimensional mixture of linear regression problem . We propose a method that maps HOG-based descriptors, extracted from face bounding boxes, to corresponding head poses. To account for errors in the observed bounding-box position, we learn regression parameters such that a HOG descriptor is mapped onto the union of a head pose and an offset, such that the latter optimally shifts the bounding box towards the actual position of the face in the image. The performance of the proposed method is assessed on publicly available datasets. The experiments that we carried out show that a relatively small number of locally-linear regression functions is sufficient to deal with the non-linear mapping problem at hand. Comparisons with state-of-the-art methods show that our method outperforms several other techniques. This work is part of the PhD of Vincent Drouard and it received the best student paper award (second place) at the IEEE ICIP'15 . Currently we investigate a temporal extension of this model.

Website:

https://team.inria.fr/perception/research/head-pose/

High-Resolution Scene Reconstruction

We addressed the problem of range-stereo fusion for the construction of high-resolution depth maps. In particular, we combine low-resolution depth data with high-resolution stereo data, in a maximum a posteriori (MAP) formulation. Unlike existing schemes that build on MRF optimizers, we infer the disparity map from a series of local energy minimization problems that are solved hierarchically, by growing sparse initial disparities obtained from the depth data. The accuracy of the method is not compromised, owing to three properties of the data-term in the energy function. Firstly, it incorporates a new correlation function that is capable of providing refined correlations and disparities, via sub-pixel correction. Secondly, the correlation scores rely on an adaptive cost aggregation step, based on the depth data. Thirdly, the stereo and depth likelihoods are adaptively fused, based on the scene texture and camera geometry. These properties lead to a more selective growing process which, unlike previous seed-growing methods, avoids the tendency to propagate incorrect disparities. The proposed method gives rise to an intrinsically efficient algorithm, which runs at 3FPS on 2.0MP images on a standard desktop computer. The strong performance of the new method is established both by quantitative comparisons with state-of-the-art methods, and by qualitative comparisons using real depth-stereo data-sets . This work is funded by the ANR project MIXCAM.

Website:

https://team.inria.fr/perception/research/dsfusion/

Hyper-Spectral Image Analysis

As an extension to our work on high-dimensional regression we addressed the problem of analyzing hyper-spectral data. In particular we addressed the problem of recovering physical properties (parameters) form hyper-spectral low-resolution images, i.e. at large planetary scales. This involves resolving inverse problems which can be addressed within machine learning, with the advantage that, once a relationship between physical parameters and spectra has been established in a data-driven fashion, the learned relationship can be used to estimate physical parameters for new hyper-spectral observations. Within this framework, we propose a spatially-constrained and partially-latent regression method which maps high-dimensional inputs (hyper-spectral images) onto low-dimensional responses (physical parameters such as the local chemical composition of the soil). The proposed regression model comprises two key features. Firstly, it combines a Gaussian mixture of locally-linear mappings (GLLiM) with a partially-latent response model. While the former makes high-dimensional regression tractable, the latter enables to deal with physical parameters that cannot be observed or, more generally, with data contaminated by experimental artifacts that cannot be explained with noise models. Secondly, spatial constraints are introduced in the model through a Markov random field (MRF) prior which provides a spatial structure to the Gaussian-mixture hidden variables . Experiments conducted on a database composed of remotely sensed observations collected from the Mars planet by the Mars Express orbiter demonstrate the effectiveness of the proposed model.

Gaussian Mixture Regression for Acoustic-Articulatory Inversion

The team expertise in latent-variable mixture models was applied to the problem of adaptation of an acoustic-articulatory model of a reference speaker to the voice of another speaker, using a limited amount of audio-only data . In the context of pronunciation training, a virtual talking head displaying the internal speech articulators (e.g., the tongue) could be automatically animated by means of such a model using only the speaker's voice. In this study, the articulatory-acoustic relationship of the reference speaker is modeled by a gaussian mixture model (GMM). To address the speaker adaptation problem, we propose a new framework called cascaded Gaussian mixture regression (C-GMR), and derive two implementations. The first one, referred to as Split-C-GMR, is a straightforward chaining of two distinct GMRs: one mapping the acoustic features of the source speaker into the acoustic space of the reference speaker, and the other estimating the articulatory trajectories with the reference model. In the second implementation, referred to as Integrated-C-GMR, the two mapping steps are tied together in a single probabilistic model. For this latter model, we present the full derivation of the exact EM training algorithm, that explicitly exploits the missing data methodology of machine learning. Other adaptation schemes based on maximum-a posteriori (MAP), maximum likelihood linear regression (MLLR) and direct cross-speaker acoustic-to-articulatory GMR are also investigated. Experiments conducted on two speakers for different amount of adaptation data show the interest of the proposed C-GMR techniques. This work was done in collaboration with Thomas Hueber and Gérard Bailly from Gipsa Lab and with Xavier Alameda-Pineda from University of Trento and former team member.

Bilateral Contracts and Grants with Industry Bilateral Contracts with Industry

In 2015 we started a collaboration with Xerox Research Center India (XRCI), Bangalore. This three-year collaboration (2015-2017) is funded by a grant awarded by the Xerox Foundation University Affairs Committee (UAC) and the topic of the project is Advanced and Scalable Graph Signal Processing Techniques. The work is done in collaboration with EPI MISTIS and our Indian collaborators are Arijit Biswas and Anirban Mondal.

Partnerships and Cooperations National Initiatives ANR MIXCAM

Type: ANR BLANC

Duration: March 2014 - February 2016

Coordinator: Radu Horaud

Partners: 4D View Solutions SAS

Abstract: Humans have an extraordinary ability to see in three dimensions, thanks to their sophisticated binocular vision system. While both biological and computational stereopsis have been thoroughly studied for the last fifty years, the film and TV methodologies and technologies have exclusively used 2D image sequences, including the very recent 3D movie productions that use two image sequences, one for each eye. This state of affairs is due to two fundamental limitations: it is difficult to obtain 3D reconstructions of complex scenes and glass-free multi-view 3D displays, which are likely to need real 3D content, are still under development. The objective of MIXCAM is to develop novel scientific concepts and associated methods and software for producing live 3D content for glass-free multi-view 3D displays. MIXCAM will combine (i) theoretical principles underlying computational stereopsis, (ii) multiple-camera reconstruction methodologies, and (iii) active-light sensor technology in order to develop a complete content-production and -visualization methodological pipeline, as well as an associated proof-of-concept demonstrator implemented on a multiple-sensor/multiple-PC platform supporting real-time distributed processing. MIXCAM plans to develop an original approach based on methods that combine color cameras with time-of-flight (TOF) cameras: TOF-stereo robust matching, accurate and efficient 3D reconstruction, realistic photometric rendering, real-time distributed processing, and the development of an advanced mixed-camera platform. The MIXCAM consortium is composed of two French partners (Inria and 4D View Solutions). The MIXCAM partners will develop scientific software that will be demonstrated using a prototype of a novel platform, developed by 4D Views Solutions, and which will be available at Inria, thus facilitating scientific and industrial exploitation.

European Initiatives FP7 & H2020 Projects EARS

Title: Embodied Audition for RobotS

Program: FP7

Duration: January 2014 - December 2016

Coordinator: Friedrich Alexander Universität Erlangen-Nünberg

Partners:

Aldebaran Roboticss (France)

Ben-Gurion University of the Negev (Israel)

Friedrich Alexander Universitat, Erlangen, Nurenberg (Germany)

Imperial College London (United Kingdom)

Humboldt-Universitat Zu Berlin (Germany)

Inria contact: Radu Horaud

The success of future natural intuitive human-robot interaction (HRI) will critically depend on how responsive the robot will be to all forms of human expressions and how well it will be aware of its environment. With acoustic signals distinctively characterizing physical environments and speech being the most effective means of communication among humans, truly humanoid robots must be able to fully extract the rich auditory information from their environment and to use voice communication as much as humans do. While vision-based HRI is well developed, current limitations in robot audition do not allow for such an effective, natural acoustic human-robot communication in real-world environments, mainly because of the severe degradation of the desired acoustic signals due to noise, interference and reverberation when captured by the robot's microphones. To overcome these limitations, EARS will provide intelligent 'ears' with close-to-human auditory capabilities and use it for HRI in complex real-world environments. Novel microphone arrays and powerful signal processing algorithms shall be able to localise and track multiple sound sources of interest and to extract and recognize the desired signals. After fusion with robot vision, embodied robot cognition will then derive HRI actions and knowledge on the entire scenario, and feed this back to the acoustic interface for further auditory scene analysis. As a prototypical application, EARS will consider a welcoming robot in a hotel lobby offering all the above challenges. Representing a large class of generic applications, this scenario is of key interest to industry and, thus, a leading European robot manufacturer will integrate EARS's results into a robot platform for the consumer market and validate it. In addition, the provision of open-source software and an advisory board with key players from the relevant robot industry should help to make EARS a turnkey project for promoting audition in the robotics world.

VHIA

Title: Vision and Hearing in Action

Program: FP7

Type: ERC

Duration: February 2014 - January 2019

Coordinator: Inria

Inria contact: Radu Horaud

The objective of VHIA is to elaborate a holistic computational paradigm of perception and of perception-action loops. We plan to develop a completely novel twofold approach: (i) learn from mappings between auditory/visual inputs and structured outputs, and from sensorimotor contingencies, and (ii) execute perception-action interaction cycles in the real world with a humanoid robot. VHIA will achieve a unique fine coupling between methodological findings and proof-of-concept implementations using the consumer humanoid NAO manufactured in Europe. The proposed multimodal approach is in strong contrast with current computational paradigms influenced by unimodal biological theories. These theories have hypothesized a modular view, postulating quasi-independent and parallel perceptual pathways in the brain. VHIA will also take a radically different view than today's audiovisual fusion models that rely on clean-speech signals and on accurate frontal-images of faces; These models assume that videos and sounds are recorded with hand-held or head-mounted sensors, and hence there is a human in the loop who intentionally supervises perception and interaction. Our approach deeply contradicts the belief that complex and expensive humanoids (often manufactured in Japan) are required to implement research ideas. VHIA's methodological program addresses extremely difficult issues: how to build a joint audiovisual space from heterogeneous, noisy, ambiguous and physically different visual and auditory stimuli, how to model seamless interaction, how to deal with high-dimensional input data, and how to achieve robust and efficient human-humanoid communication tasks through a well-thought tradeoff between offline training and online execution. VHIA bets on the high-risk idea that in the next decades, social robots will have a considerable economical impact, and there will be millions of humanoids, in our homes, schools and offices, which will be able to naturally communicate with us.

Inria International Partners Informal International Partners

Professor Sharon Gannot, Bar Ilan University, Tel Aviv, Israel,

Professor Yoav Schechner, Technion, Haifa, Israel,

Dr. Miles Hansard, Queen Mary University London,

Dr. Thomas Hueber, Gipsa Lab, CNRS, Grenoble,

Professor Daniel Gatica Perez, IDIAP Institute, Martigny, Switzerand,

Professor Nicu Sebe, University of Trento, Trento, Italy,

Professor Adrian Raftery, University of Washington, Seattle, USA.

Dr. Zhengyou Zhang, Microsoft, Redmond WA, USA.

International Research Visitors Visits of International Scientists

Professor Sharon Gannot (Bar Ilan University), February and October 2015.

Dr. Romain Sérizel (Telecom Paris Tech), February 2015.

Dr. Christine Evers (Imperial College), March 2015.

Dr. Xavier Alameda-Pineda (University of Trento), November 2015.

Dissemination Promoting Scientific Activities Scientific events selection Chair of conference program committees

Radu Horaud was program co-chair of ACM ICMI'15, November 2015, Seattle WA, USA.

Journal Member of the editorial boards

Radu Horaud is a member of the following editorial boards:

advisory board member of the International Journal of Robotics Research, Sage,

associate editor of the International Journal of Computer Vision, Kluwer, and

area editor of Computer Vision and Image Understanding, Elsevier.

Invited talks

Radu Horaud gave an invited talk at the IEEE ICCV Worskhop on 3D Reconstruction and Understanding with Video and Sound, Santiago de Chile, Chile, 17 December 2015.

Teaching - Supervision - Juries Teaching

E-learning: MOOC on Binaural Hearing for Robots, May-June 2015, 5 hours over five weeks. Teacher: Radu Horaud.

Tutorial: Embodied Audition for Robots at the European Signal Processing Conference (EUSIPCO), Nice, France, 31 August 2015, 3 hours. Teachers: Radu Horaud, Heinrich Loellmann (Friedrich Alexander Universitat Erlangen) and Christine Evers (Imperial College London).

Supervision

PhD in progress: Israel Dejene Gebru, October 2013, Radu Horaud and Sileye Ba.

PhD in progress: Dionyssos Kounades-Bastian, October 2013, Radu Horaud and Laurent Girin.

PhD in progress: Vincent Drouard, October 2014, Radu Horaud and Sileye Ba.

PhD in progress: Benoit Massé, October 2014, Radu Horaud and Sileye Ba.

PhD in progress: Stéphane Lathuilière, October 2014, Radu Horaud.

PhD in progress: Yutong Ban, October 2015, Radu Horaud.

A Geometric Approach to Sound Source Localization from Time-Delay Estimates Xavier Alameda-Pineda X. Radu Horaud R. IEEE Transactions on Audio, Speech and Language Processing 22 6 June 2014 1082-1095 https://hal.inria.fr/hal-00975293 Visual Servoing from Lines Nicolas Andreff N. Bernard Espiau B. Radu Horaud R. International Journal of Robotics Research 21 8 2002 679–700 http://hal.inria.fr/hal-00520167 Automatic Detection of Calibration Grids in Time-of-Flight Images Miles Hansard M. Radu Horaud R. Michel Amat M. Georgios Evangelidis G. Computer Vision and Image Understanding 121 April 2014 108-118 https://hal.inria.fr/hal-00936333 Cyclopean geometry of binocular vision Miles Hansard M. Radu Horaud R. Journal of the Optical Society of America A 25 9 September 2008 2357-2369 http://hal.inria.fr/inria-00435548 Cyclorotation Models for Eyes and Cameras Miles Hansard M. Radu Horaud R. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 40 1 March 2010 151-161 http://hal.inria.fr/inria-00435549 A Differential Model of the Complex Cell Miles Hansard M. Radu Horaud R. Neural Computation 23 9 September 2011 2324-2357 http://hal.inria.fr/inria-00590266 Time of Flight Cameras: Principles, Methods, and Applications Springer Briefs in Computer Science Miles Hansard M. Seungkyu Lee S. Ouk Choi O. Radu Horaud R. Springer October 2012 95 http://hal.inria.fr/hal-00725654 Stereo Calibration from Rigid Motions Radu Horaud R. Gabriela Csurka G. David Demirdjian D. IEEE Transactions on Pattern Analysis and Machine Intelligence 22 12 December 2000 1446–1452 http://hal.inria.fr/inria-00590127 Rigid and Articulated Point Registration with Expectation Conditional Maximization Radu Horaud R. Florence Forbes F. Manuel Yguel M. Guillaume Dewaele G. Jian Zhang J. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 3 March 2011 587-602 http://hal.inria.fr/inria-00590265 Human Motion Tracking by Registering an Articulated Surface to 3-D Points and Normals Radu Horaud R. Matti Niskanen M. Guillaume Dewaele G. Edmond Boyer E. IEEE Transactions on Pattern Analysis and Machine Intelligence 31 1 January 2009 158-163 http://hal.inria.fr/inria-00446898 Conjugate Mixture Models for Clustering Multimodal Data Vasil Khalidov V. Florence Forbes F. Radu Horaud R. Neural Computation 23 2 February 2011 517-557 http://hal.inria.fr/inria-00590267 Human Motion Tracking with a Kinematic Parameterization of Extremal Contours David Knossow D. Remi Ronfard R. Radu Horaud R. International Journal of Computer Vision 79 3 September 2008 247-269 http://hal.inria.fr/inria-00590247 Real-time Visuomotor Update of an Active Binocular Head Michael Sapienza M. Miles Hansard M. Radu Horaud R. Autonomous Robots 34 1 January 2013 33-45 http://hal.inria.fr/hal-00768615 Topology-Adaptive Mesh Deformation for Surface Evolution, Morphing, and Multi-View Reconstruction Andrei Zaharescu A. Edmond Boyer E. Radu Horaud R. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 4 April 2011 823-837 http://hal.inria.fr/inria-00590271 Keypoints and Local Descriptors of Scalar Functions on 2D Manifolds Andrei Zaharescu A. Edmond Boyer E. Radu Horaud R. International Journal of Computer Vision 100 1 October 2012 78-98 http://hal.inria.fr/hal-00699620 Robust Factorization Methods Using A Gaussian/Uniform Mixture Model Andrei Zaharescu A. Radu Horaud R. International Journal of Computer Vision 81 3 March 2009 240-258 http://hal.inria.fr/inria-00446987 Vision-Guided Robot Hearing Xavier Alameda-Pineda X. Radu Horaud R. 0278-3649 International Journal of Robotics Research 34 4-5 April 2015 437-456 https://hal.inria.fr/hal-00990766 Robust Temporally Coherent Laplacian Protrusion Segmentation of 3D Articulated Bodies Fabio Cuzzolin F. Diana Mateus D. Radu Horaud R. 0920-5691 International Journal of Computer Vision 112 1 March 2015 43-70 https://hal.archives-ouvertes.fr/hal-01053737 Hyper-Spectral Image Analysis with Partially-Latent Regression and Spatial Markov Dependencies Antoine Deleforge A. Florence Forbes F. Sileye Ba S. Radu Horaud R. 1932-4553 IEEE Journal on Selected Topics in Signal Processing 9 6 September 2015 1037-1048 https://hal.inria.fr/hal-01136465 Acoustic Space Learning for Sound-Source Separation and Localization on Binaural Manifolds Antoine Deleforge A. Florence Forbes F. Radu Horaud R. 0129-0657 International Journal of Neural Systems 25 1 February 2015 21 https://hal.inria.fr/hal-00960796 High-Dimensional Regression with Gaussian Mixtures and Partially-Latent Response Variables Antoine Deleforge A. Florence Forbes F. Radu Horaud R. 0960-3174 Statistics and Computing 25 5 September 2015 893-911 https://hal.inria.fr/hal-00863468 Co-Localization of Audio Sources in Images Using Binaural Features and Locally-Linear Regression Antoine Deleforge A. Radu Horaud R. Yoav Y. Schechner Y. Y. Laurent Girin L. 1558-7916 IEEE Transactions on Audio, Speech and Language Processing 23 4 April 2015 718-731 https://hal.inria.fr/hal-01112834 Fusion of Range and Stereo Data for High-Resolution Scene-Modeling Georgios Evangelidis G. Miles Hansard M. Radu Horaud R. 0162-8828 IEEE Transactions on Pattern Analysis and Machine Intelligence 37 11 November 2015 2178 - 2192 https://hal.archives-ouvertes.fr/hal-01110031 Cross-Calibration of Time-of-flight and Colour Cameras Miles Hansard M. Georgios Evangelidis G. Quentin Pelorson Q. Radu Horaud R. 1077-3142 Computer Vision and Image Understanding 134 April 2015 105-115 https://hal.inria.fr/hal-01059891 Speaker-Adaptive Acoustic-Articulatory Inversion using Cascaded Gaussian Mixture Regression Thomas Hueber T. Laurent Girin L. Xavier Alameda-Pineda X. Gérard Bailly G. 1558-7916 IEEE/ACM Transactions on Audio, Speech and Language Processing 23 12 December 2015 2246-2259 https://hal.archives-ouvertes.fr/hal-01231197 Continuous Action Recognition Based on Sequence Alignment Kaustubh Kulkarni K. Georgios Evangelidis G. Jan Cech J. Radu Horaud R. 0920-5691 International Journal of Computer Vision 112 1 March 2015 90-114 https://hal.archives-ouvertes.fr/hal-01058732 A Distributed Architecture for Interacting with NAO Fabien Badeig F. Quentin Pelorson Q. Soraya Arias S. Vincent Drouard V. Israel Dejene Gebru I. D. Xiaofei Li X. Georgios Evangelidis G. Radu Horaud R. International Conference on Multimodal Interaction Seattle, WA, United States ACM November 2015 https://hal.inria.fr/hal-01201716 International Conference on Multimodal Interfaces 17 ICMI Head Pose Estimation via Probabilistic High-Dimensional Regression Vincent Drouard V. Silèye Ba S. Georgios Evangelidis G. Antoine Deleforge A. Radu Horaud R. IEEE International Conference on Image Processing Quebec City, Canada Proceedings of the IEEE International Conference on Image Processing September 2015 https://hal.inria.fr/hal-01163663 IEEE International Conference on Image Processing 15 ICIP Audio-Visual Speech-Turn Detection and Tracking Israel Dejene Gebru I. D. Silèye Ba S. Georgios Evangelidis G. Radu Horaud R. The 12-th International Conference on Latent Variable Analysis and Signal Separation Liberec, Czech Republic August 2015 143-151 https://hal.inria.fr/hal-01163659 International Conference on Latent Variable Analysis and Signal Separation 2015 Tracking the Active Speaker Based on a Joint Audio-Visual Observation Model Israel Dejene Gebru I. D. Silèye Ba S. Georgios Evangelidis G. Radu Horaud R. ICCV Workshop on 3D Reconstruction and Understanding with Video and Sound Santiago, Chile December 2015 https://hal.archives-ouvertes.fr/hal-01220956 ICCV Workshop on 3D Reconstruction and Understanding with Video and Sound 2015 A Variational EM Algorithm for the Separation of Moving Sound Sources Dionyssos Kounades-Bastian D. Laurent Girin L. Xavier Alameda-Pineda X. Sharon Gannot S. Radu Horaud R. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics New Paltz, NY, United States IEEE Signal Processing Society October 2015 https://hal.inria.fr/hal-01169764 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics 2013 WASPAA An Inverse-Gamma Source Variance Prior with Factorized Parameterization for Audio Source Separation Dionyssos Kounades-Bastian D. Laurent Girin L. Xavier Alameda-Pineda X. Sharon Gannot S. Radu Horaud R. IEEE International Conference on Acoustic, Speech, and Signal Processing Shangai, China IEEE Signal Processing Society March 2016 https://hal.inria.fr/hal-01253169 IEEE International Conference on Acoustics, Speech and Signal Processing 41 ICASSP Non-Stationary Noise Power Spectral Density Estimation Based on Regional Statistics Xiaofei Li X. Laurent Girin L. Sharon Gannot S. Radu Horaud R. IEEE International Conference on Audio, Speech and Signal Processing Shangai, China IEEE Signal Processing Society March 2016 https://hal.inria.fr/hal-01250892 IEEE International Conference on Acoustics, Speech and Signal Processing 41 ICASSP Estimation of Relative Transfer Function in the Presence of Stationary Noise Based on Segmental Power Spectral Density Matrix Subtraction Xiaofei Li X. Laurent Girin L. Radu Horaud R. Sharon Gannot S. IEEE International Conference on Acoustics, Speech and Signal Processing Brisbane, Australia IEEE Signal Processing Society April 2015 https://hal.archives-ouvertes.fr/hal-01119186 IEEE International Conference on Acoustics, Speech and Signal Processing 2008 ICASSP Local Relative Transfer Function for Sound Source Localization Xiaofei Li X. Radu Horaud R. Laurent Girin L. Sharon Gannot S. The European Signal Processing Conference Nice, France The European Signal Processing Conference August 2015 https://hal.inria.fr/hal-01163675 European Signal Processing Conference 23 EUSIPCO An On-line Variational Bayesian Model for Multi-Person Tracking from Cluttered Scenes Silèye Ba S. Xavier Alameda-Pineda X. Alessio Xompero A. Radu Horaud R. CoRR abs/1509.01520 2015 http://arxiv.org/abs/1509.01520 Variational EM for Binaural Sound-Source Separation and Localization Antoine Deleforge A. Florence Forbes F. Radu Horaud R. IEEE International Conference on Acoustics, Speech, and Signal Processing Vancouver, Canada IEEE 2013 76-80 http://hal.inria.fr/hal-00823453 EM Algorithms for Weighted-Data Clustering with Application to Audio-Visual Scene Analysis Israel D. Gebru I. D. Xavier Alameda-Pineda X. Florence Forbes F. Radu Horaud R. CoRR abs/1509.01509 2015 http://arxiv.org/abs/1509.01509 A Variational EM Algorithm for the Separation of Time-Varying Convolutive Audio Mixtures Dionyssos Kounades-Bastian D. Laurent Girin L. Xavier Alameda-Pineda X. Sharon Gannot S. Radu Horaud R. CoRR abs/1510.04595 2015 http://arxiv.org/abs/1510.04595 Binaural Sound Source Localization based on Direct-Path Relative Transfer Function Xiaofei Li X. Laurent Girin L. Radu Horaud R. Sharon Gannot S. CoRR abs/1509.03205 2015 http://arxiv.org/abs/1509.03205