A Geometric Approach to Sound Source Localization from Time-Delay Estimates

PERCEPTION Interpretation and Modelling of Images and Videos

Vision, perception and multimedia interpretation

Perception, Cognition and Interaction

http://team.inria.fr/perception Laboratoire Jean Kuntzmann (LJK) Creation of the Team: 2006 September 01, updated into Project-Team: 2008 January 01 Project-Team A3.4. - Machine learning and statistics A5.1. - Human-Computer Interaction A5.3. - Image processing and analysis A5.4. - Computer vision A5.7. - Audio modeling and processing A5.10.2. - Perception A5.10.5. - Robot interaction (with the environment, humans, other robots) A9.2. - Machine learning A9.5. - Robotics B5.6. - Robotic systems Radu Patrice Horaud Chercheur

Grenoble

Team leader, Inria, Senior Researcher oui Xavier Alameda-Pineda Chercheur

Grenoble

Inria, Researcher Xiaofei Li Chercheur

Grenoble

Inria, Starting Research Position Pablo Mesejo Santiago Chercheur

Grenoble

Inria, Starting Research Position Yutong Ban PhD

Grenoble

Inria Israel Dejene Gebru PhD

Grenoble

Inria, until Jan 2017 Guillaume Delorme PhD

Grenoble

Inria, from Sep 2017 Vincent Drouard PhD

Grenoble

Inria Sylvain Guy PhD

Grenoble

Univ Grenoble Alpes, from Oct 2017 Dionyssos Kounades PhD

Grenoble

Inria, until Mar 2017 Stephane Lathuiliere PhD

Grenoble

Inria Benoit Masse PhD

Grenoble

Inria Dionyssos Kounades Technique

Grenoble

Inria, from Apr 2017 until Sep 2017 Bastien Mourgue Technique

Grenoble

Inria Guillaume Sarrazin Technique

Grenoble

Inria Guillaume Delorme Stagiaire

Grenoble

Inria, from Apr 2017 until Aug 2017 Divya Grover Stagiaire

Grenoble

Inria, from Mar 2017 until Aug 2017 Sylvain Guy Stagiaire

Grenoble

Inria, from Feb 2017 until Jul 2017 Duc Anh Luu Stagiaire

Grenoble

Inria, from Jul 2017 until Sep 2017 Nathalie Gillot Assistant

Grenoble

Inria Sharon Gannot Visiteur

Grenoble

Bar Ilan University, from Feb 2017 until Oct 2017 Oscar David Gomez Lopez Visiteur

Grenoble

University of Granada, from Dec 2017 Laurent Girin CollaborateurExterieur

Grenoble

Institut Polytechnique de Grenoble oui Overall Objectives Overall Objectives

Auditory and visual perception play a complementary role in human interaction. Perception enables people to communicate based on verbal (speech and language) and non-verbal (facial expressions, visual gaze, head movements, hand and body gesturing) communication. These communication modalities have a large degree of overlap, in particular in social contexts. Moreover, the modalities disambiguate each other whenever one of the modalities is weak, ambiguous, or corrupted by various perturbations. Human-computer interaction (HCI) has attempted to address these issues, e.g., using smart & portable devices. In HCI the user is in the loop for decision taking: images and sounds are recorded purposively in order to optimize their quality with respect to the task at hand.

However, the robustness of HCI based on speech recognition degrades significantly as the microphones are located a few meters away from the user. Similarly, face detection and recognition work well under limited lighting conditions and if the cameras are properly oriented towards a person. Altogether, the HCI paradigm cannot be easily extended to less constrained interaction scenarios which involve several users and whenever is important to consider the social context.

The PERCEPTION team investigates the fundamental role played by audio and visual perception in human-robot interaction (HRI). The main difference between HCI and HRI is that, while the former is user-controlled, the latter is robot-controlled, namely it is implemented with intelligent robots that take decisions and act autonomously. The mid term objective of PERCEPTION is to develop computational models, methods, and applications for enabling non-verbal and verbal interactions between people, analyze their intentions and their dialogue, extract information and synthesize appropriate behaviors, e.g., the robot waves to a person, turns its head towards the dominant speaker, nods, gesticulates, asks questions, gives advices, waits for instructions, etc. The following topics are thoroughly addressed by the team members: audio-visual sound-source separation and localization in natural environments, for example to detect and track moving speakers, inference of temporal models of verbal and non-verbal activities (diarisation), continuous recognition of particular gestures and words, context recognition, and multimodal dialogue.

Video: https://team.inria.fr/perception/demos/nao-video/

Research Program Audio-Visual Scene Analysis

From 2006 to 2009, R. Horaud was the scientific coordinator of the collaborative European project POP (Perception on Purpose), an interdisciplinary effort to understand visual and auditory perception at the crossroads of several disciplines (computational and biological vision, computational auditory analysis, robotics, and psychophysics). This allowed the PERCEPTION team to launch an interdisciplinary research agenda that has been very active for the last five years. There are very few teams in the world that gather scientific competences spanning computer vision, audio signal processing, machine learning and human-robot interaction. The fusion of several sensorial modalities resides at the heart of the most recent biological theories of perception. Nevertheless, multi-sensor processing is still poorly understood from a computational point of view. In particular and so far, audio-visual fusion has been investigated in the framework of speech processing using close-distance cameras and microphones. The vast majority of these approaches attempt to model the temporal correlation between the auditory signals and the dynamics of lip and facial movements. Our original contribution has been to consider that audio-visual localization and recognition are equally important. We have proposed to take into account the fact that the audio-visual objects of interest live in a three-dimensional physical space and hence we contributed to the emergence of audio-visual scene analysis as a scientific topic in its own right. We proposed several novel statistical approaches based on supervised and unsupervised mixture models. The conjugate mixture model (CMM) is an unsupervised probabilistic model that allows to cluster observations from different modalities (e.g., vision and audio) living in different mathematical spaces , . We thoroughly investigated CMM, provided practical resolution algorithms and studied their convergence properties. We developed several methods for sound localization using two or more microphones . The Gaussian locally-linear model (GLLiM) is a partially supervised mixture model that allows to map high-dimensional observations (audio, visual, or concatenations of audio-visual vectors) onto low-dimensional manifolds with a partially known structure . This model is particularly well suited for perception because it encodes both observable and unobservable phenomena. A variant of this model, namely probabilistic piecewise affine mapping has also been proposed and successfully applied to the problem of sound-source localization and separation . The European projects HUMAVIPS (2010-2013) coordinated by R. Horaud and EARS (2014-2017), applied audio-visual scene analysis to human-robot interaction.

Stereoscopic Vision

Stereoscopy is one of the most studied topics in biological and computer vision. Nevertheless, classical approaches of addressing this problem fail to integrate eye/camera vergence. From a geometric point of view, the integration of vergence is difficult because one has to re-estimate the epipolar geometry at every new eye/camera rotation. From an algorithmic point of view, it is not clear how to combine depth maps obtained with different eyes/cameras relative orientations. Therefore, we addressed the more general problem of binocular vision that combines the low-level eye/camera geometry, sensor rotations, and practical algorithms based on global optimization , . We studied the link between mathematical and computational approaches to stereo (global optimization and Markov random fields) and the brain plausibility of some of these approaches: indeed, we proposed an original mathematical model for the complex cells in visual-cortex areas V1 and V2 that is based on steering Gaussian filters and that admits simple solutions . This addresses the fundamental issue of how local image structure is represented in the brain/computer and how this structure is used for estimating a dense disparity field. Therefore, the main originality of our work is to address both computational and biological issues within a unifying model of binocular vision. Another equally important problem that still remains to be solved is how to integrate binocular depth maps over time. Recently, we have addressed this problem and proposed a semi-global optimization framework that starts with sparse yet reliable matches and proceeds with propagating them over both space and time. The concept of seed-match propagation has then been extended to TOF-stereo fusion .

Audio Signal Processing

Audio-visual fusion algorithms necessitate that the two modalities are represented in the same mathematical space. Binaural audition allows to extract sound-source localization (SSL) information from the acoustic signals recorded with two microphones. We have developed several methods, that perform sound localization in the temporal and the spectral domains. If a direct path is assumed, one can exploit the time difference of arrival (TDOA) between two microphones to recover the position of the sound source with respect to the position of the two microphones. The solution is not unique in this case, the sound source lies onto a 2D manifold. However, if one further assumes that the sound source lies in a horizontal plane, it is then possible to extract the azimuth. We used this approach to predict possible sound locations in order to estimate the direction of a speaker . We also developed a geometric formulation and we showed that with four non-coplanar microphones the azimuth and elevation of a single source can be estimated without ambiguity . We also investigated SSL in the spectral domain. This exploits the filtering effects of the head related transfer function (HRTF): there is a different HRTF for the left and right microphones. The interaural spectral features, namely the ILD (interaural level difference) and IPD (interaural phase difference) can be extracted from the short-time Fourier transforms of the two signals. The sound direction is encoded in these interaural features but it is not clear how to make SSL explicit in this case. We proposed a supervised learning formulation that estimates a mapping from interaural spectral features (ILD and IPD) to source directions using two different setups: audio-motor learning and audio-visual learning .

Visual Reconstruction With Multiple Color and Depth Cameras

For the last decade, one of the most active topics in computer vision has been the visual reconstruction of objects, people, and complex scenes using a multiple-camera setup. The PERCEPTION team has pioneered this field and by 2006 several team members published seminal papers in the field. Recent work has concentrated onto the robustness of the 3D reconstructed data using probabilistic outlier rejection techniques combined with algebraic geometry principles and linear algebra solvers . Subsequently, we proposed to combine 3D representations of shape (meshes) with photometric data . The originality of this work was to represent photometric information as a scalar function over a discrete Riemannian manifold, thus generalizing image analysis to mesh and graph analysis. Manifold equivalents of local-structure detectors and descriptors were developed . The outcome of this pioneering work has been twofold: the formulation of a new research topic now addressed by several teams in the world, and allowed us to start a three year collaboration with Samsung Electronics. We developed the novel concept of mixed camera systems combining high-resolution color cameras with low-resolution depth cameras , ,. Together with our start-up company 4D Views Solutions and with Samsung, we developed the first practical depth-color multiple-camera multiple-PC system and the first algorithms to reconstruct high-quality 3D content .

Registration, Tracking and Recognition of People and Actions

The analysis of articulated shapes has challenged standard computer vision algorithms for a long time. There are two difficulties associated with this problem, namely how to represent articulated shapes and how to devise robust registration and tracking methods. We addressed both these difficulties and we proposed a novel kinematic representation that integrates concepts from robotics and from the geometry of vision. In 2008 we proposed a method that parameterizes the occluding contours of a shape with its intrinsic kinematic parameters, such that there is a direct mapping between observed image features and joint parameters . This deterministic model has been motivated by the use of 3D data gathered with multiple cameras. However, this method was not robust to various data flaws and could not achieve state-of-the-art results on standard dataset. Subsequently, we addressed the problem using probabilistic generative models. We formulated the problem of articulated-pose estimation as a maximum-likelihood with missing data and we devised several tractable algorithms , . We proposed several expectation-maximization procedures applied to various articulated shapes: human bodies, hands, etc. In parallel, we proposed to segment and register articulated shapes represented with graphs by embedding these graphs using the spectral properties of graph Laplacians . This turned out to be a very original approach that has been followed by many other researchers in computer vision and computer graphics.

Highlights of the Year Highlights of the Year

In collaboration with several partners, PERCEPTION completed the three year EU STREP project EARS (2014-2017). PERCEPTION contributed to audio-source localization using microphone arrays and to the disambiguation of audio information using vision, in particular to discriminate between speaking and silent persons.

Website: https://robot-ears.eu/

PERCEPTION started and completed a one year collaboration (December 2016 – November 2017) with Samsung Electronics Digital Media and Communications R&D Center, Seoul, Korea. The topic of this collaboration, fully funded by Samsung, was multi-modal methodologies for human-robot interaction (a central topic of the team) and is part of a strategic partnership between Inria and Samsung Electronics. A follow-up of this collaboration is under preparation and it is planned to start soon (February 2018).

As an ERC Advanced Grant holder, Radu Horaud was awarded a Proof of Concept grant for his project Vision and Hearing in Action Laboratory (VHIALab). The project will develop software packages enabling companion robots to robustly interact with multiple users.

Website: https://team.inria.fr/perception/projects/poc-vhialab/

Awards

Israel Dejene Gebru (PhD student) and his co-authors, Christine Evers, Patrick Naylor (both from Imperial College London) and Radu Horaud, received the best paper award at the IEEE Fifth Joint Workshop on Hands-free Speech Communication and Microphone Arrays, San Francisco, USA, 1-3 March 2017, for their paper Audio-visual Tracking by Density Approximation in a Sequential Bayesian Filtering Framework.

Yutong Ban (PhD student) and his co-authors, Xavier Alameda-Pineda, Fabien Badeig, and Radu Horaud, were among the five finalists of the “Novel Technology Paper Award for Amusement Culture” at the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, Canada, September 2017, for their paper Tracking a Varying Number of People with a Visually-Controlled Robotic Head.

New Software and Platforms ECMPR

Expectation Conditional Maximization for the Joint Registration of Multiple Point Sets

Functional Description: Rigid registration of two or several point sets based on probabilistic matching between point pairs and a Gaussian mixture model

Participants: Florence Forbes, Manuel Yguel and Radu Horaud

Contact: Patrice Horaud

URL: https://team.inria.fr/perception/research/jrmpc/

Mixcam

Reconstruction using a mixed camera system

Keywords: Computer vision - 3D reconstruction

Functional Description: We developed a multiple camera platform composed of both high-definition color cameras and low-resolution depth cameras. This platform combines the advantages of the two camera types. On one side, depth (time-of-flight) cameras provide coarse low-resolution 3D scene information. On the other side, depth and color cameras can be combined such as to provide high-resolution 3D scene reconstruction and high-quality rendering of textured surfaces. The software package developed during the period 2011-2014 contains the calibration of TOF cameras, alignment between TOF and color cameras, TOF-stereo fusion, and image-based rendering. These software developments were performed in collaboration with the Samsung Advanced Institute of Technology, Seoul, Korea. The multi-camera platform and the basic software modules are products of 4D Views Solutions SAS, a start-up company issued from the PERCEPTION group.

Participants: Clément Ménier, Georgios Evangelidis, Michel Amat, Miles Hansard, Patrice Horaud, Pierre Arquier, Quentin Pelorson, Radu Horaud, Richard Broadbridge and Soraya Arias

Contact: Patrice Horaud

URL: https://team.inria.fr/perception/mixcam-project/

NaoLab

Distributed middleware architecture for interacting with NAO

Functional Description: This software provides a set of librairies and tools to simply the control of NAO robot from a remote machine. The main challenge is to make easy prototuping applications for NAO ising C++ and Matlab programming environments. Thus NaoLab provides a prototyping-friendly interface to retrieve sensor date (video and sound streams, odometric data...) and to control the robot actuators (head, arms, legs...) from a remote machine.This interface is available on Naoqi SDK, developed by Aldebarab company, Naoqi SDK is needed as it provides the tools to acess the embedded NAO services (low-level motor command, sensor data access...)

Authors: Fabien Badeig, Quentin Pelorson and Patrice Horaud

Contact: Patrice Horaud

URL: https://team.inria.fr/perception/research/naolab/

Stereo matching and recognition library

Keyword: Computer vision

Functional Description: Library providing stereo matching components to rectify stereo images, to retrieve faces from left and right images, to track faces and method to recognise simple gestures

Participants: Jan Cech, Jordi Sanchez-Riera, Radu Horaud and Soraya Arias

Contact: Soraya Arias

URL: https://code.humavips.eu/projects/stereomatch

Platforms Audio-Visual Head Popeye+

In 2016 our audio-visual platform was upgraded from Popeye to Popeye+. Popeye+ has two high-definition cameras with a wide field of view. We also upgraded the software libraries that perform synchronized acquisition of audio signals and color images. Popeye+ has been used for several datasets.

Websites:

https://team.inria.fr/perception/projects/popeye/

https://team.inria.fr/perception/projects/popeye-plus/

https://team.inria.fr/perception/avtrack1/

https://team.inria.fr/perception/avdiar/

NAO Robots

The PERCEPTION team selected the companion robot NAO for experimenting and demonstrating various audio-visual skills as well as for developing the concept of social robotics that is able to recognize human presence, to understand human gestures and voice, and to communicate by synthesizing appropriate behavior. The main challenge of our team is to enable human-robot interaction in the real world.

The humanoid robot NAO is manufactured by SoftBank Robotics Europe. Standing, the robot is roughly 60 cm tall, and 35cm when it is sitting. Approximately 30 cm large, NAO includes two CPUs. The first one, placed in the torso, together with the batteries, controls the motors and hence provides kinematic motions with 26 degrees of freedom. The other CPU is placed in the head and is in charge of managing the proprioceptive sensing, the communications, and the audio-visual sensors (two cameras and four microphones, in our case). NAO's on-board computing resources can be accessed either via wired or wireless communication protocols.

NAO's commercially available head is equipped with two cameras that are arranged along a vertical axis: these cameras are neither synchronized nor a significant common field of view. Hence, they cannot be used in combination with stereo vision. Within the EU project HUMAVIPS, Aldebaran Robotics developed a binocular camera system that is arranged horizontally. It is therefore possible to implement stereo vision algorithms on NAO. In particular, one can take advantage of both the robot's cameras and microphones. The cameras deliver VGA sequences of image pairs at 12 FPS, while the sound card delivers the audio signals arriving from all four microphones and sampled at 48 kHz. Subsequently, Aldebaran developed a second binocular camera system to go into the head of NAO v5.

In order to manage the information flow gathered by all these sensors, we implemented our software on top of the Robotics Services Bus (RSB). RSB is a platform-independent event-driven middleware specifically designed for the needs of distributed robotic applications. Several RSB tools are available, including real-time software execution, as well as tools to record the event/data flow and to replay it later, so that application development can be done off-line. RSB events are automatically equipped with several time stamps for introspection and synchronization purposes. RSB was chosen because it allows our software to be run on a remote PC platform, neither with performance nor deployment restrictions imposed by the robot's CPUs. Moreover, the software packages can be easily reused for other robots.

Recently (2015-2016) the PERCEPTION team started the development of NAOLab, a middleware for hosting robotic applications in C, C++, Python and Matlab, using the computing power available with NAO, augmented with a networked PC. More recently, NAOLab was renamed RMP (Robotics Middleware for Perception).

Websites:

https://team.inria.fr/perception/nao/

https://team.inria.fr/perception/research/naolab/

New Results Audio-Source Localization

In previous years we have developed several supervised sound-source localization algorithms. The general principle of these algorithms was based on the learning of a mapping (regression) between binaural feature vectors and source locations , . While fixed-length wide-spectrum sounds (white noise) are used for training to reliably estimate the model parameters, we showed that the testing (localization) can be extended to variable-length sparse-spectrum sounds (such as speech), thus enabling a wide range of realistic applications. Indeed, we demonstrated that the method could be used for audio-visual fusion, namely to map speech signals onto images and hence to spatially align the audio and visual modalities, thus enabling to discriminate between speaking and non-speaking faces. This year we released a novel corpus of real-room recordings that allow quantitative evaluation of the co-localization method in the presence of one or two sound sources. Experiments demonstrate increased accuracy and speed relative to several state-of-the-art methods. During the period 2015-2016 we extended this method to an arbitrary number of microphones based on the relative transfer function – RTF (between any channel and a reference channel). In the period 2016-2017 we extended this work and developed a novel transfer function that contains the direct path between the source and the microphone array, namely the direct-path relative transfer function , .

Websites:

https://team.inria.fr/perception/research/acoustic-learning/

https://team.inria.fr/perception/research/binaural-ssl/

https://team.inria.fr/perception/research/ssl-rtf/

Audio-Source Separation

We addressed the problem of separating audio sources from both static and time-varying convolutive mixtures. We proposed an unsupervised probabilistic framework based on the local complex-Gaussian model combined with non-negative matrix factorization . The time-varying mixing filters are modeled by a continuous temporal stochastic process. This model extended the case of static filters which corresponds to static audio sources. While static filters can be learnt in advance, e.g. , time-varying filters cannot and therefore the problem is more complex. We developed a variational expectation-maximization (VEM) algorithm that employs a Kalman smoother to estimate the time-varying mixing matrix, and that jointly estimates the source parameters. In 2017 we extended this method to incorporate the concept of diarization. Indeed, audio sources such as speaking persons do not emit continuously, bet merely take "turns". We formally modeled speech turn-taking within a combined separation and diarization formulation , . We also started to investigate the use of the convolutive transfer function for audio-source separation , , .

Websites:

https://team.inria.fr/perception/research/vemove/

https://team.inria.fr/perception/research/nmfig/

https://team.inria.fr/perception/research/dnd/

Speech Dereverberation and Noise Reduction

We address the problems of blind multichannel identification and equalization for joint speech dereverberation and noise reduction. The standard time-domain cross-relation methods are hardly applicable for blind room impulse response identification due to the near-common zeros of the long impulse responses. We extend the cross-relation formulation to the short-time Fourier transform (STFT) domain, in which the time-domain impulse response is approximately represented by the convolutive transfer function (CTF) with much less coefficients. For the oversampled STFT, CTFs suffer from the common zeros caused by the non-flat-top STFT window. To overcome this, we propose to identify CTFs using the STFT framework with oversampled signals and critically sampled CTFs, which is a good trade-off between the frequency aliasing of the signals and the common zeros problem of CTFs. The phases of the identified CTFs are inaccurate due to the frequency aliasing of the CTFs, and thus only their magnitudes are used. This leads to a non-negative multichannel equalization method based on a non-negative convolution model between the STFT magnitude of the source signal and the CTF magnitude. To recover the STFT magnitude of the source signal and to reduce the additive noise, the $ℓ_{2}$ -norm fitting error between the STFT magnitude of the microphone signals and the non-negative convolution is constrained to be less than a noise power related tolerance. Meanwhile, the $ℓ_{1}$ -norm of the STFT magnitude of the source signal is minimized to impose the sparsity .

Website: https://team.inria.fr/perception/research/ctf-dereverberation/

Acoustic-Articulatory Mapping

In this series of studies, we tackle the problem of adapting an acoustic-articulatory inversion model of a reference speaker to the voice of another source speaker. We exploited the framework of Gaussian mixture regressors (GMR) with missing data. To address speaker adaptation, we previously proposed a general framework called Cascaded-GMR (C-GMR) which decomposes the adaptation process into two consecutive steps: spectral conversion between source and reference speaker and acoustic-articulatory inversion of converted spectral trajectories. In particular, we proposed the Integrated C-GMR technique (IC-GMR) in which both steps are tied together in the same probabilistic model. In , , we extend the C-GMR framework with another model called Joint-GMR (J-GMR). Contrary to the IC-GMR, this model aims at exploiting all potential acoustic-articulatory relationships, including those between the source speaker's acoustics and the reference speaker's articulation. We present the full derivation of the exact Expectation-Maximization (EM) training algorithm for the J-GMR. It exploits the missing data methodology of machine learning to deal with limited adaptation data. We provide an extensive evaluation of the J-GMR on both synthetic acoustic-articulatory data and on the multi-speaker MOCHA EMA database. We compare the J-GMR performance to other models of the C-GMR framework, notably the IC-GMR, and discuss their respective merits. We also exploited the IC-GMR framework with visual data to provide visual biofeedback . Visual biofeedback is the process of gaining awareness of physiological functions through the display of visual information. As speech is concerned, visual biofeedback usually consists in showing a speaker his/her own articulatory movements, which has proven useful in applications such as speech therapy or second language learning. We automatically animate an articulatory tongue model from ultrasound images. We benchmarked several GMR-based techniques on a multispeaker database. The IC-GMR approach is able (i) to maintain good mapping performance while minimizing the amount of adaptation data (and thus limiting the duration of the enrollment session), and (ii) to generalize to articulatory configurations not seen during enrollment better than the plain GMR approach. As a result, the GMR appears to be a good mapping technique for non-linear regression tasks, and in particular for those requiring adaptation (either using J-GMR or IC-GMR).

Visual Tracking of Multiple Persons

Object tracking is an ubiquitous problem in computer vision with many applications in human-machine and human-robot interaction, augmented reality, driving assistance, surveillance, etc. Although thoroughly investigated, tracking multiple persons remains a challenging and an open problem. In this work, an online variational Bayesian model for multiple-person tracking is proposed. This yields a variational expectation-maximization (VEM) algorithm. The computational efficiency of the proposed method is made possible thanks to closed-form expressions for both the posterior distributions of the latent variables and for the estimation of the model parameters. A stochastic process that handles person birth and person death enables the tracker to handle a varying number of persons over long periods of time . The method was combined with visual servoing and implemented on our robot platform (Fig. ) .

Websites:

https://team.inria.fr/perception/research/ovbt/

https://team.inria.fr/perception/research/mot-servoing/

Audio-Visual Speaker Tracking and Diarization

We are particularly interested in modeling the interaction between an intelligent device and a group of people. For that purpose we develop audio-visual person tracking methods , , , . As the observed persons are supposed to carry out a conversation, we also include speaker diarization into our tracking methodology. We cast the diarization problem into a tracking formulation whereby the active speaker is detected and tracked over time. A probabilistic tracker exploits the spatial coincidence of visual and auditory observations and infers a single latent variable which represents the identity of the active speaker. Visual and auditory observations are fused using our recently developed weighted-data mixture model , while several options for the speaking turns dynamics are fulfilled by a multi-case transition model. The modules that translate raw audio and visual data into image observations are also described in detail. The performance of the proposed method are tested on challenging datasets that are available from recent contributions which are used as baselines for comparison .

Websites:

https://team.inria.fr/perception/research/wdgmm/

https://team.inria.fr/perception/research/speakerloc/

https://team.inria.fr/perception/research/speechturndet/

https://team.inria.fr/perception/research/avdiarization/

Head Pose Estimation and Tracking

Head pose estimation is an important task, because it provides information about cognitive interactions that are likely to occur. Estimating the head pose is intimately linked to face detection. We addressed the problem of head pose estimation with three degrees of freedom (pitch, yaw, roll) from a single image and in the presence of face detection errors. Pose estimation is formulated as a high-dimensional to low-dimensional mixture of linear regression problem . We propose a method that maps HOG-based descriptors, extracted from face bounding boxes, to corresponding head poses. To account for errors in the observed bounding-box position, we learn regression parameters such that a HOG descriptor is mapped onto the union of a head pose and an offset, such that the latter optimally shifts the bounding box towards the actual position of the face in the image. The performance of the proposed method is assessed on publicly available datasets. The experiments that we carried out show that a relatively small number of locally-linear regression functions is sufficient to deal with the non-linear mapping problem at hand. Comparisons with state-of-the-art methods show that our method outperforms several other techniques . This work is part of the PhD of Vincent Drouard that received the best student paper award (second place) at the IEEE ICIP'15.

In 2017 we extended this work and we proposed a head-pose tracker based on a switching Kalman filter (SKF) formalism. The SKF governs the temporal predictive distribution of the pose parameters (modeled as continuous latent variables) conditioned by the discrete variables associated with the mixture of linear inverse-regression formulation of . We formally derived the equations of the proposed switching linear regression model, we proposed an approximation that is both identifiable and computationally tractable, we designed an EM procedure to estimate the SKF parameters in closed-form, and we carried out experiments and comparisons with other methods using recently released datasets .

Websites:

https://team.inria.fr/perception/research/head-pose/

https://team.inria.fr/perception/research/head-pose-tracking/

Tracking Eye Gaze and of Visual Focus of Attention

The visual focus of attention (VFOA) has been recognized as a prominent conversational cue. We are interested in estimating and tracking the VFOAs associated with multi-party social interactions. We note that in this type of situations the participants either look at each other or at an object of interest; therefore their eyes are not always visible. Consequently both gaze and VFOA estimation cannot be based on eye detection and tracking. We propose a method that exploits the correlation between eye gaze and head movements. Both VFOA and gaze are modeled as latent variables in a Bayesian switching state-space model (also named switching Kalman filter). The proposed formulation leads to a tractable learning method and to an efficient online inference procedure that simultaneously tracks gaze and visual focus. The method is tested and benchmarked using two publicly available datasets, Vernissage and LAEO, that contain typical multi-party human-robot and human-human interactions .

Website:

https://team.inria.fr/perception/research/eye-gaze/.

Attention-Gated Conditional Random Fields

Recent works have shown that exploiting multi-scale representations deeply learned via convolutional neural networks (CNN) is of tremendous importance for accurate contour detection. We present a novel approach for predicting contours which advances the state of the art in two fundamental aspects, i.e. multi-scale feature generation and fusion. Different from previous works directly considering multi-scale feature maps obtained from the inner layers of a primary CNN architecture, we introduce a hierarchical deep model which produces more rich and complementary representations. Furthermore, to refine and robustly fuse the representations learned at different scales, the novel Attention-Gated Conditional Random Fields (AG-CRFs) are proposed. The experiments ran on two publicly available datasets (BSDS500 and NYUDv2) demonstrate the effectiveness of the latent AG-CRF model and of the overall hierarchical framework.

Pooling Local Virality

In our overly-connected world, the automatic recognition of virality - the quality of an image or video to be rapidly and widely spread in social networks - is of crucial importance, and has recently awaken the interest of the computer vision community. Concurrently, recent progress in deep learning architectures showed that global pooling strategies allow the extraction of activation maps, which highlight the parts of the image most likely to contain instances of a certain class. We extended this concept by introducing a pooling layer that learns the size of the support area to be averaged: the learned top-N average (LENA) pooling . We hypothesize that the latent concepts (feature maps) describing virality may require such a rich pooling strategy. We assess the effectiveness of the LENA layer by appending it on top of a convolutional siamese architecture and evaluate its performance on the task of predicting and localizing virality. We report experiments on two publicly available datasets annotated for virality and show that our method outperforms state-of-the-art approaches.

Registration of Multiple Point Sets

We have also addressed the rigid registration problem of multiple 3D point sets. While the vast majority of state-of-the-art techniques build on pairwise registration, we proposed a generative model that explains jointly registered multiple sets: back-transformed points are considered realizations of a single Gaussian mixture model (GMM) whose means play the role of the (unknown) scene points. Under this assumption, the joint registration problem is cast into a probabilistic clustering framework. We formally derive an expectation-maximization procedure that robustly estimates both the GMM parameters and the rigid transformations that map each individual cloud onto an under-construction reference set, that is, the GMM means. GMM variances carry rich information as well, thus leading to a noise- and outlier-free scene model as a by-product. A second version of the algorithm is also proposed whereby newly captured sets can be registered online. A thorough discussion and validation on challenging data-sets against several state-of-the-art methods confirm the potential of the proposed model for jointly registering real depth data .

Website:

https://team.inria.fr/perception/research/jrmpc/

Bilateral Contracts and Grants with Industry Bilateral Contracts with Industry

From December 2016 to November 2017 the PERCEPTION team had a collaborative project with Samsung's Digital Media and Communication R&D Center. The collaboration was fully funded by Samsung Electronics. The topic of this collaboration was multi-modal approach to human-robot interaction.

Partnerships and Cooperations European Initiatives FP7 & H2020 Projects VHIA

Title: Vision and Hearing in Action

EU framework: FP7

Type: ERC Advanced Grant

Duration: February 2014 - January 2019

Coordinator: Inria

Inria contact: Radu Horaud

'The objective of VHIA is to elaborate a holistic computational paradigm of perception and of perception-action loops. We plan to develop a completely novel twofold approach: (i) learn from mappings between auditory/visual inputs and structured outputs, and from sensorimotor contingencies, and (ii) execute perception-action interaction cycles in the real world with a humanoid robot. VHIA will achieve a unique fine coupling between methodological findings and proof-of-concept implementations using the consumer humanoid NAO manufactured in Europe. The proposed multimodal approach is in strong contrast with current computational paradigms influenced by unimodal biological theories. These theories have hypothesized a modular view, postulating quasi-independent and parallel perceptual pathways in the brain. VHIA will also take a radically different view than today's audiovisual fusion models that rely on clean-speech signals and on accurate frontal-images of faces; These models assume that videos and sounds are recorded with hand-held or head-mounted sensors, and hence there is a human in the loop who intentionally supervises perception and interaction. Our approach deeply contradicts the belief that complex and expensive humanoids (often manufactured in Japan) are required to implement research ideas. VHIA's methodological program addresses extremely difficult issues: how to build a joint audiovisual space from heterogeneous, noisy, ambiguous and physically different visual and auditory stimuli, how to model seamless interaction, how to deal with high-dimensional input data, and how to achieve robust and efficient human-humanoid communication tasks through a well-thought tradeoff between offline training and online execution. VHIA bets on the high-risk idea that in the next decades, social robots will have a considerable economical impact, and there will be millions of humanoids, in our homes, schools and offices, which will be able to naturally communicate with us.

Website: https://team.inria.fr/perception/projects/erc-vhia/

International Initiatives Inria International Partners Informal International Partners

Bar Ilan University, Israel (prof. Sharon Gannot and his team)

University of Trento, Italy (prof. Nicu Sebe and prof. Elisa Ricci)

Dr. Rafael Munoz-Salinas and prof. Manuel Marin-Jimenez, University of Cordoba, Spain,

Dr. Christine Evers and prof. Patrick Naylor, Imperial College of Science and Medecine, UK.

Dr. Miriam Redi, Wikimedia Foundation, UK.

Prof. Shih-Fu Chang, Columbia University, USA.

International Research Visitors Visits of International Scientists

Prof. Sharon Gannot (Bar Ilan University)

Oscar David Gomez Lopez (University of Granada)

Dissemination Promoting Scientific Activities Scientific Events Organisation General Chair, Scientific Chair

Xavier Alameda-Pineda was area chair of IEEE/CVF International Conference on Computer Vision 2017.

Member of the Organizing Committees

Xavier Alameda-Pineda co-organized a special session at ACM International Conference on Multimedia Retrieval 2017, and a workshop at ACM International Conference on Multimedia 2017.

Scientific Events Selection Reviewer

In 2017, Xavier Alameda-Pineda reviewed for IEEE Conference on Computer Vision and Pattern Recognition 2017, Advances on Neural Information Processing Systems 2017, IEEE International Conference on Audio Speech and Signal Processing 2018 and International Conference on Learning Representations.

Xavier was awarded with the CVPR 2017 Outstanding Reviewer Award.

Journal Member of the Editorial Boards

Radu Horaud is a member of the following editorial boards:

advisory board member of the International Journal of Robotics Research, Sage,

associate editor of the International Journal of Computer Vision, Kluwer, and

area editor of Computer Vision and Image Understanding, Elsevier.

Reviewer - Reviewing Activities

Xavier Alameda-Pineda regularly acts as reviewer for IEEE Transactions on Pattern Analysis and Machine Intelligence, IEEE Transactions on Audio, Speech, and Language Processing, IEEE Transactions on Multimedia, IEEE Transactions on Image Procesing, IEEE Transactions on Affective Computing, International Journal on Computer Vision and Computer Vision and Image Understanding.

Invited Talks

Xavier Alameda-Pineda gave an invited talk at GIPSA-Lab on Multi-speaker tracking with auditory data.

Radu Horaud gave invited talks at two IEEE ICCV Workshops:

https://www.msf-workshop.com/

https://mvr3d.github.io/.

Teaching - Supervision - Juries Supervision

PhD defended February 2017: Dionyssos Kounades-Bastian, October 2013, Radu Horaud, Laurent Girin, and Xavier Alameda-Pineda.

PhD defended in December 2017: Vincent Drouard, October 2014, Radu Horaud and Sileye Ba.

PhD in progress: Benoit Massé, October 2014, Radu Horaud and Sileye Ba.

PhD in progress: Stéphane Lathuilière, October 2014, Radu Horaud.

PhD in progress: Yutong Ban, October 2015, Radu Horaud and Laurent Girin

PhD in progress: Guillaume Delorme, September 2017, Radu Horaud and Xavier Alameda-Pineda

PhD in progress: Sylvain Guy, October 2017, Radu Horaud and Laurent Girin

A Geometric Approach to Sound Source Localization from Time-Delay Estimates Xavier Alameda-Pineda X. Radu Horaud R. IEEE Transactions on Audio, Speech and Language Processing 22 6 June 2014 1082-1095 https://hal.inria.fr/hal-00975293 Vision-Guided Robot Hearing Xavier Alameda-Pineda X. Radu Horaud R. International Journal of Robotics Research 34 4-5 April 2015 437-456 https://hal.inria.fr/hal-00990766 Visual Servoing from Lines Nicolas Andreff N. Bernard Espiau B. Radu Horaud R. International Journal of Robotics Research 21 8 2002 679–700 http://hal.inria.fr/hal-00520167 An On-line Variational Bayesian Model for Multi-Person Tracking from Cluttered Scenes Sileye Ba S. Xavier Alameda-Pineda X. Alessio Xompero A. Radu Horaud R. Computer Vision and Image Understanding 153 December 2016 64–76 https://hal.inria.fr/hal-01349763 Robust Temporally Coherent Laplacian Protrusion Segmentation of 3D Articulated Bodies Fabio Cuzzolin F. Diana Mateus D. Radu Horaud R. International Journal of Computer Vision 112 1 March 2015 43-70 https://hal.archives-ouvertes.fr/hal-01053737 Acoustic Space Learning for Sound-Source Separation and Localization on Binaural Manifolds Antoine Deleforge A. Florence Forbes F. Radu Horaud R. International Journal of Neural Systems 25 1 February 2015 21 https://hal.inria.fr/hal-00960796 High-Dimensional Regression with Gaussian Mixtures and Partially-Latent Response Variables Antoine Deleforge A. Florence Forbes F. Radu Horaud R. Statistics and Computing 25 5 September 2015 893-911 https://hal.inria.fr/hal-00863468 Co-Localization of Audio Sources in Images Using Binaural Features and Locally-Linear Regression Antoine Deleforge A. Radu Horaud R. Yoav Y. Schechner Y. Y. Laurent Girin L. IEEE Transactions on Audio, Speech and Language Processing 23 4 April 2015 718-731 https://hal.inria.fr/hal-01112834 Fusion of Range and Stereo Data for High-Resolution Scene-Modeling Georgios Evangelidis G. Miles Hansard M. Radu Horaud R. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 11 November 2015 2178 - 2192 https://hal.archives-ouvertes.fr/hal-01110031 EM Algorithms for Weighted-Data Clustering with Application to Audio-Visual Scene Analysis Israel Dejene Gebru I. D. Xavier Alameda-Pineda X. Florence Forbes F. Radu Horaud R. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 12 December 2016 2402 - 2415 https://hal.inria.fr/hal-01261374 Cross-Calibration of Time-of-flight and Colour Cameras Miles Hansard M. Georgios Evangelidis G. Quentin Pelorson Q. Radu Horaud R. Computer Vision and Image Understanding 134 April 2015 105-115 https://hal.inria.fr/hal-01059891 Automatic Detection of Calibration Grids in Time-of-Flight Images Miles Hansard M. Radu Horaud R. Michel Amat M. Georgios Evangelidis G. Computer Vision and Image Understanding 121 April 2014 108-118 https://hal.inria.fr/hal-00936333 Cyclopean geometry of binocular vision Miles Hansard M. Radu Horaud R. Journal of the Optical Society of America A 25 9 September 2008 2357-2369 http://hal.inria.fr/inria-00435548 Cyclorotation Models for Eyes and Cameras Miles Hansard M. Radu Horaud R. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 40 1 March 2010 151-161 http://hal.inria.fr/inria-00435549 A Differential Model of the Complex Cell Miles Hansard M. Radu Horaud R. Neural Computation 23 9 September 2011 2324-2357 http://hal.inria.fr/inria-00590266 Time of Flight Cameras: Principles, Methods, and Applications Springer Briefs in Computer Science Miles Hansard M. Seungkyu Lee S. Ouk Choi O. Radu Horaud R. Springer October 2012 95 http://hal.inria.fr/hal-00725654 Stereo Calibration from Rigid Motions Radu Horaud R. Gabriela Csurka G. David Demirdjian D. IEEE Transactions on Pattern Analysis and Machine Intelligence 22 12 December 2000 1446–1452 http://hal.inria.fr/inria-00590127 Rigid and Articulated Point Registration with Expectation Conditional Maximization Radu Horaud R. Florence Forbes F. Manuel Yguel M. Guillaume Dewaele G. Jian Zhang J. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 3 March 2011 587-602 http://hal.inria.fr/inria-00590265 Human Motion Tracking by Registering an Articulated Surface to 3-D Points and Normals Radu Horaud R. Matti Niskanen M. Guillaume Dewaele G. Edmond Boyer E. IEEE Transactions on Pattern Analysis and Machine Intelligence 31 1 January 2009 158-163 http://hal.inria.fr/inria-00446898 Conjugate Mixture Models for Clustering Multimodal Data Vasil Khalidov V. Florence Forbes F. Radu Horaud R. Neural Computation 23 2 February 2011 517-557 http://hal.inria.fr/inria-00590267 Human Motion Tracking with a Kinematic Parameterization of Extremal Contours David Knossow D. Remi Ronfard R. Radu Horaud R. International Journal of Computer Vision 79 3 September 2008 247-269 http://hal.inria.fr/inria-00590247 A Variational EM Algorithm for the Separation of Time-Varying Convolutive Audio Mixtures Dionyssos Kounades-Bastian D. Laurent Girin L. Xavier Alameda-Pineda X. Sharon Gannot S. Radu Horaud R. IEEE/ACM Transactions on Audio, Speech and Language Processing 24 8 August 2016 1408-1423 https://hal.inria.fr/hal-01301762 Estimation of the Direct-Path Relative Transfer Function for Supervised Sound-Source Localization Xiaofei Li X. Laurent Girin L. Radu Horaud R. Sharon Gannot S. IEEE/ACM Transactions on Audio, Speech and Language Processing 24 11 November 2016 2171 - 2186 https://hal.inria.fr/hal-01349691 Real-time Visuomotor Update of an Active Binocular Head Michael Sapienza M. Miles Hansard M. Radu Horaud R. Autonomous Robots 34 1 January 2013 33-45 http://hal.inria.fr/hal-00768615 Topology-Adaptive Mesh Deformation for Surface Evolution, Morphing, and Multi-View Reconstruction Andrei Zaharescu A. Edmond Boyer E. Radu Horaud R. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 4 April 2011 823-837 http://hal.inria.fr/inria-00590271 Keypoints and Local Descriptors of Scalar Functions on 2D Manifolds Andrei Zaharescu A. Edmond Boyer E. Radu Horaud R. International Journal of Computer Vision 100 1 October 2012 78-98 http://hal.inria.fr/hal-00699620 Robust Factorization Methods Using A Gaussian/Uniform Mixture Model Andrei Zaharescu A. Radu Horaud R. International Journal of Computer Vision 81 3 March 2009 240-258 http://hal.inria.fr/inria-00446987 From Images and Sounds to Face Localization and Tracking Vincent Drouard V. Université Grenoble Alpes December 2017 https://hal.inria.fr/tel-01667740 Theses Some Contributions to Audio Source Separation and Diarisation of Multichannel Convolutive Mixtures Dionyssos Kounades-Bastian D. Université Grenoble - Alpes February 2017 https://hal.inria.fr/tel-01543101 Theses Robust Head-Pose Estimation Based on Partially-Latent Mixture of Linear Regressions Vincent Drouard V. Radu Horaud R. Antoine Deleforge A. Sileye Ba S. Georgios Evangelidis G. 1057-7149 IEEE Transactions on Image Processing 26 3 March 2017 1428 - 1440 https://hal.inria.fr/hal-01413406 https://arxiv.org/abs/1603.09732 Joint Alignment of Multiple Point Sets with Batch and Incremental Expectation-Maximization Georgios Evangelidis G. Radu Horaud R. 0162-8828 IEEE Transactions on Pattern Analysis and Machine Intelligence XX June 2017 https://hal.inria.fr/hal-01413414 https://arxiv.org/abs/1609.01466 - 14 pages, 12 figures, 5 tables Automatic animation of an articulatory tongue model from ultrasound images of the vocal tract Diandra Fabre D. Thomas Hueber T. Laurent Girin L. Xavier Alameda-Pineda X. Pierre Badin P. 0167-6393 Speech Communication 93 October 2017 63 - 75 https://hal.archives-ouvertes.fr/hal-01578315 Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion Israel Gebru I. Sileye Ba S. Xiaofei Li X. Radu Horaud R. 0162-8828 IEEE Transactions on Pattern Analysis and Machine Intelligence 39 January 2017 https://hal.inria.fr/hal-01413403 https://arxiv.org/abs/1603.09725 - 14 pages Extending the Cascaded Gaussian Mixture Regression Framework for Cross-Speaker Acoustic-Articulatory Mapping Laurent Girin L. Thomas Hueber T. Xavier Alameda-Pineda X. 1558-7916 IEEE/ACM Transactions on Audio, Speech and Language Processing 25 3 March 2017 662-673 https://hal.archives-ouvertes.fr/hal-01485540 Multiple-Speaker Localization Based on Direct-Path Features and Likelihood Maximization with Spatial Sparsity Regularization Xiaofei Li X. Laurent Girin L. Radu Horaud R. Sharon Gannot S. 1558-7916 IEEE/ACM Transactions on Audio, Speech and Language Processing 25 10 October 2017 1997 - 2012 https://hal.inria.fr/hal-01413417 https://arxiv.org/abs/1611.01172 - 16 pages, 4 figures, 4 tables Tracking Gaze and Visual Focus of Attention of People Involved in Social Interaction Benoît Massé B. Silèye Ba S. Radu Horaud R. 0162-8828 IEEE Transactions on Pattern Analysis and Machine Intelligence PP 99 December 2017 1-15 https://hal.inria.fr/hal-01511414 https://arxiv.org/abs/1703.04727 Viraliency: Pooling Local Virality Xavier Alameda-Pineda X. Andrea Pilzer A. Dan Xu D. Nicu Sebe N. Elisa Ricci E. IEEE Conference on Computer Vision and Pattern Recognition Honolulu, Hawaii, United States July 2017 https://hal.inria.fr/hal-01558137 IEEE International Conference on Computer Vision and Pattern Recognition 2009 CVPR Tracking a Varying Number of People with a Visually-Controlled Robotic Head Yutong Ban Y. Xavier Alameda-Pineda X. Fabien Badeig F. Sileye Ba S. Radu Horaud R. IEEE/RSJ International Conference on Intelligent Robots and Systems Vancouver, Canada September 2017 https://hal.inria.fr/hal-01542987 IEEE RSJ International Conference on Intelligent Robots and Systems 2011 IROS Exploiting the Complementarity of Audio and Visual Data in Multi-Speaker Tracking Yutong Ban Y. Laurent Girin L. Xavier Alameda-Pineda X. Radu Horaud R. ICCV Workshop on Computer Vision for Audio-Visual Media Venezia, Italy October 2017 https://hal.inria.fr/hal-01577965 ICCV Workshop on Computer Vision for Audio-Visual Media 2017 Switching Linear Inverse-Regression Model for Tracking Head Pose Vincent Drouard V. Sileye Ba S. Radu Horaud R. IEEE Winter Conference on Applications of Computer Vision Santa Rosa, CA, United States March 2017 https://hal.inria.fr/hal-01430727 IEEE Workshop on Applications of Computer Vision 2014 WACV Audio-visual Tracking by Density Approximation in a Sequential Bayesian Filtering Framework Israel Gebru I. Christine Evers C. Patrick Naylor P. Radu Horaud R. IEEE Workshop on Hands-free Speech Communication and Microphone Arrays San Francisco, CA, United States IEEE Signal Processing Society March 2017 https://hal.inria.fr/hal-01452167 Joint Workshop on Hands-free Speech Communication and Microphone Arrays 2017 HSCMA Best Paper Award On the Use of Latent Mixing Filters in Audio Source Separation Laurent Girin L. Roland Badeau R. 13th International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA 2017) Grenoble, France Proc. 13th International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA 2017) 10169 Springer February 2017 225-235 https://hal.archives-ouvertes.fr/hal-01400965 International Conference on Latent Variable Analysis and Signal Separation 13 Adaptation of a Gaussian Mixture Regressor to a New Input Distribution: Extending the C-GMR Framework Laurent Girin L. Thomas Hueber T. Xavier Alameda-Pineda X. LVA ICA 2017- International Conference on Latent Variable Analysis and Signal Separation Grenoble, France February 2017 https://hal.inria.fr/hal-01646098 International Conference on Latent Variable Analysis and Signal Separation 13 An EM Algorithm for Joint Source Separation and Diarisation of Multichannel Convolutive Speech Mixtures Dionyssos Kounades-Bastian D. Laurent Girin L. Xavier Alameda-Pineda X. Sharon Gannot S. Radu Horaud R. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) New Orleans, United States March 2017 https://hal.inria.fr/hal-01430761 IEEE International Conference on Acoustics, Speech and Signal Processing 2008 ICASSP Exploiting the Intermittency of Speech for Joint Separation and Diarization Dionyssos Kounades-Bastian D. Laurent Girin L. Xavier Alameda-Pineda X. Radu Horaud R. Sharon Gannot S. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics New Paltz, NY, United States October 2017 https://hal.inria.fr/hal-01568813 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics 2013 WASPAA Recognition of Group Activities in Videos Based on Single- and Two-Person Descriptors Stéphane Lathuilière S. Georgios Evangelidis G. Radu Horaud R. IEEE Winter Conference on Applications of Computer Vision Santa Rosa, CA, United States March 2017 https://hal.inria.fr/hal-01430732 IEEE Workshop on Applications of Computer Vision 2014 WACV Deep Mixture of Linear Inverse Regressions Applied to Head-Pose Estimation Stéphane Lathuilière S. Rémi Juge R. Pablo Mesejo P. Rafael Muñoz-Salinas R. Radu Horaud R. IEEE Conference on Computer Vision and Pattern Recognition Honolulu, Hawaii, United States IEEE Computer Society July 2017 https://hal.inria.fr/hal-01504847 IEEE International Conference on Computer Vision and Pattern Recognition 2009 CVPR An EM Algorithm for Audio Source Separation Based on the Convolutive Transfer Function Xiaofei Li X. Laurent Girin L. Radu Horaud R. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics New Paltz, NY, United States October 2017 https://hal.inria.fr/hal-01568818 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics 2013 WASPAA Audio Source Separation Based on Convolutive Transfer Function and Frequency-Domain Lasso Optimization Xiaofei Li X. Laurent Girin L. Radu Horaud R. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) New Orleans, United States March 2017 https://hal.inria.fr/hal-01430754 IEEE International Conference on Acoustics, Speech and Signal Processing 2008 ICASSP Microphone Array Signal Processing for Robot Audition Heinrich Löllmann H. Alastair Moore A. Patrick Naylor P. Boaz Rafaely B. Radu Horaud R. Alexandre Mazel A. Walter Kellermann W. IEEE Workshop on Hands-free Speech Communication and Microphone Arrays San Francisco, United States IEEE Signal Processing Society March 2017 https://hal.inria.fr/hal-01485322 Joint Workshop on Hands-free Speech Communication and Microphone Arrays 2017 HSCMA Learning Deep Structured Multi-Scale Features using Attention-Gated CRFs for Contour Prediction Dan Xu D. Wanli Ouyang W. Xavier Alameda-Pineda X. Elisa Ricci E. Xiaogang Wang X. Nicu Sebe N. Advances in Neural Information Processing Systems Long Beach, United States December 2017 https://hal.inria.fr/hal-01646112 Annual Conference on Neural Information Processing Systems 21 NIPS Neural Network Reinforcement Learning for Audio-Visual Gaze Control in Human-Robot Interaction Stéphane Lathuilière S. Benoît Massé B. Pablo Mesejo P. Radu Horaud R. November 2017 https://hal.inria.fr/hal-01643775 https://arxiv.org/abs/1711.06834 - 14 pages Blind MultiChannel Identification and Equalization for Dereverberation and Noise Reduction based on Convolutive Transfer Function Xiaofei Li X. Sharon Gannot S. Radu Horaud R. November 2017 https://hal.inria.fr/hal-01568835 https://arxiv.org/abs/1706.03652 - 13 pages, 5 figures, 5 tables Multichannel Source Separation and Speech Enhancement Using the Convolutive Transfer Function Xiaofei Li X. Laurent Girin L. Sharon Gannot S. Radu Horaud R. November 2017 https://hal.inria.fr/hal-01645749 https://arxiv.org/abs/1711.07911 - 13 pages, 5 figures Plane-extraction from depth-data using a Gaussian mixture regression model Richard T Marriott R. T. Alexander Pashevich A. Radu Horaud R. December 2017 https://hal.inria.fr/hal-01663984 https://arxiv.org/abs/1710.01925 - 10 pages, 2 figures, 1 table In Memoriam Roger Mohr Karl Tombre K. Long Quan L. Radu Horaud R. Patrick Gros P. Cordelia Schmid C. Peter Sturm P. Société Informatique de France September 2017 91-98 https://hal.inria.fr/hal-01598085 Article qui rappelle la carrière scientifique de Roger Mohr