Augmented reality (AR) is a field of computer research which deals with the combination of real world and computer generated data in order to provide the user with a better understanding of his surrounding environment. Usually this refers to a system in which computer graphics are overlaid onto a live video picture or projected onto a transparent screen as in a head-up display.
Though there exist a few commercial examples demonstrating the effectiveness of the AR concept for certain applications, the state of the art in AR today is comparable to the early years of Virtual Reality. Many research ideas have been demonstrated but few have matured beyond lab-based prototypes.
Computer vision plays an important role in AR applications. Indeed, the seamless integration of computer generated objects at the right place according to the motion of the user needs automatic real-time detection and tracking. In addition, 3D reconstruction of the scene is needed to solve occlusions and light inter-reflexion between objects and to make easier the interactions of the user with the augmented scene. Since fifteen years, much work has been successfully devoted to the problem of structure and motion, but these works are often formulated as off-line algorithms and require batch processing of several images acquired in a sequence. The challenge is now to design robust solutions to these problems with the aim to let the user free of his motion during AR applications and to widen the range of AR application to large and/or unstructured environments. More specifically, the Magrit team aims at addressing the following problems:
On-line pose computation for structured and non structured environments: this problem is the cornerstone of AR systems and must be achieved in real time with a good accuracy.
Long term management of AR applications: a key problem of numerous algorithms is the gradual drifting of the localization over time. One of our aims is to develop methods that improve the accuracy and the repeatability of the pose during arbitrarily long periods of motion.
3D modeling for AR applications: this problem is fundamental to manage light interactions between real and virtual objects, to solve occlusions and to obtain realistic fused images.
The aim of the Magrit project is to develop vision based methods which allow significant progress of AR technologies in terms of ease of implementation, usability, reliability and robustness in order to widen the current application field of AR and to improve the freedom of the user during applications. Our main research directions concern two crucial issues, camera tracking and scene modeling. Methods are developed with a view to meet the expected robustness and to provide the user with a good perception of the augmented scene.
One of the most basic problems currently limiting Augmented Reality applications is the registration problem. The objects in the real and virtual worlds must be properly aligned with respect to each other, or the illusion that the two worlds coexist will be compromised.
As a large number of potential AR applications are interactive, real time pose computation is required. Although the registration problem has received a lot of attention in the computer vision community, the problem of real-time registration is still far from being a solved problem, especially for unstructured environments. Ideally, an AR system should work in all environments, without the need to prepare the scene ahead of time, and the user should walk anywhere he pleases.
For several years, the Magrit project has aimed at developing on-line and markerless methods for camera pose computation. Within the European Project ARIS, we have proposed a real-time system for camera tracking designed for indoor scenes. The main difficulty with online tracking is to ensure robustness of the process. Indeed, for off-line processes, robustness is achieved by using spatial and temporal coherence of the considered sequence through move-matching techniques. To get robustness for open-loop systems, we have developed a method which combines the advantage of move-matching methods and model-based methods by using a piecewise-planar model of the environment. This methodology can then be used in a wide variety of environments: indoor scenes, urban scenes ... We are also concerned with the development of methods for camera stabilization. Indeed, statistical fluctuations in the viewpoint computations lead to unpleasant jittering or sliding effects, especially when the camera motion is small. We have proved that the use of model selection allows us to noticeably improve the visual impression and to reduce drift over time.
An important way to improve the reliability and the robustness of pose algorithms is to combine the camera with another form of sensor in order to compensate for the shortcomings of each technology. Indeed, each technology approach has limitations: on the one hand, rapid head motions cause image features to undergo large motion between frame that can cause visual tracking to fail. On the other hand, inertial sensors response is largely independent from the user's motion but their accuracy is bad and their response is sensible to metallic objects in the scene. We recently proposed a system that makes an inertial sensor (MT9- Xsens) cooperate with the camera based system in order to improve the robustness of the AR system to abrupt motions of the users, especially head motion. This work contributes to reduce the constraints on the users and the need to carefully control the environment during an AR application . This research area will be continued in the near future within the ASPI project in order to build a dynamic articulatory model from various image modalities and sensor data.
It must be noted that the registration problem must be addressed from the rather specific point of view of augmented reality: the success and the acceptation of an AR application does not only depend on the accuracy of the pose computation but also on the visual impression of the augmented scene. The search for the best compromise between accuracy and perception is therefore an important issue in this project. This research topic is currently addressed in our project both in classical AR and in medical imaging in order to choose the camera model, including intrinsic parameters, which describes at best the considered camera.
Finally, camera tracking largely depends on the quality of the matching stage which allows to detect and to match features over the sequence. Ongoing research are conducted on the problem of establishing robust correspondences of features over time. The use of a contrariodecision is currently under study to achieve this aim.
Modeling the scene is a fundamental issue in AR for many reasons. First, pose computation algorithms often use a model of the scene or at least some 3D knowledge on the scene. Second, effective AR systems require a model of the scene to support occlusion and to compute light reflexions between the real and the virtual objects. Unlike pose computation which has to be computed in a sequential way, scene modeling can be considered as an off-line or an on-line problem according to the application.
Currently, we are mainly concerned with interactive scene modeling from various image modalities. This activity concerns our medical activities as well as the ASPI project where a complete dynamic articulatory model of a speaker must be designed from various image modalities (ultrasound, MRI, video and magnetic sensors).
For the last 15 years, we have been working in close collaboration with the neuroradiology laboratory (CHU-University Hospital of Nancy) and GE Healthcare. As several imaging modalities are now available in a per-operative context (2D and 3D angiography, MRI, ...), our aim is to develop a multi-modality framework to help therapeutic decision and treatment.
In previous works , we proposed an efficient solution to the registration of 2D/3D angiographic images and 3DXA/MRI images. Since then, we have mainly been interested in the effective use of a multimodality framework in the treatment of arteriovenous malformations (AVM). The treatment of AVM is classically a two-stage process: embolization or endovascular treatment is first performed. This step is then followed by a stereotactic irradiation of the remnant. Hence an accurate definition of the target is a parameter of great importance for the treatment. Our short term aim is to perform an accurate detection of the AVM shape within a multimodality framework. Our long term aim is to develop multimodality and augmented reality tools which make cooperate various image modalities (2D and 3D angiography, fluoroscopic images, MRI, ...) in order to help physicians in clinical routine.
Besides interactive modeling, research on on-line reconstruction are conducted in our team. Sequential reconstruction of the scene structure needed by pose or occlusion algorithms is highly desirable for numerous AR applications for which instrumentation is not conceivable. Hence, structure and pose must be sequentially estimated over time. We are currently studying this problem both for multi-planar scenes and for general scenes.
We have a significant experience in the AR field especially through the European project ARIS (2001–2004) which aimed at developing effective and realistic AR systems for e-commerce and especially for interior design. Beyond this restrictive application field, this project allowed us to develop nearly real time camera tracking methods for multi-planar environments. We are currently continuing and amplifying our research on multi-planar environments in order to obtain effective and robust AR systems in such environments. We especially aim at automatically building a model of the scene during the applications in order to be able to consider large and unknown environments.
For 15 years, we have been working in close collaboration with the University hospital of Nancy and GE Healthcare in interventional neuroradiology with the aim to develop tools allowing the physicians to take advantage of the various existing imaging modalities on the brain in their clinical practice. As several imaging modalities that bring complementary information on the various brain pathologies are now available in a pre-operative context (subtracted angiography 2D and 3D, fluoroscopy, MRI,...) our aim is to develop a multi-modality framework to help therapeutic decisions. Recently, we have investigated the use of AR tools for neuronavigation. The aim of the PhD thesis of Sebastien Gorges which ended in May 2007 was to design tools for neuronavigation that take advantage of a real-time imagery (fluoroscopy) and a pre-operative 3D imagery (3D angiography). We are currently involved in the ARC SIMPLE with the ALCOVE team and the University hospital of Nancy. Our aim is to develop simulation tools of the interventional act adapted to the patient's anatomy and physiology, in order to help the surgeon with planning the coil placement, rehearsing the therapeutic gesture, and to provide new tools to improve the medical training to the technique.
to plan the coil placement, to train the surgeon, and to improve the medical training to the technique.
We are involved in the FET-STREP european ASPI project which started on November 2005. There is a strong evidence that visual information of the speaker, especially jaw and lips, noticeably improves the speech intelligibility. Hence, having a realistic talking head could help language learning technology in giving the student a feedback on how to change articulation in order to achieve a correct pronunciation. This task is complex and necessitates a multidisciplinary effort involving speech production modeling and image analysis. The long term aim of the ASPI project is the design of a 3D +t articulatory model to be used for the realistic animation of a talking head. Within this project, we especially work on the tracking of the visible articulators using stereo-vision techniques and we intend to supplement the model with internal articulators (tongue, larynx) obtained from medical imaging (ultrasound images for tongue tracking and MRI for global model).
Our software efforts are integrated in a library called RAlib which contains our research development on image processing, registration (2D and 3D) and visualization. This library is licensed by the APP (French agency for software protection).
The visualization module is called QGLSG: it enables the visualization of images, 2D and 3D objects under a consistent perspective projection. It is based on Qt and OpenGL libraries but originally offered few 3D visualization capabilities. Frédéric Speisser joined Magrit project-team in September 2006 as an INRIA assistant engineer to take the charge to improve the 3D visualization features.
This year saw drastic enhancements thanks to the integration of the OpenSceneGraph library (
http://
This module is now used as the basis of several graphical user interfaces for demo or research in medical imaging, augmented reality and the talking head application.
Our collaboration with GE Healthcare has given rise to several patent disclosures on specific calibration process, registration and visualization .
On the theme of scene and camera reconstruction, we followed two main directions of research, one in which the objects of the scenes are planes, the other in which the environment is unspecified. In this latter case, probabilistic frameworks were considered in order to meet the required robustness. Our efforts also concentrated on designing appropriate image descriptors for structure from motion algorithms.
In we proposed a novel approach to build interest points and descriptors which are invariant to a subclass of affine transformations. Scale and rotation invariant interest points are usually obtained via the linear scale-space representation, and it has been shown that affine invariance can be achieved by warping the smoothing kernel to match the local image structure. Our approach is instead based on the so-called Affine Morphological Scale-Space, a non-linear filtering which has been proved to be the natural equivalent of the classic linear scale-space when affine invariance is required. Simple local image descriptors are then derived from the extracted interest points. We demonstrated the proposed approach by robust matching experiments.
Fundamental matrix estimation between two views is a cornerstone of structure from motion problems. Estimation is usually achieved in a twofold procedure: 1) identify matching points of interest between the two views, and 2) sort out the best matches through a robust filtering. The success of this latter step depends on the accuracy of the first one, and on several thresholds. Setting those thresholds is quite delicate and makes it difficult to automate the whole process. Moisan and Stival have proposed a statistical model that enables to get rid of these thresholds. We have assessed over real and synthetic data that this model performed better than existing ones, especially from a robustness, accuracy, and computation time point of view.
Besides, we have presented in a probabilistic framework for computing correspondences and fundamental matrix in the structure from motion problem. Contrary to most existing algorithms where perceptual correspondence setting and geometry evaluation are independent steps, the proposed algorithm is an all-in-one approach. We showed that it is robust to repeated patterns which are usually difficult to unambiguously match and thus raise many problems in the fundamental matrix estimation.
Incremental building of 3D maps for immediate use is a very challenging problem. Its difficulty primarily stems from the fact that it must be causal, i.e. rely on past frames only, and also permit real time implementations. Recent advances in simultaneous localization and map building (SLAM) have been made in robot navigation research. However, in most of these approaches, scene reconstruction is not the final product but an intermediary stage for pose computation. As a result, models are generally poor, i.e. a set of sparse points, and cannot easily be used to position a virtual object or manage interactions between the real and the virtual scenes (occlusions, collisions, light exchanges, ...).
During this year, we have built an interactive method, which is based on inter-image tracking of planar surfaces, whose intersections with a reference plane are automatically extracted using a particle filter . These intersections are used to clean up the ambiguity of the SFM problem using a single plane, and to get the equations of the intersecting planes with the reference plane. Different strategies of particle filtering were compared on synthetic data, and convincing results were obtained on real scenes.
In addition, a semi-automatic system was proposed, which requires minimum intervention from the user to reconstruct the scene in real time and immediately perceive augmentations: he/she only has to target the planes to be integrated with the camera, and validate or cancel (pressing a key) the solutions generated by the system. Whereas traditional methods are limited to applications where the user has to keep the camera fixed while he/she outlines regions or points out features using a mouse, our system is well suited to modern applications of AR that run on personal digital assistants (PDA) or mobile phones.
In order to guide tools during the procedure, the interventional radiologist uses a vascular C-arm to acquire 2D fluoroscopy images in real time. Today, 3D X-ray images (3DXA) are also available on modern vascular C-arms. A large consensus is now met in that one important next step should be to leverage the high-resolution volumetric information provided by 3DXA to complement 2D fluoroscopy images and make the tool guidance easier. In particular, 3DXA could be superimposed onto fluoroscopy images to enhance them. We call this application “Augmented Fluoroscopy”.
Such questions were investigated in Sébastien Gorges's PhD thesis in collaboration with GE Healthcare and the University Hospital of Nancy. Sébastien defended his PhD thesis in May . As a consequence, the first months of this year were mostly devoted to the consolidation of last year's preliminary activities: clinical evaluation of augmented fluoroscopy thanks to the software prototype installed on the clinical site by GE Healthcare and validation of the algorithm to track the guide-wire in fluoroscopy images. Yet, preliminary results were obtained in a view to reconstruct the guide-wire in real time. Indeed, the angiographic equipment used in our partner hospital is biplane, meaning that two X-ray images can be taken simultaneously from quasi-orthogonal viewpoints. Thanks to our above algorithm, the guide-wire is tracked in the projection images. Our previous modeling of the vascular C-arm also provides the calibration for each view. As a consequence, this wide-base calibrated stereo configuration makes it possible to reconstruct the guide-wire in 3D. Preliminary comparisons with tomographic reconstructions of the guide-wire convinced us that this approach is clinically viable, and shows a promising path to actual micro-tool navigation in 3D.
The endovascular treatment for an intracranial aneurism consists in filling the aneurismal cavity by placing coils. These are sorts of long platinum springs that, once deployed, wind into a compact ball. Considering the location of the lesion, close to the brain, and its small size, a few millimeters, the interventional gesture requires a good planning and cannot but be performed by a very experienced surgeon. A simulation tool of the interventional act, available in the operating room, reliable, adapted to the patient's anatomy and physiology, would help to plan the coil placement, rehearse the procedure, and improve the medical training to the technique.
The SIMPLE project, an INRIA cooperative research action (ARC), started this year. It aims at developing methods to simulate coil deployment in an intracranial aneurism, running in real time and adaptable to any patient data. We coordinate this project which runs in collaboration with the Alcove project-team at INRIA Futur-Lille and the University Hospital of Nancy.
Our task consists in providing a precise arterial geometric model as well as information on blood flow specific to the patient. The first step that we took this year was to extract the arterial wall. Despite the very high quality of the available 3D images (3D rotational angiography), tomographic reconstruction artefacts perturb the isosurface that should correspond to the arterial wall. Taking this isosurface as an initialization, we propose to improve it within an active surface framework where the arterial wall is deformed until its X-ray projection fits a set a registered 2D angiographic images taken on the patient. Patricia Wills joined the project-team in September to develop and implement this approach.
As demands on hospital efficiency increase, there is a stronger need for automatic analysis, recovery, and modification of surgical workflows. Even though most of the previous works have dealt with higher level and hospital-wide workflow including issues like document management, workflow is also an important issue within the surgery room. Its study has a high potential, e.g., for building context-sensitive operating rooms, evaluating and training surgical staff, optimizing surgeries and generating automatic reports.
During this year , , we have proposed an approach to segment the surgical workflow into phases based on temporal synchronization of multidimensional state vectors. Our method is evaluated on the example of laparoscopic cholecystectomy with state vectors representing tool usage during the surgeries. The discriminative power of each instrument in regard to each phase is estimated using AdaBoost . A boosted version of the Dynamic Time Warping (DTW) algorithm is used to create a surgical reference model and to segment a newly observed surgery. Full cross-validation on ten surgeries have been performed and the method have been compared to standard DTW and to Hidden Markov Models.
Being able to produce realistic facial animation is crucial for many speech applications in language learning technologies. In order to reach realism, it is necessary to acquire 3D models of the face and of the internal articulators (tongue, palate,...) from various image modalities.
Our long term objective is to provide intuitive and near-automatic tools for building a dynamic 3D model of the vocal tract from various image and sensor modalities (MRI, ultrasound (US), video, magnetic sensors ...). Previous works have proven that the ultrasound modality was an efficient way to acquire dynamic tongue information. Unfortunately, the tip of the tongue is not visible in most US images because air stops the propagation of the ultrasounds. In this work, we use magnetic sensors that are glued on the tongue in order to complete the tongue contour obtained from ultrasound images near the apex.
Combining several modalities requires that all geometrical and temporal data be consistent together. During this year, we have focused on the calibration and synchronization tasks. A fast, low cost and easily reproducible acquisition system has been designed in order to temporally align the data. In this system, the different modalities (the electro-magnetic sensors, the ultrasound system and the audio recording system) are supervised by a control PC. These modalities are synchronized together by using an external event which is an audio beep emitted by the control PC . Experiments were carried out to assess the accuracy of the synchronization process and proved that the synchronization error of the US acquisition with the other modalities is at most two images when the acquisition rate is 66Hz. This setup has been successfully used to record a corpus that is currently used to evaluate audiovisual-to-articulatory inversion methods by our partners of the European ASPI project.
Ongoing works concern the design of an algorithm for fusing sensor information and tongue tracking to recover the complete shape of the tongue in US sequences. We are currently investigating snake-based strategies constrained by the sensor positions in order to improve the robustness of classical tracking methods.
Modeling facial dynamics is often achieved using stereovision techniques. Unfortunately, reconstruction artifacts are common and spatial and temporal regularization constraints are often imposed to cope with this problem. As for most regularization methods, the parameters of the regularization functions need to be carefully tuned to obtain satisfying results in order to avoid reconstruction artifacts or excessive smoothing.
In order to improve the robustness of the process, we proposed an approach that borrowed concepts from model-based approach and from vision based tracking of markers on the face. Our input data were a dense 3D map of the talker obtained for oneviseme as well as a corpus of stereo sequences of the talker with markers painted on his/her face that allowed the face kinematics to be computed. In this study, the 3D dense map was acquired with a scanner for a sustained vowel but other acquisition technologies can be used. Our main idea was to transfer kinematics learned on the sparse mesh onto the 3D dense mesh in order to generate realistic dense animations of the face. As a result, we were able to recover dynamic dense meshes that did not present neither depth artifacts nor over-smoothing. As a side effect, we are also able to recover the dense modes of the talker, thus allowing an easy animation of the head .
The partnership with GE Healthcare (formerly GE Medical Systems) started in 1993. In the past few years, it bore on the supervision of CIFRE PhD fellows on the topic of using a multi-modal framework in interventional neuroradiology. A new PhD started in January 2004 and ended on May 2007 on the design of augmented reality tools for neuronavigation. The concept of Augmented Fluoroscopywas one of the main results of this PhD. A prototype that implements these results has been developed by GE Healthcare and has been available at the University Hospital of Nancy for clinical evaluation since July 2006.
The SIMPLE project, an INRIA cooperative research action (ARC), started this year. It aims at developing methods to simulate coil deployment in an intracranial aneurism, running in real time and adaptable to any patient data. We coordinate this project which runs in collaboration with the Alcove project-team at INRIA Futur-Lille and the University Hospital of Nancy.
ASPI is about Audiovisual-to-articulatory inversion. Participants in this project are INRIA Lorraine, ENST (Paris), KTH (Stokholm), the University Research Institute of National Technical University of Athens and the University of Bruxelles. Audiovisual-to-articulatory inversion consists in recovering the vocal tract shape dynamics (from vocal folds to lips) from the acoustical speech signal, supplemented by image analysis of the speaker's face. Being able to recover this information automatically would be a major break-through in speech research and technology, as a vocal tract representation of a speech signal would be both beneficial from a theoretical point of view and practically useful in many speech processing applications (language learning, automatic speech processing, speech coding, speech therapy, film industry...). The Magrit team is involved in the development of articulatory models from various image modalities (ultrasound, video, MRI) and electromagnetic sensors.
This year, the first main achievement concerned the synchronization of the different imaging modalities , .The present system enables the synchronized acquisition of ultrasound images, electromagnetic sensors, speech and face stereovision images to be captured. The second main achievement concerned inversion. We are starting the evaluation of inversion by using these articulatory data and other data acquired with an articulograph. In addition we are working on the inversion of fricative sounds and we are also investigating the automatic processing of ultrasound and X-ray data with a view to track articulators.
The acquisition of MRI data has been continued in order to elaborate a 3D model of the vocal tract.
M.-O. Berger was a member of the program committee of the conferences MICCAI 07, ISMAR 07, RFIA'08.
G. Simon was a member of the program committee of ISMAR 07 and of the ISMAR 2007 Awards Committee.
F. Sur was a member of the program committee of the Conférence sur l'Apprentissage, CAP 2007.
Workshop organizations:
M.O. Berger co-organized with D. Bechmann (Strasbourg university) the workshop: Virtual and Augmented Reality: break the deadlocksin Laval (april, 20 th).
Several members of the group, in particular assistant professors and Ph.D. students, actively teach at Henri Poincaré Nancy 1, Nancy 2 universities and INPL.
Other members of the group also teach in the computer science Master of Nancy and in the “Master en sciences de la vie et de la santé” (SVS).
Members of the group participated in the following events: International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 07, Brisbane, Australia), International Conference on Image Processing (ICIP 2007, San Antonio, USA), Audio visual speech processing (AVSP 2007, Hilvarenbeck, The Nederlands), INTERSPEECH 2007 (Anvers, Belgium), Orasis 2007 (Obernai, France), Computer Assisted Radiology and Surgery (CARS 2007, Berlin, Germany), PDE and variational methods in image analysis (Metz, France).
Thesis committees:
M.O. Berger: S. Bourgeois (Clermont Ferrand), P. Mozer ( Grenoble), G. Sourimant (Rennes) E. Boyer (HDR, Grenoble)