The overall objective of the MOVI research team is to develop theories, models, methods, and systems in order to allow computers to see and to understand what they see. A major difference between classical computer systems and computer vision systems is that while the former are guided by sets of mathematical and logical rules, the latter are governed by the laws of nature. It turns out that formalizing interactions between an artificial system and the physical world is a tremendously difficult task.
A first objective is to be able to gather images and videos with one or several cameras, to calibrate them, and to extract 2D and 3D geometric information from these images and videos. This is an extremely difficult task because the cameras receive light stimuli and these stimuli are affected by the complexity of the objects (shape, surface, color, texture, material) composing the real world. The interpretation of light in terms of geometry is also affected by the fact that the three dimensional world projects onto two dimensional images and this projection alters the Euclidean nature of the observed scene.
A second objective is to analyse articulated and moving objects. The real world is composed of rigid, deformable, and articulated objects. Solutions for finding the motion fields associated with deformable and articulated objects (such as humans) remain to be found. It is necessary to introduce prior models that encapsulate physical and mechanical features as well as shape, aspect, and behaviour. The ambition is to describe complex motion as ``events'' at both the physical level and at the semantic level.
A third objective is to describe and interpret images and videos in terms of objects, object categories, and events. In the past it has been shown that it is possible to recognize a single occurrence of an object from a single image. A more ambitious goal is to recognize object classes such as people, cars, trees, chairs, etc., as well as events or objects evolving in time. In addition to the usual difficulties that affect images of a single object there is also the additional issue of the variability within a class. The notion of statistical shape must be introduced and hence statistical learning should be used. More generally, learning should play a crucial role and the system must be designed such that it is able to learn from a small training set of samples. Another goal is to investigate how an object recognition system can take advantage from the introduction of non-visual input such as semantic and verbal descriptions. The relationship between images and meaning is a great challenge.
A fourth objective is to build vision systems that encapsulate one or several objectives stated above. Vision systems are built within a specific application. The domains at which vision may contribute are numerous:
Multi-media technologies and in particular film and TV productions, database retrieval;
Visual surveillance and monitoring;
Augmented and mixed reality technologies and in particular entertainment, cultural heritage, telepresence and immersive systems, image-based rendering and image-based animation;
Embedded systems for car and driving technologies, portable devices, defense, space, etc.
Computer vision requires models that describe the image creation process. An important part (besides e.g. radiometric effects), concerns the geometrical relations between the scene, cameras and the captured images, commonly subsumed under the term ``multi-view geometry''. This describes how a scene is projected onto an image, and how different images of the same scene are related to one another. Many concepts are developed and expressed using the tool of projective geometry. As for numerical estimation, e.g. structure and motion calculations, geometric concepts are expressed algebraically. Geometric relations between different views can for example be represented by so-called matching tensors (fundamental matrix, trifocal tensors, ...). These tools and others allow to devise the theory and algorithms for the general task of computing scene structure and camera motion, and especially how to perform this task using various kinds of geometrical information: matches of geometrical primitives in different images, constraints on the structure of the scene or on the intrinsic characteristics or the motion of cameras, etc.
Modern computer vision techniques and applications require the deployment of a large number of cameras linked to a powerful multi-PC computing platform. Therefore, such a system must fulfill the following requirements: The cameras must be synchronized up to the millisecond, the bandwidth associated with image transfer (from the sensor to the computer memory) must be large enough to allow the transmission of uncompressed images at video rates, and the computing units must be able to dynamically store the data and/to process them in real-time.
Until recently, the vast majority of systems were based on hybrid analog-digital camera systems. Current systems are all-digital ones. They are based on network communication protocols such as the IEEE 1394. Current systems deliver 640 ×480 grey-level/color images but in the near future 1600 ×1200 images will be available at 30 frames/second.
Camera synchronization may be performed in several ways. The most common one is to use special-purpose hardware. Since both cameras and computers are linked through a network, it is possible to synchronize them using network protocols, such as NTP (network time protocol).
Recovering shapes from images is a fundamental task in computer vision. Applications are numerous and include, in particular, 3D modeling applications and mixed reality applications where real shapes are mixed with virtual environments. The problem faced here is to recover shape information such as surfaces, point positions, or differential properties from image information. A tremendous research effort has been made in the past to solve this problem and a number of partial solutions have been proposed. However, a fundamental issue still to be addressed is the recovery of full shape information over time sequences. The main difficulties are precision, robustness of computed shapes as well as consistency of these shapes over time. An additional difficulty raised by real-time applications is complexity. Such applications are today feasible but often require powerful computation units such as PC clusters. Thus, significant efforts must also be devoted to switch from traditional single-PC units to modern computation architectures.
The perception of motion is one of the major goals in computer vision with a wide range of promising applications. A prerequisite for motion analysis is motion modelling. Motion models span from rigid motion to complex articulated and/or deformable motion. Deformable objects form an interesting case because the models are closely related to the underlying physical phenomena. In the recent past, robust methods were developed for analysing rigid motion. This can be done either in image space or in 3D space. Image-space analysis is appealing but it requires sophisticated non-linear minimization methods and a probabilistic framework. An intrinsic difficulty with methods based on 2D data is the ambiguity of associating a multiple degree of freedom 3D model with image contours, texture and optical flow. Methods using 3D data are more relevant with respect to our recent research investigations. 3D data are produced using stereo or a multiple-camera setup. These data are matched against an articulated object model (based on cylindrical parts, implicit surfaces, conical parts, and so forth). The matching is carried out iteratively using various methods, such as ICP (iterative close point) or EM (expectation/maximization).
Challenging problems are the detection of motion and motion tracking. When a vision systems observes complex articulated motion, such as the motion of the hands, it is crucial to be able to detect motion cues and to interpret them in terms of moving parts, independently of a prior model. Another difficult problem is to track articulated motion over time and to estimated the motions associated with each individual degree of freedom.
In the last few years, there has been a tremendous effort to combine the methods of computer vision and information retrieval, so that images and videos may be indexed, searched and retrieved more efficiently. But it soon became clear that there is a semantic gap between the lower-level visual primitives that computer vision can recognize and the higher-level concepts that information retrieval can usefully index. A promising approach extends Bayesian, statistical methods such as Hidden Markov Models - which have been very successful in speech processing - to video. Thus, objects and events are represented as flexible structures of recognizable primitives. A key difficulty is the choice of primitives, both in space and in time, from a large number of possible low-level features. The analysis and description of human motion is one area where this issue can be resolved using built-in geometric and kinematic models of the human body. Yet, detecting people without prior knowledge remains an unresolved problem, which necessitates a combination of different low-level clues and high-level reasoning. Recently, a related, but much more general approach has emerged, where the recognition of objects in images is viewed as a translation problem between images and their natural language description. Thus, statistical translation models are learned from large collections of described images, as was done in the 1990's for collection of textual translations. Extending this approach to the recognition of events in video will be a challenging but promising new area of research.
3D modeling from images can be seen as a basic technology, with many uses and applications in various domains. Some applications only require geometric information (measuring, visual servoing, navigation) while more and more rely on more complete models (3D models with texture maps or other models of appearance) that can be rendered in order to produce realistic images. Some of our projects directly address potential applications in virtual studios or ``edutainment'' (e.g. virtual tours), and many others may benefit from our scientific results and software.
Mixed realities consist in merging real and virtual environments. The fundamental issue in this field is the level of interaction that can be reached between real and virtual worlds, typically a person catching and moving a virtual object. This level depends directly on the precision of the real world models that can be obtained and on the rapidity of the modeling process to ensure consistency between both worlds. A challenging task is then to use images taken in real-time from cameras to model the real world without help from intrusive material such as infrared sensors or markers.
Augmented reality systems allow an user to see the real world with computer graphics and computer animation superimposed and composited with it. Applications of the concept of AR basically use virtual objects to help the user to get a better understanding of her/his surroundings. Fundamentally, AR is about augmentation of human visual perception: entertainment, maintenance and repair of complex/dangerous equipment, training, telepresence in remote, space, and hazardous environments, emergency handling, and so forth. In recent years, computer vision techniques have proved their potential for solving key-problems encountered in AR: real-time pose estimation, detection and tracking of rigid objects, etc. However, the vast majority of existing systems use a single camera and the technological challenge consisted in aligning a prestored geometrical model of an object with a monocular image sequence.
We are particularly interested in the capture and analysis of human motion, which consists in recovering the motion parameters of the human body and/or human body parts, such as the hand. In the past researchers have concentrated on recovering constrained motions such as human walking and running. We are interested in recovering unconstrained motion. The problem is difficult because of the large number of degrees of freedom, the small size of some body parts, the ambiguity of some motions, the self-occlusions, etc. Human motion capture methods have a wide range of applications: human monitoring, surveillance, gesture analysis, motion recognition, computer animation, etc.
The employment of advanced computer vision techniques for media applications is a dynamic area that will benefit from scientific findings and developments. There is a huge potential in the spheres of TV and film productions, interactive TV, multimedia database retrieval, and so forth.
Vision research provides solutions for real-time recovery of studio models (3D scene, people and their movements, etc.) in realistic conditions compatible with artistic production (several moving people in changing lighting conditions, partial occlusions). In particular, the recognition of people and their motions will offer a whole new range of possibilities for creating dynamic situations and for immersive/interactive interfaces and platforms in TV productions. These new and not yet available technologies involve integration of action and gesture recognition techniques for new forms of interaction between, for example, a TV moderator and virtual characters and objects, two remote groups of people, real and virtual actors, etc.
Another important domain is the interaction with multi-media databases through advanced multimodal interfaces. In order to archive and manage multimedia visual material such as news, social and cultural events, movies, theater and music performances, etc., it is necessary to extract and store information concerning its content in addition to its mere recording. This implies that a system is able to perform automatic analysis of visual information available with video sequences. Generally speaking, modern audio-visual systems for understanding, classifying, archiving, and accessing multimedia databases must encapsulate the following features: (a) shot detection and classification, (b) recognition of individuals (actors, players, athlets, ...), (c) recognition of facial expressions, and (d) action and gesture recognition.
In the long term (five to ten years from now) all car manufacturers foresee that cameras with their associated hardware and software will become parts of standard car equipment. Cameras' fields of view will span both outside and inside the car. Computer vision software should be able to have both low-level (alert systems) and high-level (cognitive systems) capabilities. Forthcoming camera-based systems should be able to detect and recognize obstacles in real-time, to assist the driver for manoeuvering the car (through a verbal dialogue), and to monitor the driver's behaviour. For example, the analysis and recognition of the driver's body gestures and head motions will be used as cues for modelling the driver's behaviour and for alerting her or him if necessary.
The MOVI project has a long tradition of scientific and technological collaborations with the French defense industry. In the past we collaborated with Aérospatiale SA for 10 years (from 1992 to 2002). During these years we developed several computer vision based techniques for air-to-ground and ground-to-ground missile guidance. In particular we developed methods for enabling 3D reconstruction and pose recovery from cameras on-board of the missile, as well as a method for tracking a target in the presence of large scale changes.
The Grimage platform is a room dedicated to video acquisition, 3D computations and visualisation. It is an experimental platform for several research teams of the INRIA Rhone-Alpes including MOVI, MOAIS, ARTIS and EVASION. Many specific software developments have been realized for this platform using National or European fundings.
The MV platform is a mobile platform composed of 4-6 video cameras and a small PC cluster including a laptop for visualization. It allows real-time visualization of dynamic textured models of real scenes. This platform has been developed by the MOVI team for the European Project Holonics.
We developed software for calibrating large numbers of cameras with minimal user intervention. The calibration object is very simple, consisting of a rigid rod holding active or passive markers. The calibration process consists of three steps – acquiring a short sequence of the calibration object moving in the field of the cameras, detecting and tracking the markers automatically in all views, and simultaneously reconstructing the 3D trajectories of the markers and the camera parameters. The new software includes a graphical user interface (GUI) to allow even non-expert users to perform and validate a calibration. In the current version, user intervention is strictly limited to initializing the camera positions and orientations in a global reference coordinate system. The software will be used internally in the GRIMAGE room and also delivered to some of our industry partners.
We developed a system allowing to calibrate a camera from a special purpose calibration object. In particular, the focal length and the radial distortion parameters are estimated. Other functionalities such as the acquisition of images, the stereo-calibration and 3D measures are also available. In detail, the software package includes the following features:
Calibration: Non-linear estimation of the intrinsic parameters: focal length, optical center, skew parameter, radial distortion. Non-linear estimation of the relative position of the camera to the calibration object
Stereo Calibration: Non-linear estimation of the parameters of a pair of cameras: intrinsic parameters and relative position of the two cameras. Calibration from several pairs of views of the calibration object.
Metrology: 3D reconstruction by triangulation. Statistics on the 3D reconstruction accuracy.
Images Acquisition: Acquisition from IC-RGB acquisition card. Acquisition from pgm image files.
The software package is downloadable at http://perception.inrialpes.fr/Soft/calibration/index.html.
The Blinky software library aims at real-time acquisition of images for multiple cameras spread over a PC cluster or a computing grid. The library contains tools to develop two kinds of software components:
A frontendis directly connected to the camera driver and is in charge of doing the image acquisition. Blinky makes the images available either to the local host by using shared memory or to other hosts by networking, allowing images to be captured transparently accross a local network. Actually, the frontend can be seen as an image server, which can also be used to change the camera parameters (shutter speed, aperture, zoom...). The frontend is also able to record the raw video stream to a file, which can later be used as if it were a live camera.
A backendis the user application that captures the images and processes them. Each camera is designated by a device name which is valid accross the network, and acquiring an image is as simple as reading a file. Multiple backends can connect to a single frontend, allowing for many applications to run at the same time using the same cameras. A single backend can also connect to multiple cameras, and a single function call is necessary to acquire a set of synchronized images from multiple cameras spread over the network.
The blinky distribution also contains a set of sample frontends and backends:
blinkyf1394, a frontend for IIDC-compliant IEEE1394 cameras on Linux.
blinkyflnrd, a frontend for CameraLink cameras, using the Arvoo Leonardo frame grabber on Linux.
blinkyfdummy, a dummy frontend serving predictable images.
blinkybenchmark, a backend to test the acquisition rate (FPS) of a camera.
blinkysaveimages, a backend that images aquired for one or several cameras to any standard image format.
blinkysdl, a backend that visualizes the video stream using the SDL graphical library.
The library itself is fully POSIX-compliant, and was ported to several variants of Unix and to Windows 2000/XP. As soon as this software is out of its beta-stage, it will be distributed with an open-source license.
We developed a new software with a graphical user interface (GUI) to remotely control the acquisition, synchronization and display of video streams from multiple cameras (up to 16 cameras) arbitrarily distributed on a grid of workstations. The software can be used as a recorder, in which case it controls all camera parameters (e.g. gain, shutter speed, focus, white balance) through the IEEE 1394 interface. The software allows the hardware configuration to be dynamically modified, and the control of either hardware or software synchronization of all cameras. The software controls remote processes which are responsible for recording the video streams on disk with frame rates up to 30 frames per second and resolutions up to 800 x 600. The software displays the video streams from all cameras simultaneously in reduced resolution during recording.
The software can also be used as a video player, allowing the synchonized display of all recorded video streams simultaneously, as well as limited video editing functions (including transcoding, trimming, bookmarking and transfer over networks).
The software makes it possible even for non-experts to successfully record and play-back multiple-view video. It will be used internally in the GRIMAGE room and also delivered to some of our academic and industry partners. It will also serve as a framework for developping other real-time video processing modules.
A software for extracting and tracking interest points in video sequences has been developed. The software is based on standard computer vision techniques, and comes with a user-friendly GUI.
We developed a complete model acquisition chain, from camera acquisition to model visualization. Modeling algorithms are silhouette based and produce an approximated model of the scene, the visual hull, in real-time. The software was deployed on the GRIMAGE platform which is composed of up to 10 firewire cameras, 20 PCs and 16 projectors for visualization. Its implementation was made in collaboration with the MOAIS INRIA team and includes today: image acquisition (described before), background subtraction to extract silhouettes, 3D modeling on PC clusters to compute visual hulls, 3D rendering and 3D display using several projectors for high screen resolutions.
A versatile software library for stereo reconstruction was developed. This library contains a number of state-of-the-art algorithms, and it helps in prototyping advanced stereo algorithms, such as those dealing with occlusions and low-textured areas. Besides, two real-time implementations of standard stereo algorithms were developed: A parallel version to be run on a PC cluster, and a version using the GPU (Graphics Processing Unit). All versions can run from live cameras using the Blinky acquisition library, making them usable for real-time applications.
We have developed a library that allows visual hulls to be computed given a set of silhouettes and the associated camera calibration. Complexity is quasi-optimal allowing therefore real-time computation of 3D models. This library is the state-of-art in the field and has already been used by several research teams in the world.
We have developed a human-body modeller that takes input from videos and which interactively allows an user to associate 3D points with rotational joints. Based on these point-to-joint assignments, a person is modelled with 19 body parts and 48 rotational joints. Each body part is modelled by a truncated elliptical cone.
We have developed a software which displays sequences of 3D textured dynamic models. This software is dedicated to multiple video context. It takes as input a sequence of 3D models over time and the associated projective textures, or videos, from several viewpoints. The output is a textured 3D models that can be seen interactively from any viewpoint, and at any time in the sequence including continous playing. To our knowledge, this is the first public software for such input data.
Cyclope is a 6 degrees of freedom tracker using a single camera, an infrared flash, and a tracking device equipped with retro-reflective markers. It is robust and portable, making it a good candidate for virtual reality applications. Cyclope was demonstrated at SIGGRAPH'2005.
The motivation of the works in this section is to propose generic approaches for camera calibration and structure-from-motion, that would work for any type of camera. To do so, we first adopt a simple but general camera model that consists in representing one projection ray in 3D per image pixel. This model comprises most camera types used in computer vision, e.g. pinhole (with or without radial or tangential distortion), catadioptric cameras (central or non-central), pushbroom, non-central mosaics, multi-camera systems, etc.
In we proposed a generic calibration algorithm for our general camera model that requires three images of a calibration grid, taken from unknown positions. Four different algorithms are proposed, working for 3D or planar calibration grids and for fully non-central or central cameras.
This first work only allowed to calibrate regions of the image where the calibration grids' projections overlap. In , we then showed how to calibrate the whole camera, by integrating multiple images of calibration grids.
In , we address the calibration of an intermediate case between central (single viewpoint) and fully non-central camera models: so-called axial cameras are non-central, but there exists a line in 3D that cuts all projection rays. This is the case for pushbroom cameras, mis-aligned catadioptric cameras and possibly for fisheye lenses.
We recently proposed a first self-calibration approach for the general camera model introduced in the previous paragraph. This is a rather difficult problem and our first contribution thus considers the special case of central cameras (but with otherwise unconstrained projection rays) and relies on specific camera motions (pure translation or rotations) .
In , we propose a unified treatment of the structure-from-motion problem by giving generic algorithms for the basic tasks of pose estimation, motion estimation and 3D point triangulation. Like our calibration approach (see above), these algorithms are applicable to whatever camera type, the only distinction that may have to be made is between central and non-central cameras.
In , we present a multi-view geometry for our general camera model, embodied by multi-view tensors and matching constraints. Like with perspective cameras, constraints for up to 4 views at a time exist. The obtained matching tensors operate on Plücker coordinates of projection rays rather than homogeneous coordinates of image points as in the classical case, and are thus of dimension .
This work addresses a camera model lying between the classical parametric models and the general non-parametric model used in the previous paragraphs: cameras with arbitrary radial distortion. This model is still sufficiently general for most typical cameras, including pinhole cameras with moderate distortions, most catadioptric cameras and fisheyes. In we propose calibration algorithms that use a planar calibration grid (a flat screen is used in practice). The algorithms can already work with a single image and with only a few images, stable results were achieved. Examples of distortion corrections using the calibration results are shown in figure .
In , we propose closed-form solutions for camera calibration using images of a linear object, e.g. a stick with markers attached (``wand''). Further, we reveal the critical, or degenerate, motions that cause the calibration to fail and in the vicinity of which the calibration will be unstable.
We propose and evaluate several approaches for providing initial solutions for multi-camera calibration , , that can then be refined by bundle adjustment. Calibration is done using one, two or three rigidly coupled wands (stick with markers attached) and exploiting the available geometrical constraints in different manners.
Calibrating a multiple camera environment using image silhouettes is an interesting alternative to usually fastidious standard calibration procedures which require specific patterns: LED or calibration sticks for instance. A person moving in the acquisition space is then sufficient to update camera parameters. We have studied why, and how, a set of image silhouettes give constraints on such parameters, and we have derived from that a practical approach for the calibration of sets of camera .
Many vision tasks need rotating and zooming cameras. We investigated the case of a fixed panoramic camera that is coupled to a pan-tilt-zoom camera and we revealed the equations governing the link between the epipolar geometry and the kinematics of the pan-tilt mechanisms. We developed a practical method that is well suited for visual surveillance systems that use camera cooperation. This work was done by Radu Horaud, David Knossow, and Markus Michaelis (Plettac Electronics) .
One of our ambitious research projects is to solve the problem of human motion capture from several cameras and without using body markers. Whenever a 3D object is described with a set of smooth surfaces, it projects onto the image as an extremal contour. We developed a parameterization of such a contour point as a function of the kinematic description of an articulated chain. This has been applied to the problem of tracking human motion from its observed image contours. Cylindrical elliptic cones are smooth surfaces used to model body parts. They project onto images as straight lines, whose motion can be obtained from body part motions in closed-form. The Jacobian of this parameterization is used within an efficient and general tracker, figure and . The work is part of David Knossow's PhD thesis under the supervision of Remi Ronfard, and with the collaboration of Radu Horaud and Frederic Devernay.
We developed a method for matching 3D points (e.g. stereo data) with articulated shapes. The latter are represented by an articulated implicit surface that models local shape deformations such as bending. The method was applied to the problem of tracking the motions of a hand with 27 degrees of freedom. The tracker can accomodate with bad data and with situations where the shape rapidly turns away from the viewer, figure . The work was carried out within the PhD of Guillaume Dewaele under the supervision of Radu Horaud and Frederic Devernay .
Motion analysis methods in Computer Vision usually deal with tracking rigid, articulated or deformable objects in a scene, but little or no work has been done on the analysis of 3D deformable surface, such as cloth, the surface of a liquid, or the surface of a plastic solid subject to various deformations. With the recent availability of high-resolution and high-dynamic cameras, it is now possible to perceive very subtle changes in the images, and bring up information about the 3D deformations of these surfaces, especially when several viewpoints of the surface are available. There are a few methods specialized in the reconstruction of continuous surfaces from several viewpoints, which take into account constraints on the regularity of the observed surface, such as PDE-based stereo methods. However, 3D reconstruction data of a surface at each time instant does not contain information about the 3D motion of each individual point on the surface. What we get is a 2D manifold in a 3D space (i.e. a surface reconstruction), whereas we also may want a 3D vector field (the motion field) defined on that manifold. A testbed application for these new methods we are developping is the ACI Masse de données GEOLSTEREO.
The problem of matching 3D point sets is a central one in computer vision. We developed a method that consider this problem as a point-classification problem: Data points are assigned to one or several model points when the latter are viewed as classes of points. In practice the mathing is carried out using a robust EM algorithm. The work was carried out by Guillaume Dewaele, Radu Horaud, Frederic Devernay, and Florence Forbes (MISTIS group), and a paper was submitted to ECCV'06.
We modeled human bodies with articulated implicit surfaces and we developed a method for finding the pose parameters of this shape from 3D data. The latter are obtained from multiple silhouettes using visual-hull extraction and embeds both positions and orientations. We defined a new data-to-implicit-surface sum-of-square distance that is minimized over the pose parameters .
One important ingredient of any motion capture and tracking method is the description of the human body. We developed a method and a software package that considers a kinematic representation with 19 body parts and 54 degrees of freedom, as well as a representation of each body-part as a 3D shape, i.e. a truncated elliptical cone. The fixed parameters of both the kinematic model and the 3D shapes are estimated using an interactive procedure as well as bundle adjustment techniques. The work is carried out by David Knossow, Loic Lefort, and Remi Ronfard.
Action recognition is a new research topic for MOVI.We are investigating the use of volumetric reconstructions from multiple cameras for the purpose of recognizing primitive actions such as crossing arms, waiving, standing up, sitting down, etc. To that end, we derived motion descriptors based on the cylindric Fourier decomposition of a new representation called the Motion History Volume . The descriptors were shown to be useful for classifying and segmenting human action. This is a promising avenue of research which we will continue to explore in 2006.
In , we propose a Bayesian approach for dense 3D surface modelling from calibrated input images. The surface is represented by the union of multiple depth maps, one per image, and associated color maps for the estimated surface color. Occlusions are handled by hidden visibility variables, telling how likely it is that some 3D points are visible or not in some input image. A prior on 3D surfaces that allows strong discontinuities, is used, and the likelihood of the estimated model is based on photoconsistency. The MAP (maximum a posteriori) estimate of the 3D surface is achieved using an EM algorithm or direct gradient descent. Good results have been obtained for scenes with different degrees of difficulty.
We have developed two main approaches for reconstructing specular surfaces by acquiring images of reflections of a calibration object. In , it was required that both, the camera and calibration object, be placed at different known positions. A voxel coloring type approach was then used to reconstruct points on the specular surface and the associated normals.
In , , we considerably relaxed the requirements; in the developed approach the camera remains static and only the calibration object is put into different, unknownpositions. We developed different approaches for recovering these positions; once they are known, points on the specular surface are determined by triangulation. In practice, a flat screen is used as calibration object and due to displaying series of appropriate gray codes, dense point correspondences and thus dense reconstructions of the specular surface, are achieved. An example is shown in figure .
Assume we are given several image silhouettes of an object corresponding to different camera viewpoints. The visual hull is the maximal solid shape consistent with the object silhouettes. Such an approximation of the object captures all the geometric information available from the silhouettes. Visual hulls are often used to provide robust 3D shape models in various modeling applications. In the past, we have studied how to efficiently compute them and we have proposed quasi-exact and exact solutions allowing to model complex scenes, along with an optimal complexity in time and space , .
Some applications such as tele-immersion require real-time modeling to allow for interactions between real and virtual worlds. We have developed real-time solutions for visual hull computations. These solutions are based on both parallelization and distribution schemes over a cluster of PCs, and are therefore scalable. Our current implementation allows models to be computed at 30 frames per second using 8-10 cameras , .
Many computer vision applications are based on silhouettes extracted using background subtractionmethods. These methods are usually monocular and do not account for possibly redundant information in several cameras. As a results, they are very sensitive to noise that pertubs the background model. One way to improve robustness of silhouette extraction in multiple camera environments is to integrate all the 2D image silhouette probabilities in a 3D probabilistic grid. Fusing the information from several images allows errors which are unconsistent over the image set to be removed. We have developed such a method in Bayesian framework .
Target tracking with a single camera for air-ground missile guidance. 36 months (2003-2006). 30,000 euros and the salary of a PhD student (Aude Jacquot).
TOSA develops semi-automatic and automatic tools for air-to-ground missile guidance systems. Combat aircrafts are equiped with pan-tilt-yaw infrared cameras coupled with a laser beam, and the pilot has the tasks first, to designate a target onto this image and second, to keep this target within the field of view of the camera. The difficulties are associated with both finding a small target in a low-resolution image and in keeping track of this target in the presence of large aircraft motions (rotations, forward translation, etc.). When the aircraft flies at 300 meters/second, the target first appears as a blob and after only a few tenth of a second it changes its appearance to become a complex 3D structure.
The scientific and technological objectives are the following: to develop tools enabling (i) off-line modelling of complex 3D man-made objects from a collection of images (geometric modelling as well as the aspect of the objects, such as color and texture), (ii) automatic identification of the target, and automatic tracking of the target such that large changes are taken into account.
Detection and classification of objects which are ahead of a vehicle. 36 months (2004-2007). 50,000 euros and the salary of a PhD student (Julien Morat).
In June 2004 we started a 3 year collaboration with the French car manufacturer Renault SA (Direction de la Recherche). Within this collaboration Renault co-funds a PhD thesis with ANRT. The topic of the collaboration and of the thesis is the detection and classification of obstacles which are ahead of a vehicle. We currently develop a prototype system based on stereoscopic visionwith the following functionalities: low speed following, pre-crash, and pedestrian detection. In particular we study the robustness of the image processing algorithms with respect to camera/stereo calibration problems (the system should be able to self-detect such problems).
OCETRE is a 2-year exploratory RNTL project granted by the French Ministry of Research. The project started in January 2004. The scientific goal of the project is to develop methods and techniques for recovering, in real-time, the geometry of a complex scene such as a scene composed of both static and dynamic objects (for example, people moving around). We will combine methods based on stereo with methods based on visual hull reconstruction from silhouettes. One original contribution of the project will be to combine dense depth data (gathered with stereo) with visual hulls.
We develop a camera setup composed of one color camera and two black-and-white camera. This three cameras are linked to a PC and they deliver synchronized videos at 30 frames per second. Moreover, several such setups will be deployed and synchronized using a PC cluster.
The industrial collaborators (Total-Immersion SA and Thales Training and Simulation SA) are interested to develop real-time augmented reality applications using our methods. Since the moving objects are reconstructed in real-time, it will be possible to treat them as graphical objects and therefore mix real and virtual objects in a realistic manner, i.e., in 3D space thus taking into account their interactions and mutual occlusions, and not in image space as is currently done by many augmented-reality systems.
Along with Université de Rennes, MOVI is one of two scientific contributors to the SEMOCAP project funded by the CNC and Ministère de la Recherche et de l'Industrie as part of the RIAM network (Recherche et Innovation en Audiovisuel et Multimedia) which was started in January 2004 for two years. The goal of the project is to build a low-cost system for human motion capture without markers, using multi-view video analysis and biomechanical motion models. As part of the project, we have built a prototype motion capture system using the GRIMAGE infrastructure, which has been demonstrated for the first time in December in Rennes. Evaluation is being performed against ground truth data collected using a marker-based Vicon system.
MOVI participates in the project ParkNav, in the framework of the ROBEA program. The other partners are eMOTION and PRIMA (INRIA Rhône-Alpes), VISTA (IRISA Rennes) and RIA (LAAS, Toulouse). The project is about the interpretation of complex dynamic scenes and reactive motion planing in such scenes.
MOVI is part of the project Cyber-II(ACI Masses de Données), with an extended consortium (ARTIS, MOVI, APACHE, LIRIS Lyon). Started late in 2003 Cyber-IIfollows the project Cyber(ACI Jeunes Chercheurs) initiated in 2000. Research on various topics will be carried out. Concerning MOVI, this will concern real-time 3D reconstruction, recovery of surface reflectance properties, and virtual relighting of scenes.
In September 2004 we started a 3 year collaboration with the Géosciences Azur laboratory (UMR 6526). This colloboration received funding from the French Minister of research through the ACI Masses de donnéesprogram (Action Concertée Incitative).
Mathematical modelling as well as simulation and visualization tools are widely used in order to understand, predict, and manage geological phenomena. These simulation tools cannot fully take into account the complexitiy of the natural catastrophes such as surface earthquakes occuring at level of the sedimentary layer, land slidings, etc. Within this project we plan to study and develop measurement methods based on computer vision techniques. The physical model consists in a mock-up of the geological object to be studied. Existing techniques allow to reproduce, at the mock-up scale, the influence of several hundreds years of the Earth gravitational field. We plan to observe such a mock-up with a high-resolution stereoscopic camera pair and to apply dense stereo reconstruction techniques in order to study the 3D deformations over time. In particular, the expected accuracy of the planned measurements is of the order of 10 m which corresponds to an actual amplitude of a few centimeters.
This ARC is concerned with the representation of 3D objects which plays a central role in various domains such as computer graphics or computer vision. Different disciplines use different representations and conversions between these representations appears to be a challenging issue with an impact over a wide class of disciplines. To reach this goal, this ARC connects participants having skills in various disciplines (3D acquisition, 3D reconstruction, Digital Geometry Processing, Numerical Analysis and Computer Graphics). The MOVI team is concerned with the acquisition and reconstruction part of the project.
In December 2005, Remi Ronfard signed an agreement with Grenoble Alpes Incubation on a spin-off project, nicknamed SEE-NUSH, with plans to create a startup company in 2006-07 to industrialize and commercialize the multiple-camera software developed by MOVI into products implementing virtual cameras and virtual stages. The targeted applications are in 3D cinematography for sports, performing arts, medecine, games and motion pictures.
Holonics is a European 3-year project which started on September 1, 2004. We have three industrial partners: EPTRON, coordinator (Spain), Holografika (Hungary), and Total-Immersion (France). The general scientific and technological challenge of the project is to achieve realistic virtual representations of humans through two complimentary technologies: (i) multi-camera based acquisition of human data and of human actions and gestures, and (ii) visualization of these complex representations using modern 3D holographic display devices.
Our team will develop a real-time multi-camera and multi-PC system. The developments will be based on 3D reconstruction methods based on silhouettes and on visual hulls as well as on human-motion capture methods and action and gesture recognition.
Visitor is a 4 year European project (2004-2008) under the Marie-Curie actions for young researcher mobility – Early State Training or EST. Within these actions, the GRAVIR laboratory has been selected to host PhD students granted by the European commission. The MOVI team, which is part of the GRAVIR laboratory, actively participated in the project elaboration. The MOVI team is currently coordinating this project and hosts two PhD students from this program.
VISIONTRAIN is a 4 year Marie Curie Research Training Network, or RTN (2005-2009). This network gathers 11 partners from 11 European countries and has the ambition to address foundational issues in computational and cognitive vision systems through an European doctoral and post-doctoral program.
VISIONTRAIN addresses the problem of understanding vision from both computational and cognitive points of view. The research approach will be based on formal mathematical models and on the thorough experimental validation of these models. We intend to reduce the gap that exists today between biological vision (which performs outstandingly well and fast but not yet understood) and computer vision (whose robustness, flexibility, and autonomy remain to be demonstrated). In order to achieve these ambitious goals, 11 internationally recognized academic partners plan to work cooperatively on a number of targeted research topics: computational theories and methods for low-level vision, motion understanding from image sequences, learning and recognition of shapes, categories, and actions, cognitive modelling of the action of seeing, and functional imaging for observing and modelling brain activity. There will be three categories of researchers involved in this network: doctoral students, post-doctoral researchers, as well as highly experienced researchers. The work program will include participation to proof-of-concept achievements, annual thematic schools, industrial meetings, attendance of conferences, etc.
In January, we started an associate team with Brown University (Providence, Rhode Island, USA) on the theme of 3D cinematography. 3D cinematography, also sometimes called free-viewpoint video, is the process of building 3D models of dynamic scenes from multiple video camera inputs. Our teams at INRIA and Brown University have separately developed methods and tools for 3D photography, but moving to 3D cinematography raises difficult technological and scientific challenges, where we want to collaborate more actively. In this new application, the processing pipeline must be fully automated, so that 3D reconstruction can be performed at video frame rates. Implementing fast and robust vision algorithms using inputs from multiple streams is a difficult subject in itself. Generating new viewpoints from a limited number of cameras is another issue. Most existing methods necessitate the storage of the video streams from all cameras in full resolution, because they interpolate the new viewpoint from all available viewpoints at the same time frame. This is not a realistic scenario, because high quality can only be obtained by increasing the number of cameras. If we can reconstruct deforming surfaces with consistent texture coordinates over time, we can instead generate viewpoints by interpolating over multiple time frames. The goal of the associate team is to contribute to that effort by joint work in two crucial areas – real-time video processing for large networks of camera systems, using distributed/parallel and embedded architectures and real-time mesh processing for reconstruction, deformation, texturing and view interpolation of 3D objects from multiple video, using multi-resolution models amenable to efficient, scalable compression.
We had one week-long start-up meeting in Grenoble with Gabriel Taubin and Chad Jenkins in Grenoble in June, and several follow-up meetings in Providence over the summer and fall of 2005 to set up multiple collaborations between our researchers and PhD students. We have many active joint projects for 2006. In particular, Gabriel Taubin and Remi Ronfard will co-organize an international workshop on 3D cinematography in New York City on June 22, 2006, which will be attended by all associate team members.
Radu Horaud is a member of the editorial boards of the International Journal of Robotics Researchand of the International Journal of Computer Vision, as well as area editorof Computer Vision and Image Understanding
Peter Sturm has co-organized the ISPRS/IEEE Workshop Towards Benchmarking Automated Calibration, Orientation and Surface Reconstruction from Images (BenCOS), held in conjunction with ICCV, Beijing, China, October 2005.
Peter Sturm was member of the Program Committees of:
ICCV'05 (International Conference on Computer Vision, Beijing),
CVPR'05 ( IeeeInternational Conference on Computer Vision and Pattern Recognition, San Diego, USA),
ICIP'05 ( IeeeInternational Conference on Image Processing, Genova, Italy),
BMVC'05 (British Machine Vision Conference, Oxford, UK),
DV'05 (Workshop on Dynamical Models for Computer Vision, held in conjunction with ICCV, Beijing, China),
OMNIVIS'05 (Workshop on Omnidirectional Vision, Camera Networks and Non-classical Cameras, held in conjunction with ICCV, Beijing, China),
ORASIS'05 (Journées Jeunes Chercheurs, Fournol, France)
and a reviewer for SIGGRAPH'05.
Edmond Boyer was Area Chair of the British Machine Vision Conference conference.
Edmond Boyer was member of the Program Committees of: ICCV'05, BMVC'05, ICIP'05, ICASSP'05 and ORASIS'05.
Remi Ronfard was co-organizer of the IEEE workshop on Modeling People and Human Interaction (PHI'05), Beijing, October 15, 2005.
Remi Ronfard was reviewer for the Siggraph 2005 and ICIP 2005 conferences.
Radu Horaud is in charge of European coordination at INRIA Rhône-Alpes.
Radu Horaud is a member of the SPECIF award committee for the period 2003-2005.
Peter Sturm is Co-Chairman of the Working Group ``Image Orientation'' of the ISPRS (International Society for Photogrammetry and Remote Sensing), for the period 2004-2008.
Peter Sturm is a member of the INRIA Committee on ``Actions Incitatives'' (part of COST – Conseil d'Orientation Scientifique et Technologique).
Remi Ronfard is member of the "Commission de Specialistes" for recruitments at the University Joseph Fourier of Grenoble.
Edmond Boyer is member of the IMAG (Intitut d'Informatique et de Mathématiques Appliquées de Grenoble) Scientific Committe.
Edmond Boyer is member of the "Commission de specialistes" for recruitments at the University Joseph Fourier of Grenoble and at the Institut National Polytechnique de Grenoble.
Edmond Boyer is coordinator of the Marie-Curie Visitor Project and member of the Visitor Scientific Committe.
Radu Horaud is the coordinator of the Visiontrain Marie Curie Research Training Network.
3D Computer Vision, postgraduate course, University of Zaragoza, Spain, 20h, P. Sturm.
Analyse d'images, DESS informatique, Univ. Joseph Fourier, 30h, R. Ronfard.
Optimisation, m2r ivr, inpg, 6h, P. Sturm.
Vision 3D, m2r ivr, inpg, 12h, P. Sturm.
Géométrie projective, m2r ivr, inpg, 6h, E. Boyer.
Stereoscopic Perception, Mastère Photogrammétrie Numérique, ENSG, Marne-la-Vallée, 17h, F. Devernay.
Computer Vision, Mastère 2 Pro IICAO, Université Joseph Fourier, Grenoble, 24h, F. Devernay.
Peter Sturm gave invited talks at:
CREA Lab, Amiens, France, November 2005.
National Lab for Pattern Recognition, Beijing, China, October 2005.
National Technical University Athens, Greece, April 2005.
Remi Ronfard gave an invited talk at Brown University, July 2005, on Human motion capture and action recognition for 3D cinematography.
Remi Ronfard, Herve Mathieu and Florian Geffray presented a live demonstration of 3D cinematography to 30 students of the Ecole Nationale Superieure des Arts du Theatre (Lyon, December 1).
Jean-Sébastien Franco and Guillaume Dewaele defend their PhD theses in December 2005.
Peter Sturm acted as reviewer for the following PhD theses:
John Mallon, Dublin City University, Ireland.
Diego Ortín Trasobares, University of Zaragoza, Spain.
Peter Sturm acted as examiner for the following theses:
Srikumar Ramalingam, University of California, Santa Cruz (qualifying examination).
Neil Birkbeck, University of Alberta, Canada, (MSc examination).
Sio-Hoï Ieng, Université Pierre et Marie Curie, Paris (PhD).
The following PhD students did internships with Peter Sturm:
Kiyoung Kim, PhD student at Gwangju Institute of Science and Technology (South Korea), 2005/06 (6 months).
Lazaros Grammatikopoulos, PhD student at National Technical University Athens (Greece), 2005 (10 days).
Diego Aguilera, PhD student at Universidad de Salamanca (Spain), 2005 (3 months).
Jean-Philippe Tardif, PhD student at Université de Montréal, 2005