One of the consequences of the increasing ease of use and significant cost reduction of computer systems is the production and exchange of more and more digital and multimedia documents. These documents are fundamentally heterogeneous in structure and content as they usually contain text, images, graphics, video and sounds.
Information retrieval can no longer rely on text-based queries alone; it will have to be multi-modal and to integrate all the aspects of the multimedia content. In particular, the visual content has a major role and represents a central vector for the transmission of information. The description of that content by means of image analysis techniques is less subjective than the usual keyword-based annotations, whenever they exist. Moreover, being independent from the query language, the description of visual content is becoming paramount for the efficient exploration of a multimedia stream.
In the IMEDIA group we focus on the intelligent access by visual content. With this goal in mind, we develop methods that address key issues such as content-based indexing, interactive search and image database navigation, in the context of multimedia content.
Content-based image retrieval systems provide help for the automatic search and assist human decisions. The user remains the maître d'oeuvre, the only one able to take the final decision. The numerous research activities in this field during the last decade have proved that retrieval based on the visual content was feasible. Nevertheless, current practice shows that a usability gap remains between the designers of these techniques/methods and their potential users.
One of the main goals of our research group is to reduce the gap between the real usages and the functionalities resulting from our research on visual content-based information retrieval. Thus, we apply ourselves to conceive methods and techniques that can address realistic scenarios, which often lead to exciting methodological challenges.
Among the "usage" objectives, an important one is the ability, for the user, to express his specific visual interest for a part ofa picture. It allows him to better target his intention and to formulate it more accurately. Another goal in the same spirit is to express subjective preferences and to provide the system with the ability to learn those preferences. When dealing with any of these issues, we keep in mind the importance of the scalability of such interactive systems in terms of indexing and response times. Of course, what value these times should have and how critical they are depend heavily on the domain (specific or generic) and on the cost of the errors.
Our research work is then at the intersection of several scientific specialities. The main ones are image analysis, pattern recognition, statistical learning, human-machine interaction and database systems. It is structured into the following main themes:
Image indexing: this part mainly concerns modeling the visual aspect of images, by means of image analysis techniques. It leads to the design of image signatures that can then be obtained automatically.
Clustering and statistical learning: generic and fundamental methods for solving problems of pattern recognition, which are central in the context of image indexing.
Interactive search and personalization: to let the system take into account the preferences of the user, who usually expresses subjective or high-level semantic queries.
Cross-media indexing, and in particular bimodal text + imageindexing, which addresses the challenge of combining those two media for a more efficient indexing and retrieval.
More generally, the research work and the academic and industrial collaborations of the IMEDIA team aim to answer the complex problem of the intelligent access to multimedia content.
The final CHORUS conferenceidentified cross-disciplinary challenges and recommendations in the domain of search engine technology. It had a great success: in addition to high representatives of the European Commission, the conference was attended by major industrial (e.g. Yahoo!, Thomson, Phillips, Exalead, etc.) and academic stakeholder of the search engine community (including representatives from North America and Japan).
Pl@ntNet projectBegining of the Pl@ntNet project “Plant Computational Identification & Collaborative Information System”
http://
- Developing cutting-edge transdisciplinary researches at the frontier between integrative systematics and computational sciences, based on the exploitation of large datasets, knowledge and expertise on plant morphology, anatomy, taxonomy, ecology, biogeography and uses.
- Providing free, easy-access software tools and methods for plant identification and for the collection, management, share and exploitation.
- Promoting citizen science, as a powerful means to enrich databases with new informations on plants and to meet the need of capacity building in agronomy, botany and ecology.
SHREC 2009 - Content-based retrieval of 3D generic modelsOur 3D alignment method coupled with one of our 2D/3D descriptors, the MDLA approach, has been positioned in the first place of the SHREC 2009 - Generic Shape retrieval Contest, with respect to the precision-recall measures.
We group the existing problems in the domain of content-based image indexing and retrieval in the following themes: image indexing, pattern recognition, personalisation and cross-media indexing. In the following we give a short introduction to each of these themes.
the process of extracting from a document (here a picture) compact and structured significant visual features that will be used and compared during the interactive search.
The goal of the IMEDIA team is to provide the user with the ability to do content-based search into image databases in a way that is both intelligent and intuitive to the users. When formulated in concrete terms, this problem gives birth to several mathematical and algorithmic challenges.
To represent the content of an image, we are looking for a representation that is both compact (less data and more semantics), relevant (with respect to the visual content and the users) and fast to compute and compare. The choice of the feature space consists in selecting the significant features, the descriptorsfor those features and eventually the encoding of those descriptors as image signatures.
We deal both with generic databases, in which images are heterogeneous (for instance, search of Internet images), and with specific databases, dedicated to a specific application field. The specific databases are usually provided with a ground-truth and have an homogeneous content (faces, medical images, fingerprints, etc.)
Note that for specific databases one can develop dedicated and optimal features for the application considered (face recognition, etc.). On the contrary, generic databases require generic features (colour, textures, shapes, etc.).
We must not only distinguish generic and specific signatures, but also local and global ones. They correspond respectively to queries concerning parts of pictures or entire pictures. In this case, we can again distinguish approximate and precise queries. In the latter case one has to be provided with various descriptions of parts of images, as well as with means to specify them as regions of interest. In particular, we have to define both global and local similarity measures.
When the computation of signatures is over, the image database is finally encoded as a set of points in a high-dimensional space: the feature space.
A second step in the construction of the index can be valuable when dealing with very high-dimensional feature spaces. It consists in pre-structuring the set of signatures and storing it efficiently, in order to reduce access time for future queries (tradeoff between the access time and the cost of storage). In this second step, we have to address problems that have been dealt with for some time in the database community, but arise here in a new context: image databases. The diversity of the feature spaces we deal with force us to design specific methods for structuring each of these spaces.
Statistical learning and classification methods are of central interest for content-based image retrieval .
We consider here both supervised and unsupervised methods. Depending on our knowledge of the contents of a database, we may or may not be provided with a set of labelled training examples. For the detection of knownobjects, methods based on hierarchies of classifiers have been investigated. In this context, face detection was a main topic, as it can automatically provide a high-level semantic information about video streams. For a collection of pictures whose content is unknown, e.g. in a navigation scenario, we are investigating techniques that adaptively identify homogeneous clusters of images, which represent a challenging problem due to feature space configuration.
Object detection is the most straightforward solution to the challenge of content-based image indexing. Classical approaches (artificial neural networks, support vector machines, etc.) are based on induction, they construct generalisation rules from training examples. The generalisation error of these techniques can be controlled, given the complexity of the models considered and the size of the training set.
Our research on object detection addresses the design of invariant kernels and algorithmically efficient solutions as well as boosting method for similarity learning. We have developed several algorithms for face detection based on a hierarchical combination of simple two-class classifiers. Such architectures concentrate the computation on ambiguous parts of the scene and achieve error rates as good as those of far more expensive techniques.
Unsupervised clustering techniques automatically define categories and are for us a matter of visual knowledge discovery. We need them in order to:
Solve the "page zero" problem by generating a visual summary of a database that takes into account all the available signatures together.
Perform image segmentation by clustering local image descriptors.
Structure and sort out the signature space for either global or local signatures, allowing a hierarchical search that is necessarily more efficient as it only requires to "scan" the representatives of the resulting clusters.
Given the complexity of the feature spaces we are considering, this is a very difficult task. Noise and class overlap challenge the estimation of the parameters for each cluster. The main aspects that define the clustering process and inevitably influence the quality of the result are the clustering criterion, the similarity measure and the data model.
We investigate a family of clustering methods based on the competitive agglomeration that allows us to cope with our primary requirements: estimate the unknown number of classes, handle noisy data and deal with classes (by using fuzzy memberships that delay the decision as much as possible).
We are studying here the approaches that allow for a reduction of the "semantic gap". There are several ways to deal with the semantic gap. One prior work is to optimise the fidelity of physical-content descriptors (image signatures) to visual content appearance of the images. The objective of this preliminary step is to bridge what we call the numerical gap. To minimise the numerical gap, we have to develop efficient images signatures. The weakness of visual retrieval results, due to the numerical gap, is often confusingly attributed to the semantic gap. We think that providing richer user-system interaction allows user expression on his preferences and focus on his semantic visual-content target.
Rich user expression comes in a variety of forms:
allow the user to notify his satisfaction (or not) on the system retrieval results–method commonly called relevance feedback. In this case, the user reaction expresses more generally a subjective preference and therefore can compensate for the semantic gap between visual appearance and the user intention,
provide precise visual query formulation that allows the user to select precisely its region of interest and pull off the image parts that are not representative of his visual target,
provide interactive visualisation tools to help the user when querying and browsing the database,
provide a mechanism to search for the user mental image when no starting image example is available. Several approaches are investigated. As an example, we can mention the logical composition from visual thesaurus. Besides, learning methods related to information theory are also developed for efficient relevance feedback model in several context study including mental image retrieval.
We have described, up to now, our research approaches in using the visual content alone. But when additional information is available, it may prove complementary and potentially valuable in improving the results returned to the user. We may cite here metadata(file name, date of creation, caption, etc.) but also the textual annotations that are sometimes available. We must note that annotations usually carry high-level information related to a prior knowledge of the context. The use of these sources of information implies that we can speak of multimedia indexing.
We can think of several approaches for combining textual and visual information in the context of indexing and retrieval. As examples, we may cite the automatic textual annotation of images based on similarities between visual signatures or the propagation of textual annotations relying on the interaction between textual ontologies and visual ontologies. We also investigate methods that allow automatic textual annotation from visual content analysis. This part of our research activities is yet another solution for the reduction of the "semantic gap".
Security applicationsExamples: Identify faces or digital fingerprints (biometry). Biometry is an interesting specific application for both a theoretical and an application (recognition, supervision, ...) point of view. Two PhDs were defended on themes related to biometry. Our team also worked with a database of images of stolen objects and a database of images after a search (for fighting pedophilia).
Audio-visual applicationsExamples: Look for a specific shot in a movie, documentary or TV news, present a video summary. Help archivists to annotate the contents. Detect copies of a given material in a TV stream or on the web. Our team has a collaboration with INA (French TV archives), IRT (German broadcasters) and press agencies AFP and Belga in the context of an European project. Text annotation is still very important in such applications, so that cross-media access is crucial.
Scientific applicationsExamples: environmental images databases: fauna and flora; satellite images databases: ground typology; medical images databases: find images of a pathological character for educational or investigation purposes. We have an ongoing project on multimedia access to biodiversity collections for species identifications.
Culture, art and designIMEDIA has been contacted by the French ministry of culture and by museums for their image archives.
Finding a specific texture for the textile industry, illustrating an advertisement by an appropriate picture. IMEDIA is working with a picture library that provides images for advertising agencies. IMEDIA is involved in TRENDS European project dedicated to provide designers (CRF Fiat, Stile Bertone) with advanced content selection and visualisation tools.
IKONA is a framework for building Content Based Image Retrieval software prototypes. It has been designed and implemented in our team during the last four years . The current version is fully generic and is highly adaptable to any CBIR scenario thanks to its level of abstraction. As a research environment, IKONA offers support to the researchers in their work by providing stable and tested tools. As an application, it can easily be deployed and used by non-specialist users.
IKONA is based on a client/server architecture. The communication between the two components is achieved through a proprietary network protocol. It is a set of commands the server understands and a set of answers it returns to the client. The communication protocol is extensible, i.e. it is easy to add new functionalities without disturbing the overall architecture. It is also modular and therefore can be replaced by any new or existing protocol dealing with multimedia information retrieval.
The main processes are on the server side. They can be separated in two main categories:
offline processes: data analysis, features extraction and structuration
on-line processes: answer the client requests
The images are characterised with Globalsignatures that are implemented in the server:
Generic signatures: Colour, Shape and Texture features investigated at the IMEDIA Group.
Specific signatures: Faces and signatures for fingerprints.
Annotations: Some keywords.
Besides, two localsignatures are included: The region-based description and the point-based one. The server uses image signatures and offers several types of query paradigms, available to the user through the graphical interfaces of the clients:
query by global example: The user selects an entire image as visual query.
partial queries: the user is looking for regions in images that are visually similar to a the selected region.
relevance feedback on global and partial query: the user interacts with the system in a feedback loop, by giving positive and negative examples to help the system identify the category of images she/he is interested in ;
mental image search: Two different methods are investigated. The first is Target Image Search with relevance feed-back model based on mutual information, the second one consist on Logical Query Composition.
We have developed two main clients that can communicate with the server. A good starting point for exploring the possibilities offered by IKONA is our web demo, available at http://www-roc.inria.fr/cgi-bin/imedia/circario.cgi/bio_diversity?select_db=1. This CGI client is connected to a running server with several generalist and specific image databases, including more than 23,000 images. It features query by example searches, switch database functionality and relevance feedback for image category searches. The second client is a desktop application. It offers more functionalities. More screen-shots describing the visual searching capabilities of IKONA are available at http://www-rocq.inria.fr/imedia/cbir-demo.html.
The architecture of this client/server software and several visual signatures were a subject of a deposit to APP. It is distributed to INA, AFP, INRA, Ministry of Interior, JRC and Alinari.
PMH is a generalist software library dedicated to locality sensitive hashing in metric spaces for approximate similarity search. It allows to index and exploit efficiently large datasets of content descriptors, usually represented by high dimensional feature vectors. The construction of the index and the required memory space are linear in dataset size. The nearest neighbour search algorithm is sublinear in dataset size.
PMH is globally related to Locality Sensitive Hashing methods (LSH) that have been proved to be the most efficient ones for approximate similarity search in large and high-dimensional datasets. Contrary to classical LSH method (such as the ones used in MIT E2LSH package), PMH includes a multi-probe search algorithm which allows to drastically reduce the memory space complexity enabling to deal with datasets of several order of magnitude larger. Our multi-probe algorithm being based on a probabilistic control of buckets success probability also offers to control accurately the quality of the approximate search. Finally, PMH library is widely more generic than concurrent libraries (such as FLANN or LSHKIT). It allows the use of different metric types (L1, L2, Hamming, inner product, weighted distances, etc.), different data types (binary, float, sparse, non vectorial, etc.), different query types (K nearest neighbours, range queries, probabilist queries, empirical models, etc.), differentes hashing functions families (random projections with different distributions, kernel based projections, optimized projections such as PCA or LDA, etc.).
Notably, PMH library is the core technology for the scalability issues addressed by VITALAS European project and is fully integrated in the resulting VITALAS multimedia search engine. It has been successfully applied to multi-users real-time content-based retrieval in 20 millions Flickr images and to real-time local search of small objects in a 100K images collection (including 120 millions SIFT features).
Millions of users interact with search engines daily. Most of existing popular search engines allow users to represent their search intents by issuing the query as a list of keywords. However, keywords queries are usually ambiguous. This ambiguity often leads to unsatisfying search results. For example, the query “apple” covers several different topics; fruit, smart phone, computer and so on. Heterogeneous search results need to be combined and structured efficiently and generically.
We propose to use clustering techniques that are using only ranked nearest neighbours information (and not directly features or similarity measures). Such method has been proved to be very interesting.
We are notably using some a contrario principle to normalize connexity information.
The goal is to easy fuse different sources of information without any learning or prior knowledge and to produce either mono or multi source clusters in the same clustering results.
The first step is to consider all objects as candidate cluster centers and to compute a significance score for each center with his nearest neighbours including an oracle selection step to decide which modalities are more significant for each candidate cluster.
Because this step is time consuming, we construct a fast shared neighbour's intersection matrix for each modality at the beginning of the process.
This optimization accelerates our algorithm so that a user can quickly get an overview of the different clusters with the mention of the modalities used.
We experimented our approach on the Exalead Corpus http://www.exalead.com/search/and we found very interesting multi-source clusters for different queries.
We plan to evaluate our work in the scope of re-ranking rather than clustering since there is not an evaluation dataset for web search clustering.
For now, the different information sources that we use are mostly visual ones (Bof, Global features, etc). We would like to test our fusion (re-ranking /clustering) algorithm on different modalities and see how we perform compared to state-of-the art.
An example of a structured result for the query “Flag” is shown in Figure (see also http://www-roc.inria.fr/ hamzaoui/InterfaceExa.html).
This year, we pursued our work on 3D model retrieval and indexing in several directions.
A new global descriptor, called 3D gaussian descriptor (3DGA), derived from the Gauss transform has been proposed in and . It consists in a spatial description of the model built from the Gaussian law and obtained by a summation on the surface of the model (see figure ). The 3DGA descriptor is efficient but less effective than our 2D/3D descriptors for the generic models. Nevertheless, it may be useful to describe the 3D model having an important part of its surface hidden when computing its 2D projections.
Our 3D alignment method , has shown again its good performances. Indeed, our alignment method, coupled with the MDLA descriptor (AL-MDLA) won the generic track of the SHREC 2009 contest (see figure ).
Moreover, this result has been reinforced by the detailed evaluations made by Mohamed Chaouch in his thesis on the main 3D generic shapes databases: once again the AL-MDLA approach obtained the best retrieval performances in all the cases. These results confirmed the importance of an appropriate choice of a 3D alignment method during the normalisation step of the retrieval process and the effectiveness of our 2D/3D descriptor when retrieving 3D models inside a database of 3D generic models.
Our alignment work has also been extended to reduce the number of reference frames that can be associated to a 3D model to find its natural pose among the 48 coordinate systems associated to the alignment axes. The principle of the extension is detailed in and in . It is based on observations of human perception w.r.t. the vertical symmetries of the models.
An interactive tool has been developed by Skander El Fekih during his master's thesis . Figure shows examples of reduced sets of models reference frames proposed to the user by the tool.
The main difficulty in 2D shape recognition is that shapes of objects can vary within the same semantic class. These variations, called deformations, can be due to multiple reasons: the objects may be viewed from different perspectives, the objects may be structurally different (in the case of articulated and deformable objects), or objects may have a different scale. In general, a normalization step to achieve invariance under all possible deformations is required before the recognition process. The normalization consists of three steps. The first step centers the objects to achieve translation invariance. The second step normalizes the scale of the objects. The third step aligns the objects to achieve rotation invariance. Most existing normalization methods are efficient solutions for centering and scaling. However, alignment remains unsolved.
Humans achieve this task efficiently by placing objects in the way that they are most commonly seen in their surroundings. Finding a technique that simulates this behavior is challenging. Results from psychological tests on human perception and recent 3D alignment methods show that symmetry is an important factor that contributes to such intuitive alignment. Based on this, we propose a new approach to automatically align 2D shape in an intuitive way. Inspired by an idea related to 3D alignment , this approach is based on two types of symmetry: the reflective symmetry and the local translational symmetry. The reflective symmetry is used as a criterion to validate the principal component analysis (PCA) alignment.
In case the PCA alignment is rejected, an alternative technique is proposed, which is based on the local translational symmetry. This is defined as the repetition of the same geometrical properties along a given direction. In our algorithm, we used two representations of shape: its boundary and its surface. We show that the surface representation, which takes into account all points of the shape, often works better than the boundary representation. It can be argued that points on periphery are more sensitive to deformations. In general, compared to other alignment approaches, our method computes rapidly and efficiently intuitive alignments, such as the ones presented in figure .
In the scope of the Pl@ntNet project, we are working on plant identification. Previous work on Orchidae of Laos showed precise identification by the use of images of their leaves. In this case, the leaves were scanned and appropriately cropped in order to retain only the relevant information. We are now extending this preliminary work onto the grapes identification, first by the use of a regular digital camera and second by evaluating several shooting protocols. The latters aim at being more realistic against the working conditions in the field.
Contour-based shape descriptors, such as the one presented in , have interesting discriminative properties and should address all these previous issues. Before being able to describe the regions of interest, a segmentation should be performed beforehand. The segmentation algorithm should ideally be working with a few and yet intuitive parameters, and should be fast. The original watershed transform along with some of its improvements and extensions , were interesting candidates for this task.
Our work consisted in implementing and evaluating the original watershed on images under varying shooting conditions. We first focused on images with relatively homogeneous background, with either controlled or uncontrolled illuminating conditions. In the semi-supervised version of the watershed, an image marking the inside of each interest region is needed. We postponed the automatic choice of the markers to a future work, and used manually placed markers.
The details of this work are presented in and an example of segmentation is shown in figure .
These results mainly show that the watershed transform is able to address the extraction of regions of interest. Some work should be done in order to address less controlled shooting conditions and automatic processing.
The extension of this work are threefold. First, we are investigating automatic markers placement. The visual cues on which we lead our work are the colour and the vein network. Indeed, the vein network for the grape families is almost always visible. Second, the segmentation should be robust to varying illuminating conditions and particularly to shadows. We propose to enhance the currently used gradients for that purpose. Finally, partial image description inside the regions should make the final identification's step robust to frequently occurring occlusions. Finally, we also would like to extend these investigation to flower segmentation.
In recent years the resolution of images that are obtained from satellites has increased significantly to reach nowadays 41 cm/pixel in the panchromatic band with GeoEye-1 sensor.
Consequently, new challenges arise for an accurate land-cover interpretation of greatly spectral and spatial heterogeneous data. Because of this heterogeneity, satellite images are ambiguous
and their classification remains a difficult task despite many thoughtful attempts. Indeed, most existing classification methods are only suitable to a specific range of resolution and on the
whole they fail as the resolution is high. In order to overcome this shortcoming, we proceed in
,
as follows: Fist, we perform a multi-cue combination by
incorporating various features such as color, texture and edge in a single unified discriminative model. Given a high resolution satellite image database, we learn an appropriate dictionary
which consists of cue meaningful clusters namely color clusters, textons and shapemes. Second, we adopt a probabilistic modeling approach to resolve uncertainties and intra-region
variabilities as well as to enforce global labeling consistency. In fact, we define a Discriminative Random Field (DRF)
model on an adjacency graph of superpixels which focuses directly
on the conditional distribution
of labels
Lgiven the image observations
Xand the learned parameters
. Our DRF model captures similarity, proximity and familiar configuration so that a powerful discrimination is ensured. In order to capture contextual interactions of the labels as
well as the data, we define in
a non-homogeneous discriminative model with spatially dependent
association and pairwise potentials. Third, we take a feature selection approach based on sharing boosting
to learn efficiently the feature functions and to discriminate
powerfully the regions of interest though the content complexity. Finally, we apply a cluster sampling algorithm
, which combines the representational advantages of DRF and graph
cut approaches, to infer the global optimal labeling.
We train and test our model on high resolution SPOT-5 satellite images. Our method is suitable to any range of resolution since we need just to perform training in the appropriate database. Promising results are obtained as shown in figures and .
The non-homogeneous DRF model provides better results than the homogeneous DRF model which demonstrates the importance of contextual information integration. In figure , we illustrate results obtained by our homogeneous DRF model for urban area extraction. In future work, we plan to learn the weighting parameters of potentials and extend our model to a multi-scale framework.
Description and recognition of textures in satellite images has attracted growing attention in recent years. In a novel approach for retrieval of textures based on a novel type of image representation is presented: the Local Binary Pattern Correlograms (LBPCs). Our representation is obtained by first performing an extraction of the most informative points in the image. Then, we compute local binary patterns around these interest points. Furthermore, we propose a novel texture feature by computing the correlogram of the LBP computed around the interest points. Our new LBPCs combine the potential of local and global descriptors. Local descriptors, represented by local features extracted around interest points, are characterized by their robustness to occlusions, scale and geometric transformations. Global descriptors, represented by correlograms, are very informative about the overall visual structure of an object. The LBP occurrence correlogram is proved to be a very powerful texture feature. Our proposed LBP Correlograms has been tested on a real SPOT image database. The experimental result shows good average retrieval accuracy. Excellent results are achieved compared against some state of the art methods.
In Figure , the precision recall curve of our proposed approach (LBPCs) is compared with the curve of the other approaches: (I) LBPCs combining the monochrome and opponent LBP and with three set neighborhood, (II) LBPCs with only one set neighborhood, (III) MLBP Histogram [7] and (IV) traditional Correlograms. It is clearly showed that the performance of our method is better than others. There is not much different in accuracy between our proposed LBPCs and LBPCs combining monochrome and opponent LBP. Our method is faster and with a smaller memory size to store index.
The characterization, evaluation and use of plant biodiversity is based on the precise and efficient identification of its components and especially of the species. The identification keys issued from systematic botany mainly rely on characteristics that are ineffective in many real-world situations. The development of the inventory of species, of community ecology and of the monitoring of self-propagating plants is limited because it requires an active and continuing involvement of the very few highly specialized botanists. The collaboration between the UMR AMAP and the IMEDIA team aims to address this challenge by exploiting image analysis and recognition in a generic interactive species identification system. Since the identification process should be interactive, we decided to further explore relevance feedback on sets of local image features that describe regions of interest of an image. In the case under focus here, such regions would correspond to plant organs whose attributes are potentially relevant for identification. It should be noted that the problem we address is very difficult, since there is significant variability in pose and the relevant plant organs often correspond to sets of patches that are scattered in a region of interest.
During the first year of Wajih Ouertani's PhD, we have studied and tested state of the art kernels for matching sets of vectors, in order to extend SVM-based relevance feedback to the use of local features. The first experiments concerned the Pyramid Match Kernel (PMK) , for which interesting results are reported in the literature on object class recognition. PMK is based on a hierarchical uniform quantization of the feature space and represents a set of local features as a multi-resolution histogram. The kernel is obtained as a modified histogram intersection, with level-specific weight. Our experiments on Graz-02 dataset with several descriptors including SIFT show that there are two significant problems with this approach. The first is related to PMK and more specifically to the construction of the hierarchy. Consider a quantization level that is too coarse and unable to provide enough discrimination to separate local features that have low similarity. When getting from this level to the next, more refined level, each quantization interval is divided by two in all dimensions. If the dimension of the description space is high, the resulting quantization intervals are too small and local features that should be considered similar actually fall in different intervals. To address this issue, we investigated the random histogram features set representation , associated to a linear kernel, and this representation appears to be more able to avoid this kind of problem.
The second problem is not specific to the use of PMK but rather to the fact that all the local features selected in a region of interest are used as a single (positive or negative) example. Set kernels tend to “bind” the local features in a set together, so it becomes harder to ignore part of them that are actually irrelevant (e.g. they come from the background) or to be robust to strong occlusions. We explore solutions based on replacing a large set of local features by several localized subsets of features.
Another issue is object/noise separation: since the relevant plant organs often correspond to sets of patches that are scattered in a region of interest, many of the local features falling in the region selected by the user actually belong to the background or have in their description a strong influence from the background. It is then necessary to find appropriate feature selection solutions in order to reduce the level of such noise.
As part of our work, several programs and software modules were developed to handle, integrate and evaluate this type of feedback. In order to evaluate the performance in the target application, a botanical database with a local ground truth (region-based annotations) was prepared by AMAP using annotation software developed by IMEDIA.
In the frame of the Pl@ntNet project, we begun to work on classification methods helping botanists to identify plant species. One field of investigation which has recently started concerns the “multi biological criterias” classification. Indeed, botanists are used to observe and analyse specimens according various visual aspects, various “characteristics” or “biological criterias”, in order to identify the biological taxonomy of one plant and to discern plants between them.
Figure shows an example of this biological description on one specimen. This sample, one “Ebenaceae Diospyros Elliotii” plant in its natural environment, is represented by several pictures where each picture is annotated by a set of labels, i.e. some usual botanical characteristics as “bark”, “flowers”, “inflorescence”, “limb marge”, “leaf”, “petiole”, ...). These annotations in this botanical context lead us to an original image classification problem where each individual sample in the training data is represented by several multi-labeled pictures. Moreover, each class (i.e. each specie) is represented by several specimens which are not necessary covering the same botanical characteristics. Furthermore, this is a challenging classification problem because even one flora of a limited geographical area can contain several hundred species.
First investigations are centered on an hierarchical classification model, an extension of a previous work on information fusion . This classification method combines the visual signatures of the partial and complementary views of the known species and the botanical expert annotations. Our future challenge is to take into account the botanical expertise knowledge of a user in an interactive approach in order to improve the classification performances.
In the scope of a use case of VITALAS European project, we did work on a new content-based retrieval framework applied to logo retrieval in large natural image collections. The first contribution is a new challenging dataset, called BelgaLogos , which was created in collaboration with professionals of BELGA press agency, in order to evaluate logo retrieval technologies in real-world scenarios. The dataset as well as baseline results have been made available to the community on a dedicated web page http://www-rocq.inria.fr/imedia/belga-logo.htmland exchanges with other partners did start on the topic.
The second and main contribution is a new visual query expansion method using an a contrario thresholding strategy in order to improve the accuracy of expanded query images. Whereas previous methods based on the same paradigm used a purely hand tuned fixed threshold, we provide a fully adaptive method enhancing both genericity and effectiveness. This new technique has been evaluated on both OxfordBuilding dataset and our new BelgaLogos dataset. Results did show that the proposed technique outperforms both the baseline method and previous state-of-the-art visual query expansion method. Mean Average Precision results on BelgaLogos dataset are provided in Table . More details can be found in .
Baseline | Qexp acontrario | |||
Logo name | Qset1 | Qset2 | Qset1 | Qset2 |
Adidas | 7.8 | 0.7 | 13.3 | 0.7 |
Adidas-text | 5.6 | 1.1 | 7.8 | 1.1 |
Base | 14.4 | 38.9 | 21.5 | 58.2 |
Bouygues | 18.2 | 11.3 | 18.6 | 15.3 |
Citroën | 6.1 | 4.5 | 38.4 | 4.5 |
Citroën-text | 5.3 | 0.1 | 18.8 | 0.1 |
CocaCola | 23.0 | 0.1 | 48.6 | 0.1 |
Cofidis | 26.0 | 55.2 | 26.6 | 65.3 |
Dexia | 16.6 | 29.3 | 24.0 | 51.3 |
Ecusson | 1.1 | 0.1 | 5.9 | 0.1 |
Eleclerc | 78.1 | 74.1 | 80.6 | 80.1 |
Ferrari | 24.7 | 7.5 | 41.4 | 17.5 |
Gucci | 50.0 | 0.0 | 50.0 | 0.0 |
Kia | 32.8 | 61.3 | 67.5 | 75.6 |
Mercedes | 9.7 | 18.5 | 15.0 | 19.2 |
Nike | 1.4 | 1.2 | 3.5 | 2.6 |
Peugeot | 20.0 | 20.7 | 20.2 | 23.2 |
US President | 64.3 | 60.3 | 96.6 | 100.0 |
Puma | 8.6 | 2.2 | 20.0 | 2.2 |
Puma-text | 51.6 | 0.7 | 56.6 | 0.7 |
Quick | 24.4 | 39.0 | 41.4 | 56.6 |
Roche | 50.0 | 0.2 | 50.0 | 0.2 |
SNCF | 33.3 | 27.9 | 35.4 | 33.7 |
StellaArtois | 32.7 | 31.8 | 39.3 | 43.4 |
TNT | 22.5 | 2.5 | 33.54 | 4.4 |
VRT | 11.1 | 5.8 | 12.53 | 11.2 |
All | 20.8 | 19.0 | 34.11 | 25.7 |
In the scope of the VITALAS project, we developed a graphical interface in order to exploit the temporal relationships of images within videos. The tests were conducted on a database of 10 hours of news videos (approximately 75000 images). The interface combines the classical similarity search of Maestro with the temporal information available from news events. Based on this information we allow a user to more efficiently navigate through a large collection of audio-visual data. Indeed, the navigation allows to combine the images similarity search with their temporal relationship by proposing two views. The first one shows an unordered similarity search and, for each image, the videos and their time stamp within each video. The events that are closer to the beginning of the videos are, for instance, more related to the hot news. The second view shows the main topics and the key frames of the video associated to the selected images, along with their temporal occurrence. The main topics are drawn from a clustering on the whole database. The clusters with a large number of elements are considered as structuring each videos (reports, interviews, jingle...), while clusters of smaller size are considered as providing information on t he topics covered by the videos. In the example shown in figure , we used maps as entry points on the semantic content of news report, which are in this case the events related to identifiable and geographically located parts of the world. This first view on the left reveals that the first map is used in three different videos in the 10 hours, and might be a location covered by a series of reports. By selecting an image, a time-line view ( on the right) is shown. This view stresses the contents co-occurring with the map within the same time period . It may then provide information on events, people, polls or popular opinions that, in some extent, are related to this geographical event and might be hard to infer with the visual similarity only.
We demonstrated the functionalities of this interface during the VITALAS annual review.
The need for watching movies is in perpetual increase due to the widespread of the internet and the increasing popularity of the video on demand service. The important mass of movies stored in the Internet or in VOD servers need to be structured to accelerate the browsing operation. We propose in a new system called "The Scene Pathfinder" that aims at segmenting the movies into scenes to give users the opportunity to have a non-sequential access and to watch particular scenes of the movie. This helps them to judge quickly the movie and decide if they have to buy or to download it and avoiding waste of time and money. The proposed approach is multimodal (see also , , ). We use both of visual and auditory information to accomplish the segmentation. We base on the assumption that every movie scene is either action or non-action scene. Non-action scenes are generally characterized by static backgrounds and occur in the same place. For this reason, we base on the content information and on the Kohonen map to extract these kinds of scenes (shots agglomerations). Action scenes are characterized by high tempo and motion. For this reason, we base on tempo features and on the Fuzzy CMeans to classify shots and to localize the action zones. The two processes are complementary. Indeed, the over segmentation that may occur in the extraction of action scenes by basing on the content information is repaired by the Fuzzy clustering. Our system has been tested on a varied database and obtained results show the merit of our approach (compared to ) and that our assumptions are well-founded.
In figure , we present our framework. We divide the scenes of the movies into two important classes: action scenes and non-action scenes. To detect non-action scenes (dialog, monolog, landscape, romance...) we use the content information and the Kohonen map to discover the agglomerations of shots (scenes) having common backgrounds and objects. In the other hand, we use audio-visual tempo features and the Fuzzy CMeans classifier to delimit the core of action scenes (fight, car chase, war, gun fire...) to remedy the over segmentation that may occur in action scenes.
Unlike search by similarity techniques which have—to some extent—become reliable over the past few years, object retrieval still have got many issues. In fact, searching for a concept addresses various problems related to feature extraction (e.g. invariance in viewpoint, illumination, affine transformations, etc.) and machine learning (robustness, over-fitting, genericity, computation time, etc.).
Our research focuses on building concise and powerful models to make it possible to retrieve objects in large heterogeneous image collections. We integrated indeed a good feature selection algorithm based on both boosting and lasso techniques.
Contrary to most training algorithms used in automatic object recognition, this new algorithm generates sparse models from the complete space of all the local features of the training images. The intuitive idea is to add an extra term to the loss function. This term represents a constraint which causes shrinkage of the solutions towards zero.
Given a loss function
L(
y,
f(
x))where
f(
x)being the sum of base learners
, the objective is to minimize the following function:
where
nis the number of the training examples and
0is a parameter that controls the amount of the shrinkage applied. The bigger
the
coefficient, the sparser the model. Sparsity is known to be a good tradeoff between model simplicity and good category representation. Added to that, sparsity tends to favor
interpretability which is practical for a human interaction afterwards.
Preliminary experiments carried out on the PascalVOC database using two successful state-of-the-art descriptors (SIFT and SURF) are promising.
The bag-of-visual-words is a popular representation for images that has proven to be quite effective for automatic annotation.
The main idea behind bag-ofvisual-words is to represent an image with a collection of visual patches and to compute an histogram counting the occurrences of these patches as a global signature. This representation can then be used in any learning framework to manage the automatic annotation problem. It is simple to implement and provides current state-of-the-art performances on several evaluation benchmarks. One of the main characteristic of bag-of-visual-words is their orderless nature. The spatial position of the visual patches is dropped and never used. On one hand this choice brings flexibility and robustness to the representation as it is able to deal with changes in viewpoint or occlusion. On the other hand, the spatial relations between patches could be useful to describe the internal structure of objects or to highlight the importance of contextual visual information for these objects.
We extend this representation in order to include weak geometrical information by using visual word pairs. We choose to consider the co-occurrence of words in a predefined local neighborhood of each patch. Thus, we only consider the distance between two patches, whatever their relative orientation. This way, we include both contextual and structural information in our new visual signature.
Following our previous work, we choose to extract standard low-level visual patches on a regular grid before creating the pairs and we use SVMs as a learning strategy.
On a standard image database (Pascal VOC 2007), we achieve 10% higher annotation performances by considering the word pairs.
Embedding the word pairs in a standard bag-of-visual-words representation brings very significant improvement for an automatic annotation task. The weak geometrical information they encode is complementary to the standard words occurrences histogram.
This work as been published in . The overall system is described in the thesis .
Most recent and effective recognition techniques are based on high-dimensional and sparse representations induced by the large number of local visual features. Classifiers learned on such
representations are usually applied to test images one by one and the complexity in a retrieval context is intrinsically linear in dataset size. Hence, we explored an efficient boosting
strategy in order to reduce the retrieval complexity when using feature rich representations of images. For learning step we used AdaBoost algorithm with a weak learner based on distances
between training local features. Instead of predicting the scores of the images one by one, we performed
Trange queries in the dataset
according to the
Tweak classifiers parameters
. Therefore we used the a posteriori multi-probe locality sensitive hashing similarity search structure
. Each range query returns a set of features
Rtsuch as:
Experiments on Caltech 256 dataset show that the technique is about 250 times faster than the naive exhaustive method with surprisingly better performances (see Table ).
We also applied the proposed method to a real time relevance feedback mechanism based on freely selected image regions. Experiments show that the active learning provides significant effectiveness improvements (see figure and , for more details).
As each year, the integration of latest research results has been achieved.
Some OS specific dependencies, on which we were working in 2008, were now tested on several platforms such as different Unices (32 and 64 bits), Mac OSX and with different compilers. The program now runs safely on all these architectures. To achieve that, we progressively moved to the Boost library (threads, unit tests , graphs, linear algebra supports) which is known for its quality, portability and permissive licence. The dependencies of the core software were also cleaned, and we are now able to build easily the software from scratch, which was previously possible only at the expense of a lot of time. A particular care was taken concerning the licence issues of these dependencies.
A lot of efforts were also put on the quality and stability of the program in order to develop new functionalities without introducing regressions, and to safely deploy MAESTRO on distant servers within the scope of industrial partnerships (need of VITALAS, Pl@ntNet,...). Automatic reports are built at each modification of the code repository, for several architectures and platforms in a distributed manner.
During year 2009, we finalized jointly with INA a release version v1 of PMH library. The global architecture of the library was modified in order to make it more flexible and generic. Several parts of of indexing chain have been isolated in independent modules: a Transformer module that enables to project original data in new feature space, a scalar Quantizer module, a multi-dimensional Quantizer module that enables the creation of the hash keys and a query modeller that enables to learn the prior probability tables for a given query model. Several new functionalities were also added, the most important one being the ability to search directly on binary compressed signatures instead of original space signatures. The second main new functionality is a two-level hierarchy of hash tables that allows to bridge the previous limit of hash key sizes induced by memory limitations. Other new functionalities include new metrics, new scalar quantization and new hash function families. Discussions with INA regarding licencing of the software are currently engaged with the objective to have an open source licence next year. Finally, new research results on random maximum margin hashing have been implemented for experiments and should be fully integrated in the next few months.
It is a joint project with AMAP (CIRAD, Montpellier) and Tela Botanica, an international botanical network with 8,500 members and an active collaborative web platform (10,000 visits /day).
The project has its financial support from Agropolis International Foundation (
http://
Dissemination:
Project presentation at SIA 2009 (“La botanique numérique” meeting, Salon International de l'Agriculture) February 23, 2009, Paris, France.
Posters ( , ) and demo at e-Biosphere 09, June 1-3 2009, London, U.K.
Presentation at the XIII Congrès Forestier Mondial in Buenos Aires, in October 2009 .
Presentations at the Taxonomic Database Working Group annual conference , (TDWG 2009), November 9-13 2009, Montpellier, France.
The PhD thesis of Wajih Ouertani, financed by INRA, in the context of a strategic collaboration between INRIA and INRA, addresses interactive species identification through advanced relevance feedback mechanisms based on local image information.
The project "R2I - Recherche Interactive d'Images" is a joint project which aims at designing new methods for interactive image search. The final goal of this project is a system which can index about one billion of images and provide users with advanced interaction capabilities. The partners are the company Exalead, a leader in the area of corporate network indexing and a specialist for user-centered approaches, the INRIA project-team Imedia, a research group with a strong background in interactive search of multi-media documents, as well as LEAR and the University of Caen, both specialists in object recognition. Amel Hamzaoui begun her PhD thesis inside the R2I project (see section for more details).
“Video & image indexing and retrieval in the large scale” (
http://
A presentation of VITALAS Project (The VITALAS project : Video and Image Indexing and Retrieval in the Large Scale) has been made at the International Symposium of the THESEUS Research Program - Kick-off talks, in Berlin, Germany, in June 2009.
The demo of the second version of VITALAS system has been made at CHORUS Final conference in Brussels, Belgium, in May 2009.
VITALAS participated to the TRECVID-2009 evaluation (High-Level Feature Extraction and Interactive search tasks, )
CHORUS is a coordination Action in the field of Audio-Visual Search Engines accepted in the call6 of the 6th Framework Programme (
http://
The EU coordination action CHORUS organised the international CHORUS conference (held May 26-28, 2009 in Brussels, Belgium) to present its final report, which identified cross-disciplinary challenges and recommendations in the domain of search engine technology. In addition to high representatives of the European Commission, the conference was attended by major industrial (e.g. Yahoo!, Thomson, Phillips, Exalead etc.) and academic stakeholder of the search engine community (including representatives from North America and Japan).
Joint collaboration with Shin'Ichi Satoh and Michael Houle has been established since 2006. Several visits and mobilities have been achieved between IMEDIA and NII. The main topics consist on social web mining, scalable clustering and object recognition.
Don Geman is a regular visiting professor since several years; The scientific topics adressed are related to relevance feedback and mental category image search.
The CIVE (Classification d'Images d'espèces VEgétales) project is a collaborative project between AMAP, INRIA, ISI (Institut Supérieur d'Informatique de Tunisie) and Sup'Com (Ecole Supérieure de Communication de Tunis). It is financed by both the Tunisian Universities and INRIA.
Participation to the “Conférence annuelle de la société savante des naturalistes en Tunisie” in Hammamet, in November 2009.
In 2007, IMEDIA did organise the first international benchmark event about video copy detection technologies, as a "live" event during ACM CIVR 2007 conference (
http://
Demos of IKONA/MAESTRO software have been presented at:
CHORUS Final conferenceMay 26-27, 2009 in Brussels, Belgium;
Salon Européen de la recherche et de l'innovationJune 3-5, 2009 in Paris, France.
Pl@ntNet:La Tribune (March 3 2009) “L'herbier collaboratif arrive”, les Echos (March 2009) “La reconnaissance vidéo au service de la botanique”, RFI and France3.
See also
http://
VITALAS:Les Echos (June 26 2009) “Le moteur de recherche vidéo européen Vitalas entre en piste”
European dissemination during the Final Chorus Conference:
http://
General co-Chair of ACM Multimedia Information Retrieval (ACM MIR 2010 - March 29-31 - Philadelphia, Pennsylvania):
http://
Co-chair of "Track V: Multimedia and Document Analysis, Processing and Retrieval" in ICPR 2010: International Conference on Pattern Recognition 2010 (23-26 August
Istanbul),
http://
Chair of "Brave New Ideas" in ACM Multimedia 2010 25-29 October 2010, Florence, Italy (
http://
Founding member of the ACM ICMR "ACM International Conference on Multimedia Retrieval" born from the fusion of: ACM MIR (International Conference on Multimedia Information Retrieval) and ACM CIVR (International Conference on Image and Video Retrieval)
Member of the steering committee of ACM ICMR (4 years)
Program Chair of final Chorus conference
http://
Member of the Scientific Advisory board of the japanese project "Multimedia Web Social Analysis and Mining" supported by the MEXT (Japonese ministry of Research)
Member of an academia Think-Tank for the PPP european initiative.
Scientific coordinator of VITALAS IP FP6;
Scientific coordinator of CHORUS CA FP6;
Expert for ESF (European Science Foundation :
http://
Expert for the EC for FP7 preparation, participation to several expert meetings.
Expert for NWO (Netherland)
Elected member in the Steering Board of NEM ETP (Networked and Electronic Media European Technology Platform) and acting as INRIA representative
Invited speaker in the "Session: Content"
http://
Organizers: European Commission, NICT - Tokyo, october 2009
Member of Futur Internet of Content (FCN) Cluster within FIA. - Scientific committee of "Search and Discovery" session at FIA (Futur Internet Assembly) - Stockholm nov 2009
French expert for COST ICT Domain (intergovernmental network for European Cooperation in the field of Scientific and Technical Research)
Member of ACM - SIGMM committee and of ACM Multimedia Information Retrieval International Conference steering committee
Member of the Editorial board of scientific journals: I3, PRA
Member of several Technical program committees (TPC) of major international conferences: ACM MM, ACM, CIVR, ACM, MIR, IEEE ICME, IEEE ICPR, CBMI, SAMT, WIAMIS...
Responsabilities within INRIA: member of COPIL SDRH of INRIA (comité de pilotage national sur les priorités et la prospective de la politique des ressources humaines), member of the “Comité d'animation scientifique” of the research topic "Perception, cognition, interaction" of INRIA, member of the direction team of the CRI Paris-Rocquencourt to represent the researchers and member of the BCP (Bureau du comité des projets) of the CRI Paris-Rocquencourt
Co-organizer of the ISIS Workshop “Scalability in multimedia information retrieval” (
http://
Scientific expert for the French National Research Agency (ANR), call “Programme Blanc”, and for PENEK (Cyprus).
On leave from CNAM between March and August 2009 (“Congé pour Recherches”) at INRIA Rocquencourt and New Jersey Institute of Technology (USA).
Journal reviewer: IEEE Transactions on Neural Networks, Information Science, Multimedia Tools and Applications, Pattern Recognition Letters.
Member of the steering committee of Pl@ntNet: project assistant-coordinator, WP3 co-leader (platform specifications and development) and WP5 co-leader (dissemination).
Journal Reviewer: IEEE Transactions on Image Processing.
Technical Programme committee member for SAMT 2009, the international conference on Semantic and Digital Media Technologies.
Member of the steering committee of VITALAS IP FP6 (leader of WP2 "Enabling technologies: Media Content Description and Summarisation").
Member of the steering committee of Pl@ntNet: WP2 co-leader.
Member of the steering committee of VITALAS IP FP6 (leader of WP7:“User interface and visualisation”),
Member of the steering committee of Pl@ntNet (WP5 co-leader: “Dissemination”).
Member of the Humanities and Social Sciences committee for the “Blanc” and “Young researcher” 2009 programmes of the French National Research Agency (ANR),
Member of the steering committee of the CNRS GDR IG (Informatique Graphique) ;
Member of the technical programme committee of the First International Conference on Advances in Multimedia (MMEDIA 2009),
Member of the editorial board of the “Revue Electronique Francophone d'Informatique Graphique”.
Journal Reviewer: International Journal of Computer Vision, IEEE Transactions on Visualization and Computer Graphics, Computers & Graphics.
ANNEE 2008
20h course on multimedia indexing at ISI and SupCom Tunis.
In charge of the course “Multimedia Databases” of the Master in computer science of the University Paris Dauphine.
24h training course on “Advanced C++ programming” for researchers and PhD students at INRIA Rocquencourt in February 2009.
24h TP on Java Database Connectivity (JDBC), CNAM 3rd year (NFA011)
192 hours in the Mathematic and Computer Science Departement of Reims Champagne Ardenne University;
In charge of the course "Images Acquisition and Analyses" of the Master "enginnering, images and knowlegle" of Reims Champagne Ardenne University.