One of the consequences of the increasing ease of use and significant cost reduction of computer systems is the production and exchange of more and more digital and multimedia documents. These documents are fundamentally heterogeneous in structure and content as they usually contain text, images, graphics, video and sounds.
Information retrieval can no longer rely on text-based queries alone; it will have to be multi-modal and to integrate all the aspects of the multimedia content. In particular, the visual content has a major role and represents a central vector for the transmission of information. The description of that content by means of image analysis techniques is less subjective than the usual keyword-based annotations, whenever they exist. Moreover, being independent from the query language, the description of visual content is becoming paramount for the efficient exploration of a multimedia stream.
In the IMEDIA group we focus on the intelligent access by visual content. With this goal in mind, we develop methods that address key issues such as content-based indexing, interactive search and image database navigation, in the context of multimedia content.
Content-based image retrieval systems provide help for the automatic search and assist human decisions. The user remains the maître d'oeuvre, the only one able to take the final decision. The numerous research activities in this field during the last decade have proven that retrieval based on the visual content was feasible. Nevertheless, current practice shows that a usability gap remains between the designers of these techniques/methods and their potential users.
One of the main goals of our research group is to reduce the gap between the real usages and the functionalities resulting from our research on visual content-based information retrieval. Thus, we apply ourselves to conceive methods and techniques that can address realistic scenarios, which often lead to exciting methodological challenges.
Among the "usage" objectives, an important one is the ability, for the user, to express his specific visual interest for a part ofa picture. It allows him to better target his intention and to formulate it more accurately. Another goal in the same spirit is to express subjective preferences and to provide the system with the ability to learn those preferences. When dealing with any of these issues, we keep in mind the importance of the response time of such interactive systems. Of course, what value the response time should have and how critical it is depends heavily on the domain (specific or generic) and on the cost of the errors.
Our research work is then at the intersection of several scientific specialties. The main ones are image analysis, pattern recognition, statistical learning, human-machine interaction and database systems.
Our work is structured into the following main themes :
Image indexing : this part mainly concerns modeling the visual aspect of images, by means of image analysis techniques. It leads to the design of image signatures that can then be obtained automatically.
Clustering and statistical learning : generic and fundamental methods for solving problems of pattern recognition, which are central in the context of image indexing.
Interactive search and personalization : to let the system take into account the preferences of the user, who usually expresses subjective or high-level semantic queries.
Cross-media indexing, and in particular bimodal text + imageindexing, which addresses the challenge of combining those two media for a more efficient indexing and retrieval.
More generally, the research work and the academic and industrial collaborations of the IMEDIA team aim to answer the complex problem of the intelligent access to multimedia content.
We group the existing problems in the domain of content-based image indexing and retrieval in the following themes : image indexing, pattern recognition, personalization and cross-media indexing. In the following we give a short introduction to each of these themes.
the process of extracting from a document (here a picture) compact and structured significant visual features that will be used and compared during the interactive search.
The goal of the IMEDIA team is to provide the user with the ability to do content-based search into image databases in a way that is both intelligent and intuitive to the users. When formulated in concrete terms, this problem gives birth to several mathematical and algorithmic challenges.
To represent the content of an image, we are looking for a representation that is both compact (less data and more semantics), relevant (with respect to the visual content and the users) and fast to compute and compare. The choice of the feature space consists in selecting the significant features, the descriptorsfor those features and eventually the encoding of those descriptors as image signatures.
We deal both with generic databases, in which images are heterogeneous (for instance, search of Internet images), and with specific databases, dedicated to a specific application field. The specific databases are usually provided with a ground-truth and have an homogeneous content (faces, medical images, fingerprints, etc.)
Note that for specific databases one can develop dedicated and optimal features for the application considered (face recognition, etc.). On the contrary, generic databases require generic features (color, textures, shapes, etc.).
We must not only distinguish generic and specific signatures, but also local and global ones. They correspond respectively to queries concerning parts of pictures or entire pictures. In this case, we can again distinguish approximate and precise queries. In the latter case one has to be provided with various descriptions of parts of images, as well as with means to specify them as regions of interest. In particular, we have to define both global and local similarity measures.
Also, since the arrival of Anne Verroust-Blondet, we have been investigating the problem of 3D model description, in order to complete our approach of the description of the visual appearance in 2D and 3D.
When the computation of signatures is over, the image database is finally encoded as a set of points in a high-dimensional space : the feature space.
A second step in the construction of the index can be valuable when dealing with very high-dimensional feature spaces. It consists in pre-structuring the set of signatures and storing it efficiently, in order to reduce access time for future queries (tradeoff between the access time and the cost of storage). In this second step, we have to address problems that have been dealt with for some time in the database community, but arise here in a new context : image databases. The diversity of the feature spaces we deal with force us to design specific methods for structuring each of these spaces. A collaboration on this topic is under way with Michel Scholl (INRIA/CNAM).
Statistical learning and classification methods are of central interest for content-based image retrieval .
We consider here both supervised and unsupervised methods. Depending on our knowledge of the contents of a database, we may or may not be provided with a set of labeled training examples. For the detection of knownobjects, methods based on hierarchies of classifiers have been investigated. In this context, face detection was a main topic, as it can automatically provide a high-level semantic information about video streams. For a collection of pictures whose content is unknown, e.g. in a navigation scenario, we are investigating techniques that adaptatively identify homogeneous clusters of images, which represent a challenging problem due to feature space configuration.
Object detection is the most straightforward solution to the challenge of content-based image indexing. Classical approaches (artificial neural networks, support vector machines, etc.) are based on induction, they construct generalization rules from training examples. The generalization error of these techniques can be controlled, given the complexity of the models considered and the size of the training set.
Our research on object detection addresses the design of invariant kernels and algorithmically efficient solutions. We have developed several algorithms for face detection based on a hierarchical combination of simple two-class classifiers. Such architectures concentrate the computation on ambiguous parts of the scene and achieve error rates as good as those of far more expensive techniques. The computational efficiency we are looking for has the effect of a regularization constraint : it favors structurally simple classifiers, which have good generalization properties.
Beside this work focusing on the trade-off between error rate and computational cost, we are working on the design of invariant kernels for vision. We have worked on the scale invariance of kernel methods based on the triangular kernel, and we have unified kernel methods and the matching of points of interest by designing matching kernels.
These high invariance of matching schemes to the view-based representation underlying support vector machines or other kernel methods.
Unsupervized clustering techniques automatically define categories and are for us a matter of visual knowledge discovery. We need them in order to :
Solve the "page zero" problem by generating a visual summary of a database that takes into account all the available signatures together.
Perform image segmentation by clustering local image descriptors.
Structure and sort out the signature space for either global or local signatures, allowing a hierarchical search that is necessarily more efficient as it only requires to "scan" the representatives of the resulting clusters.
Given the complexity of the feature spaces we are considering, this is a very difficult task. Noise and class overlap challenge the estimation of the parameters for each cluster. The main aspects that define the clustering process and inevitably influence the quality of the result are the clustering criterion, the similarity measure and the data model.
We investigate a family of clustering methods based on the competitive agglomeration that allows us to cope with our primary requirements : estimate the unknown number of classes, handle noisy data and deal with classes (by using fuzzy memberships that delay the decision as much as possible).
We are studying here the approaches that allow for a reduction of the "semantic gap" There are several ways to deal with the semantic gap. One prior work is to optimize the fidelity of physical-content descriptors (image signatures) to visual content appearance of the images. The objective of this preliminary step is to bridge what we call the numerical gap. To minimize the numerical gap, we have to develop efficient images signatures. The weakness of visual retrieval results, due to the numerical gap, is often confusingly attributed to the semantic gap. We think that providing richer user-system interaction allows user expression on his preferences and focus on his semantic visual-content target.
Rich user expression comes in a variety of forms :
allow the user to notify his satisfaction (or not) on the system retrieval results–method commonly called relevance feedback. In this case, the user reaction expresses more generally a subjective preference and therefore can compensate for the semantic gap between visual appearance and the user intention,
provide precise visual query formulation that allows the user to select precisely its region of interest and pull off the image parts that are not representative of his visual target,
provide a mechanism to search for the user mental image when no starting image example is available. Several approaches are investigated. As an example, we can mention the logical composition from visual thesaurus. Besides, learning methods related to information theory are also developed for efficient relevance feedback model in several context study including mental image retrieval.
We have described, up to now, our research approaches in using the visual content alone. But when additional information is available, it may prove complementary and potentially valuable in improving the results returned to the user. We may cite here metadata(file name, date of creation, caption, etc.) but also the textual annotations that are sometimes available. We must note that annotations usually carry high-level information related to a prior knowledge of the context. The use of these sources of information implies that we can speak of multimedia indexing.
We can think of several approaches for combining textual and visual information in the context of indexing and retrieval. As examples, we may cite the automatic textual annotation of images based on similarities between visual signatures or the propagation of textual annotations relying on the interaction between textual ontologies and visual ontologies. We also investigate methods that allow automatic textual annotation from visual content analysis. This part of our research activities is yet another solution for the reduction of the "semantic gap".
Security applicationsExamples : Identify faces or digital fingerprints (biometry). Biometry is an interesting specific application for both a theoretical and an application (recognition, supervision, ...) point of view. Two PhDs were defended on themes related to biometry. Our team also worked with a database of images of stolen objects and a database of images after a search (for fighting pedophilia). We are currently collaborating with the Ministry of the Interior.
MultimediaExamples : Look for a specific shot in a movie, documentary or TV news, present a video summary. Our team has a collaboration with the TV channel TF1 in the context of a RIAM project. Text annotation is still very important in such applications, so that cross-media access is crucial.
Scientific applicationsExamples : environmental images databases : fauna and flora; satellite images databases : ground typology; medical images databases : find images of a pathological character for educational or investigation purposes. We have an ongoing project on multimedia access to biodiversity collections.
Culture, art and educationExamples : encyclopaedic research, query by example of paintings or drawings, query by a detail of an image. IMEDIA has been contacted by the French ministry of culture and by museums for their image archives.
Finding a specific texture for the textile industry, illustrating an advertisement by an appropriate picture. IMEDIA is working with a picture library that provides images for advertising agencies.
TelecommunicationsExamples : image representation and content-based queries stand as the basis of MPEG-4 and MPEG-7. IMEDIA does not contribute to their normative aspects but is interested in the latest results related to the MPEG-7 group. Note that the signatures developped by IMEDIA can be used with this norm.
IKONA is a framework for building Content Based Image Retrieval software prototypes. It has been designed and implemented in our team during the last four years . The current version is fully generic and is highly adaptable to any CBIR scenario thanks to its level of abstraction. As a research environment, IKONA offers support to the researchers in their work by providing stable and tested tools. As an application, it can easily be deployed and used by non-specialist users.
IKONA is based on a client/server architecture. The communication between the two components is achieved through a proprietary network protocol. It is a set of commands the server understands and a set of answers it returns to the client. The communication protocol is extensible, i.e. it is easy to add new functionalities without disturbing the overall architecture. It is also modular and therefore can be replaced by any new or existing protocol dealing with multimedia information retrieval.
The main processes are on the server side. They can be separated in two main categories :
offline processes : data analysis, features extraction and structuration
online processes : answer the client requests
The images are characterized with Globalsignatures that are implemented in the server :
Generic signatures : Color, Shape and Texture features investigated at the Imedia Group.
Specific signatures : Faces and signatures for fingerprints.
Annotations : Some keywords.
Besides, two localsignatures are included : The region-based description and the point-based one. The server uses image signatures and offers several types of query paradigms, available to the user through the graphical interfaces of the clients :
query by global example : The user selects an entire image as visual query.
partial queries : the user is looking for regions in images that are visually similar to a the selected region.
relevance feedback on global and partial query : the user interacts with the system in a feedback loop, by giving positive and negative examples to help the system identify the category of images she/he is interested in ;
mental image search : Two different methods are investigated. The first is Target Image Search with relevance feed-back model based on mutual information, the second one consist on Logical Query Composition.
We have developped two main clients that can communicate with the server. A good starting point for exploring the possibilities offered by IKONA is our web demo, available at http://www-rocq.inria.fr/cgi-bin/imedia/ikona. This CGI client is connected to a running server with several generalist and specific image databases, including more than 23,000 images. It features query by example searches, switch database functionality and relevance feedback for image category searches. The second client is a desktop application. It offers more functionnalities. More screenshots describing the visual searching capabilities of IKONA are available at http://www-rocq.inria.fr/imedia/cbir-demo.html.
The architecture of this client/server software and several visual signatures were a subject of a deposit to APP.
Several categories of image descriptors are studied in the IMEDIA group. Some of them, the global descriptors in particular, allow the interrogation of high-dimensional databases (about 500,000 images) in real-time with standard hardware. Other descriptors, such as the local descriptors involving points of interest, currently only allow the interrogation of small databases (about 3,000 images). Our objective is to fit to scale the descriptors developed at IMEDIA. For the moment, we focus on the local approaches, that do not allow real-time responses for the databases encountered in various applications.
This year, our first contribution was to experimentally assess whether the so-called curse of dimensionalityphenomenon is reached with various categories of distributions related to dimensions encountered with local descriptors (usually between 8 and 30). We compare the performance of a single sphere query when the collection is indexed by a tree structure (an SR-tree in our experiments) to that of a sequential scan. The tested distributions were :
synthetic clustered distributions (
DC)
synthetic uniform distributions (
DU)
real distributions (
DR, obtained by an extraction and a local description of points of interest from a set of generalist images).
Some of the experiments realized are illustrated in figure
; it shows the ratio of the sequential scan CPU time over the CPU time obtained with the SR-tree traversal,
versus the number of neighbors for
DRand
DC.
Clearly, for the real dataset as well as for the clustered distribution, the curse of dimensionality is not reached. The speed up for dimension
d= 29is significant even for large values of the number of neighbors. In contrast, as experimented earlier in the literature, there is no gain with the uniform data set (not shown here). These trends were confirmed by the study of the ratio of nodes acceded
during the tree traversal. This ratio remains small for the
DRand
DCdistributions. All the experiments performed show that when the dimension of the space is moderate, i.e. between 8 and 30, the tree structure indeed performs well : it exhibits a significant gain wrt to sequential scan even in a 29-dimensional space.
When considering objects or parts of image described with a set of local descriptors, searching in the feature space is usually done independently and sequentially for each local descriptor. We have studied multiple queriesapproaches , existing in the Database community, before applying and adapting them to the retrieval of groups of local descriptors. Two directions have been investigated :
Reduction of the I/O costs, by studying a new approach for searching in the multidimensional structure. A lot of multidimensional structures exist ; the structure considered is the SR-Tree one, which achieves good performances for feature spaces based on local descriptors (see previous section);
Reduction of the CPU costs, by considering relations existing on distances computed between several query points and points of the feature space, for instance the triangular inequality. The structure considered is also the SR-tree one, but the proposed improvements could be applied to any tree structure. We have revisited the two lemmas proposed in and proposed three novel ones.
Figure illustrates our approach : it presents CPU time consumed for similarity search with multiple queries, by using or not some of the lemmas. The evaluation has been done within a dataset of 1 million of real points of interest. The results show that the joint use of such lemmas allows to accelerate the search, whatever the dimensions.
All these studies are done in collaboration with the CEDRIC/Vertigo research group, during the thesis of Nouha Bouteldja in the CEDRIC laboratory, and were accepted for publication in an international conference .
This work is done in collaboration with the INA (French National Institute of Audiovisual), within the scope of the CIFRE thesis of Julien Law-To. The main application considered is real-time monitoring of huge databases of videos (about 100000 hours). At present, we focus more particularly in video content-based copy detection (CBCD).
In order to protect the INA from the piracy of its videos, we have to link the TV broadcast on one side to the video database on the other side (monitoring) in order to find identical sequences. The CBCD system must be robust to common transformations used in the TV post-production such as zooming, cropping, shifting, etc.
Local descriptors have been proved to be very useful for image indexing with application to object or sub-image retrieval. In the Computer Vision community, a number of recent techniques have been proposed to identify points of interest or regions of interest in images. If directly applied to image sequences, one of the drawbacks of such descriptors is its spatio-temporal redundancy. When considering applications like real-time monitoring, it is necessary to build a very compact description. In this context, two kinds of promising approaches can be investigated :
The spatio-temporal information can be exploited jointly, instead of working on the time dimension and the spatial dimension separately. See for example the work of Laptev on space-time interest points ;
To be more compact and more significant, the image description should involve new signatures that should be inspired by pre-attentive human vision and the focalization of attention. Such features have been widely studied by the neurophysiological community and can be modelized mathematically. See for example the work of Heidemann on focus-of-attention from local color symmetries.
This year, we have investigated the first class of approaches. Studying the trajectories of the points of interest is an efficient solution to avoid temporal redundancy and, moreover an interesting way to strongly fingerprint the video sequence. By modelizing and labelling the different kinds of behaviours of those points according to their trajectories, the CBCD system can be improved. Such a content-based video description is richer and more compact than the usual ones. It is also generic then allows to investigate new possibilities for video retrieval, such as :
Specific queries on video sequences, like object retrieval with a particular behaviour;
Segmentation of the video based on the points of interest behaviour (camera motion, cut, credits, ...).
Based on this description, the current experiments show an improvement on the quality of the monitoring process and on its flexibity. This work has been submitted to the international conference CVPR 2006.
Within the BIOTIM project, we have designed a new shape descriptor called Directional Fragment Histogram (DFH) , . This shape descriptor is computed using the outline of the plant. Both local and global directional information are coded. Local information are computed from segments taken randomly from the external contour. The accumulation of all possible segments of a specific length provides the global information. Several versions of this descriptor are studied and tested on different image databases. The implemented DFH versions are based respectively on the freeman chain-codes, the angles of the gradient, the curvatures and the combination of both angles of the gradient and curvatures. In our evaluation, the following two kinds of imge database are used :
Botanical databases
The Arabidopsis database provided by the INRA Institute. It consists of about 400 images.
The Swedish leaves database taken from the web site of the "the Swedish Museum of Natural History" . This database is composed of 1125 images and 15 classes, each classe represents a tree species.
The Smithsonian leaves database provided from the collaboration with the NSF projet "An Electronic Field Guide : Plant Exploration in the 21st Century". It contains 134 classes including 1520 images : see http://www.cfar.umd.edu/ gaaga/leaf/leaf.html.
Generic Objects databases
The Mpeg7 CE-Shape-1database taken from the web page of Professor Latecki : http://www.cis.temple.edu/ latecki/research.html, it contains 1400 images and 70 classes.
The Kimia databases is composed of three subdatasets. They contain respectively 99, 256 and 1032 silhouette images : see http://www.lems.brown.edu/vision/software/index.html.
The ETH-80 database includes 3280 images classified into 80 objects : see http://www.vision.ethz.ch/projects/categorization/eth80-db.html.
We evaluate the DFH gradient version as well as some existing shape descriptors on the Swedish database. These shape descriptors are IKONA shape descriptor based on the Hough transform , Edge Orientation Histogram (EOH) , MPEG-7 CSS descriptor and CCH descriptor . The figures represent respectively precision and ROC graphs.
As illustrated in these graphs, our DFH shape descriptor is more performant than the others. Noting that the CSS-MPEG7 descriptor gives comparable performances. However, its computation time is about 250 times compared to the DFH. Both CCH and EOH present lower results because they focus only on the global shape without taking into account local distribution of directions. The Figure shows "partial queries" results on two leaves taken from the swedish database.
The performances of the different versions of DFH are depending on shape database contenent. In future work, we intend to improve our DFH descriptor to handle this limitation.
The development and application of various remote sensing platforms result in the production of huge amounts of satellite image data. Therefore, there is an increasing need for effective querying and browsing in these image databases. Region-Based Image Retrieval (RBIR) is a powerful
tool since it allows to search images containing similar objects of a reference image. It requires the satellite image to be segmented into a number of regions. Segmentation consists in partitioning the image into non-overlapping regions that are homogeneous with regards to some
characteristics such as spectral and texture. Remote sensed images contain both textured and non-textured regions. This is even more true today with high resolution images such as IKONOS, SPOT-5 and QuickBird data. In order to cope with this content heterogeneity, we propose an adaptive
variational segmentation algorithm
We have implemented and compared the efficiency of several existing 2D/3D shape descriptors, i.e. descriptors built from 2D views of 3D models. We have improved the efficiency of the descriptors based on silhouettes or depth-buffer images. The 2D shape signatures are extracted from classical Fourier Transform (1DFFT for silhouettes and 2DFFT for depth-buffer images). We have introduced a new approach based on relevance index, which takes into account the diversity of information contained in the projection images. It was evaluated on the Princeton database (907 models) and these retrieval results (cf. Figure ) show its performance and robustness in the 3D-model retrieval process. This work is described in and is supported by the European Network of Excellence DELOS II within "Description, Matching and Retrieval by Content of 3D Objects" Project (Task 3.8 of JPA2).
There is an increasing variety of content-based image retrieval scenarios involving the use of local descriptors (object class recognition , object and scene recognition , content-based copy detection ). Enhancing the performance of these scenarios by using the geometric distribution or the relative positions of the local descriptors is an active research area. During this year, we have shown that in the copy detection scenario, the robust estimation of a global geometric transformation model after the search is widely profitable to improve the discrimination of the detection (paper submitted to IEEE transactions on multimedia). However, for other scenarios, using the geometry remains a challenging task : Including the geometric distribution in the descriptor itself often leads to a lake of robustness during the search of similar local descriptors whereas post-processing techniques are generally highly time consuming and thus limited to very small data sets. Moreover, in most of them, the geometric consistency is limited to rigid transformation models which do not allow to enforce the matching when two geometric distributions are dependent but not linearely linked. For a few months, we are investigating the use of non parametric geometric consistency measurements such as mutual information and robust correlation ratio and we plane to combine them with some robust local geometric properties that could be included in the descriptor itself in order to limit the number of matches during the second step.
This work was done in collaboration with the NII (National Institute of Japan) within the scope of the visit of Alexis Joly at the NII (july 2005). Local features are well-suited to content-based image retrieval because of their locality, their local uniqueness and their high information content . However, as they are selected only according to the local information content in the image, there is no guaranty that they will be distinctive in a large set of images. A local feature corresponding to a high saliency in the image can be highly redundant in some specific databases, such as the TV news database stored at NII in which textual characters are extremely frequent. To overcome this issue, we propose to select relevant local features directly according to their discrimination power in a specific set of images. By computing the density of the local features in a source database with a new fast non parametric density estimation technique, it is indeed possible to select quickly the most rarelocal features in a large set of images. Figure illustrates the difference between the 20 most salient points of an image and the 20 most rare points according to their density in a large image database.
Query By Visual Thesaurus is a new query alternative that overcomes the absence of starting example image and offers the possibility to combine multiple visual patches in order to retrieve the target mental image. The Visual Thesaurus is obtained by means of region categorization into visual patches that stand for the database regions representatives.
We introduce a novel semantic region labelling criterion based on point of interest dispersion. This point-based coherence criterion (PCC) leads to label regions through topological and spatial dispersion of points of interest . This dual region/points description strengthens the "page zero" construction process and is not computationally expensive since PCC is evaluated on single coarse regions using Harris color points detector that catches the local photometric variability on small sites (few pixels). The use of points of interest was motivated by their ability to finely describe coarsely segmented region. Thus our approach tends to add se- mantic knowledge information on low level features by labelling rough regions into homogenous and textured classes. The novelty of our approach is the unsupervised region categorization and the construction of a visual region summary that handles both photometric attributes and point-based description.
Gouet-Brunet and Boujemaa
proposed a color version of the Harris detector for image description. The idea of the Harris detector
is that interest points are local maxima related to the second moment matrix
(
x,
I,
D)(
) and the gradient distribution in a local neighborhood of a point
x.
where
Iis the integration scale,
Dis the derivation scale,
gthe Gaussian and
Lthe image smoothed by a Gaussian for each channel
i{
R,
G,
B}.
As far as points of interest are concerned, a homogenous region is more likely to contain less points than a textured one because points catch the local photometric variability around a very small site. As a matter of fact, assumed homogenous patches should be well described by means of classical color moments, whereas textured or non homogenous regions are better characterized by points attributes.
Let
P= {
Pk,
k{1, ..,
i}}be the set of points of interest detected on the region
Rj. We superimpose a grid
Cell= {
Cell(
i,
j)}on the region.
The idea of PCC is to compute the number of effective points contributing to describe a texture from those that are only due to coarse segmentation. To do so, we build a point histogram where bin
(
i,
j)reflects the number of interest points located in the
(
i,
j)-th cell (see equation
).
i
j
P
k
P
k
C
e
l
l
(
i,
j)
Once all points visited, the histogram is binarizedi.e. all filled cells are switched on, empty others are considered switched off (Figure ). Each cell is visited once so that we keep only those containing at least one point of interest (effective cell)( ).
For visual thesaurus construction, we started with regions clustering using mean colors descriptors. The resulting first level of classification brings together all visually similar regions into a set of representatives. This step is following by PCC computation and labelling for all classes. PCC highly discriminates between different regions structure labelling them either with homogenousor textured. Figure shows, for example, three classes obtained by mean color classification; each class is divided into 2 classes using PCC : homogenous and textured .
Many of the available image databases have keyword annotations associated with the images. As keywords and visual features provide complementary information, using both sources of information is an advantage in many applications. We address the challenge of semantic gap reduction using a
hybrid visual and conceptual representation of the content within an active relevance feedback context. In
we introduce a new feature vector, based on the keyword annotations available for the images, which makes use of
conceptual information extracted from an external lexical database
We rely on an external ontology, defining semantic relations between concepts, to find good candidates for the core concepts and to define the feature vectors for sets of keywords. WordNet is a well-known general purpose ontology that organizes nouns, verbs, adjectives and adverbs into synsets (set of words having similar meaning), each representing one underlying lexical concept. The concepts are linked by semantic relations of various types, such as synonymy, hypernymy, hyponymy, etc.
The
core conceptswe need for building the conceptual feature vectors should allow us to evaluate the conceptual similarity between keywords
wthat are mapped to different concepts
c(
w)in the ontology. We must then rely on the hypernymy/hyponymy subgraph in WordNet linking the concepts associated to all the keywords in the database to the most generic concepts. For every concept corresponding to a keyword annotating an image, we find all the
paths in the ontology that lead to the most generic concepts. The paths obtained for all the keywords in the database define the hypernyms graph. A small set (compared to the number of different keywords) of core concepts is then selected; good candidates are super-concepts of several
c(
w)concepts that are relatively close to these; also, the core concepts must be balanced among all the branches containing
c(
w)concepts.
For any image in the database, we project the keywords associated to it to the set of core concepts, through the use of several semantic similarity functions. The resulting feature vectors have the advantage of being compact, comprehensible (each dimension corresponds to a core concept) and easy to integrate in any CBIR system. Experimental evidence shows that the joint use of the visual descriptor and the new conceptual descriptor dramatically improves the quality of the results, both in a Query By Example (QBE) context and with Relevance Feedback (RF), as illustrated in the following figures. A detailed presentation of our method can be found in . Our relevance feedback framework was described in
We have pursued the work performed with Marie-Luce Viaud from INA (Institut National de l'Audiovisuel) within a collaboration agreement between INRIA and INA : in order to propose a graphic interface allowing the user to elaborate new strategies while searching in a database, as presented in , a tool automatically computing a "Euler like" diagrammatic representation of the result is under construction. This year, we have implemented a system drawing automatically a planar extended Euler diagram from any configuration involving less than nine sets. Each set corresponds to the database documents having a given term in its associated description. Our results have been presented at Euler Diagrams 2005, whorkshop co-organized with INA (cf. examples in figure ).
Content-based image retrieval (CBIR) can be much improved by providing a relevant organization of image collections. While it is easy to apply standard unsupervised clustering algorithms to the descriptors of the images in a database, the results of this fully automatic categorization are rarely satisfactory. The problem we are interested in may be enlightened as follows. Simple semantic information in the form of pairwise (must-link or cannot-link) constraints between data items or class labels for some items is available and the structure of the entire data has to be discovered with respect to this semantic knowledge.
We assume that users can easily evaluate whether two images should be in the same category or rather in different categories, so they can easily define the constraints mentioned above. Following previous work by Demiriz et al. , Wagstaff et al. or Basu et al. , in we introduced Pairwise-Constrained Competitive Agglomeration (PCCA), a fuzzy semi-supervised clustering algorithm. PCCA is reminded in the next paragraph. In the original version of PCCA we did not make further assumptions regarding the data, so the pairs of items for which the user is required to define constraints are randomly selected. But in many cases, such assumptions regarding the data areavailable. We argue in that quite general assumptions let us perform a more adequate, activeselection of the pairs of items and thus significantly reduce the number of constraints required for achieving a desired level of performance.
In , we proposed an algorithm named PCCA that belongs to the family of search-based semi-supervised clustering methods. It is based on the Competitive Agglomeration (CA) algorithm , a fuzzy partitional algorithm that does not require the user to specify the number of clusters to be found. Using the same notations as in , the PCCA objective function minimize :
under the constraint .
It can be shown (see ) that the equation for updating memberships is
The first term in (
),
ursFCM, comes from FCM and only focusses on the distances between data items and prototypes. The second term,
ursConstraints, takes into account the supervision : memberships are reinforced or deprecated according to the pairwise constraints given by the user. The third term,
ursBias, leads to a reduction of the cardinality of spurious clusters, which are discarded when their cardinality drops below a threshold.
To make this semi-supervised clustering approach attractive for the user, it is important to minimize the number of constraints he has to provide for reaching some given level of quality. This can be done by asking the user to define must-link or cannot-link constraints for the pairs of data items that are expected to have the strongest corrective effect on the clustering algorithm (i.e. that are maximally informative).
When using PCCA we consider that the similarities between data items provide relatively reliable information regarding the target categorization and the constraints only help finding the most relevant clusters. There is then little uncertainty in identifying well-separated compact clusters. To be maximally informative, supervision effort (i.e. constraints) should rather be spent for defining those clusters that are neither compact, nor well-separated from their neighbours. One can note that this is consistent with the findings in regarding unsupervised clustering.
Following this remarks, we consider that the least well-defined cluster at iteration
tis the one with the smallest density at that iteration.
When the least well-defined cluster is found, we need to identify the data items near its boundary. In the fuzzy setting, one can consider that a data item represented by the vector
xris assigned to cluster
sif
ursis the highest among its membership degrees. The data items at the boundary are those having the lowest membership values to this cluster among all the items assigned to it.
As we already mentioned that in real configurations, there's very often no sharp boundary between clusters so that a fuzzy partition is often better suited for the determination of membership-based boundaries. To this end, after finding the least well-defined cluster with the upper criterion, we consider a virtual boundary that is only defined by a membership threshold and will usually be larger than the true one (this is why we call it "extended" boundary). The items whose membership values are closest to this threshold are considered to be on the boundary and the user is asked to provide constraints directly between these items. We should note that the extended boundary will probably contain ambiguous points shared with near clusters.
Non-redundancy is complementary to ambiguousness in maximizing the amount of information provided by the constraints. The non-redundancy criterion iteratively chooses from an augmented set of feature points lying on the boundary of the least defined cluster vectors
xjthat maximize the lowest of the values of the distances
d(
xi,
xj)for all the
xiitems already included in the selected non-redundant set. This can be expressed as :
where
Sis the augmented set of points and
xiare the already chosen points.
In the following, we use the name AFCC (Active Fuzzy Competitive Clustering) for the resulting algorithm.
We compared AFCC to original PCCA, to the CA algorithm (unsupervised clustering) and to PCKmeans (semi-supervised clustering).
The comparison shown here is performed on a ground truth database composed of images of different phenotypes of Arabidopsis thaliana, corresponding to slightly different genotypes. This scientific image database is issued from studies of gene expression. A sample of the images is shown in .
Figure presents the dependence between the percentage of well-categorized data points and the number of pairwise constraints considered. We provide as a reference the graphs for the CA algorithm and for K-means, both ignoring the constraints (unsupervised learning). The correct number of classes was directly provided to K-means and PCKmeans. CA, PCCA and AFCC were initialized with a significantly larger number of classes and found the appropriate number by themselves.
These experimental results clearly show that the user can significantly improve the quality of the categories obtained by providing a simple form of supervision, the pairwise constraints. With a similar number of constraints, PCCA performs significantly better than PCKmeans by making a better use of the available constraints. The fuzzy clustering process directly takes into account the pairwise constraints thanks to the signed constraint terms in the equation for updating the memberships. The active selection of constraints (AFCC) further reduces the number of constraints required for reaching such an improvement. The number of constraints becomes very low with respect to the number of items in the dataset.
Object classification with images represented by sets of local features is a challenging task for kernel based methods. We introduce a new kernel which operates on sets of vectors, named the intermediate matching kernel
,
. It is based on a new and flexible matching approach which improves results over GCS kernel
. The matching procedure is guided by an intermediate set of vectors. Indeed, an explicit mapping
mdepending on a vector
defines the core of the proposed kernel. One possible choice, but not the only one, for
mis to associate, for each element of the input space
X(a set of vectors from
), the nearest vector from
to
m :
This mapping is performed for
and
which are the two sets of vectors to be compared, then the matched pair of vectors is compared using any positive definite kernel
koperating on vectors such as the RBF kernel. A sequence of parametric mappings
are separately applied for the two sets of vectors. The intermediate matching kernel is finally obtained as a sum over the matched pairs. More formally, for any set of vectors
, the intermediate matching kernel can be defined as :
By construction, is positive definite. An important fact to notice is that the positive definiteness of the intermediate matching kernel is insured when the matching set is chosen independently from and . We define as the set formed with the center classes obtained by clustering all vectors of the training sample. However, other approaches can be used.
Local jets | ||
Kernels | valid. error | test error |
Matching kernel (loser-take-nothing) | 19.66 ±0.20 | 19.66 ±0.40 |
Matching kernel (winner-take-all) | 13.46 ±0.87 | 13.1 ±0.69 |
GCS-based kernel | 8.93 ±0.53 | 9.33 ±1.00 |
Intermediate matching kernel | 8.33 ±0.54 | 8.93 ±0.16 |
Table summarizes performance comparisons with other kernels that operate on sets of vectors. The task is image classification with images represented by set of jets around interest point. The Intermediate Matching kernel yielded the best test errors with 8.33%. In this experiment, the size of the matching set is 40.
Tuning hyper-parameters is a necessary step to improve learning algorithm performances. For SVM, adjusting kernel parameters may increases results drastically. Parameter tuning is usually performed using cross-validation by sweeping the parameter space. When the number of parameters
increases, such as in the case of multiple kernel parameter, the complexity of such grid search is exponential with respect to the number of optimized parameters. The gradient descent approach introduced in
reduces significantly search steps of optimal parameters. We define the LCCP (Log Convex Concave Procedure)
derived from the CCCP optimization framework (Convex ConCave Procedure)
for optimizing kernel parameters, by minimizing the radius-margin bound. To apply the LCCP, we prove, for a good
choice of kernels, that the radius is log convex and the margin is log concave
. The LCCP is more efficient than gradient descent technique since it insures that the minimized criterion
decreases monotonically and converges to a local minimum without searching the size step. We apply the LCCP for the optimization of parameters
iin the following kernel :
which is an extension of the
L1-distance kernel :
We also recall the
L2-distance kernel :
The above two kernels can be shown to be conditionally positive definite , .
Tab. summarizes the average test errors for different data sets and shows clearly the interest of the multiple kernel parameters, despite a preliminary weightening of the databases.
There is a more and more interest in shape recognition with the growing need for retrieval applications such as finding leaf species. In this work, we introduced a new shape description method targeted for
2
Dshape contours. A contour is a manifold
, where
sis referred to as the curvilinear abscissa and
as a uniform sampling of
, according to
s(i.e.,
s=
k,
,
).
It is shown in our work
,
that the triangular kernel, defined as
,
p]0, 2[achieves similarity and particularly scale invariance for many kernel methods including support vector machine and kernel principal component analysis (KPCA). Given a
shape contour
, the top
deigenvalues, of KPCA on
, are used as its shape description. These eigenvalues are rotation and translation invariant when using the Gaussian kernel and can be normalized to be also scale invariant when using the triangular kernel. Notice that computing this description does not necessitate sampling a
curve according to an
orderedcurvilinear abscissa which might be unavailable and difficult to find mainly for complex contours.
This shape description is also scale invariant when using the linear kernel, nevertheless the dimension of the underlying eigenspace will not exceed two
D, we have at most two non null eigenvalues when solving KPCA using the linear kernel.
We ran our experiments on Smithsonian and Swedish
databases consisting respectively of 1525 and 1125 images of leaves. The Smithsonian dataset has 135 categories
with different cardinalities while the Swedish set has 15 classes, each one contains 75 images. Notice that only the external contours are used for shape description even though this is not sufficient to make the "best" predictions as color and internal structures are also prominent.
Figure (
) shows the precision-recall curves for different values of the regularizer parameter
pof the triangular kernel. The precision reaches the highest value for
p= 1.9and drops dramatically when
pis close to its upper bound (i.e.,
p= 2). Figure (
) shows the precision-recall on the Swedish dataset for
d= 100dimensions using our shape description. Comparisons, with the representative state of the art work, are reported using linear PCA (
d= 2) and other shape descriptors including the edge orientation histogram (EOH)
, Hough
and Curvature Scale Space (CSS)
. When the recall is less than
30%, the precision of KPCA is better than the other descriptors including CSS.
This year, the Ikona/Maestro software has been fully re-engineered to improve its modularity. Most of the changes have been done on the server part. They relate to the internal architecture. In this new version, the interfaces of all the main components have been unified. This changes will make easier the integration of new descriptors, new query paradigms and new functionnalities.
In the aceMedia european integrated project, IMEDIA is in charge of the developement of the "Intelligent Search and Retrieval" application module. This module brings together the software of four research teams that work on different multimedia information retrieval paradigms. The first version has been delivered in June. It has been integrated in a global client-side application : the PCS User Interface. Further improvements on this module, as well as its integration in a web-based application are planned for the next year.
A co-supervision of a Phd within CIFRE context. The main topic is about optimal fine visual signatures for monitiring of INA video collections.
The partners of this project are the IMEDIA and ATOLL teams of INRIA Rocquencourt, the CEDRIC laboratory of the CNAM Paris, the LIFO laboratory of the University of Orléans, the Institute of Research for Development (IRD) and the National Institute for Research in Agriculture (INRA). BIOTIM is coordinated by IMEDIA. The project is financially supported by the French National Science Fund (FNS).
This project concerns the conception and developpement of content description methods for aerial and satellite images indexing and retrieval by content. This work is done jointly with ARIANA project Team (Sophia Antipolis), ENST-Paris (CNRS) and URISA research team from Sup'Com (School of Engineering - Tunis). One of the objectives is to make connection with symbolic and semantics features queries in the context of satellite image repositories.
This project is a part of IMVN (image, video et vie numérique) competitiveness pole in the region Ile de France. It aims to develop a framework for advanced multimedia search engine. The main partner is Thales.
"Integrating knowledge, semantics and content for user-centred intelligent media services" in the 6th Framework Program. The consortium of this project is composed of 15 industrial and academic European partners (Alinari, Belgavox, DCU, France Telecom, Fraunhofer, INRIA, ITI, Motorola, Philips, QMUL, Telefonica, Thomson, UAM, UKarlsruhe).
"Multimedia Understanding through Semantics, Computation and Learning" in the 6th Framework Programme. This network of excellence is composed of 42 European academic institutions. Nozha Boujemaa chairs the Workpackage "Single Media Processing" and is deputy scientific coordinator of the network.
"Network of excellence on Digital Libraries" in the 6th Framework Programme. This network of excellence is composed of 44 European academic institutions for the period 2004-2007.
ViMining is an associated research team composed of IMEDIA group and the team of Pr. Shin'ichi Satoh from the National Institute of Informatics (NII), Japan. The major topics of common interest are : detection and description of semantic video events; organisation of the feature space; cross-media indexing and retrieval.
For more information, see http://www-rocq.inria.fr/imedia/vimining/index.htmland http://www-direction.inria.fr/international/EQUIPES_ASSOCIEES/index.eng.htm
This project involves the URISA research team from the school of engineering Sup'Com in Tunis. This project aims at developping unsupervised classification methods in order to segment satellite images and organize visual database indexes.
Information about past and on-going projects are also detailed at http://www-rocq.inria.fr/imedia/projects.html.
Keynote speaker for the international conference ESA-EUSC'05: Image Information Mining - Theory and Application to Earth Observation ;
Scientific expert for the European commission, invited to the consultation workshop "Challenges of Future Search Engines" held at Brussels during September 2005 for the preparation of FP7 calls ;
Scientific expert for NWO Research agency (Netherlands - CATCH projects) ;
Organizer of a special session "Machine Learning for Visual Information Retrieval" for the 7th ACM SIGMM International Workshop on Multimedia Information Retrieval (ACM-MIR'05) ;
"Tutorials Chair" for ECDL'05 besides member of several TPC among them ICME'05, MIR'05...
Deputy scientific coordinator of Muscle NoE (Network of Exellence FP6), member of steering committee and scientific coordinator of WP5 "Single Modality". Co-organizer (with Valerie Gouet) of several scientific meeting for WP5. ;
In charge of international relation at INRIA Rocquencourt and member of "Bureau du Comité des Projets" till october 2005 ;
Since September 2005, member of the "National Evaluation Commission" of INRIA.
PhD Jury member :
Chairman : INSA Lyon.
Reviewer : Ecole Centrale Lyon, Université de Nice, Université De Reims.
HDR Jury member :
Reviewer : Antoine Tabbone - Université de Nancy.
Member of the scientific commission (section CNU 27) of CNAM ;
Member of the CNAM/CEDRIC laboratory council ;
Scientific expert for the french research program ARA "Masses of Data" ;
Deputy scientific leader of the work package WP5 in the Muscle Network of Excellence ;
Leader of the WP5/task 2 in the Muscle Network of Excellence ;
Co-organiser (with Nozha Boujemaa) of the WP5 Third Scientific meeting in the Muscle Network of Excellence, 27-29 April 2005, Paris ;
Co-organiser (with Nozha Boujemaa) of the WP5 First focus meeting in the Muscle Network of Excellence, 1-2 December 2005, Rocquencourt ;
Conference program committee member of the International Workshop on Computer Vision meets Databases (CVDB'05), in conjunction with ACM Sigmod, June 17, 2005, Baltimore, MD, USA ;
Jury member of several engineer diplomas at CNAM ;
Reviewer for conference papers : ICME'05, CVDB'05, Acivs'05 ;
Reviewer for a journal paper : IVC.
Active member of the JPSearch ad-hoc group (ISO/IEC JTC 1/SC 29/WG 1 - Coding of still pictures) that aims to produce standards to facilitate management, search, and retrieval of content in the form of still pictures.
Journal Reviewer : Journal of Mathematical Imaging and Vision ;
Jury member :
PhD Sabri Boughorbel, Paris Orsay University, 13 juillet, 2005.
Member of program commity of 11th International Conference on Computer Analysis of Images and Patterns (CAIP'2005). Member of reading commity of Colloque du Groupement de Recherche En Traitement du Signal et de l'Image (GRETSI'2005).
AFIG President (Association Française d'informatique Graphique) ;
Member of the Executive Committee (Conseil d'Administration) of the French chapter of Eurographics ;
Co-organizer of Euler Diagrams 2005 workshop ( http://www-rocq.inria.fr/imedia/euler2005/) with Marie-Luce Viaud of INA.
30h TPs on Java, 1st year, IUT Velizy, May 2005.
192 hours in the Computer Science Department of CNAM;
National responsible for the course "Computer Vision" of the Master research STIC - Computer Science of CNAM (6 ECTS - 60 hours);
In charge of the course "Image indexing and retrieval" in the master SAR of Paris 6 (7,5 hours).
6h courses on "Object Detection", Master STIC, CNAM, november 2005.
24h TP on "Computer Vision", Master Sciences de l'ingénieur, Pierre and Marie Curie University, october 2005.
3h Course on "Object Detection", Master Sciences de l'ingénieur, Pierre and Marie Curie University, november 2005.
Course (9H) on Computational Geometry in the option "Computer Vision" in last year of the engineering degree course of the ENSTA school (École Nationale Supérieure de Techniques Avancées, Paris).