TEXMEX is a joint project with CNRS and University of Rennes 1. The team has been created on January 1st, 2002 and became an INRIA project on November 1st, 2002.
The explosion of the quantity of numerical documents raises the problem of the management of these documents. Beyond the problem of storage, we are interested in problems related with the management of the contents: how to exploit large bases of documents, how to classify them, how to index them to be able to search documents, how to visualize their contents? To solve these problems, we propose a multi-field work gathering within the same team specialists of the various media: image, video, text, and specialists in exploitation techniques of the data and metadata extracted from these data: databases, statistics, information retrieval. Our work is at the intersection of these fields and relates more particularly to 3 points: searching in large image databases, adding semantics to search engines, coupling media for multimedia document description.
The exploitation of the contents of large databases of digital multimedia documents is a problem with multiple facets, and the construction of a system exploiting such a database calls upon many techniques: study and description of documents, organization of the bases, search algorithms, classification, visualization, but also adapted management of the primary and secondary memories, interfaces and interaction with the user.
The five major challenges of the field appear to us to be the following ones:
it is necessary, first of all, to be able to
treat large sets of documents: it is important to develop techniques
which scale up gracefully with respect to the quantity of documents they handle,
and to evaluate their results according to quality and speed;
the multimedia documents are not a simple juxtaposition of
independent media, and it is important to better exploit the
existing links between the various media present in the same
document;
multimedia document databases are evolutionary: the
sets of documents evolve, as do the document description techniques
and the modes of interrogation, which modifies in turn the way in which
the bases are used;
towards queries of a semantic nature, description
techniques have only access to the document syntax; it is thus necessary to find means for reducing this difference between semantic needs and syntactic description tools;
the user-system interaction is a central point: the user must
be able to translate his needs efficiently and simply but
with shades, to guide the system or to
evaluate the results; he must be the one who controls the system.
We have adopted a matricial organization. On the one hand, we have competences in two main fields, automatic document description and exploitation of these descriptions, and on the other hand, we defined three transversal topics of research. The main idea is to concentrate on the questions where the team multidisciplinarity appears to be an asset to obtain original results. Most graduate students of the team work at the intersection of two domains and are thus advised by two persons.
organization and management of the bases, logical and temporal
consistency, selection and strategies of computation of descriptors
and metadata; statistical techniques for the exploration of great
volumes of data; indexing techniques aiming at confining in
the smallest possible volume the exploitation of the
data and thus avoiding an exhaustive examination whose cost is
certainly controlled but crippling; system problems related to
the physical organization of large volumes of data, like disc access
management or cache memory management requiring new techniques which are adapted to the
characteristics of the descriptors and to the way of using them.
Going from corpora of a few thousands of images to corpora containing a few millions remains a research challenge today. The solution can come neither from the only descriptors nor from new indexing techniques, but necessitates to take into account all the various components of the system and their articulation. We thus propose to work on:
data description, especially in the case of compressed or watermarked images,
indexing and search algorithms,
database organization and use of the metadata,
system and hardware support,
and on the coupling between these various techniques to improve the performances of the current systems in terms of speed as well as of quality of recognition.
Search engines are extensively used tools, but they appear to be disappointing most of the time, due to their syntactic approach which is based on keywords. Natural language processing tools could however provide them more semantic capabilities, by allowing word sense disambiguation or the possibility to recognize the various formulations of the same concept. It is thus advisable to combine these two techniques.
This union is, however, not so simple. On the one hand, it requires to provide query extension strategies to search engines and then to translate these extensions in terms of similarity. On the other hand, natural language processing tools must work in much broader environments than the ones in which they are usually used. The contribution of such a modification of the engines must also be established, which requires a precise evaluation of the obtained results.
Studying media coupling is undertaken in two manners. Within the framework of video, we are interested in descriptions which jointly use the sound and image tracks of the video. Such techniques can be applied to automatic video structuring, but also to improve people detection and recognition techniques, whether by their face or their voice.
In addition, we study the coupling between text and image in the documents where these two media are strongly coupled, a common case in scientific bibliographical databases, on the web, in newspapers, in art books or technical documents. The goal is to connect, in the same document, the image and the text which is referred to it. This should make it possible to obtain an automatic and semantic description of the images, to connect different documents, either by the search for images visually similar, or by the search for texts treating the same subject, and thus to improve the description of the images and to remove possible ambiguities in the understanding of the text.
The work within the team needs two kinds of competencies: to exploit the content of documents, one should first be able to access this content, i.e. to characterize or describe this content. One should also be able to use this description in order to fulfill a task related to these documents. Finally, both description and exploitation techniques must satisfy the needs of the user (and the proof of this simple fact is not trivial).
Finding a solution requires the use of document description techniques based on text, image or video processing (sound and speech processing are studied by the METISS team with which we closely collaborate.) It is also necessary to exploit the correlation and complementarity between the different media, since they do not bring the same information and do not share the same limitations.
After this description stage, it is necessary to exploit the descriptions to satisfy the user's query. At this second stage are needed sorting, indexing, retrieving algorithms which must provide good and fast results, two constraints usually opposite.
These two aspects are not independent and any solution with only one of the two aspects can not solve any real problem. The combination of the two in the context of large databases raises many difficult, but interesting, questions, and their solution may only come from a confrontation of people and ideas coming from both sides.
All the multimedia documents have the ambivalent characteristic to be, on the one hand, very rich semantically and, on the other hand, very poor, especially when considering the elementary components which constitute them (continuations of letters or pixels). More concise and informative descriptions are needed in order to handle these documents.
Computing image descriptors has been studied for about ten years. The aim of such a description is to extract indices called descriptors whose distance reflects those of the images they are computed from. This problem can be seen as a coding problem: how images should be coded such that the similarity between the codes reflects the similarity between the original images?
The first difficulty of the problem is that image similarity is not a well defined concept. Images are polysemic, and their level of similarity will depend on the user who judges this similarity, on the problem this user tries to solve, and on the set of images he is using. As a consequence, there does not exist a single descriptor which can solve every problem.
The problem can be specialized with respect to the different kinds of users, databases and needs. As an example, the problem of professional users is usually very specific, when domestic users need more generic solutions. The same difference occurs between databases composed of very dissimilar images and those composed only of images of one kind (fingerprints or X ray images). Finally, retrieving one particular image from an excerpt or browsing in a database to choose a set of images may require very different descriptors.
To solve these problems, many descriptors have been proposed in the
literature. The most frequent framework is image
retrieval from a large database of dissimilar images using the query by
example paradigm. In this case, the descriptors integrate the information
of the whole image : color histograms in various color spaces, texture
descriptors, shape descriptors (their major drawback is to require
automatic image segmentation). This field of research is still active:
Color histograms provide a too poor information to solve any problem as
soon as the size of the database increases
Texture histograms are usually useful for one kind of texture, but they fail to describe all the possible textures, and no technique exist to decide in which category a given texture falls, and thus which descriptor should be used to describe it properly. Shape descriptors suffer from the lack of robustness of shape extractors.
Many other works have been done in the case of specific databases. Face detection and recognition is the most classical and important case, but other works concern medical images for example.
In the team, we work with a different paradigm based on local descriptors:
one image is described by a set of descriptors. This solution offers the
possibility of partial recognitions like object recognitions independently
of the background
The main stages of the method are the following. First, simple features are
extracted from each image (interest point in our case, but edges and
regions can be used as well). The most widely used extractor is the
Harris
The similarity between images are then translated into the concept of
invariance: Measurements of the image invariants to some geometric
(rotation, translations, scalings) or
photometric (intensity variations) transformations are searched for. In
practice, this concept of invariance is usually replaced by the weaker
concept of quasi-invariance
In the case of points, the classical technique consists of characterizing
the signal around each point by its convolution with a Gaussian kernel and its
first derivatives and by mixing these measurements in order to obtain the
invariance properties. The invariance with respect to rotations, scalings
and affine transformations was obtain respectively by
Florack
One of the main difficulties of the domain is the evaluation and the comparison of the methods. Each one corresponds to a slightly different problem and comparing them is difficult and usually unfair: The results depend on the used database, especially when they are quite small. In this case, a simple syntactic criteria can give the impression of a good semantic description, but this does not tell anything about what would happen with a larger database.
Professional and domestic video collections are usually much bigger than the corresponding still image collections: there is a common factor of 1000 in between the two. If the images often have a weaker quality (motion, fuzzy images...), they present a temporal redundancy which can be exploited in order to gain some robustness.
Video indexing is a large concept which covers different topics of
research. Video structuring consists of finding the temporal units of a
video (shots, scenes) and is a first step to compute a table of contents
of a video. Key-event detection is more oriented to the creation of an
index of the video. Finally, all the extracted elements can be
characterized with various descriptors: motion
descriptors
Many contributions have been proposed in the literature in order to compute
a temporal segmentation of videos, and especially to detect shot
boundaries and transitions
Automating indexing of textual documents i.e. simple or complex words automatically extracted from a document, that "represent"
its semantic content and make its detection possible when the document database is considered; second, dealing
with the fact that the representation is a word one and not a conceptual one. Therefore information
retrieval has to be able to overcome two semantic problems: various possibilities to formulate the same idea (how
to match a same concept in a text and a query but expressed with different words); word ambiguity (a same
word –graphical chain– can cover different concepts). In addition to these difficulties, the meaning of a
word, and thus the semantic relations that link it to other words, varies from one domain to the other. One
solution is to make use of domain-specific linguistic resources, both to disambiguate words and to expand user
queries with synonyms, hyponyms... These domain-specific resources are however not pre-existing and must be
automatically extracted from corpora (collections of texts) using machine learning techniques.
Lots of works have been done during the last decade in the domain of automatic corpus-based acquisition of
lexical resources, essentially based on statistical methods, even though symbolic approaches also present a growing
interest disk shop - to sell disks) that are up to now not often used in information retrieval systems.
Our research both concerns the machine learning algorithms developed to extract lexical elements from corpora, and
the linguistic and applicative interests of the learnt elements.
Differential (or interpretative) semantics semes (i.e. semantic
features). Within a given semantic class –group of words that can be exchanged in some contexts–, words share
generic semes that characterize their common points and are used to build the class
(e.g. /to seat/ is associated with {chair, armchair, stool...}), and specific
ones that explicit their differences (/has arms/ differentiates armchair from the two others).
Following Rastier, two kinds of linguistic contexts are fundamental to characterize relations of lexical meaning:
the topic of the text unit in which a word occurrence is found, and its neighborhood. Differential semantics states
that valid semantic classes, in which specific semes can be determined, can only be defined within specific topic. And a topic
can be recognized within a text
by the presence of a semantic isotopy, i.e. the copresence within the sets of semes
(named sememes) representing some of its words of some recurrent semes. For example, a war topic
can be detected in a text unit that contains the words soldier, offensive, general... by the presence
of the same seme /war/ within the sememes of all these words.
We have developed a 3-level method to extract lexicons based on Rastier's principles. First, with the help of
a hierarchical classification method (Linkage Likelihood Analysis, LLA
In one of the components of this lexical model qualia structure, words
are described in terms of semantic
roles. For example, the telic role indicates the purpose or function of
an item (cut for knife), the agentive role its creation mode
(build for house)... The qualia structure of a noun is
mainly made up of verbal associations, encoding relational information.
We have developed a learning method, using inductive logic programming asares, is generic enough to be applied to the extraction of other kinds
of semantic lexical information.
A collection of texts is said to be thematically homogeneous if the texts share some domains of interest. We are concerned by the indexing and analysis of such texts. The research of relevant keywords is not trivial : even in thematically homogeneous sets, there is a high variability in the used words and even in the concerned sub-fields. Apart from the indexing of the texts, it is valuable to detect thematic evolutions in the underlying corpus.
Generally, textual data are not structured and we must suppose that the files we are concerned with have either a minimal structure, or a general common thema. The method we use is the factorial correspondence analysis. We get clusters of documents and their characteristic words. Recently, R Priam defended a thesis where he proposes methods very close to the Kohonen maps to visualize words and documents local proximities.
The situation on this subject is very different according to the concerned
media. Reference test bases exist for text, sound or speech, and regular
evaluations campaign are organized (NIST for sound and speech recognition,
TREC for text in English, AMARYLLIS for text in French, SENSEVAL or
ROMANSEVAL for text disambiguation)
In the domain of images and videos, The BENCHATLON provides a database to
evaluate image retrieval systems while TREC provides test database for video
indexing. A system to evaluate shot transition algorithms has been developed
by G. Quenot and P. Joly
To improve the data organization or to define the strategies to choose or compute some descriptors, it may be advisable to use contextual informations. These informations, called metadata (or data about the data) can provide some information about how the data were produced, obtained, or used, they can provide information about the users or describe the content of a document.
V. Kashyap and A. Sheth
Some of them are computed directly from the documents;
The others are provided by a human, like keywords.
The organization of the metadata can be completed and become matricial when considering the various objectives of the metadata: easy data access, data abstract, interoperability, media or content representation...), or the way the metadata were obtained. Metadata are a privileged way to keep information relative to a document or its descriptors in order to facilitate future processing. They appear to be a key point in a coherent exploitation of large multimedia databases.
Even if the description of the documents can be done automatically, this is not enough to build a complete indexing and retrieval system usable in practice. As a matter of fact, the system must be able to answer a query in a reasonable amount of time, and thus requires tools in order to guarantee this aspect. The section is devoted to some of these tools.
On-line and off-line processing define the two main categories of exploitation. Off-line processing correspond usually to all techniques which need to consider all the data, and the complexity in time is thus not the main issue. On the other hand, on-line processing need to go really fast. To gain such a performance, these procedures use the result of the off-line processing to limit the treatment to the smallest data subset necessary to answer the query.
The situation where we have few available data has been well studied but a huge amount of data generates different kinds of problems : for instance, the use of classical inferential statistics results in hypothesis testing conclude rather often to reject the null hypothesis. Besides, the methods of models identification fail very often or we overestimate the quality of the model. The question is : how can we set a representative sampling in such datasets? We must also add that some clustering algorithms are unusable with such large datasets. Therefore, it is clear that working with huge datasets is difficult because of their computational complexity, because of the data quality and because of the scaling problem in inferential statistics.
However, statistical methods can be used with caution if the data quality is good. So the first step is the cleaning and the checking of data to be sure of their coherence. The second step depends on our goal. Either we want to build a global model, either we are looking for hidden structures in the data. In the first case, we can work on a sample of the data and use methods such as clustering, segmentation, regression models. In case we are looking for hidden structures, sampling is not appropriate and we need to use other heuristics.
Exploratory data analysis (EDA) is an essential tool to deal with huge amount of data. EDA describes data in an interactive way, without a priori hypothesis and provides useful graphical representations. Visualization methods when the dimension of the data is greater than three is also indispensable: for instance, parallel coordinates. All these previous methods analyse the data to discover their properties.
We can add that most of the available data mining programs are very expensive, and that their contents are very disappointing and poor for most of them.
This section gives an overview of the techniques used in databases for indexing multimedia data (often focusing on still images). Database indexing techniques are needed as soon as the space required to store all the descriptors gets too big to fit in main memory. Database indexing techniques are therefore used for storing descriptors on disks and for accelerating the search process by using multi-dimensional index structures. Their goal is mainly to minimize the resulting number of I/Os. This section first gives an overview of traditional multidimensional indexing approaches achieving exact NN-searches. We especially focus on the filtering rules theses techniques use to dramatically reduce their response times. We then move to approximate NN-search schemes.
Traditional database multidimensional indexing techniques typically
divide the data space into cells containing vectors. Cell construction
strategies can be classified in two broad categories:
data-partitioning indexing methods space-partitioning
Data-partitioning index methods all derive from the seminal
R-Tree
Space-partitioning techniques like
grid-file
NN-algorithms typically use the geometrical properties of cells to
eliminate those cells that can niether have any impact on the result of
the current query filtering
rules. The first rule is applied at the very beginning of the
search process and identifies irrelevant cells as follows:
where
The search process then ranks the remaining cells on their increasing
distances to
The second filtering rule is applied to stop the search as soon as it
is detected that none of the vectors in any remaining cell can
possibly impact the current set of neighbors ; all remaining cells are
skipped. This second rule is:
where
The ``curse of dimensionality'' phenomenon makes these filtering rules
ineffective in high-dimensional
spaces
This phenomenon is particularly prevalent when performing exact
NN-searches. There is therefore an increasing interest in performing
approximate NN-searches, where result quality is traded for
reduced query execution time. Many approaches to approximate
NN-searches have been published.
Dimension reduction techniques have been used to overcome the ``curse
of dimensionality'' phenomenon. These techniques, such as PCA, SVD or
DFT (see
Weber and Böhm with their approximate version of theVA-File
Geometrical approaches typically consider an approximation of the
sizes of cells instead of considering their exact sizes. They
typically account for an additional
In
The AC-NN scheme for M-Trees presented in
Approximate NN-searches using locality sensitive
hashing (LSH) techniques are described in
DBIN
P-Sphere Trees
To our knowledge, no technique linking the precision of the
search to a probability of improving the result can search for
We are particularly interested in large image bases, like those managed by photo agencies. These agencies have between five hundred thousands and twelve millions of images. The Andia Press agency has a million of them, Sigma twelve millions, the Corbis agency which gathers the whole of acquisitions of Bill Gates has thirty six millions of them. These agencies work according to two modes. In the first one, they respond to a customer query by sending him a set of images. The customer pays for the images that he publishes. In the second mode, the customers are subscribers at the agencies which send their new photographs systematically to them, the mode of payment being the same one. This working method is that of the AFP or Reuters.
One of the concerns of the agencies is of course the digital rights management, and the fact that they are not unduly used by people or institutions while not having discharged the rights. Watermarking and indexing are two techniques planned to control image diffusion, either by seeking a watermark of property in the images, or by checking, using indexing techniques, that the image is not a fragment of an image of the agency base.
The existing video databases are generally little digitized. The progressive passage to digital television should quickly change this point. As a matter of fact, TF1 passed to an entirely digitized production, the cameras remaining the only analogical stage of the production. Treatment, assembly and diffusion are digital. In addition, domestic digital decoders can, from now on, be equipped with hard disks allowing a storage initially modest, of ten hours of video, but larger in the long term, of a thousand of hours.
One can then distinguish two types of digital files. First of all those of the private individuals, including recordings of diffused programs and films taken using digital camcorders. If the effort of management of such bases will be probably weak, without rigorous method, there is a great need for tools to help the user: automatic creation of summaries and synopses to allow to find information easily, or to gain, in a few minutes, a general idea of a program. Even if the service is rustic, it is initially evaluated according to the appreciation which it brings to a system (video tape recorder, decoder), will have to remain not very expensive, but will benefit from a large diffusion.
On the other hand professional files (TV channels archives, registration of copyright, cineclubs, producers...) are of a much larger size, but benefit from the attentive care of professionals of documentation and archiving. In this field, systems can be much more expensive and are judged according to the profits of productivity and the assistance which they bring to documentalists, journalists and users.
Searching in large textual corpora has already been the topic of many researches. The current stakes are the management of very large volumes of data, the possibility to answer requests relying more on concepts that on simple inclusions of words in the texts, and the characterization of sets of texts.
We work on the exploitation of scientific bibliographical bases. The explosion of the number of scientific publications make the retrieval of relevant data for a researcher a very difficult task. The generalization of document indexing in data banks did not solve the problem. The main difficulty is to choose the keywords which will encircle a domain of interest. The statistical method used, the factorial analysis of correspondences, makes it possible to index the documents or a whole set of documents and provides the list of the most discriminating keywords for this or these documents. The validation of the indexing is carried out by making a search for information in databases more general than the one which made it possible to work out the index and by studying the reported documents. That in general makes it possible to still reduce the subset of words characterizing a field.
Another difficulty is to find within a given document the parts which tackle a subject. We thus worked, from texts of bioinformatics coming from bases such as Medline, on the automatic extraction of the zones of texts describing the interactions between genes and on the modeling of the described interaction. Modeling requiring a fine and expensive analysis of sentences, it should be carried out only on zones of texts likely to contain an interaction indeed. Our methodologies of training of semantic bonds between words are exploited to determine these relevant zones of texts. To a corpus of summaries extracted from Medline, we apply a training by ILP to try to learn what distinguishes the sentences containing interactions description of the others.
We also explore scientific documentary corpora to solve two different problems: to index the publications by the way of meta-keys and to identify the relevant publications in a large textual database. For that, we use factorial data analysis which allows us to find the minimal sets of relevant words that we call meta-keys and to clear out the bibliographical search from the problems of noise and silence. The performances of factorial correspondence analysis are sharply greater than classic search by logical equation.
If collaboration between robotics and vision is an already old subject, it has undergone an important change of paradigm in the five last years. Hitherto, collaboration was considered on the level of planning: a camera observed the world around a robot to enable him to plan its displacements. The results appeared to be not so satisfactory.
The field of collaboration then moved towards control: the vision is not
any more used to plan a movement, but to ensure its follow-up and good
execution, by setting up a closed loop of control including
vision
Some difficulties remain: the tasks to be achieved are specified using a target image that should be reached, but that assumes that the robot is able to establish a bond between this image and the current image provided by the camera. This is a classical image matching problem. If these two images do not have anything in common, it will be necessary to use a collection of intermediate images, which define intermediate positions of the robot before reaching the final position.
Therefore, the control problem corresponds to an image collection management problem, with dynamic collections to follow the evolution of the environment of the robot, and needs for fast access for recognition. This application appears important because it widely opens the experimental use conditions of visual servoing: once an environment collected in a base, the robot can start from any position to go towards any target. If this kind of approach presents little interest for articulated arm for which the articular co-ordinates can be read directly, an autonomous vehicle can benefit from it in restricted environments such as car parks. In this case, the systems of positioning as the GPS do not offer sufficient relative precision and do not give information of orientation.
this software allows to compute local or global image descriptors: differential local invariants, global and local color histograms or weighed histograms. It was deposited to "Agence pour la Protection des Programmes" under the number
IDDN.FR.001.270047.000.S.P.2003.000.21000. (Correspondant: Patrick Gros.)
Our work on image description does not aim at finding new general descriptors. The IMEDIA and LEAR teams are very active in this field, and we use their results. The originality of our work comes from the size of the database we want to handle. In large databases, most images will be compressed. Is it possible to describe an image without decompressing it? In many databases, images will also be watermarked, and the influence of watermarking (and of the systems for breaking watermarks) on the content-bases description techniques is not clear. This is our first direction of research.
A second direction concerns the combination of descriptors: when documents are described by many descriptors, how a query should be processed in order to provide the fastest as possible answer? To answer this question, we study the information that each descriptor can provide about the other ones. The aim is to determine the order in which the descriptors should be considered.
The third direction is description indexing and retrieval. In the local description scheme, 1 million of images can give raise to 600 millions of descriptors, and retrieving any information in such an amount of data requires really fast access techniques, whatever the aim of this access may be.
A fourth direction is due to our collaboration with the roboticians of the VISTA team. They work on visual servoing and using a database is a good way to improve the applicability of their techniques to large displacements. Our description technique appear to be particularly well suited to such an application where a matching between images is required, and not only a global link of similarity between images.
This is a joint work with the TEMICS team (S. Pateux).
Image authentication is becoming very important for certifying image data integrity. A key issue in image authentication is the design of a compact signature being robust under allowable manipulations. Watermarking has been mostly investigated to deal with the problem of detection of illegal copies. But it provides only an assumption, not a proof, of illegacy. We believe that content based image description techniques may provide robust detection of illegal copies. Big databases are made of compressed images. In order to speed up the matching scheme, it is of interest to calculate signatures from the compressed images. Thanks to its wavelet analysis, JPEG2000 compression standard allows the design of multiresolution signatures. Inspired by classical content based local description techniques, we have developed a robust point extractor in the wavelet space. Its average robustness is 10 % less than multiresolution Harris point extractor reference. We will investigate how to describe (in the wavelet space) the neighborhood of these points by means of vectors invariant to allowable image manipulations. Another point we consider is the comparison of robustness and speed between classical local signatures and wavelet signatures.
Content-based image retrieval is not easy when image databases become very large. Fixed image database can be described in several ways by global visual descriptors of color, texture, or form (pixel level). Most frequent queries imply and combine results of several type of descriptors such as: "retrieve all images that have similar color and similar texture to the given example image". To retrieve more efficiently and more effectively an image of a large database, we exploited combinations of descriptors. Firstly we surveyed the state of the art of image mining and content-based image retrieval. Then, our objective was to study the interest of association rules between descriptors to accelerate response time of queries on large fixed image databases. We used 5 MPEG-7 descriptors to describe several thousands of fixed images. We initially used K-means based algorithm to compute clusters of images for each descriptor. We then generated relations between different clusters in form of association rules. Multiple correspondence analysis was used to study the relevance of found associations and to validate our approach. We are now exploiting association rules between clusters of descriptors to optimize content based retrieval.
We designed an approximate search-scheme for high-dimensional DB where
the precision of the search can be stochastically controlled and
where the search can retrieve the
This approach first clusters vectors. It encloses clusters in minimum
bounding hyperspheres in an Euclidean space. All existing vectors
might not be in clusters because the clustering isolates outliers.
Outliers are stored in a specific file that we treat separately. The
clustering algorithm we use is derived from the first phase of
Birch
The output of the clustering phase is a set of minimum bounding hyperspheres defined by their center and their exact radius. As for Birch, clusters might overlap and outliers are treated separately. Data points are stored sequentially on disk on a per cluster basis. No specific data structure is used to index the clusters. Outliers are also stored in a separate data file, in a sequential manner.
Each cluster is analyzed off-line to derive several approximate radii given the exact radius, the volume and the distribution of vectors within each cluster. For each cluster, several approximate radii are determined, each corresponding to a predetermined level of precision. All the approximate radii of one cluster are always smaller than the exact radius of the same cluster. Approximate radii will ultimately be considered during the approximate NN-searches.
At query submission time, a user provides, along with the query, an
imprecision level called
This imprecision level then determines which specific approximate
radii must be taken into account by the filtering rules during the
NN-search. Irrelevant clusters are thus filtered out and the remaining
clusters are then ranked with respect to the distance of their centers
to the query point. Clusters are then accessed one after the other.
When a cluster is accessed, all the data points it contains (all
points enclosed within its exact bounding hypersphere) are
fetched in memory. The search then computes the distances between all
points in the cluster and the query vector. This might in turn update
the current set of neighbors. It might also filter out more clusters.
The search stops when
Before returning the result to the user, a sequential scan of the file where outliers are stored is performed. This might also update the current set of neighbors.
This is a joint work with the VISTA team (F. Chaumette).
We are working on automatic robot motion control, using visual
information provided by an on-board camera, and an image data base of
the navigation space. The image base describes the environment in
which the robotic system moves. More exactly, it describes
features that can observe the robot camera. Thanks to this base, the
robot localization is nothing but a
The definition of the path the robot has to follow is also defined in terms of images : the desired position corresponds to the image the camera should obtain at the end of the motion. The same image retrieval method presented before enables to localize the desired position. By translating the image base into a valuated graph (corresponding to the feasibility to go from on image to an other), and using graph theory, the shortest image path can be easily found between the initial image and the desired one. Those images extracted from the database describe in a continuous way the space the robot has to pass through in order to reach the desired position.
During this year, we have defined a formalism that enables to control robot motions, given this image sequence, the features matched between each consecutive couple of images, and the images acquired by the camera. 3D reconstruction is not necessary yet. Furthermore, robot motion is not defined during an off-line stage; motion are determined for each image acquired by the camera. This will permit us to take easily in account within our scheme unexpected exterior events, like occlusion, obstacles, ...
Our method is based on potential field theory. The robot moves in order to make features defined on the image path, initially out of the camera field of view, become visible. Furthermore, the obtained trajectory is independent of the intermediate image positions. This work has been validated through experiments with a planar environment, and planar motions, with an articulated arm. We are trying to relax those constraints in order to be able to deal with more general motions, and on a 3D scene. Then non-holonomic constraints will be added in order to manage mobile robots in real environments.
Furthermore, we want to improve the data base management, which could accelerate the retrieval process. For example grabbing conditions (moment of the day, weather conditions, ...) are criteria that can be extracted automatically from the image signal. Those information could help to categorize the images of the base, and also to provide to the robot images that best correspond to the current exterior condition (which can be very useful as long as those images are used in the feature tracking stage). At last, a protocol for the autonomous image base acquisition should be defined, in order to be able to make experiments with the robot Cycab that owns Irisa.
The general frame of our work is explained in .
During 2003, our work has especially concerned the 3 following points:
asares: two semi-supervised versions combining symbolic and
statistical approaches.
asares is our inductive logic programming (ILP) based system (using aleph algorithm)
that automatically infers from a set of positive and negative examples of elements in a given relation
(e.g. noun-verb (N-V) pairs in which the V plays for the N one of the roles defined in the
qualia structure in Pustejovsky's Generative Lexicon model, or do not play such a role;
we shall refer respectively to these two cases as N-V qualia (resp. non-qualia) pairs) morpho-syntactic and semantic patterns
that characterize this relation and can be applied to a corpus to get new elements in this same
relation. asares that combine in two different
ways a statistical approach (N-V qualia pairs are considered as one special kind of cooccurrences) and the ILP approach.
The first and sequential combination is presented in asares supervised version, but keep advantages of the
two approaches they mix: robustness and automation of the statistical method, quality of the results and
expressiveness of the symbolic one.
Acquisition of semantic lexicons based on Rastier's differential semantics: automatic generation of sets of keywords for topic characterization and detection.
We have ended the elaboration of a sequence of statistical data analysis treatments, that
refines and enriches the results of an initial LLA (linkage likelihood analysis) classification of
the words of a given corpus based on their distribution over its paragraphs, and
allows us to obtain in a fully automatic way and with the use of no prior information
sets of keywords
that characterize the main topics of the (morpho-syntactically tagged) corpus. Each class can
then be used to detect the presence of its topic in any paragraph of the corpus, by a simple
keyword cooccurrence criterion. The obtained sets enable us to split an initial non-specialized corpus into
several topic-specific ones and to get the linguistic material necessary to carry on the second
step of the elaboration of semantic lexicons based on Rastier's principles, i.e. the
automatic constitution of semantic classes within homogeneous topics. We have begun a work in
this direction. In order to achieve this goal, once again without any human intervention or
external data, thus eventually favoring precision over recall, we consider statistical techniques that
group words appearing in similar contexts. We plan to take into account different lengths of contexts
for different word categories, and to consider the relative positions of contextual
elements. This idea leads us to the need of the definition of a non-symmetric similarity measure to
automatically build the semantic classes.
Linguistic resources and information retrieval (IR).
We have evaluated the interest of N-V qualia relations for the expansion of queries in an
information retrieval system (IRS). More precisely, we have used Salton's IRS smart and the
data of the IRS evaluation campaign Amaryllis, and asares has learnt qualia pairs from one
Amaryllis corpus. Our experiments have shown smart. More precisely, the relevance of the first ten documents is
increased, and these results are particularly interesting if we consider the way search engines
are commonly used. Moreover, Fabienne Moreau has begun in October a PhD thesis which aims at exploring
methods of extending Salton's vector space model (VSM) to improve its ability to capture the semantics
of natural language texts. Currently, under Salton's theory, documents are represented as a
set of features, without regard for the relationship between individual terms. The goal of this
thesis is to adapt the VSM to allow information gained from natural language processing
to inform IR.
Nicolas Bonnel, a second year PHD student, is currently working on the dynamic generation of 2D and 3D multimedia interactive presentations, that aims at representing the results of a search in a database. N. Bonnel has a Cifre contract with France Telecom and his thesis is done in cooperation with FT. For that, he uses metaphors developed by France Telecom and works on the relevance of descriptors for the documents and the improvement of the graphical representation. We need therefore to perform quality evaluation and we have to take into account the user profile to optimize the results of a query.
Knowledge extraction from textual databases is not obvious. Among the used methods, we find factorial analysis, neural
networks or Kohonen maps. R Priam 's PhD thesis
The term multimedia documents is broadly used and covers in fact most documents. It is in fact more and more appropriate since any document are now truly multimedia and contain several media: sound, image, video text. The description of these documents, videos for example, remains quite difficult. Research groups are often monodisciplinary and specialist of only one of these media, and the interaction between the different media of a same document is not taken into account. Nevertheless, it is clear that this interaction is a very rich source of information and allows to avoid the limitations of the techniques devoted to a single media since the limits vary according to the concerned media.
We propose to investigate, in conjunction with other teams like METISS and VISTA, this new aspect of multimedia document description. We propose to follow two directions: the first one concerns the case of video, where image and sound are closely related and provide complementary information. The sound track also opens the possibility of speech recognition, and requires the use of natural language processing in order to use this new modality. In this case, one of the problem is to handle the dynamics proper to each media. The second direction concerns the documents which mix text and still images like journals, technical manuals, or most of web pages.
Our work on this topic is done in close collaboration with the METISS and VISTA teams of IRISA and with the Thomson company where E. Kijak has done most of her thesis (cf. ).
The aim of this work is to define a general method allowing to describe all the media of a video, as well as their interaction. Another constraint is that this method should also allow a user to formulate a task or a query concerning videos. This problem was first studied during the thesis of E. Kijak in the frame of a limited case, the structuring of sport (and particularly tennis) reports. In such documents, there are four main sources of information: the tennis rules which explain how a tennis game is organized, the production rules which explain how the producer works, what tools it uses and how he tries to reflect what's going on by formal techniques, the image and the sound tracks.
It is clear that none of these sources can explain the video alone. The goal is thus to integrate these sources such that their complementarity can be used to obtain the most complete description as possible. Three ways of integration are possible. In the late integration frame, the processing is done independently of each media, and the results are merged in a second time. The main difficulty of such a method comes from the merging operation since there is not coherence between the results provided by each media, and usually no satisfying way to solve this point.
The second way is to give a leading role to one of the modality. We experimented such a solution with image as the leading media. To help the characterization of the shots of a tennis reports, each shot was characterized by the presence, during this shot, of speech, applauds or ball noise. Sound is seen as a complementary source of information and allows to improve the results obtained previously with images only. Such a solution appears to be not fully satisfactory since only a small portion of the information carried by the sound track can be taken into account.
The third way we plan to study is early integration, where the different media are mixed from the beginning and all decision are taken based of the whole stream of information.
Such integration frames must be supported by a foundation technique which allow to handle them. Hidden Markov Models where chosen first, due to their nice properties to represent temporal streams and their ability to represent a priori knowledge about tennis and production rules. A hierarchical model was used to represent the complete structure of a tennis report, ans the Viterbi algorithm was used in order to identify this structure from the video stream. A problem with this model is that the time model it used is not flexible enough to handle properly the different time granularity of the various media.
We propose to use another kind of models, segment models, to circumvent this problem. In these models, each state does not correspond to a unique observation as it is the case in HMM, but can correspond to a variable number of observations. It is also possible to use several streams of observations. On the other hand, the models are more complicated and their use is more costly. The use of these models is the subject of the thesis of M. Delakis.
In text retrieval engines, images are not taken into account. When an image retrieval engine exists aside the previous one, it treats the images independently of the text surrounding them. Of course, it should be better to couple these two engines or, at least, to couple the information that both media can provide.
The first way to do this is to determine the parts of the text which are related to images. This should allow to get a textual description of the images, and to make textual query to retrieve images in a much richer context than that if systems using simple keywords linked with the images.
In the other way, it is possible to find documents containing a same image and to use both texts to disambiguate or improve the understanding of the text. These two points are the thesis subject of H. Renault.
The thesis of S.A. Berrani and E. Kijak were supported by a CIFRE grant in the frame of a contract between Thomson and TEXMEX.
The thesis of N. Bonnel was supported by a CIFRE grant in the frame of a contract between Thomson and TEXMEX.
For one year, we collaborate with a service company Socio logiciel working in statistics and data analysis in the marketing area. We were consultants on some statistical problems and we have to report on new statistical methods. This contract was conducted with the laboratory ERIC of the university of Lyon 2.
This is a joint project with the VISTA team of IRISA. Duration 46 months, Start date: September 2000. Partners: LIMSI - CNRS, AEGIS, INRIA (VISTA, TEXMEX, and IMEDIA projects), TF1.
The Mediaworks project was created in the frame of the PRIAM program and the French information society program, financed by the Ministry of industry. It began on September 1st 2000. This project gathers the TF1 TV channel, the LIMSI lab, AEGIS (a SME) and INRIA (projects IMEDIA of the INRIA Rocquencourt and VISTA). It concerns the development of a system to assist documentalists who index TV archives. Its principal features are the cooperation between the text and image media, and the development of a semantic search engine. TEXMEX works together with VISTA to develop tools for automatic structuring of video in plans and for computing an iconic representation of these plans.
Duration : 30 months, starting January 2002. Partners: Canon, L2S, Andia Presse, INRIA (projects CODES, TEMICS and TEXMEX).
Copyright protection is a key component to allow photo holders like photo agencies to diffuse their collections through the Web. It seems today impossible to avoid skilled hackers to thrust their way in Web sites and rob images. Therefore, legal image holders need a way to check whether the images available on a third party site do not originate from their own database (DB) of images. This is particularly crucial if that third party is making money by selling the images it pretends to possess. Watermarking is a first solution to this problem, but it requires a complex organization to become a legal argument. Moreover, pirated images are often washed out in order to remove the inserted marks.
This project addresses the problem enforcing the protection of copyright by relying on a content-based image retrieval (CBIR) scheme for that. The idea is provide a tool allowing professionals (e.g. photo agencies) to check if a published image comes from their DBs using only visual similarity. Its goal is therefore to detect matches between a set of doubtful images (e.g. downloaded from the Web) and the ones stored in the DB of the legal holders of photographies. If an image was indeed stolen and used to create a pirated copy, it tries to identify from which original image the pirated copy comes from.
So far, we performed extensive experiments showing that our image description scheme was useful in this context.
Duration 12 months, Start date: January 2003. Partners: Thomson, LTU, CLIPS-IMAG.
The ANAPURNA project is a small project focused on demonstrating the feasibility of a management system of pictures on a digital set-top box in a familial context of use. The projects concerns the definition of the possible ergonomy of such a system, and brings together three technologies: image description, image indexing and search, and tuning of the system on an experimental platform developed by Thomson.
TEXMEX was in charge of the indexing and retrieval part of the project, which allowed to transfer to Thomson the results of Sid-Ahmed Berrani's thesis.
Duration 24 months, Start date: october 2003. Partners: Communications et Systèmes, INA, IRIT, Canal+ Technologie, Vecsys, Arte France.
The FERIA project aims at developing a framework for the development of multimedia applications in the domain of archive diffusion and valorization. This framework should allow to develop easily applications in the domain of multimedia production. These applications, in a second stage, will be used to produces DVD, web sites or other products.
Within this project, TEXMEX is in charge of still image analysis (logo and text detection, face detection and recognition), and of coordinating a research group on multimedia description of video documents.
Duration : 30 months. Partners: IRISA (projet TEMICS et TEXMEX), Motorola, Telefonica, Technical University Munich, Queen Mary University of London, BTexact Technologies, Heinrich-Hertz Institute Berlin, FramePOOL.
This project is concerned with the design of new algorithms for indexing and watermarking video streams in order to create new multimedia-related services. Our contributions are focused on the architecture of the indexing and search schemes involved in the design of the database of videos.
L. Berti-Équille collaborates with INSERM U522 on the problem of management and cleaning of data issued from hepatic transcriptome experiments.
L. Berti-Équille collaborates with INRA Animal Genetics on the problem of management and sharing of data on gene interaction.
Joint work with SYMBIOSE, ADEPT and R2D2.
The goal of this ACI is to provide a portal allowing the access to shared computing resources geographically distributed in order to boost bio-related algorithms (such as DNA alignments for instance). Our team is involved in the design of the architecture of the grid.
Joint work with SYMBIOSE and R2D2.
This works aims at defining a specialized highly-parallel architecture devoted to process large amount of data such as genomics sequences. This architecture is based on FPGA. We are involved in its design.
Inter-EPST bio-informatic action from CNRS, INSERM, INRA, INRIA, Ministère de la Recherche, grouping people from Symbiose and TexMex teams at Inria, plus the following laboratories: Leibniz/Imag, LIPN, LRI, MIG/INRA and Inra-Ensar. This action, which has stopped at the end of October 2003, aimed at discovering within bio-informatic textual databases (MedLine), texts dealing with gene interactions, and at extracting the interaction networks from those texts. Our participation concerned the automatic detection of text segments that describe the interactions.
This national action, coordinated by L. Romary (Loria), concerns information retrieval within electronic textual databases. Together with industrial firms, it aims at developing a software chain able to capture, analyze, and search through textual documents, using and grouping research solutions proposed by different INRIA teams.
Joint work with Atlas, Gemo and Paris.
We investigate, in this 3 years project, how peer-to-peer (P2P) distributed architectures can be used to cope with the fast increasing volumes of numeric data (such as text and multimedia data) often stored in autonomous, heterogeneous and distributed equipments. The potential advantages of P2P systems are node autonomy, scale up to large numbers of nodes, high availability (through replication) and performance (through parallelism). Thus, the main objective of the project is to provide high-level services for managing text and multimedia data in large-scale P2P systems.
Our role in this project is focused on the management (querying and retrieval) of multimedia data.
Joint work with Symbiose, R2D2 and Equipage (Valoria, Vannes).
The REMIX project proposes the design of a dedicated and very large RAM index memory (several hundreds of Giga bytes), big enough to entirely store huge indexes in main memory, avoiding the use of any disk. The use of an almost unlimited main memory raises completely new issues when designing indexes and allows to entirely revisit the principles that are at the root of almost all existing indexing strategies. Here, within this scheme, direct access to data, massive parallel processing, huge data redundancy, precomputed structures, etc., can be advantageously promoted to speed-up the search.
Our role in this project is focused on the design of main-memory oriented index structures dedicated to boost content-based retrievals.
This program of the French ministry of Research aims at helping the creation of new research teams by young researchers. In the frame of this program, Texmex received 103 kEUR for 3 years.
L. Amsaleg participates to the AS of the STIC department of CNRS: multimedia data, interrogation and storage.
L. Berti-Équille participates to the GafoQualité group of AS GafoDonnées of the STIC department of CNRS.
L. Berti-Équille participates to the "Documents Mutimédia" and "Médiation" working groups of GDR I3.
L. Berti-Équille participates to the AS "Médiation d'informations via les méta-données" of the STIC department of CNRS.
P. Gros is a member of the stearing committees of the RTP 25 (Computer Vision) and RTP33 (Documents and Contents: creation, indexing, browsing) of the STIC department of CNRS.
P. Gros and L. Amsaleg participate to the AS image mining of STIC department of CNRS.
P. Sébillot is a member of the thematic network "Information and knowledge: discovering and abstracting" of the STIC department of CNRS.
P. Sébillot is member of the AS "Semantic Web" of the STIC department of CNRS.
P. Sébillot is a member of AFIA Café (Collège apprentissage, fouille et extraction).
P. Sébillot is a member of the working group A3CTE: Application, Learning and Knowledge Acquisition from Electronic Texts of GDR I3.
This working group aims at encouraging research activities in video and image analysis and understanding among the members of ERCIM. Its main action was to organize the MUSCLE consortium which has been accepted as a Network of excellence in the 6th Framework program.
A MOU (Memorandum Of Understanding) has been signed between IRISA (TEXMEX Team) and NII - National Institue of Informatics - of Tokyo - Japan in 2002, to provide a general framework to initiate an M4 project (metadata and multimedia management).
This work is supported by Égide.
Image databases and therefore content-based retrieval systems have become increasingly important in many applications areas. While extremely effective (they return high quality results), these techniques are very inefficient (they answer very slowly) due to their complexity and because of the inadequacy of traditional operating system support.
The goal of this project is to develop techniques that integrate efficiency and effectiveness in content-based image retrieval systems. The long-term benefits of this work are expected to be much improved image retrieval systems that are key for emerging applications.
A. Morin organized an invited paper session during the meeting of the International statistical institute in Berlin, august 2003 on Impact of developments in information systems on statistics education.
In the frame of the AS image mining of the STIC Deparment of CNRS, A. Morin coorganized with Thierry Denoeux a meeting on data analysis, statistics and learning for image mining on the 1st of April 2003. L. Amsaleg and P. Gros organized another meeting on image mining and databases.
P. Sébillot was a member of the program committee and the organization committee of TALN'03 (Traitement automatique des langues naturelles) in June 2003 at Batz sur Mer.
P. Sébillot was a member of the organization committee of RECITAL 2003 (Young researcher conference associated with TALN'03).
P. Gros was a member of the organizing committee and of the program committee of the European Content-Based Multimedia Indexing workshop which took place in Rennes in September 2003.
L. Berti-Équille was a member of the program committee of the conference INFORSID 2003.
L. Berti-Équille was a member of the program committee of the conference SEMSOFT 2003.
P. Gros is associate editor of the journal "Traitement du signal".
P. Gros is a member of the scientific board of Universtiy of Rennes 1.
P. Sébillot is associate editor of the journal In Cognito.
P. Sébillot is associate editor of Jedai (Journal Électronique d'Intelligence Artificielle).
P. Sébillot was a member of the programm committee of GL'03 (Generative approaches to the Lexicon, Genève, Suisse, 15-17 May 2003.)
P. Sébillot was a member of the programm committee of the Workshop "Acquisition, apprentissage et exploitation de connaissances sémantiques pour l'accès au contenu textuel", plate-forme AFIA2003, Laval, 1-4 juillet 2003.
P. Sébillot is a member of the board of ATALA (Association pour le traitement automatique des langues).
P. Sébillot is a member of the scientific committee of the TCAN program od CNRS (Traitement des conaissances, apprentissage et NTIC).
DEA Computer Science, Rennes. P. Sébillot, L. Amsaleg and P. Gros : Multimedia Indexing: Techniques and Applications.
DESS Mitic, Ifsic, Rennes 1. L. Amsaleg, P. Gros et P. Sébillot: Digital documents indexing and retrieval. L. Amsaleg and P. Sébillot: data management. L. Amsaleg and P. Sébillot: Databases.
INSA Rennes, 5th year. L. Berti-Équille : bioinformatics - biological data management.
ENST Bretagne, 3rd year. L. Berti-Équille : dataware houses and data mining.
Diic2, LSI, 2nd year. L. Amsaleg: databases.
IUP Miage, 3rd year. L. Amsaleg: Datawarehouses and Datamining.
P. Sébillot gave an invited talk to the Workshop "Acquisition, apprentissage et exploitation de connaissances sémantiques pour l'accès au contenu textuel", plate-forme AFIA2003, Laval, 1-4 juillet 2003.
P. Gros was invited to a France - Taïwan seminar of recent issues in Multimedia (March 2003)