With the success of sites like Youtube or DailyMotion, with the development of the Digital Terrestrial TV, it is now obvious that the digital videos have invaded our usual information channels like the web. While such new documents are now available in huge quantities, using them remains difficult. Beyond the storage problem, they are not easy to manipulate, browse, describe, search, summarize, visualize as soon as the simple scenario “1. search the title by keywords 2. watch the complete document” does not fulfill the user's needs anymore. That is, in most cases.
Most usages are linked with the key concept of repurposing. Videos are a raw material that each user recombines in a new way, to offer new views of the content, to adapt it to new devices (ranging from HD TV sets to mobile phones), to mix it with other videos, to answer information queries... Somehow, each use of a video gives raise to a new short-lived document that exists only while it is viewed. Achieving such a repurposing process implies the ability to manipulate videos extracts as easily as words in a text.
Many applications exist in both professional and domestic areas. On the professional side, such applications include transforming a TV broadcast program into a web site, a DVD or a mobile phone service, switching from a traditional TV program to an interactive one, better exploiting TV and video archives, constructing new video services (video on demand, video edition, etc). On the domestic side, video summarizing can be of great help, as can a better management of the videos locally recorded, or simple tools to face the exponential number of TV channels available that increase the quantity of interesting documents available, overall increasing but make them really hard to find.
In order to face such new application needs, we propose a multi-field work, gathering in a single team specialists that are able to deal with the various media and aspects of large video collections: image, video, text, sound and speech, but also data analysis, indexing, machine learning... The main goal of this work is to segment, structure, describe, or de-linearize the multimedia content in order to be able to recombine or re-use that content in new conditions. The focus on the document analysis aspect of the problem is an explicit choice since it is the first mandatory step of any subsequent application, but using the descriptions obtained by the processing tools we develop is also an important goal of our activity.
To summarize our research project in one short sentence, let us say that we would like our computers to be able to watch TV and use what has been watched and understood in new innovative services. The main challenges to address in order to reach that goal are: the size of the documents and of the document collections to be processed, the necessity to process jointly several media and to obtain a high level of semantics, the variety of contents, of contexts, of needs and usages, linked to the difficulty to manage such documents on a traditional interface.
Our own research is organized in three directions: 1- developing advanced algorithms of data analysis, description and indexing, 2- searching new techniques for linguistic information acquisition and use, 3- building new processing tools for audiovisual documents.
Processing multimedia documents produces most of the time lots of descriptive metadata. These metadata can take many different aspects ranging from a simple label issued from a limited list, to high dimensional vectors or matrices of any kind; they can be numeric or symbolic, exact, approximate or noisy. As examples, image descriptors are usually vectors whose dimension can vary between 2 and 900, while text descriptors are vectors of much higher dimension, up to 100,000 but that are very sparse. Real size collections of documents can produce sets of billions of such vectors.
Most of the operations to be achieved on the documents are in fact translated in terms of operations on their metadata, which appear as key objects to be manipulated. Although their nature is much simpler than the data used to compute them, these metadata require specific tools and algorithms to cope with their particular structure and volume. Our work concerns mainly three domains:
data analysis techniques, eventually coupled to data visualization techniques, to study the structure of large sets of metadata, with applications to classical problems like data classification, clustering, sampling, or modeling,
advanced data indexing techniques in order to speed-up the manipulation of these metadata for retrieval or query answering problems,
description of compressed, watermarked or attacked data.
Natural languages are a privileged way to carry high level semantic information. Used in speech from an audio track, in textual format or overlaid in images or videos, alone or associated with images, graphics or tables, organized linearly or with hyperlink, expressed in English, French, or Chinese, this linguistic information may take many different forms, but always exhibits a common basic structure: it is composed of sequences of words. Building techniques that preserve the subtle links existing between these words, their representations with letters or other symbols and the semantics they carry is a difficult challenge.
As an example, actual search engines work at the representation level (they search sequences of letters), and do not consider the meaning of the searched words. Therefore, they do not use the fact that “bike” and “bicycle” represent a single concept while “bank” has at least two different meanings (a river bank and a financial institution).
Extracting high level information is the goal of our work. First, acquisition techniques that allow us to associate pieces of semantics with words, to create links between words are still an active field of research. Once this linguistic information is available, its use raises new issues. For example, in search engines, new pieces of information can be stored and the representation of the data can be improved in order to increase the quality of the results.
One of the main characteristics of audiovisual documents is their temporal dimension. As a consequence, they cannot be watched or listened to globally, but only by a linear process that takes some time. On the processing side, these documents often mix several media (image track, sound track, some text) that should be all taken into account to understand the meaning and the structure of the document. They can also have an endless stream structure with no clear temporal boundaries, like on most TV or radio channels. Therefore, there is an important need to segment and structure them, at various scales, before describing the pieces that are obtained.
Our work is organized in three directions. Segmenting and structuring long TV streams (up to several weeks, 24 hours a day) is a first goal that allows to extract program and non program segments in these streams. These programs can then be structured at a finer level. Finally, once the structure is extracted, we use the linguistic information to describe and characterize the various segments. In all this work, the interaction between the various media is a constant source of difficulty, but also of inspiration.
TexMexhas co-organized the MediaEval 2011 evaluation campaign tasks on Violent Scenes Detection and on Spoken Web Search.
TexMexhas successfully participated in two tasks of Trecvid 2011, the main benchmark in automatic video analysis and retrieval organized by the National Institute of Technology.
In the Semantic Indexing task, we have contributed to the submission of the Quaero consortium, jointly with LIG and Karlsruhe Institute of Technology. This submission was ranked 3 rdout of 19 participants.
In the Copy Detection task, our joint submission with the LEAR project-team was ranked approximately 3 rdout of 21 participants, with respect to search quality.
Gwénolé Lecorvé was awarded the best Ph.D. award of the French Speech Communication Association.
In most contexts where images are to be compared, a direct comparison is impossible. Images are compressed in different formats, most formats are error-prone, images are re-sized, cropped, etc. The solution consists in computing descriptors, which are invariant to these transformations.
The first description methods associate a unique global descriptor with each image, e.g., a color histogram or correlogram, a texture descriptor. Such descriptors are easy to compute and use, but they usually fail to handle cropping and cannot be used for object recognition. The most successful approach to address a large class of transformations relies on the use of local descriptors, extracted on regions of interest detected by a detector, for instance the Harris detector or the Difference of Gaussian method proposed by David Lowe .
The detectors select a square, circular or elliptic region that is described in turn by a patch descriptor, usually referred to as a local descriptor. The most established description
method, namely the SIFT descriptor
, was shown robust to geometric and photometric transforms. Each local
SIFT descriptor captures the information provided by the gradient directions and intensities in the region of interest in each region of a 4
Local descriptors can be used in many applications: image comparison for object recognition, image copy detection, detection of repeats in television streams, etc. While they are very reliable, local descriptors are not without problems. As many descriptors can be computed for a single image, a collection of one million images generates in the order of a billion descriptors. That is why specific indexing techniques are required. The problem of taking full advantage of these strong descriptors on a large scale is still an open and active problem. A recent trend consists in computing a global descriptor from local ones, such as proposed in the so-called bag-of-visual-word approach . Recently, global description computed from local descriptors has been shown successful in breaking the complexity problem. We are active in designing methods that aggregate local descriptors into a single vector representation without loosing too much of the discriminative power of the descriptors.
Our work on textual material (textual documents, transcriptions of speech documents, captions in images or videos, etc.) is characterized by a chiefly corpus-based approach, as opposed to an introspective one. A corpus is for us a huge collection of textual documents, gathered or used for a precise objective. We thus exploit specialized (abstracts of biomedical articles, computer science texts, etc.) or non specialized (newspapers, broadcast news, etc.) collections for our various studies. In TexMex, according to our applications, different kinds of knowledge can be extracted from the textual material. For example, we automatically extract terms characteristic of each successive topic in a corpus with no a priori knowledge; we produce representations for documents in an indexing perspective ; we acquire lexical resources from the collections (morphological families, semantic relations, translation equivalences, etc.) in order to better grasp relations between segments of texts in which a same idea is expressed with different terms or in different languages...
In the domain of the corpus-based text processing, many researches have been undergone in the last decade. While most of them are essentially based on statistical methods, symbolic approaches also present a growing interest . For our various problems involving language processing, we use both approaches, making the most of existing machine learning techniques or proposing new ones. Relying on advantages of both methods, we aim at developing machine learning solutions that are automatic and generic enough to make it possible to extract, from a corpus, the kind of elements required by a given task.
Describing multimedia documents, i.e., documents that contain several modalities ( e.g., text, images, sound) requires taking into account all modalities, since they contain complementary pieces of information. The problem is that the various modalities are only weakly synchronized, they do not have the same rate and combining the information that can be extracted from them is not obvious. Of course, we would like to find generic ways to combine these pieces of information. Stochastic models appear as a well-dedicated tool for such combinations, especially for image and sound information.
Markov models are composed of a set of states, of transition probabilities between these states and of emission probabilities that provide the probability to emit a given symbol at a given state. Such models allow generating sequences. Starting from an initial state, they iteratively emit a symbol and then switch in a subsequent state according to the respective probability distributions. These models can be used in an indirect way. Given a sequence of symbols (called observations), hidden Markov models (HMMs, ) aim at finding the best sequence of states that can explain this sequence. The Viterbi algorithm provides an optimal solution to this problem.
For such HMMs, the structure and probability distributions need to be a priori determined. They can be fixed manually (this is the case for the structure: number of states and their topology), or estimated from example data (this is often the case for the probability distributions). Given a document, such an HMM can be used to retrieve its structure from the features that can be extracted. As a matter of fact, these models allow an audiovisual analysis of the videos, the symbols being composed of a video and an audio component.
Two of the main drawbacks of the HMMs is that they can only emit a unique symbol per state, and that they imply that the duration in a given state follows an exponential distribution. Such drawbacks can be circumvented by segment models . These models are an extension of HMMs were each state can emit several symbols and contains a duration model that governs the number of symbols emitted (or observed) for this state. Such a scheme allows us to process features at different rates.
Bayesian networks are an even more general model family. Static Bayesian networks are composed of a set of random variables linked by edges indicating their conditional dependency. Such models allow us to learn from example data the distributions and links between the variables. A key point is that both the network structure and the distributions of the variables can be learned. As such, these networks are difficult to use in the case of temporal phenomena.
Dynamic Bayesian networks are a generalization of the previous models. Such networks are composed of an elementary network that is replicated at each time stamp. Duration variable can be added in order to provide some flexibility on the time processing, like it was the case with segment models.
While HMMs and segment models are well suited for dense segmentation of video streams, Bayesian networks offer better capabilities for sparse event detection. Defining a trash state that corresponds to non event segments is a well known problem is speech recognition: computing the observation probabilities in such a state is very difficult.
Techniques for indexing multimedia data are needed to preserve the efficiency of search processes as soon as the data to search in becomes large in volume and/or in dimension. These techniques aim at reducing the number of I/Os and CPU cycles needed to perform a search. Multi-dimensional indexing methods either perform exact nearest neighbor (NN) searches or approximate NN-search schemes. Often, approximate techniques are faster as speed is traded off against accuracy.
Traditional multidimensional indexing techniques typically group high dimensional features vectors into cells. At querying time, few such cells are selected for searching, which, in turn, provides performance as each cell contains a limited number of vectors . Cell construction strategies can be classified in two broad categories: data-partitioningindexing methods that divide the data space according to the distribution of data, and space-partitioningindexing methods that divide the data space along predefined lines and store each descriptor in the appropriate cell.
Unfortunately, the “curse of dimensionality” problem strongly impacts the performance of many techniques. Some approaches address this problem by simply relying on dimensionality reduction techniques. Other approaches abort the search process early, after having accessed an arbitrary and predetermined number of cells. Some other approaches improve their performance by considering approximations of cells (with respect to their true geometry for example).
Recently, several approaches make use of quantization operations. This, somehow, transforms costly nearest neighbor searches in multidimensional space into efficient uni-dimensional accesses. One seminal approach, the LSH technique , uses a structured scalar quantizer made of projections on segmented random lines, acting as spatial locality sensitive hash-functions. In this approach, several hash functions are used such that co-located vectors are likely to collide in buckets. Other approaches use unstructured quantization schemes, sometimes together with a vector aggregation mechanism to boost performance.
Data Mining (DM) is the core of knowledge discovery in databases whatever the contents of the databases are. Here, we focus on some aspects of DM we use to describe documents and to retrieve information. There are two major goals to DM: description and prediction. The descriptive part includes unsupervised and visualization aspects while prediction is often referred to as supervised mining.
The description step very often includes feature extraction and dimensional reduction. As we deal mainly with contingency tables crossing "documents and words", we intensively use factorial correspondence analysis. "Documents" in this context can be a text as well as an image.
Correspondence analysis is a descriptive/exploratory technique designed to analyze simple two-way and multi-way tables containing some measure of correspondence between the rows and columns. The results provide information, which is similar in nature to those produced by factor analysis techniques, and they allow one to explore the structure of categorical variables included in the table. The most common kind of table of this type is the two-way frequency cross-tabulation table. There are several parallels in interpretation between correspondence analysis and factor analysis: suppose one could find a lower-dimensional space, in which to position the row points in a manner that retains all, or almost all, of the information about the differences between the rows. One could then present all information about the similarities between the rows in a simple 1, 2, or 3-dimensional graph. The presentation and interpretation of very large tables could greatly benefit from the simplification that can be achieved via correspondence analysis (CA).
One of the most important concepts in CA is inertia,
i.e., the dispersion of either row points or column points around their gravity center. The inertia is linked to the total Pearson
In the supervised classification task, a lot of algorithms can be used; the most popular ones are the decision trees and more recently the Support Vector Machines (SVM). SVMs provide very good results in supervised classification but they are used as "black boxes" (their results are difficult to explain). We use graphical methods to help the user understanding the SVM results, based on the data distribution according to the distance to the separating boundary computed by the SVM and another visualization method (like scatter matrices or parallel coordinates) to try to explain this boundary. Other drawbacks of SVM algorithms are their computational cost and large memory requirement to deal with very large datasets. We have developed a set of incremental and parallel SVM algorithms to classify very large datasets on standard computers.
With the proliferation of high-speed Internet access, piracy of multimedia data has developed into a major problem and media distributors, such as photo agencies, are making strong efforts to protect their digital property. Today, many photo agencies expose their collections on the web with a view to selling access to the images. They typically create web pages of thumbnails, from which it is possible to purchase high-resolution images that can be used for professional publications. Enforcing intellectual property rights and fighting against copyright violations is particularly important for these agencies, as these images are a key source of revenue. The most problematic cases, and the ones that induce the largest losses, occur when “pirates” steal the images that are available on the Web and then make money by illegally reselling those images.
This applies to photo agencies, and also to producers of videos and movies. Despite the poor image quality, thousands of (low-resolution) videos are uploaded every day to video-sharing sites such as YouTube, eDonkey or BitTorrent. In 2005, a study conducted by the Motion Picture Association of America was published, which estimated that their members lost 2,3 billion US$ in sales due to video piracy over the Internet. Due to the high risk of piracy, movie producers have tried many means to restrict illegal distribution of their material, albeit with very limited success.
Photo and video pirates have found many ways to circumvent even the most clever protection mechanisms. In order to cover up their tracks, stolen photos are typically cropped, scaled, their colors are slightly modified; videos, once ripped, are typically compressed, modified and re-encoded, making them more suitable for easy downloading. Another very popular method for stealing videos is cam-cording, where pirates smuggle digital camcorders into a movie theater and record what is projected on the screen. Once back home, that goes to the web.
Clearly, this environment calls for an automatic content-based copyright enforcement system, for images, videos, and also audio as music gets heavily pirated. Such a system needs to be effective as it must cope with often severe attacks against the contents to protect, and efficient as it must rapidly spot the original contents from a huge reference collection.
The existing video databases are generally little digitized. The progressive migration to digital television should quickly change this point. As a matter of fact, the French TV channel TF1 switched to an entirely digitized production, the cameras remaining the only analogical spot. Treatment, assembly and diffusion are digital. In addition, domestic digital decoders can, from now on, be equipped with hard disks allowing a storage initially modest, of ten hours of video, but larger in the long term, of a thousand of hours.
One can distinguish two types of digital files: private and professional files. On one hand, the files of private individuals include recordings of broadcasted programs and films recorded using digital camcorders. It is unlikely that users will rigorously manage such collections; thus, there is a great need for tools to help the user: automatic creation of summaries and synopses to allow finding information easily or to have within few minutes a general idea of a program. Even if the service is rustic, it is initially evaluated according to the added value brought to a system (video tape recorder, decoder), must remain not very expensive, but will benefit from a large diffusion.
On the other hand, these are professional files: TV channel archives, cineclubs, producers... These files are of a much larger size, but benefit from the attentive care of professionals of documentation and archiving. In this field, the systems can be much more expensive and are judged according to the profits of productivity and the assistance which they bring to archivists, journalists and users.
A crucial problem for many professionals is the need to produce documents in many formats for various terminals from the same raw material without multiplying the editing costs. The aim of such a repurposingis for example to produce a DVD, a web site or an alert service by mobile phone from a TV program at the minimum cost. The basic idea is to describe the documents in such a way that they can be easily manipulated and reconfigured easily.
Searching in large textual corpora has already been the topic of many researches. The current stakes are the management of very large volumes of data, the possibility to answer requests relating more on concepts than on simple inclusions of words in the texts, and the characterization of sets of texts.
We work on the exploitation of scientific bibliographical bases. The explosion of the number of scientific publications makes the retrieval of relevant data for a researcher a very difficult task. The generalization of document indexing in data banks did not solve the problem. The main difficulty is to choose the keywords, which will encircle a domain of interest. The statistical method used, the factorial analysis of correspondences, makes it possible to index the documents or a whole set of documents and to provide the list of the most discriminating keywords for these documents. The index validation is carried out by searching information in a database more general than the one used to build the index and by studying the retrieved documents. That in general makes it possible to still reduce the subset of words characterizing a field.
We also explore scientific documentary corpora to solve two different problems: to index the publications with the help of meta-keys and to identify the relevant publications in a large textual database. For that, we use factorial data analysis, which allows us to find the minimal sets of relevant words that we call meta-keys and to free the bibliographical search from the problems of noise and silence. The performances of factorial correspondence analysis are sharply greater than classic search by logical equation.
The deposit of this software at APP is currently being processed (submitted). The software is available from its homepage, namely
http://
Babaz is a audio database management system with an audio-based search function, which is intended for audio-based search in video archives.
It is licensed under the terms of the GNU General Public License v3.0.
Joint work with Christian Wengert (Kooba Inc.) and Matthijs Douze (INRIA LEAR and SED project-teams)
This package implements the color descriptor proposed in our ACM Multimedia paper , which improves the previous color histogram representation.
The bag-of-colors software corresponds to two packages:
The (reference) Matlab package, which was co-developed with Christian Wengert and Matthijs Douze ;
The python package (translated) was translated from Matlab by Sébastien Campion.
The Matlab version of this package is available on Github at
https://
The python version is available on the gforge INRIA server, and might be available on request.
The software homepage is available at
http://
Bonzaiboost stands for boosting over small decisions trees. bonzaiboost is a general purpose machine-learning program based on decision tree and boosting for building a classifier from
text and/or attribute-value data. Currently one configuration of bonzaiboost is ranked first on
http://
This software was developed in collaboration with project-team TEMICS (P. Meerwald)
Don Quixotte a software suite in C for Tardos Fingerprinting code (Code generation, collusion, and accusation with single and/or joint decoding).
This software was developed in collaboration with project-team ASPI (F. Cérou, A. Guyader)
Rare Event is a Matlab package for rare event probabilities and extreme quantiles estimations
This software is jointly maintained by Matthijs Douze, from INRIA Grenoble.
Bigimbaz is a platform originally developed in the Learproject-team, and now co-maintained by TexMex. It integrates several contributions on image description and large-scale indexing: detectors, descriptors, retrieval using bag-of-words and inverted files, and geometric verification.
Visual graphical interface for tracking visual targets based on particle filter tracking or based on mean-shift. The deposit of this software at APP is currently being processed.
Creation of spatio-temporal mosaic based on dominant motion compensation. It depends on the Motion2D library, which computes the dominant motion, and then adjust the images by back-warping. The deposit of this software at APP is currently being processed.
The software homepage is available here:
http://
First APP deposit: IDDN.FR.001.260038.000.S.P.2011.000.40000
PimPy stands for Indexing Multimedia with Python (or Platform for Indexing Multimedia with Python). The aim of this module is to provide a convenient and high level API to manage common multimedia indexing tasks. It includes severals features. It is used, in particular
to retrieve video features, such as histogram, binarized DCT descriptor, SIFT, SURF, etc ;
to detect video cuts and dissolve (GoodShotDetector) ;
for fast video frame access (pyffas) ;
for raw frame extraction, or video segment extraction and re-encoding ;
to search a video segment in another video (content based retrieval) ;
to perform scene clustering.
This software is jointly maintained by Matthijs Douze, from INRIA Grenoble.
First APP deposit: IDDN.FR.001.220012.000.S.P.2010.000.10000
A new version of the software at APP is currently being processed.
Pqcodes is a library which implements the approximate k nearest neighbor search method of . This software has been transferred to Technicolor in August 2011.
The deposit of this software at APP is currently being processed.
Implementation of the Geometric Hashing algorithm of to check if geometrical consistency between pairs of images.
This software is jointly maintained with Guillaume Gravier.
Samusa enable to detect speech and/or musical segment in multimedia content.
This software is jointly maintained by Matthijs Douze, from INRIA Grenoble.
APP deposit: IDDN.FR.001.220014.000.S.P.2010.000.10000
A new version of the software at APP is currently being processed.
Yael is a C/python/Matlab library providing (multi-threaded, Blas/Lapack, low level optimization) implementations of computationally demanding functions. In particular, it provides very optimized functions for k-means clustering and exact nearest neighbor search.
TVSearch is a content based retrieval search engine used to search and propagate manual annotation such as advertisement in a TV corpora. Based on a binary DCT descriptor, it used GPU card to compute exhaustive Hamming distance between the query and database. For example, a query of 11 seconds in 21 days on television (504 hours) is done in 9 seconds. ( i.e., bit-rate of 2,3 days/second) TVSearch offer a web services API using the HTTP/REST protocol.
The deposit of this software at APP is currently being processed.
AVSST is an Automatic Video Stream Structuring Tool. First, it allows the detection of repetitions in a TV stream. Second, a machine learning method allows the classification of programs and inter-programs such as advertisements, trailers, etc. Finally, the electronic program guide is synchronized with the right timestamps based on dynamic time warping. A graphical user interface is provided to manage the complete workflow.
Several software programs have been developed in the team over the years:
I-Description(APP deposit number: IDDN.FR.001.270047.000.S.P.2003.000.21000),
Asares, is a symbolic machine learning system that automatically infers, from descriptions of pairs of linguistic elements found in a corpus in which the components are linked by a given semantic relation, corpus-specific morpho-syntactic and semantic patterns that convey the target relation. (IDDN.FR.001.0032.000.S.C.2005.000.20900),
AnaMorpho, detects morphological relations between words in many languages (IDDN.FR.001.050022.000.S.P.2008.000.20900),
DiVATexis a audio/video frame server. (IDDN.FR.001.320006.000.S.P.2006.000.40000),
NaviTexis a video annotation tool. (IDDN.FR.001.190034.000.S.P.2007.000.40000),
Telemex, is a web service that enables TV and radio stream recording.
VidSigcomputes a small and robust video signature (64 bits per image).
VidSegcomputes segmentation features such as cuts, dissolves, silences in audio track, changes of ratio aspect, monochrome images. (IDDN.FR.001.250009.000.S.P.2009.000.40000),
Isec, web application used as graphical interface for image searching engines based on retrieval by content.
GPU-KMeans, implementation of k-means algorithm on graphical process unit (graphic cards)
Correspondence Analysiscomputes a factorial correspondence analysis (FCA) for image retrieval.
GPU Correspondence Analysis, is an implementation of the previous software Correspondence Analysis on graphical processing unit (graphical card).
CAVIZis an interactive graphical tool that allows to display and to extract knowledge from the results of a Correspondence Analysis on images.
Kiwi(standing for Keywords Extractor) is mostly dedicated to indexing and keyword extraction purposes.
Topic Segmenter, is a software dedicated to topic segmentation of texts and (automatic) transcripts.
S2E(Structuring Events Extractor) is a module which allows the automatic discovery of audiovisual structuring events in videos.
2pac, build classes of words of similar meanings (“semantic classes“) specific to the use that is made of them in that given topic. (IDDN.FR.001.470028.000.S.P.2006.000.40000)
Faestos, (Fully Automatic Extraction of Sets of keywords for TOpic characterization and Spotting) is a tool composed of a sequence of statistical treatments that extracts from a morpho-syntactically tagged corpus sets of keywords that characterize the main topics that corpus deals with. (IDDN.FR.001.470029.000.S.P.2006.000.40000)
Fishnet, Fishnet is an automatic web pages grabber associated with a specific theme.
Match Maker, semantic relation extraction by statistical methods.
IRISA News Topic Segmenter (irints), automatically segments speech transcripts into topic-consistent parts.
IRISAphon, produce phonetic words.
The gradual migration of television from broadcast diffusion to Internet diffusion offers tremendous possibilities for the generation of rich navigable contents. However, it also raises numerous scientific issues regarding de-linearization of TV streams and content enrichment. In this demonstration, we illustrate how speech in TV news shows can be exploited for de-linearization of the TV stream. In this context, de-linearization consists in automatically converting a collection of video files extracted from the TV stream into a navigable portal on the Internet where users can directly access specific stories or follow their evolution in an intuitive manner.
Structuring a collection of news shows requires some level of semantic understanding of the content in order to segment shows into their successive stories and to create links between stories in the collection, or between stories and related resources on the Web. Spoken material embedded in videos, accessible by means of automatic speech recognition, is a key feature to semantic description of video contents. At IRISA/INRIA Rennes, we have developed multimedia content analysis technology combining automatic speech recognition, natural language processing and information retrieval to automatically create a fully navigable news portal from a collection of video files.
The demonstration was presented in several workshops (Quaero CTC workshop, Journée INRIA Industrie La Télévision du Futur) and a video has been made available online on the portal of the EIT ICT Labs OpenSEM project.
See the demo at
http://
Until 2005, we used various computers to store our data and to carry out our experiments. In 2005, we began some work to specify and set-up dedicated equipment to experiment on very large collections of data. During 2006 and 2007, we specified, bought and installed our first complete platform. It is organized around a very large storage capacity (155TB), and contains 4 acquisition devices (for Digital Terrestrial TV), 3 video servers, and 15 computing servers partially included in the local cluster architecture (IGRIDA).
In 2010, we have acquired a new large memory server with 144GB of RAM which is used for memory demanding tasks, in particular to improve the speed of building index or language model. The previous server dedicated to this kind of jobs (acquired in 2008) has been upgraded to 96GB of RAM.
A dedicated website has been developed in 2009 to provide a user support. It contains useful information such as references of available and ready to use software on the cluster, list of corpus stored on the platform, pages for monitoring disk space consumption and cluster loading, tutorials for best practices and cookbooks for treatments of large datasets.
In 2008, we build up a corpus of multimedia data. It consists in a continuous recording (6 months) of two TV channels and three radios. It also includes web pages related to these contents captured on broadcaster's website. This corpus is to be used for different studies like the treatment of news along the time and to provide sub-corpus like TV news within the Quaero project (see below). The manual annotation of all the TV programs is under progress.
This platform is funded by a joint effort of INRIA, INSA Rennes and University of Rennes 1.
This is a joint work with the Temicsproject-team (J. Zepeda and C. Guillemot).
In the context of ANR project ICOS-HD ended at december 2010, in collaboration with Christine Guillemot from Temics, we investigated sparse representations methods for local image description. We have developed methods for learning dictionaries to be used for sparse signal representations. These methods lead to dictionaries which have been called Iteration-Tuned Dictionaries (ITDs), Basic ITD (BITD), Tree-Structured ITD (TSITD) and Iteration-Tuned and Aligned Dictionaries (ITAD). All three proposed ITD schemes (BITD, TSITD and ITAD) have been shown to outperform the state-of-the-art learned dictionaries in terms of PSNR versus sparsity. The performance of these dictionaries has also been assessed for both compression and de-noising applications. ITAD in particular has been used to produce a new image codec that outperforms JPEG2000 for a fixed image class and leads in 2011 to two new publications , .
This is joint work with Christian Wengert (Kooaba) and Matthijs Douze (INRIA LEAR and SED project-teams).
This work investigates the use of color information when used within a state-of-the-art large scale image search system. We introduce a simple color signature generation procedure, used either to produce global or local descriptors. As a global descriptor, it outperforms several state-of-the-art color description methods, in particular the bag-of-words method based on color SIFT. As a local descriptor, our signature is used jointly with SIFT descriptors (no color) to provide complementary information.
This is joint work with Matthijs Douze (INRIA LEAR and SED project-teams), Patrick Pérez (Technicolor), Florent Perronnin (Xérox Research Center Europe) and Cordelia Schmid (INRIA LEAR).
This work addresses the problem of large-scale image search and consolidates and extends results from a previous work . Different ways of aggregating local image descriptors into a vector are compared, and the Fisher vector is shown to achieves better performance than the reference bag-of-visual words approach for any given vector dimension. We then jointly optimize dimensionality reduction and indexing in order to obtain a precise vector comparison as well as a compact representation. The evaluation shows that the image representation can be reduced to a few dozen bytes. Searching a 100 million image dataset takes about 250 ms on one processor core.
This is a joint work with Björn Þór Jónsson and Grímur Tómasson from the School of Computer Science, Reykjavik University, Iceland.
Since the introduction of personal computers, personal collections of digital media have been growing ever larger. It is therefore increasingly important to provide effective browsing tools for such collections. We have proposed a multi-dimensional model for media browsing, called ObjectCube, based on the multi-dimensional model commonly used in OLAP applications. We implemented a prototype of a media browser based on the ObjectCube model. We then ran evaluations of its performance using three different underlying data stores and photo collections of up to one million photos.
Textual data can be easily transformed in frequency tables and any method working on contingency tables can be used to process them. Besides, with the important amount of available textual data, we need to find convenient ways to process the data and to get invaluable information. It appears that the use of factorial correspondence analysis allows us to get most of the information included in the data. We start exploring temporal changes in textual data and mainly focus on the visualization of results: we try to detect the topics if they have not already been identified and to study the evolution of the previous vocabulary inside a topic through time. In fact, as with economical datasets, we try to find seasonal components and cycling components in the documents and to characterize these components.
.
Support Vector Machines (SVM) and kernel methods are known to provide accurate models but the learning task usually needs a quadratic program, so this task for very large datasets requires a large memory capacity and a long time. We have developed new algorithms. The first versions of the algorithms were based on a CPU distributed software program, then we have used GP-GPU (General Purpose GPU) versions to significantly improve the algorithm speed (130 times faster than the CPU one, 2500 times faster than libSVM, SVMPerf or CB-SVM). We have extended the least squares SVM algorithm (LS-SVM) to adapt the algorithm to datasets having a very large number of dimensions and have applied boosting to LS-SVM for datasets having simultaneously a very large number of vectors and dimensions on standard computers. In image classification, the usual frameworks involve three steps : feature extraction, building codebook by feature quantization and training the classifier with a standard classification algorithm (eg. SVM). However, task complexity becomes very large when applying this approach on large scale datasets like the ImageNet dataset containing more than 14 million images and 21,000 classes. The complexity is both about the time needed to perform each task and the memory and disk usage (eg. 11TB are needed to store the SIFT descriptors computed on the full datasets). Efficient algorithms must be used into these three steps: - obviously, the descriptors computed for one image are independant of the other image ones, so they can be computed in a parallel way, - the quantization step usually uses a k-means algorithm, we have developed different versions of parallel k-means algorithms to use on GPU or a cluster of CPUs, - for the learning task, we have developed a parallel version of LibSVM. The first results on the ten largest classes of ImageNet dataset are promising , we have developed a fast and efficient framework for large scale image classification.
.
Over the years, the level of maturity reached by content-based retrieval systems (CBRSs) has significantly increased. CBRSs have so far been used in very friendly settings where cultural enrichments are paramount. CBRSs are also used in quite different settings where the control, the surveillance and the filtering of multimedia information are central, such as for copyright enforcement systems. While an abundant literature assesses that today's CBRSs are robust against general-purpose attacks, we address in this work the security of content-based retrieval systems. Because of our expertise, we focus on security of content-based image retrieval, where images are described by SIFT descriptors and indexed by NV-Tree. We proved in one preliminary study that a real system fails to match a specifically attacked image and its quasi-copy, breaking its otherwise excellent copyright protection performances. After proposing specific attacks that aim to disturb the descriptor detection stage by both prevent some key-points of being detected and create new ones , , we pursue the work by considering attacks dedicated to the description computation stage.
.
A key issue in watermarking and fingerprinting applications is to satisfy the requirement on the probability of false detection or false accusation. Assume commercial contents are encrypted and watermarked and that future consumer electronics storage devices have a watermark detector. These devices refuse to record a watermarked content since it is copyrighted material. The probability of false alarm is the probability that the detector considers an original piece of content (which has not been watermarked) as protected. The movie that a user shot during his holidays could be rejected by his storage device. This absolutely non user-friendly behavior really scares consumer electronics manufacturers.
In fingerprinting, users’ identifiers are embedded in purchased contents. When this content is found in an illegal place (e.g. a P2P network), the copyright holders decode the hidden message, find an identifier, and thus they can trace the traitor, i.e. the customer who has illegally broadcast his copy. However, the task is not that simple because dishonest users might collude. For security reason, anti-collusion codes have to be employed. Yet, these solutions have a non-zero probability of error (defined as the probability of accusing an innocent). This probability should be, of course, extremely low, but it is also a very sensitive parameter: anti-collusion codes get longer (in terms of the number of bits to be hidden in content) as the probability of error decreases. Fingerprint designers have to strike a trade-off, which is hard to conceive when only rough estimation of the probability of error is known. The major issue for fingerprinting algorithms is the fact that embedding large sequences implies also assessing reliability on a huge amount of data, which may be practically unachievable without using rare event analysis.
In collaboration with the team-projects ASPI and ALEA, we developed a novel strategy for simulating rare events and an associated Monte Carlo estimation of tail probabilities. Our method uses a system of interacting particles and exploits a Feynman-Kac representation of that system to analyze their fluctuations. Our precise analysis of the variance of a standard multilevel splitting algorithm reveals an opportunity for improvement. This leads to a novel method that relies on adaptive levels and produces, in the limit of an idealized version of the algorithm, estimates with optimal variance. Some numerical results show performance close to the idealized version of our technique for these practical applications. This work has been published in the journal Statistics and computing . Algorithms for estimating extreme probabilities and quantiles are implemented as a Matlab package.
.
So far, the accusation process of a Tardos fingerprinting code is based on single decoders which compute a score per user. Users with the highest score or whose scores is above a
threshold are then deemed guilty. In the past years, we have contributed to this approach bringing two improvements: the `learn and match' strategy aims at estimating the collusion process
and using the matched score function; a rare event analysis translates this score into a more meaningful probability of being guilty. A fast implementation computes the scores of one
million of users within 0.2 second on a regular laptop. Therefore, contrary to common belief, although a single decoder is exhaustive with a linear complexity in
This fast implementation allows us to propose iterative decoders. A first idea is that conditioning by the identities of some colluders bring more discrimination power to the score
function. The first iteration is thus a single decoder, users we are extremely confident to accuse are enrolled as side information. The next iteration computes new scores for the remaining
users etc. A second idea is that information theory proves that a joint decoder computing scores for pairs, triplets, or in general
.
A key assumption of the fingerprinting schemes developed so far is that the colluders may know their own codewords but they ignore the codeword of any other innocent user. Otherwise, the collusion can very easily forge a pirated content framing an innocent user because it contains a sequence close enough to his/her codeword. This puts a lot of pressure on the versioning mechanism which creates the personal copy of the content in accordance to a codeword. For instance, suppose that the versioning is done in the user's setup box, the unique codeword being loaded into this device at the manufacture. If the code matrix ends up in the hands of an untrustworthy employee, then the whole fingerprinting system is pulled down. This is one argument of the motivation for designing cryptographic protocols for the construction, the versioning and the accusation. We have proposed a new asymmetric fingerprinting protocol dedicated to the state-of the-art Tardos codes. We believe that this is the first such protocol, and that it is practically efficient. The construction of the fingerprints and their embedding within pieces of content is based on oblivious transfer and do not need a trusted third party. Note, however, that during the accusation stage, a trusted third party, like a Judge, is necessary like in any asymmetric fingerprinting scheme we are aware of. This works has been done in collaboration with the team-project TEMICS, Lab-STICC Telecom Bretagne and University College London, and presented at Information Hiding . Ana Charpentier defended her PhD. thesis in October 2011 .
.
We show that an image can be approximately reconstructed based on the output of a black-box local description software such as those classically used for image indexing. Our approach consists first in using an off-the-shelf image database to find patches which are visually similar to each region of interest of the unknown input image, according to associated local descriptors. These patches are then warped into input image domain according to interest region geometry and seamlessly stitched together. Final completion of still missing texture-free regions is obtained by smooth interpolation. As demonstrated in our experiments, visually meaningful reconstructions are obtained just based on image local descriptors like SIFT, provided the geometry of regions of interest is known. The reconstruction allows most often the clear interpretation of the semantic image content. As a result, this work raises critical issues of privacy and rights when local descriptors of photos or videos are given away for indexing and search purpose.
This is a joint work with Björn Þór Jónsson from the School of Computer Science, Reykjavik University, Iceland and with Herwig Lejsek, Videntifier Technologies, Iceland.
We have further improved the NV-Tree (Nearest Vector Tree) indexing techniques. It addresses the specific, yet important, problem of efficiently and effectively finding the approximate
Dynamic Time Warping (DTW) is the most popular approach for evaluating the similarity of time series, but its computation is costly. Therefore, simple functions lower bounding DTW
distances have been designed, accelerating searches by quickly pruning sequences that could not possibly be best matches. The tighter the bounds, the more they prune and the better the
performance. Designing new functions that are even tighter is difficult because their computation is likely to become complex, canceling the benefits of their pruning. It is possible,
however, to design simple functions with a higher pruning power by relaxing the
no false dismissalassumption, resulting in approximate lower bound functions. We have discover how very popular approaches accelerating DTW such as LB_Keogh and LB_PAA can be made more
efficient
viaapproximations. The accuracy of approximations can be tuned, ranging from no false dismissal to potential losses when aggressively set for great response time savings. At very large
scale, indexing time series is mandatory. These approximate lower bound functions can be used with
We have proposed an improved asymmetric Hamming Embedding scheme for large scale image search based on local descriptors. The comparison of two descriptors relies on an vector-to-binary code comparison, which limits the quantization error associated with the query compared with the original Hamming Embedding method. The approach is used in combination with an inverted file structure that offers high efficiency, comparable to that of a regular bag-of-features retrieval systems, and consistently improves the search quality over the symmetric version on the two datasets used for the evaluation.
Part of this work on this topic was done in cooperation with Matthijs Douze and Cordelia Schmid (INRIA/ Lear).
An extension of our previous work on source coding techniques for high-dimensional indexing has been proposed . The goal is to index a large set of vectors, as large as 1 billion vectors, with limited CPU and memory usage. Based on the product quantization-based indexing technique , we show that it is interesting to add an additional level of processing to refine the estimated distances. It consists in quantizing the difference vector between a point and the corresponding centroid. When combined with an inverted file, this gives three levels of quantization. Experiments performed on SIFT and GIST image descriptors show excellent search accuracy outperforming three state-of-the-art approaches. Compared with the original work , the proposed re-ranking technique is shown to obtain better trade-off with respect to memory, efficiency and search quality.
Following recent works on Hamming Embedding techniques, we propose a binarization method that aim at addressing the problem of nearest neighbor search for the Euclidean metric by mapping the original vectors into binary vectors ones, which are compact in memory, and for which the distance computation is more efficient.
Our method is based on the recent concept of anti-sparse coding, which exhibits here excellent performance for approximate nearest neighbor search. Unlike other binarization schemes, this framework allows, up to a scaling factor, the explicit reconstruction from the binary representation of the original vector. We also show that random projections which are used in Locality Sensitive Hashing algorithms, are significantly outperformed by regular frames for both synthetic and real data if the number of bits exceeds the vector dimensionality, i.e., when high precision is required.
This is a joint work with Björn Þór Jónsson from the School of Computer Science, Reykjavik University, Iceland.
The scale of multimedia data collections is expanding at a very fast rate. In order to cope with this growth, the high-dimensional indexing methods used for content-based multimedia retrieval must adapt gracefully to secondary storage. Recent progress in storage technology, however, means that algorithm designers must now cope with a spectrum of secondary storage solutions, ranging from traditional magnetic hard drives to state-of-the-art solid state disks. We have analyzed the impact of storage technology on a simple, prototypical high-dimensional indexing method for large scale query processing. We found that while the algorithm implementation deeply impacts the performance of the indexing method, the setup of the underlying storage technology is equally important.
This work is done in the framework of the Quaero project (see below).
On this subject, TexMexis implied in three tasks of the Quaero project.
The first task concerns the extraction of terminology from document. The objective of this work is to study the development and the adaptation of methods to automate the acquisition and the structuring of terminologies . In this context, in 2011, we have undergone a new evaluation of terminology extraction systems. Here again, our system, relying on TermoStat (see previous reports) ranked first for the tracks in which we participated. We have also continued our work the use of morphology for biomedical terminologies. This approach relies on the decomposition of terms into morphemes and the translation of these morphemes into japanese (kanji) sub-words. The kanji characters thus offer a semantic way to access the semantics of the morpheme and allow us to detect semantic relations between them. We have tested this approach on more languages and have proved its relevance for information retrieval problems.
The second task aims at extracting semantic and ontological relations from documents. Indeed, detecting semantic and ontological relations in texts is a key to describe a domain and thus manipulate cleverly documents. In 2011, we developed a new relation extraction system based on k-nearest-neighbors and language modeling. It has been tested in the framework of the Quaero evaluation campaign and ranked first or second for all tracks. We have also developed a clustering technique for named entities. It relies on new representation schemes called bag-of-vectors (or bag-of-bags-of-features), which perform better than the classical bag-of-word approach.
The last task directly deals with the semantic annotation of multimedia documents based on textual data, for, very often, many textual or language-related data can be found in multimedia documents or come along such documents. For example, a TV-broadcast, contains speech that can transcribed, Electronic Program Guide and standard program guide information, closed captions, associated websites, etc. All these sources offers a way to exploit complementary information that can be used to semantically annotate multimedia document. During this year, we finished the development of a football multimedia corpus. It contains the video of several matches, the speech transcripts, associated textual data from specialized websites. All these media have been manually annotated in terms of events, named entities, specialized relations (fouls, replacements, etc) and other relevant information. This corpus will be distributed under LGPL-LR license.
This work is done in the context of a joint TexMex/Orange Ph.D. thesis supported by a CIFRE grant with Orange Labs.
We aim at helping multimedia content understanding by obtaining benefit from textual clues embedded in digital video data. In 2011, we proposed an Optical Character Recognition-based method to recognize natural scene texts in images, avoiding the conventional character segmentation step. The text image is scanned with multi-scale windows and a robust recognition model is applied on each window, relying on a neural classification approach, to identify non valid characters and recognize valid ones. A graph model is used to represent spatial constraint between recognition results, and to determine the best sequence of characters. Some linguistic knowledge is also incorporated in the graph to remove errors due to recognition confusions. The method was evaluated on the ICDAR 2003 database of scene text images and outperforms state-of-the-art approaches. This work will be presented at DAS2012.
Christian Raymond and Vincent Claveau participated to DEFT (
http://
Our work on this topic is done in close collaboration with Olivier Pivert from the Pilgrimproject-team of IRISA Lannion.
Databases (DB) querying mechanisms, and more particularly the division of relations was at the origin of the Boolean model for Information Retrieval Systems (IRSs). This model has rapidly shown its limitations and is no more used in Information retrieval (IR). Among the reasons, the Boolean approach do not allow to represent and use the relative importance of terms indexing the documents or representing the queries. However, this notion of importance can be captured by the division of fuzzy relations. This division, modeled by fuzzy implications, corresponds to graded inclusions. Theoretical work conducted by the Pilgrimproject-team have shown the interest of this operator in IR.
Our first work was to investigate the use of graded inclusions to model the information retrieval process. In this framework, documents and queries are represented by fuzzy sets, which are paired with operations like fuzzy implications and T-norms. Through different experiments, we have shown that only some among the wide range of fuzzy operations are relevant for information retrieval. When appropriate settings are chosen, it is possible to mimic classical systems, thus yielding results rivaling those of state-of-the-art systems. These positive results have validated the proposed approach, while negative ones have given some insights on the properties needed by such a model.
More recently, the links between our fuzzy model and other classical IR models have been studied. It has been shown that our fuzzy implication-based model can be interpreted as several classical models: an Extended Boolean Model, a Logical Model, a Vector Space Model or a Language Model in IR.
Automatic speech recognition outputs are by nature incomplete and uncertain, so much that lexical indexes of speech are not sufficient to overcome the errors due to out-of-vocabulary words and to most of the named entities, consisting in important semantic information from the discourse. Using if necessary a phonetic index is a solution to retrieve partially the mis-recognized words but at the price of a lower precision because the phonetic representation is also noisy. We proposed this year (still to be submitted) an indexation method which jointly model lexical and phonetic levels with finite-state transducers, offering the possibility to take a lexical path or a phonetic path between two synchronization nodes. The edges are weighted by a vector of features (edition scores, confidence measures, durations) that will be used in a supervised manner to estimate the reliability of the returned result at the search step. The experiments have shown the complementary of lexical-phonetic representations and their contribution for a task of spoken utterance retrieval using named entity queries.
We work on the issue of structuring large TV streams. More precisely, we focus on the problem of labeling the segments of a stream according to their types ( e.g., programs, commercial breaks, sponsoring, etc). Contrary to existing techniques, we wanted to take into account the sequential aspect of the data, and thus we used Conditional Random Fields (CRF), a classifier which has proved useful to handle sequential data in other domains like computational linguistics or computational biology. During this year, we proved the relevance of CRF in the framework of TV segments labeling. We conducted different experiments, either on manually or automatically segmented streams, with different label granularities, and demonstrated that this approach rivals existing ones. The use of this model for semi-supervised and unsupervised learning are under study.
This work was performed in close collaboration with Technicolor as external partner.
Following our work on the detection of audio concepts related to violence in movie soundtracks , we developed a system for the detection of violent scenes in movies, combining multimodal features. We investigated multimodal fusion strategies and temporal integration exploiting Bayesian networks as a joint distribution model. Several strategies for learning the structure of the Bayesian networks were compared, resulting in a complete system for violence detection. The system was evaluated on the Violent Scenes Detection task of the MediaEval 2011 international evaluation that we co-organized with Technicolor and the University of Geneva . A fair amount of time was dedicated this year to the organization of the evaluation campaign which includes defining the task and metrics, supervising the annotation, recruiting participants, analyzing the results and organizing the corresponding workshop session.
This work on audio content discovery was partially carried out in collaboration with Armando Muscariello and Frédéric Bimbot from the Metiss project-team.
As an alternative to supervised approaches for multimedia content analysis, where predefined concepts are searched for in the data, we investigate content discovery approaches where knowledge emerge from the data. Following this general philosophy, we pursued work on motif discovery in audio and video content.
Audio motif discovery is the task of finding out, without any prior knowledge, all pieces of signals that repeat, eventually allowing variability. In 2011, we extended our recent work on seeded discovery to near duplicate detection and spoken document retrieval from examples. First, we proposed algorithmic speed ups for the discovery of near duplicate motifs (low variability) in large (several days long) audio streams, exploiting subsampling strategies . Second, we investigated the use of previously proposed efficient pattern matching techniques to deal with motif variability in speech data in a different setting, that of spoken document retrieval from an audio example. We demonstrated the potential of model-free approaches for efficient spoken document retrieval on a variety of data sets, in particular in the framework of the Spoken Web Search task of the MediaEval 2011 international evaluation .
Video structure is often enforced through editing rules which result in a set of shots defining an event that repeats throughout the video with a high visual and audio similarity. Typical such shots are anchor persons and close-up on guests in talk-shows. We recently proposed an unsupervised multimodal approach to discover such events exploiting audio and visual consistency between two sets of independent nested clusters, one for each modality . In 2011, we extended the approach in two directions. First, we improved the selection of consistent audio and visual clusters and the unsupervised selection of positive and negative examples exploiting redundancy between nested clusters. Second, we extended the method to discover several audio-visually consistent events rather than a single one in our previous work, thus enabling the use of unsupervised mining as a pre-processing step for video structure analysis.
Our work on this topic is done in close collaboration with Sébastien Lefèvre from the Seasideproject-team of IRISA Vannes.
Segmenting a program into topics is an important step for fine-grained structuring of TV streams. Based on our work on vectorization (see previous reports), we have developed a new segmentation technique using speech transcripts. Making an analogy with image segmentation, we have adapted the watershed transform to handle these textual data and more precisely the distances computed by vectorization between possible segments.
This method has been tested on different TV collections (news, reports) as well as more usual text collection used for segmentation evaluation. In every cases, our technique has outperformed any state-of-the-art approaches.
Speech can be used to structure and organize large collections of spoken documents (videos, audio streams, etc) based on semantics. This is typically achieved by first transforming speech into text using automatic speech recognition (ASR), before applying natural language processing (NLP) techniques on the transcripts. Our research focuses firstly on the adaptation of NLP methods designed for regular texts to account for the specific aspects of automatic transcripts. In particular, we investigate a deeper integration between ASR and NLP, i.e., between the transcription phase and the semantic analysis phase.
In 2011, we mostly focused on robust transcription, hierarchical topic segmentation and collection structuring.
On the one hand, we investigated the use of broad phonetic landmarks and syllable prominence to improve large vocabulary speech recognition by guiding the Viterbi search process. Several mechanisms to incorporate landmarks into the search space were studied. Significant improvements were observed on radio broadcast news data in the French language. On the other hand, we pursued our work on unsupervised topic adaptation, focusing on the automatic selection of out-of-vocabulary words combining phonetic and morpho-syntactic criteria.
Linear topic segmentation has been widely studied for textual data and recently adapted to spoken contents. However, most documents exhibit a hierarchy of topics which cannot be recovered using linear segmentation. We investigated hierarchical topic segmentation of TV programs exploiting the spoken material. Recursively applying linear segmentation methods is one solution but fails at the lowest levels of the hierarchy when small segments are targeted, in particular when transcription errors jeopardize lexical cohesion. We proposed new probabilistic measures of the lexical cohesion to emphasize the contribution of words that appears only locally, thus attenuating the impact of words which contributed to the segments at an upper level of the hierarchy .
Finally, we initiated work in collaboration with INA on structuring a large collection of news reports. The idea is to automatically create links and threads between reports in several months of broadcast news shows, based either on the documentary records of the shows and/or on the automatic transcripts. As preliminary step towards this goal, we investigated distances between documentary records in an information retrieval setting so as to construct a nearest neighbor graph. The next step consists in exploiting graph clustering methods.
Our research in speech for TV content structuring was illustrated through the Texmix demonstration (see Section ) which exploits most of our achievements in the field, including transcription, topic segmentation and collection structuring.
The French government organized in 2005 competitiveness poles ( pôles de compétitivité) in France to strengthen ties in given regions between industries (big and small companies), research labs (both public and private ones) and teaching institutions (universities and schools of engineering). In 2011, the pole actively prepared a proposal to build an “IRT” (Institut de Recherche Technologique), a new tool proposed by the government to foster innovation and transfer between academic and industrial partners. Texmex is involved in this project, and is responsible for one of its experimental platform. Until Oct 1st, Patrick Gros was also deputy member of the executive committee and the project selection committee.
Duration: 36 months, since September 15 th2010.
C. Penet's Ph.D. thesis is supported by a CIFRE grant in the framework of a contract between Technicolor and TexMex. The aim of this work is to study and develop techniques based on stochastic models to analyze the content of movies. The application developed in Technicolor consists in detecting violent scenes in movies in order to facilitate parental supervision.
Duration: 36 months, since October 2009.
K. Elagouni's Ph.D. thesis is supported by a CIFRE grant in the framework of a contract between Orange Labs and TexMex. The aim of the work is to investigate a more semantic approach to describe multimedia documents based on textual material found inside the images.
Duration: 36 months, since April 2011.
Ludivine Kuznik's Ph.D. thesis is supported by a CIFRE grant in the framework of a contract between INA and TEXMEX within the OSEO/QUAERO project. The aim of the work is to investigate a more semantic approach to structure and navigate very large collections of TV archives.
Duration: 5 years, starting in May 2008. Prime: Technicolor.
Quaero is a large research and applicative program in the field of multimedia description (ranging from text to speech and video) and search engines. It groups 5 application projects, a joint Core Technology Cluster developing and providing advanced technologies to the application projects, and a Corpus project in charge of providing the necessary data to develop and evaluate the technologies. The large scope of QUAERO's ambitious objectives allows it to take full advantage of Texmex's many areas of research, through its tasks on: Indexing Multimedia Objects, Term Acquisition and Recognition, Semantic Annotation, Video Segmentation, Multi-modal Video Structuring, Image and video fingerprinting.
In 2011, the Quaero team of TexMexwas mainly affected by the leave of Mathieu Ben, our technical coordinator and of Stacy Payne our financial coordinator. S. Payne was replaced by Carryn Hayward. Among the key fact of our participation this year is our participation to Trecvid.
Duration: 1 year, starting January 2011.
OpenSEM is a project of the EIT KIC ICT Labs grouping 5 academic partners: TU Delft (The Netherlands), VTT (Finland), TU Berlin (Germany), Institut Eurecom (France) and INRIA Rennes.
The project (See
http://
Maximizing the open dissemination and impact of existing knowledge, tangible results (software, tools, demonstrations, field trial results), and rich social content (multimedia, plus metadata such as tag and ratings, plus social network information).
Driving the immediate potential for the triple synergy between content-based analysis, user-based collaborative analysis and social networks and community building through large scale benchmarking competitions (MediaEval).
Participation to the project includes contributing software and demonstration to the OpenSEM portal as well as organizing and participating to the MediaEval 2011 benchmark initiative. As a particularly visible action, we developed the Texmix demo interface that allows the demonstration, on a corpora of news reports provided by INA, the work that was developed in the team on topic segmentation, keyword extraction, image retrieval, named entity extraction and classification. This demo was demonstrated during the fall Quaero plenary and during the INRIA-industry special day on future TV.
Duration: 32 months, starting November 2011. Prime: Videntifier Technologies.
FIIA is an innovative software service for the Forensic market that automatically identifies and analyzes the content of images on web sites and seized computers. The service saves time and money, gathers better evidence, and builds stronger court cases. We are in charge of helping with the technology needed to identify the logos from terrorist organizations that are inserted in images or videos. Challenges are related to the poor resolution and small size of logos as well as to the very strict efficiency constraints that the logo detector must match.
Duration: 3 years, started in November 2009.
Partners: IRISA, LIA, LIUM
The project ASH (Automatic System Harnessing) aims at developing new collaborative paradigms for speech recognition. Many current ASR systems rely on an a posteriori combination of the output of several systems (e.g., confusion network combination). In the ASH project, we investigate new approaches in which three ASR systems work in parallel, exchanging information at every step of the recognition process rather than limiting ourselves to an a posteriori combination. What information is to be shared and how to share such information and make use of it are the key questions that the project is addressing. The collaborative paradigm is being extended to landmark-based speech recognition where detection of landmarks and speech transcription can be considered as two (or more) collaborative processes.
Guillaume Gravier was invited to the Multimedia Information Retrieval Lab at Delft University of Technology for one week in May 2011. He gave a seminar entitled Speech, Language and Multimedia at IRISA/INRIA Rennes and participated in the organization of MediaEval 2011, an international benchmark initiative in multimedia processing. A EU project involving TU Delft and IRISA/INRIA Rennes, initiated during his visit in Delft, has been submitted to the FET Open program.
Julien Fayolle spent three months, from May to July 2011, in the BUSIM speech processing group at Bogazici University (Istanbul, Turkey) to work on lexical-phonetic automata for spoken utterance retrieval in collaboration with Murat Saraçlar. Whereas the state-of-the-art approaches consist in a late fusion of the results of phonetic and lexical searches, the idea was to adapt the Murat Saraclar's spoken utterance retrieval methods to a new representation combining both lexical and phonetic levels earlier than the retrieval step.
Jiangbo Yuan spent five months in the TexMexproject-team to work on audio indexing for video copy detection. He has contributed to the submission of INRIA at copy detection task, working on our audio indexing engine, more precisely on the post-verification step. He then worked on the detection of near-duplicate patterns in very large datasets of vectors.
His venue was funded by the EIT ICT Labs OpenSEM project.
Hervé Jégou spent one week (October 2011) in the Center of Machine Perception at the Czech Technical University, Prague, to initiate a collaboration with Pr. Jiri Matas and Dr. Ondrej Chum. During this visit, he gave an invited talk at the 29th Pattern Recognition and Computer Vision Colloquium, entitled “Approximate search as a source coding problem, with application to large scale image retrieval”. This visit was the opportunity to start a joined work on image search.
Björn Þór Jónsson from the School of Computer Science, Reykjavik University, Iceland and with Herwig Lejsek, Videntifier Technologies, Iceland spent one week in the team. They came to participate to the large scale high dimensional indexing experiments involving more than 30 billion SIFT descriptors.
Laurent Amsaleg
was a program committee member of CIVR 2010;
was a program committee member of ACM Multimedia 2011;
was a program committee member of ICMR 2011;
was a program committee member of MMM 2011;
was a program committee member of VLDB 2011 PhD Forum;
was a reviewer for IET Information Security;
was a reviewer for EURASIP Journal on Advances in Signal Processing;
was the publicity chairman of ACM Multimedia 2011;
was a member of the “commission de spécialistes, Toulon”;
was a member of the “commission de spécialistes, Nantes”.
Vincent Claveau
was a program committee member of WI'11 (International conference on Web Intelligence), Grenoble, France, July 2011;
was a program committee member of TALN'11 (17 econférence nationale Traitement automatique des langues naturelles), Montpellier, France, July 2011;
was a program committee member of RECITAL'11, Montpellier, France, July 2011;
was a program committee member of SIIM, (Symposium sur l’Ingénierie de l’Information Médicale), Toulouse, France, June 2011;
was a program committee member of Conférence en Recherche d'Information et Applications, CORIA 2011, Avignon, France, March 2011;
is a member of the editorial board of the journal TAL, Traitement Automatique des Langues;
was a reviewing committee member for the journal Multimedia Tools and Applications (MTAP).
Teddy Furon
was a program committee member of Information Hiding 2011, Prague, Czech Republic;
was a program committee member of CMS 2011, Ghent, Belgium;
was a program committee member of ISPA 2011, Dubrovnik, Croatia;
was a program committee member of IEEE ICME 2011, Barcelona, Spain;
was a program committee member of EUSIPCO 2011, Barcelona, Spain;
was a program committee member of IEEE WIFS 2011, Foz do Iguaçu, Brazil;
was an evaluator for the French ANR, 2011;
was an associate editor of EURASIP Journal on Information Security, 2011;
was an associate editor of IET Journal of Information security, 2011;
was a member of the technical committee of IEEE Information Forensics and Security subsociety, 2011.
Guillaume Gravier
was a technical chair of CBMI 2011, Madrid, Spain;
was a program committee member of MediaEval 2011;
is vice-president of the French Speech Communication Association (AFCP);
is the scientific leader of the French ANR project ETAPE targeting an evaluation campaign on speech technologies for multimedia;
co-founded in 2011 the Speech and Language in Multimedia (SLIM) Special Interest Group of the Intl. Speech Communication Association (ISCA).
Patrick Gros
was a co-organizer of the International ACM Workshop on Automated Media Analysis and Production for Novel TV Services (AIEMPro 2011) that took place during the ACM International conference on Multimedia in Scottsdale, Arizona, USA, in December 2011;
was a program committee member of the ninht International Workshop on Content Based Multimedia Indexing (CBMI) Which was held in MAdrid, Spain, France in June 2011;
is a member of the steering board of the Content Based Multimedia Indexing (CBMI) workshop series;
was a program committee member of RFIA'12, 18ème conférence en Reconnaissance des Formes et Intelligence Artificielle, Lyon, France, January 2012;
was a scientific committee member of the “Conférence en Recherche d'Information et Applications (CORIA) 2011 - Avignon, March 2011;
was a program committee member of the Second International Conference on Creative Content Technologies CONTENT, Roma, Italy, September 2011.
Hervé Jégou
was a program committee member of CVPR 2011;
was a program committee member of ICCV 2011;
was an evaluator for the French ANR, 2011;
was an expert for the program “Futur et Rupture” of the Institut Télécom;
was reviewer for several journals, in particular for the International Journal of Computer Vision, IEEE Transactions on Pattern Analysis and Machine Intelligence and IEEE Transactions on Multimedia.
Annie Morin
was a vice -president of the CNU, Conseil national des Universités, in Computer Science;
was a member of ITI (Information technology Interfaces) 2011 conference.
François Poulet
was a program committee member of VINCI'11, Visual INformation Communications International, Hong-Kong, China, August 2011;
was a program committee member of AusDM'11, Autralasian Data Mining Conference, Ballarat, Australia, December 2011;
was a program committee member of EGC'11, Extraction et Gestion de Connaissances, Brest, France, January 2011;
was a reviewer of ESANN 2001, European Symposium of Artificial Neural Network, Bruges, Belgique, April 2011;
was a reviewer of NeuroComputing;
was a reviewer of EJOR, European Journal of Operational Research;
was co-organizer of the 9th workshop Visualisation et Extraction de Connaissances, (AVEC-EGC'11), Brest, France, January 2011.
Christian Raymond
is a member of the editorial board of the e-journal "Discours",
http://
was a reviewing committee member for the journal CSL, Computer Speech and Language;
was a reviewing committee member of Interspeech (12th Annual Conference of the International Speech Communication Association), Florence, Italy;
was a reviewing committee member of ICMLA (The tenth International Conference on Machine Learning and Applications), Honolulu Hawaii, USA.
Pascale Sébillot
was a member of the program committee of CORIA 2011 (8e conférence en recherche d'information et applications), Avignon, France, March 2011;
was a member of the program committee of TALN 2011 (18e conférence francophone Traitement automatique des langues naturelles), Montpellier, France, June-July 2011;
was a member of the program committee of TIA 2011 (9e conférence Terminologie et intelligence artificielle), Paris, France, November 2011;
is an editorial committee member of the Journal TAL (Traitement automatique des langues; since July 2009);
was a member of the reading committee of several issues of the Journal TAL (Traitement automatique des langues) in 2011.
Laurent Ughetto
was a program committee member of the Rencontres Francophones sur la Logique Floue et ses Applications (LFA'11).
Teddy Furon. Talk at GdR ISIS French national day on fingerprinting, July 2011
Hervé Jégou. Talk at GdR ISIS French national day on fingerprinting, July 2011
Hervé Jégou. Invited talk at INESC Porto, Portugal, July 2011
Hervé Jégou, Invited talk at CVUT Pragua at the 29th Pattern Recognition and Computer Vision Colloquium, Czech Republic, October 13th 2011
Hervé Jégou, Talk at the Trecvid Workshop, Gaithersburg, USA, December 2011
Gwénolé Lecorvé was awarded the best Ph.D. award of the French Speech Communication Association.
Hervé Jégou was awarded an Outstanding Reviewer Award at CVPR ’11.
TexMexhas participated to the Trecvid copy detection task in 2011. This joint submission with LEAR project-team was ranked roughly 3 rdout of 21 participants. On the “No-False alarm” profile, our submission was best for 23 of the 56 transformation types for the optimal NDCR measure.
TRECVID semantic indexing task: Hervé Jégou has contributed to the Quaero submission , jointly with LIG (main contributor) and Karlsruhe Institute of Technology. This submission was ranked 3 rdout of 19 participants.
TexMexwas ranked first on the diachronic task (identification of writing year of OCR newpapers) at the DEFT evaluation campaign.
Laurent Amsaleg: Managing Large Collections of Digital Data, 14h, M2 R, University Rennes 1, France.
Laurent Amsaleg: Advanced Databases, 8h, Master, ENSAI, France.
Vincent Claveau: Symbolic Sequential Data, Master by research in computer science 2nd year (8 students, 7 hours), University of Rennes 1, France.
Vincent Claveau: Multimedia Databases, engineer diploma 3rd year, (14 students, 10 hours), ENSSAT - University of Rennes 1, Lannion, France.
Patrick Gros: coordinates the track "From Data to Knowledge: Machine Learning, Modeling and Indexing Multimedia Contents and Symbolic Data" of the Master by research in computer science (2 ndyear), University of Rennes 1, Rennes. He is responsible of the Math workshop of the master (10h).
Camille Guinaudeau: Analysis of audiovisual documents and flows for indexing (7h), Master by research in computer science 2nd year (M2), University of Rennes 1, France.
Annie Morin: Data Analysis L3 MIAGE ISTIC University of Rennes 1 (38 hours, 60 students).
Annie Morin: Short term forecast M1 MIAGE ISTIC University of Rennes 1 (40 hours, 20 students).
Annie Morin: Statistical Data Mining M2 MIAGE ISTIC University of Rennes 1 (38 hours, 20 students).
Annie Morin: Statistical Process control and Reliability M2 Micro Electronics ISTIC University of Rennes 1 (37 hours, 10 students).
Annie Morin: Statistical Process control and Reliability International M2 Telecom and Electronics, South East University, Nanjing, China (40 hours, 15 students).
Ewa Kijak is head of the Image engineering track of ESIR, the engineering formation of University of Rennes 1, France.
Ewa Kijak: Image processing (50h), Image analysis and classification (24h), engineer diploma 2nd year (M1), ESIR - University of Rennes, France.
Ewa Kijak: Computer vision: Image indexing and retrieval (10h), engineer diploma 3rd year (M2), ESIR - University of Rennes, France.
Ewa Kijak: Multimedia Databases (10h), engineer diploma 3rd year (M2), ENSSAT - University of Rennes 1, Lannion, France.
Ewa Kijak and Camille Guinaudeau: Digital Documents Indexing and Retrieval (22h and 10h), Professional Master in Computer Science 2nd year (M2), ISTIC, University of Rennes 1, France.
François Poulet is in charge of the Master in computer science, M2, MITIC, Computer Science Methods and Information and Communication Technologies, ISTIC, University of Rennes 1.
François Poulet: Managing Large Collections of Digital Data. Master by research in computer science, M2, ISTIC, University of Rennes 1, 10h EqTD.
François Poulet: Supervised Learning. Master by research in computer science, M2, ISTIC, University of Rennes 1, 16h EqTD.
François Poulet: Introduction to Data Mining. Professional Master in Computer Science, M2, ISTIC, University of Rennes 1, 15h EqTD.
François Poulet: Mining Symbolic Data. Professional Master in Computer Science, M2, ISTIC, University of Rennes 1, 25h EqTD.
François Poulet: Applications and Problem Solving. Professional Master in Computer Science, M2, ISTIC, University of Rennes 1, 10h EqTD.
François Poulet: Learning Methods for Multimedia Data. Professional Master in Computer Science, M2, ISTIC, University of Rennes 1, 26h EqTD.
François Poulet: Algorithms and Functional Programming. Computer Science Licence, L1, ISTIC, University of Rennes 1, 60h EqTD.
Pascale Sébillot was course co-director of the Research in Computer Science specialism of the Master's in Computer Science (2nd year), University of Rennes 1, till September 12, 2011.
Pascale Sébillot: : Advanced Databases and Modern Information Systems, 69 hours, M2, INSA de Rennes, France.
Pascale Sébillot: : Data-Based Knowledge Acquisition 2: Symbolic Methods, 18 hours, M1, INSA de Rennes, France.
PhD : Romain Tavenard, Indexation de séquences de descripteurs, University Rennes 1, defended July 4 th2011, Laurent Amsaleg & Patrick Gros.
PhD : Camille Guinaudeau, Structuration automatique de flux télévisuels, INSA de Rennes, defended December 7 th, 2011, Guillaume Gravier and Pascale Sébillot.
PhD in progress : Juan David Cruz-Gomez, Algorithmique de réseaux socio-sémantiques pour la Visualisation par point de vue de communautés en ligne, since December 2009, Cécile Bothorel (Télécom Bretagne), François Poulet.
PhD in progress : Thanh Toan Do, Challenging the security of CBIR systems, 2 ndyear, Laurent Amsaleg, Ewa Kijak, Teddy Furon.
PhD in progress : Thanh-Nghi Doan, Image Classification, since November 2010, François Poulet.
PhD in progress : Ali Reza Ebadat, Annotation de documents multimédias à partir d'indices textuels, October 10 th, 2009, Vincent Claveau and Pascale Sébillot.
PhD in progress : Khaoula Elagouni, Indexation automatique d'images et de vidéos par reconnaissance automatique de textes incrustés et traitement automatique des langues, October 10 th, 2009, Pascale Sébillot, Christophe Garcia (LIRIS, Lyon) and Franck Mamalet (Orange Labs, Rennes).
PhD in progress : Gylfi Gudmundsson, Towards parallel and distributed CBIR systems, 2 ndyear, Laurent Amsaleg.
PhD in progress : Mihir Jain, Video description and indexing, since February 2011, Hervé Jégou and Patrick Gros.
PhD in progress : Ludivine Kuznik, Structuration et navigation dans des archives documentaires, April 18 th, 2011, Guillaume Gravier, Pascale Sébillot and Jean Carrive (INA).