LINKMEDIA

LINKMEDIA - 2024

2024Activity reportProject-TeamLINKMEDIA

RNSR: 201421145C

Research center Inria Centre at Rennes University
In partnership with:Institut national des sciences appliquées de Rennes, CNRS, Université de Rennes
Team name: Creating and exploiting explicit links between multimedia fragments
In collaboration with:Institut de recherche en informatique et systèmes aléatoires (IRISA)
Domain:Perception, Cognition and Interaction
Theme:Vision, perception and multimedia interpretation

Keywords

Computer Science and Digital Science

A3.3.2. Data mining
A3.3.3. Big data analysis
A3.4. Machine learning and statistics
A3.4.1. Supervised learning
A3.4.2. Unsupervised learning
A3.4.8. Deep learning
A4. Security and privacy
A5.3.3. Pattern recognition
A5.4.1. Object recognition
A5.4.3. Content retrieval
A5.7. Audio modeling and processing
A5.7.1. Sound
A5.7.3. Speech
A5.8. Natural language processing
A9.2. Machine learning
A9.3. Signal analysis
A9.4. Natural language processing

1 Team members, visitors, external collaborators

Research Scientists

Laurent Amsaleg [Team leader, CNRS, Senior Researcher]
Teddy Furon [INRIA, Senior Researcher]
Eva Giboulot [INRIA, Researcher, from Oct 2024]
Guillaume Gravier [CNRS, Senior Researcher]

Faculty Members

Caio Corro [INSA RENNES, Associate Professor, from Sep 2024]
Ewa Kijak [Univ. Rennes, Associate Professor]
Simon Malinowski [Univ. Rennes, Associate Professor]
Pascale Sébillot [INSA RENNES, Professor]

Post-Doctoral Fellows

Eva Giboulot [INRIA, Post-Doctoral Fellow, until Sep 2024]
Ryan Webster [INRIA, Post-Doctoral Fellow]

PhD Students

Adèle Denis [INRAE, from Sep 2024]
Virgile Dine [INRIA, from Sep 2024]
Deniz Engin [INRIA, until Feb 2024]
Gautier Evennou [IMATAG, CIFRE]
Pierre Fernandez [FACEBOOK, CIFRE]
Enoal Gesny [INRIA, from Oct 2024]
Louis Hemadou [SAFRAN, CIFRE]
Chloé Imadache [INRIA, from Oct 2024]
Carolina Jeronimo De Almeida [GOUV BRESIL]
Quentin Le Roux [THALES, CIFRE]
Hugo Thomas [Univ. Rennes]
Karim Tit [INRIA, from Feb 2024 until Mar 2024]
Karim Tit [THALES, until Jan 2024]
Shashanka Venkataramanan [INRIA, until May 2024]

Technical Staff

Morgane Casanova [CNRS, from May 2024 until Oct 2024, Engineer]
Morgane Casanova [CNRS, Engineer, until Apr 2024]
Nicolas Fouqué [CNRS, from Mar 2024, Engineer]

Interns and Apprentices

Enoal Gesny [INRIA, Intern, from Apr 2024 until Sep 2024]
Chloé Imadache [INRIA, Intern, from May 2024 until Sep 2024]
Amelie Knecht [Univ. Rennes, from Sep 2024]

Administrative Assistants

Aurélie Patier [Univ. Rennes, until Jul 2024]
Sabrina Ysope [INRIA, from Jul 2024]

Visiting Scientists

Isabela Borlido Barcelos [GOUV BRESIL, from Sep 2024 until Sep 2024]
Caio Corro [SORBONNE UNIVERSITE, from Jul 2024 until Aug 2024]

External Collaborator

Charly Faure [DGA-MI]

2 Overall objectives

2.1 Context

Linkmedia is concerned with the processing of extremely large collections of multimedia material. The material we refer to are collections of documents that are created by humans and intended for humans. It is material that is typically created by media players such as TV channels, radios, newspapers, archivists (BBC, INA, ...), as well as the multimedia material that goes through social-networks. It has images, videos and pathology reports for e-health applications, or that is in relation with e-learning which typically includes a fair amount of texts, graphics, images and videos associating in new ways teachers and students. It also includes material in relation with humanities that study societies through the multimedia material that has been produced across the centuries, from early books and paintings to the latest digitally native multimedia artifacts. Some other multimedia material are out of the scope of Linkmedia, such as the ones created by cameras or sensors in the broad areas of video-surveillance or satellite images.

Multimedia collections are rich in contents and potential, that richness being in part within the documents themselves, in part within the relationships between the documents, in part within what humans can discover and understand from the collections before materializing its potential into new applications, new services, new societal discoveries, ... That richness, however, remains today hardly accessible due to the conjunction of several factors originating from the inherent nature of the collections, the complexity of bridging the semantic gap or the current practices and the (limited) technology:

Multimodal: multimedia collections are composed of very diverse material (images, texts, videos, audio, ...), which require sophisticated approaches at analysis time. Scientific contributions from past decades mostly focused on analyzing each media in isolation one from the other, using modality-specific algorithms. However, revealing the full richness of collections calls for jointly taking into account these multiple modalities, as they are obviously semantically connected. Furthermore, involving resources that are external to collections, such as knowledge bases, can only improve gaining insight into the collections. Knowledge bases form, in a way, another type of modality with specific characteristics that also need to be part of the analysis of media collections. Note that determining what a document is about possibly mobilizes a lot of resources, and this is especially costly and time consuming for audio and video. Multimodality is a great source of richness, but causes major difficulties for the algorithms running analysis;
Intertwined: documents do not exist in isolation one from the other. There is more knowledge in a collection than carried by the sum of its individual documents and the relationships between documents also carry a lot of meaningful information. (Hyper)Links are a good support for materializing the relationships between documents, between parts of documents, and having analytic processes creating them automatically is challenging. Creating semantically rich typed links, linking elements at very different granularities is very hard to achieve. Furthermore, in addition to being disconnected, there is often no strong structure into each document, which makes even more difficult their analysis;
Collections are very large: the scale of collections challenges any algorithm that runs analysis tasks, increasing the duration of the analysis processes, impacting quality as more irrelevant multimedia material gets in the way of relevant ones. Overall, scale challenges the complexity of algorithms as well as the quality of the result they produce;
Hard to visualize: It is very difficult to facilitate humans getting insight on collections of multimedia documents because we hardly know how to display them due to their multimodal nature, or due to their number. We also do not know how to well present the complex relationships linking documents together: granularity matters here, as full documents can be linked with small parts from others. Furthermore, visualizing time-varying relationships is not straightforward. Data visualization for multimedia collections remains quite unexplored.

2.2 Scientific objectives

The ambition of Linkmedia is to propose foundations, methods, techniques and tools to help humans make sense of extremely large collections of multimedia material. Getting useful insight from multimedia is only possible if tools and users interact tightly. Accountability of the analysis processes is paramount in order to allow users understanding their outcome, to understand why some multimedia material was classified this way, why two fragments of documents are now linked. It is key for the acceptance of these tools, or for correcting errors that will exist. Interactions with users, facilitating analytics processes, taking into account the trust in the information and the possible adversarial behaviors are topics Linkmedia addresses.

3 Research program

3.1 Scientific background

Linkmedia is de facto a multidisciplinary research team in order to gather the multiple skills needed to enable humans to gain insight into extremely large collections of multimedia material. It is multimedia data which is at the core of the team and which drives the design of our scientific contributions, backed-up with solid experimental validations. Multimedia data, again, is the rationale for selecting problems, applicative fields and partners.

Our activities therefore include studying the following scientific fields:

multimedia: content-based analysis; multimodal processing and fusion; multimedia applications;
computer vision: compact description of images; object and event detection;
machine learning: deep architectures; structured learning; adversarial learning;
natural language processing: topic segmentation; information extraction;
information retrieval: high-dimensional indexing; approximate k-nn search; embeddings;
data mining: time series mining; knowledge extraction.

3.2 Workplan

Overall, Linkmedia follows two main directions of research that are (i) extracting and representing information from the documents in collections, from the relationships between the documents and from what user build from these documents, and (ii) facilitating the access to documents and to the information that has been elaborated from their processing.

3.3 Research Direction 1: Extracting and Representing Information

Linkmedia follows several research tracks for extracting knowledge from the collections and representing that knowledge to facilitate users acquiring gradual, long term, constructive insights. Automatically processing documents makes it crucial to consider the accountability of the algorithms, as well as understanding when and why algorithms make errors, and possibly invent techniques that compensate or reduce the impact of errors. It also includes dealing with malicious adversaries carefully manipulating the data in order to compromise the whole knowledge extraction effort. In other words, Linkmedia also investigates various aspects related to the security of the algorithms analyzing multimedia material for knowledge extraction and representation.

Knowledge is not solely extracted by algorithms, but also by humans as they gradually get insight. This human knowledge can be materialized in computer-friendly formats, allowing algorithms to use this knowledge. For example, humans can create or update ontologies and knowledge bases that are in relation with a particular collection, they can manually label specific data samples to facilitate their disambiguation, they can manually correct errors, etc. In turn, knowledge provided by humans may help algorithms to then better process the data collections, which provides higher quality knowledge to humans, which in turn can provide some better feedback to the system, and so on. This virtuous cycle where algorithms and humans cooperate in order to make the most of multimedia collections requires specific support and techniques, as detailed below.

Machine Learning for Multimedia Material.

Many approaches are used to extract relevant information from multimedia material, ranging from very low-level to higher-level descriptions (classes, captions, ...). That diversity of information is produced by algorithms that have varying degrees of supervision. Lately, fully supervised approaches based on deep learning proved to outperform most older techniques. This is particularly true for the latest developments of Recurrent Neural Networkds (RNN, such as LSTMs) or convolutional neural network (CNNs) for images that reach excellent performance 54. Linkmedia contributes to advancing the state of the art in computing representations for multimedia material by investigating the topics listed below. Some of them go beyond the very processing of multimedia material as they also question the fundamentals of machine learning procedures when applied to multimedia.

Learning from few samples/weak supervisions. CNNs and RNNs need large collections of carefully annotated data. They are not fitted for analyzing datasets where few examples per category are available or only cheap image-level labels are provided. Linkmedia investigates low-shot, semi-supervised and weakly supervised learning processes: Augmenting scarce training data by automatically propagating labels 57, or transferring what was learned on few very well annotated samples to allow the precise processing of poorly annotated data 66. Note that this context also applies to the processing of heritage collections (paintings, illuminated manuscripts, ...) that strongly differ from contemporary natural images. Not only annotations are scarce, but the learning processes must cope with material departing from what standard CNNs deal with, as classes such as "planes", "cars", etc, are irrelevant in this case.
Ubiquitous Training. NN (CNNs, LSTMs) are mainstream for producing representations suited for high-quality classification. Their training phase is ubiquitous because the same representations can be used for tasks that go beyond classification, such as retrieval, few-shot, meta- and incremental learning, all boiling down to some form of metric learning. We demonstrated that this ubiquitous training is relatively simpler 57 yet as powerful as ad-hoc strategies fitting specific tasks 71. We study the properties and the limitations of this ubiquitous training by casting metric learning as a classification problem.
Beyond static learning. Multimedia collections are by nature continuously growing, and ML processes must adapt. It is not conceivable to re-train a full new model at every change, but rather to support continuous training and/or allowing categories to evolve as the time goes by. New classes may be defined from only very few samples, which links this need for dynamicity to the low-shot learning problem discussed here. Furthermore, active learning strategies determining which is the next sample to use to best improve classification must be considered to alleviate the annotation cost and the re-training process 61. Eventually, the learning process may need to manage an extremely large number of classes, up to millions. In this case, there is a unique opportunity of blending the expertise of Linkmedia on large scale indexing and retrieval with deep learning. Base classes can either be "summarized" e.g. as a multi-modal distribution, or their entire training set can be made accessible as an external associative memory 77.
Learning and lightweight architectures. Multimedia is everywhere, it can be captured and processed on the mobile devices of users. It is necessary to study the design of lightweight ML architectures for mobile and embedded vision applications. Inspired by 81, we study the savings from quantizing hyper-parameters, pruning connections or other approximations, observing the trade-off between the footprint of the learning and the quality of the inference. Once strategy of choice is progressive learning which early aborts when confident enough 62.
Multimodal embeddings. We pursue pioneering work of Linkmedia on multimodal embedding, i.e., representing multiple modalities or information sources in a single embedded space 75, 74, 76. Two main directions are explored: exploiting adversarial architectures (GANs) for embedding via translation from one modality to another, extending initial work in 76 to highly heterogeneous content; combining and constraining word and RDF graph embeddings to facilitate entity linking and explanation of lexical co-occurrences 51.
Accountability of ML processes. ML processes achieve excellent results but it is mandatory to verify that accuracy results from having determined an adequate problem representation, and not from being abused by artifacts in the data. Linkmedia designs procedures for at least explaining and possibly interpreting and understanding what the models have learned. We consider heat-maps materializing which input (pixels, words) have the most importance in the decisions 70, Taylor decompositions to observe the individual contributions of each relevance scores or estimating LID 38 as a surrogate for accounting for the smoothness of the space.
Extracting information. ML is good at extracting features from multimedia material, facilitating subsequent classification, indexing, or mining procedures. Linkmedia designs extraction processes for identifying parts in the images 67, 68, relationships between the various objects that are represented in images 44, learning to localizing objects in images with only weak, image-level supervision 70 or fine-grained semantic information in texts 49. One technique of choice is to rely on generative adversarial networks (GAN) for learning low-level representations. These representations can e.g. be based on the analysis of density 80, shading, albedo, depth, etc.
Learning representations for time evolving multimedia material. Video and audio are time evolving material, and processing them requests to take their time line into account. In 63, 47 we demonstrated how shapelets can be used to transform time series into time-free high-dimensional vectors, preserving however similarities between time series. Representing time series in a metric space improves clustering, retrieval, indexing, metric learning, semi-supervised learning and many other machine learning related tasks. Research directions include adding localization information to the shapelets, fine-tuning them to best fit the task in which they are used as well as designing hierarchical representations.

Adversarial Machine Learning.

Systems based on ML take more and more decisions on our behalf, and maliciously influencing these decisions by crafting adversarial multimedia material is a potential source of dangers: a small amount of carefully crafted noise imperceptibly added to images corrupts classification and/or recognition. This can naturally impact the insight users get on the multimedia collection they work with, leading to taking erroneous decisions for example.

This adversarial phenomenon is not particular to deep learning, and can be observed even when using other ML approaches 43. Furthermore, it has been demonstrated that adversarial samples generalize very well across classifiers, architectures, training sets. The reasons explaining why such tiny content modifications succeed in producing severe errors are still not well understood.

We are left with little choice: we must gain a better understanding of the weaknesses of ML processes, and in particular of deep learning. We must understand why attacks are possible as well as discover mechanisms protecting ML against adversarial attacks (with a special emphasis on convolutional neural networks). Some initial contributions have started exploring such research directions, mainly focusing on images and computer vision problems. Very little has been done for understanding adversarial ML from a multimedia perspective 48.

Linkmedia is in a unique position to throw at this problem new perspectives, by experimenting with other modalities, used in isolation one another, as well as experimenting with true multimodal inputs. This is very challenging, and far more complicated and interesting than just observing adversarial ML from a computer vision perspective. No one clearly knows what is at stake with adversarial audio samples, adversarial video sequences, adversarial ASR, adversarial NLP, adversarial OCR, all this being often part of a sophisticated multimedia processing pipeline.

Our ambition is to lead the way for initiating investigations where the full diversity of modalities we are used to work with in multimedia are considered from a perspective of adversarial attacks and defenses, both at learning and test time. In addition to what is described above, and in order to trust the multimedia material we analyze and/or the algorithms that are at play, Linkmedia investigates the following topics:

Beyond classification. Most contributions in relation with adversarial ML focus on classification tasks. We started investigating the impact of adversarial techniques on more diverse tasks such as retrieval 37. This problem is related to the very nature of euclidean spaces where distances and neighborhoods can all be altered. Designing defensive mechanisms is a natural companion work.
Detecting false information. We carry-on with earlier pioneering work of Linkmedia on false information detection in social media. Unlike traditional approaches in image forensics 52, we build on our expertise in content-based information retrieval to take advantage of the contextual information available in databases or on the web to identify out-of-context use of text or images which contributed to creating a false information 64.
Deep fakes. Progress in deep ML and GANs allow systems to generate realistic images and are able to craft audio and video of existing people saying or doing things they never said or did 60. Gaining in sophistication, these machine learning-based "deep fakes" will eventually be almost indistinguishable from real documents, making their detection/rebutting very hard. Linkmedia develops deep learning based counter-measures to identify such modern forgeries. We also carry on with making use of external data in a provenance filtering perspective 69 in order to debunk such deep fakes.
Distributions, frontiers, smoothness, outliers. Many factors that can possibly explain the adversarial nature of some samples are in relation with their distribution in space which strongly differs from the distribution of natural, genuine, non adversarial samples. We are investigating the use of various information theoretical tools that facilitate observing distributions, how they differ, how far adversarial samples are from benign manifolds, how smooth is the feature space, etc. In addition, we are designing original adversarial attacks and develop detection and curating mechanisms 38.

Multimedia Knowledge Extraction.

Information obtained from collections via computer ran processes is not the only thing that needs to be represented. Humans are in the loop, and they gradually improve their level of understanding of the content and nature of the multimedia collection. Discovering knowledge and getting insight is involving multiple people across a long period of time, and what each understands, concludes and discovers must be recorded and made available to others. Collaboratively inspecting collections is crucial. Ontologies are an often preferred mechanism for modeling what is inside a collection, but this is probably limitative and narrow.

Linkmedia is concerned with making use of existing strategies in relation with ontologies and knowledge bases. In addition, Linkmedia uses mechanisms allowing to materialize the knowledge gradually acquired by humans and that might be subsequently used either by other humans or by computers in order to better and more precisely analyze collections. This line of work is instantiated at the core of the iCODA project Linkmedia coordinates.

We are therefore concerned with:

Multimedia analysis and ontologies. We develop approaches for linking multimedia content to entities in ontologies for text and images, building on results in multimodal embedding to cast entity linking into a nearest neighbor search problem in a high-dimensional joint embedding of content and entities 74. We also investigate the use of ontological knowledge to facilitate information extraction from content 51.
Explainability and accountability in information extraction. In relation with ontologies and entity linking, we develop innovative approaches to explain statistical relations found in data, in particular lexical or entity co-occurrences in textual data, for example using embeddings constrained with translation properties of RDF knowledge or path-based explanation within RDF graphs. We also work on confidence measures in entity linking and information extraction, studying how the notions of confidence and information source can be accounted for in knowledge basis and used in human-centric collaborative exploration of collections.
Dynamic evolution of models for information extraction. In interactive exploration and information extraction, e.g., on cultural or educational material, knowledge progressively evolves as the process goes on, requiring on-the-fly design of new models for content-based information extractors from very few examples, as well as continuous adaptation of the models. Combining in a seamless way low-shot, active and incremental learning techniques is a key issue that we investigate to enable this dynamic mechanisms on selected applications.

3.4 Research Direction 2: Accessing Information

Linkmedia centers its activities on enabling humans to make good use of vast multimedia collections. This material takes all its cultural and economic value, all its artistic wonder when it can be accessed, watched, searched, browsed, visualized, summarized, classified, shared, ... This allows users to fully enjoy the incalculable richness of the collections. It also makes it possible for companies to create business rooted in this multimedia material.

Accessing the multimedia data that is inside a collection is complicated by the various type of data, their volume, their length, etc. But it is even more complicated to access the information that is not materialized in documents, such as the relationships between parts of different documents that however share some similarity. Linkmedia in its first four years of existence established itself as one of the leading teams in the field of multimedia analytics, contributing to the establishment of a dedicated community (refer to the various special sessions we organized with MMM, the iCODA and the LIMAH projects, as well as 58, 59, 55).

Overall, facilitating the access to the multimedia material, to the relevant information and the corresponding knowledge asks for algorithms that efficiently search collections in order to identify the elements of collections or of the acquired knowledge that are matching a query, or that efficiently allow navigating the collections or the acquired knowledge. Navigation is likely facilitated if techniques are able to handle information and knowledge according to hierarchical perspectives, that is, allow to reveal data according to various levels of details. Aggregating or summarizing multimedia elements is not trivial.

Figure 1: Exploration-search axis with example tasks

Three topics are therefore in relation with this second research direction. Linkmedia tackles the issues in relation to searching, to navigating and to summarizing multimedia information. Information needs when discovering the content of a multimedia collection can be conveniently mapped to the exploration-search axis, as first proposed by Zahálka and Worring in 79, and illustrated by Figure 1 where expert users typically work near the right end because their tasks involve precise queries probing search engines. In contrast, lay-users start near the exploration end of the axis. Overall, users may alternate searches and explorations by going back and forth along the axis. The underlying model and system must therefore be highly dynamic, support interactions with the users and propose means for easy refinements. Linkmedia contributes to advancing the state of the art in searching operations, in navigating operations (also referred to as browsing), and in summarizing operations.

Searching.

Search engines must run similarity searches very efficiently. High-dimensional indexing techniques therefore play a central role. Yet, recent contributions in ML suggest to revisit indexing in order to adapt to the specific properties of modern features describing contents.

Advanced scalable indexing. High-dimensional indexing is one of the foundations of Linkmedia. Modern features extracted from the multimedia material with the most recent ML techniques shall be indexed as well. This, however, poses a series of difficulties due to the dimensionality of these features, their possible sparsity, the complex metrics in use, the task in which they are involved (instance search, $k$ -nn, class prototype identification, manifold search 57, time series retrieval, ...). Furthermore, truly large datasets require involving sketching 41, secondary storage and/or distribution 40, 39, alleviating the explosion of the number of features to consider due to their local nature or other innovative methods 56, all introducing complexities. Last, indexing multimodal embedded spaces poses a new series of challenges.
Improving quality. Scalable indexing techniques are approximate, and what they return typically includes a fair amount of false positives. Linkmedia works on improving the quality of the results returned by indexing techniques. Approaches taking into account neighborhoods 50, manifold structures instead of pure distance based similarities 57 must be extended to cope with advanced indexing in order to enhance quality. This includes feature selection based on intrinsic dimensionality estimation 38.
Dynamic indexing. Feature collections grow, and it is not an option to fully reindex from scratch an updated collection. This trivially applies to the features directly extracted from the media items, but also to the base class prototypes that can evolve due to the non-static nature of learning processes. Linkmedia will continue investigating what is at stake when designing dynamic indexing strategies.

Navigating.

Navigating a multimedia collection is very central to its understanding. It differs from searching as navigation is not driven by any specific query. Rather, it is mostly driven by the relationships that various documents have one another. Relationships are supported by the links between documents and/or parts of documents. Links rely on semantic similarity, depicting the fact that two documents share information on the same topic. But other aspects than semantics are also at stake, e.g., time with the dates of creation of the documents or geography with mentions or appearance in documents of some geographical landmarks or with geo-tagged data.

In multimedia collections, links can be either implicit or explicit, the latter being much easier to use for navigation. An example of an implicit link can be the name of someone existing in several different news articles; we, as humans, create a mental link between them. In some cases, the computer misses such configurations, leaving such links implicit. Implicit links are subject to human interpretation, hence they are sometimes hard to identify for any automatic analysis process. Implicit links not being materialized, they can therefore hardly be used for navigation or faceted search. Explicit links can typically be seen as hyperlinks, established either by content providers or, more aligned with Linkmedia, automatically determined from content analysis. Entity linking (linking content to an entity referenced in a knowledge base) is a good example of the creation of explicit links. Semantic similarity links, as investigated in the LIMAH project and as considered in the search and hyperlinking task at MediaEval and TRECVid, are also prototypical links that can be made explicit for navigation. Pursuing work, we investigate two main issues:

Improving multimodal content-based linking. We exploit achievements in entity linking to go beyond lexical or lexico-visual similarity and to provide semantic links that are easy to interpret for humans; carrying on, we work on link characterization, in search of mechanisms addressing link explainability (i.e., what is the nature of the link), for instance using attention models so as to focus on the common parts of two documents or using natural language generation; a final topic that we address is that of linking textual content to external data sources in the field of journalism, e.g., leveraging topic models and cue phrases along with a short description of the external sources.
Dynamicity and user-adaptation. One difficulty for explicit link creation is that links are often suited for one particular usage but not for another, thus requiring creating new links for each intended use; whereas link creation cannot be done online because of its computational cost, the alternative is to generate (almost) all possible links and provide users with selection mechanisms enabling personalization and user-adaptation in the exploration process; we design such strategies and investigate their impact on exploration tasks in search of a good trade-off between performance (few high-quality links) and genericity.

Summarizing.

Multimedia collections contain far too much information to allow any easy comprehension. It is mandatory to have facilities to aggregate and summarize a large body on information into a compact, concise and meaningful representation facilitating getting insight. Current technology suggests that multimedia content aggregation and story-telling are two complementary ways to provide users with such higher-level views. Yet, very few studies already investigated these issues. Recently, video or image captioning 78, 73 have been seen as a way to summarize visual content, opening the door to state-of-the-art multi-document text summarization 53 with text as a pivot modality. Automatic story-telling has been addressed for highly specific types of content, namely TV series 45 and news 65, 72, but still need a leap forward to be mostly automated, e.g., using constraint-based approaches for summarization 42, 72.

Furthermore, not only the original multimedia material has to be summarized, but the knowledge acquired from its analysis is also to summarize. It is important to be able to produce high-level views of the relationships between documents, emphasizing some structural distinguishing qualities. Graphs establishing such relationships need to be constructed at various level of granularity, providing some support for summarizing structural traits.

Summarizing multimedia information poses several scientific challenges that are:

Choosing the most relevant multimedia aggregation type: Taking a multimedia collection into account, a same piece of information can be present in several modalities. The issue of selecting the most suitable one to express a given concept has thus to be considered together with the way to mix the various modalities into an acceptable production. Standard summarization algorithms have to be revisited so that they can handle continuous representation spaces, allowing them to benefit from the various modalities 46.
Expressing user’s preferences: Different users may appreciate quite different forms of multimedia summaries, and convenient ways to express their preferences have to be proposed. We for example focus on the opportunities offered by the constraint-based framework.
Evaluating multimedia summaries: Finding criteria to characterize what a good summary is remains challenging, e.g., how to measure the global relevance of a multimodal summary and how to compare information between and across two modalities. We tackle this issue particularly via a collaboration with A. Smeaton at DCU, comparing the automatic measures we will develop to human judgments obtained by crowd-sourcing.
Taking into account structuring and dynamicity: Typed links between multimedia fragments, and hierarchical topical structures of documents obtained via work previously developed within the team are two types of knowledge which have seldom been considered as long as summarization is concerned. Knowing that the event present in a document is causally related to another event described in another document can however modify the ways summarization algorithms have to consider information. Moreover the question of producing coarse-to-fine grain summaries exploiting the topical structure of documents is still an open issue. Summarizing dynamic collections is also challenging and it is one of the questions we consider.

4 Application domains

4.1 Asset management in the entertainment business

Media asset management—archiving, describing and retrieving multimedia content—has turned into a key factor and a huge business for content and service providers. Most content providers, with television channels at the forefront, rely on multimedia asset management systems to annotate, describe, archive and search for content. So do archivists such as the Institut National de l'Audiovisuel, the bibliothèque Nationale de France, the Nederlands Instituut voor Beeld en Geluid or the British Broadcast Corporation, as well as media monitoring companies, such as Yacast in France. Protecting copyrighted content is another aspect of media asset management.

4.2 Multimedia Internet

One of the most visible application domains of linked multimedia content is that of multimedia portals on the Internet. Search engines now offer many features for image and video search. Video sharing sites also feature search engines as well as recommendation capabilities. All news sites provide multimedia content with links between related items. News sites also implement content aggregation, enriching proprietary content with user-generated content and reactions from social networks. Most public search engines and Internet service providers offer news aggregation portals. This also concerns TV on-demand and replay services as well as social TV services and multi-screen applications. Enriching multimedia content, with explicit links targeting either multimedia material or knowledge databases is central here.

4.3 Data journalism

Data journalism forms an application domain where most of the technology developed by Linkmedia can be used. On the one hand, data journalists often need to inspect multiple heterogeneous information sources, some being well structured, some other being fully unstructured. They need to access (possibly their own) archives with either searching or navigational means. To gradually construct insight, they need collaborative multimedia analytics processes as well as elements of trust in the information they use as foundations for their investigations. Trust in the information, watching for adversarial and/or (deep) fake material, accountability are all crucial here.

5 Social and environmental responsibility

5.1 Impact of research results

The Synapses Labcom

The year 2024 is marked by close collaboration with a major French media organization. The Linkmedia Ouest-France team is launching Synapses, the first “joint laboratory” with a press organization to develop AI for journalism. Supported by the French National Research Agency (ANR), it comes after thirty years of partnership, and targets the analysis of photo archives, the processing of historical texts and the visualization of complex data. Synapses combines “AI and data sovereignty” to exploit a unique heritage of 105 million documents. This partnership highlights the sharing of scientific knowledge, but also our respective sensitivities to the societal impact of AI in order to work on better information for diverse audiences.

6 Highlights of the year

The Linkmedia team split in September 2024, giving birth to the Artishau team, lead by Teddy Furon. Artishau is composed of many (old) members of Linkmedia plus a few members from the Wide team.

Because that split is very recent, the current activity report merges the elements from Artishau and the ones from Linkmedia, leading to the writing of a single report for both teams.

In order to materialize the split, a few elements are important to note:

Eva Giboulot got a permanent INRIA research position (CRCN) at the time of that split, and she is now a member of Artishau. Before that, she was a post-doc inside Linkmedia.
4 students (Denis, Dine, Gesny, Imadache) started their PhD at the time of that split.
Publications of Artishau have been added to the ones of Linkmedia.

Artishau will have its own activity report in 2025.

7 New results

7.1 Extracting and Representing Information

7.1.1 Decreasing graph complexity with transitive reduction to improve temporal graph classification

Participants: Carolina Jerônimo, Zenilton Patrocínio [PUC Minas - Pontifícia Universidade Católica de Minas Gerais], Simon Malinowski, Guillaume Gravier, Silvio Guimarães [PUC Minas - Pontifícia Universidade Católica de Minas Gerais].

Domains such as bioinformatics, social network analysis, and computer vision, describe relations between entities and cannot be interpreted as vectors or fixed grids. Instead, they are naturally represented by graphs. Often this kind of data evolves over time in a dynamic world, respecting a temporal order being known as temporal graphs. The latter became a challenge since subgraph patterns are very difficult to find and the distance between those patterns may change irregularly over time. While state-of-the-art methods are primarily designed for static graphs and may not capture temporal information, recent works have proposed mapping temporal graphs to static graphs to allow for the use of conventional static kernels approaches. This work presents a new method for temporal graph classification based on transitive reduction, which explores new kernels and graph neural networks for temporal graph classification 11. We compare the transitive reduction impact on the map to static graphs in terms of accuracy and computational efficiency across different classification tasks. Experimental results demonstrate the effectiveness of the proposed mapping method in improving the accuracy of supervised classification for temporal graphs while maintaining reasonable computational efficiency.

7.1.2 DINOv2: Learning Robust Visual Features without Supervision

Participants: Maxime Oquab [Meta AI], Timothée Darcet [Meta AI, Thoth], Théo Moutakanni [CentraleSupélec, Meta AI, Université Paris-Saclay], Huy Vo [Meta AI], Marc Szafraniec [Meta AI], Vasil Khalidov [Meta AI], Pierre Fernandez [Meta AI], Daniel Haziza [Meta AI], Francisco Massa [Meta AI], Alaaeldin El-Nouby [Meta AI], Mahmoud Assran [Meta AI], Nicolas Ballas [Meta AI], Wojciech Galuba [Meta AI], Russell Howes [Meta AI], Po-Yao Huang [Meta AI], Shang-Wen Li [Meta AI], Ishan Misra [Meta AI], Michael Rabbat [Meta AI], Vasu Sharma [Meta AI], Gabriel Synnaeve [Meta AI], Hu Xu [Meta AI], Hervé Jegou [Meta AI], Julien Mairal [Thoth], Patrick Labatut [Meta AI], Armand Joulin [Meta AI], Piotr Bojanowski [Meta AI].

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size 12. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP on most of the benchmarks at image and pixel levels.

7.1.3 Functional invariants to watermark large transformers

Participants: Pierre Fernandez [Meta], Guillaume Couairon [Meta], Teddy Furon, Matthijs Douze [Meta].

The rapid growth of transformer-based models increases the concerns about their integrity and ownership insurance. Watermarking addresses this issue by embedding a unique identifier into the model, while preserving its performance. However, most existing approaches require to optimize the weights to imprint the watermark signal, which is not suitable at scale due to the computational cost. This paper explores watermarks with virtually no computational cost 19. It is applicable to a non-blind white-box setting (assuming access to both the original and watermarked networks). They generate functionally equivalent copies by leveraging the models' invariance, via operations like dimension permutations or scaling/unscaling. This enables to watermark models without any change in their outputs and remains stealthy. Experiments demonstrate the effectiveness of the approach and its robustness against various model transformations (fine-tuning, quantization, pruning), making it a practical solution to protect the integrity of large models.

7.1.4 Recherche de relation à partir d'un seul exemple fondée sur un modèle N-way K-shot : une histoire de distracteurs

Participants: Hugo Thomas, Guillaume Gravier, Pascale Sébillot.

La recherche de relation à partir d'un exemple consiste à trouver dans un corpus toutes les occurrences d'un type de relation liant deux entités dans une phrase, nommé type cible et caractérisé à l'aide d'un seul exemple. Nous empruntons le scénario d'entraînement et évaluation N-way K-shot à la tâche de classification de relations rares qui prédit le type de relation liant deux entités à partir de peu d'exemples d'entraînement, et l'adaptons à la recherche de relation avec un exemple. Lors de l'évaluation, un modèle entraîné pour la classification de relations en N-way K-shot est utilisé, dans lequel K vaut un pour le type cible, une des N classes (du N-way) représente le type cible, et les N-1 classes restantes sont des distracteurs modélisant la classe de rejet. Les résultats sur FewRel et TACREV démontrent l'efficacité de notre approche malgré la difficulté de la tâche. L'étude de l'évolution des performances en fonction du nombre de distracteurs et des stratégies de leur choix met en avant une bonne configuration globale, à savoir un nombre élevé de distracteurs à une distance intermédiaire du type de relation cible dans l'espace latent appris par le modèle. Le diagnostic a posteriori de notre méthode révèle l'existence de configurations optimales pour chaque type cible que nos analyses actuelles échouent à caractériser, ouvrant la voie à de futurs travaux 32.

7.1.5 One-shot relation retrieval in news archives: adapting N-way K-shot relation classification for efficient knowledge extraction

Participants: Hugo Thomas, Guillaume Gravier, Pascale Sébillot.

One-shot relation retrieval is the knowledge extraction task that consists in searching in a textual dataset for all occurrences of a relation of interest, named the source relation, characterized by a single example–a relation being a link between a pair of entities in an utterance. Performing this task on large datasets requires an intelligent system to automate the process, for instance when exploring news archives for press review or business intelligence. We propose a framework that leverages the representation learning capabilities of N-way K-shot models for few-shot relation classification and extends these models to enable one-shot retrieval with a rejection class 28. At evaluation time, one-shot relation retrieval is performed in a N-way K-shot setting where 1 of the N ways (or relations) is the source relation and the N-1 others are distractors, i.e., relations modeling a rejection class. We benchmark this framework and investigate the influence of the number and the choice of distractors on the standard TACREV and FewRel datasets. Experimental results demonstrate the effectiveness of our approach to address this highly challenging task, however with high variability primarily induced by the type of the source relation. Experiments also highlight a sound strategy for the choice of distractors-a large number of distractors at an intermediate distance from the embedding of the source relation in the latent space learned by the model-, which provides a competing trade-off between recall and precision. This strategy is globally optimal but can however be surpassed on certain source relations by others, depending on the characteristics of the source relation, paving the way for future work. We finally show the substantial benefit of two-shot retrieval over one-shot retrieval, which sheds light on the design of actual intelligent applications leveraging one-or few-shot relation retrieval.

7.1.6 Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video

Participants: Shashanka Venkataramanan, Mamshad Rizve [UFC], João Carreira [Google DeepMind], Yuki Asano [UvA], Yannis Avrithis [IARAI].

Self-supervised learning has unlocked the potential of scaling up pretraining to billions of images, since annotation is unnecessary. But are we making the best use of data? How more economical can we be? In this work, we attempt to answer this question by making two contributions. First, we investigate first-person videos and introduce a "Walking Tours" dataset. These videos are high-resolution, hourslong, captured in a single uninterrupted take, depicting a large number of objects and actions with natural scene transitions. They are unlabeled and uncurated, thus realistic for self-supervision and comparable with human learning. Second, we introduce a novel self-supervised image pretraining method tailored for learning from continuous videos. Existing methods typically adapt image-based pretraining approaches to incorporate more frames. Instead, we advocate a "tracking to learn to recognize" approach. Our method called DORA, leads to attention maps that Discover and tRAck objects over time in an end-to-end manner, using transformer cross-attention 31. We derive multiple views from the tracks and use them in a classical self-supervised distillation loss. Using our novel approach, a single Walking Tours video remarkably becomes a strong competitor to ImageNet for several image and video downstream tasks.

7.1.7 AggNet: Learning to aggregate faces for group membership verification

Participants: Marzieh Gheisari [IBENS], Javad Amirian [ISIR], Teddy Furon, Laurent Amsaleg.

In certain applications of face recognition, our goal is to verify whether an individual belongs to a particular group while keeping their identity undisclosed. Existing methods have suggested a process of quantizing pre-computed face descriptors into discrete embeddings and aggregating them into a single representation for the group. However, this mechanism is only optimized for a given closed set of individuals and requires relearning the group representations from scratch whenever the groups change. In this paper, we introduce a deep architecture that simultaneously learns face descriptors and the aggregation mechanism to enhance overall performance 10. Our system can be utilized for new groups comprising individuals who have never been encountered before, and it easily handles new memberships or the termination of existing memberships. Through experiments conducted on multiple extensive, real-world face datasets, we demonstrate that our proposed method achieves superior verification performance compared to other baseline approaches.

7.1.8 REStore: Exploring a Black-Box Defense against DNN Backdoors using Rare Event Simulation

Participants: Quentin Le Roux, Kassem Kallas, Teddy Furon.

Backdoor attacks pose a significant threat to deep neural networks as they allow an adversary to inject a malicious behavior in a victim model during training. This paper addresses the challenge of defending against backdoor attacks in a blackbox setting where the defender has a limited access to a suspicious model. In this paper, we introduce Importance Splitting, a Sequential Monte-Carlo method previously used in neural network robustness certification, as an off-the-shelf tool for defending against backdoors 25. We demonstrate that a black-box defender can leverage rare event simulation to assess the presence of a backdoor, reconstruct its trigger, and finally purify test-time input data in real-time. So-called REStore, our input purification defense proves effective in black-box scenarios because it uses triggers recovered with a query access to a model (only observing its logit, probit, or top-1 label outputs). We test our method on MNIST, CIFAR-10, and CASIA-Webface. We believe we are the first to demonstrate that backdoors may be considered under the lens of rare event simulation. Moreover, REStore is the first one-stage black-box input purification defense that approaches the performance of more complex comparables. REStore avoids gradient estimation, model reconstruction, or the vulnerable training of additional models.

7.1.9 Proactive Detection of Voice Cloning with Localized Watermarking

Participants: Robin San Roman [FAIR, MultiSpeech], Pierre Fernandez [Meta], Hady Elsahar [FAIR], Alexandre Défossez [Kyutai], Teddy Furon, Tuan Tran [FAIR].

In the rapidly evolving field of speech generative models, there is a pressing need to ensure audio authenticity against the risks of voice cloning. We present AudioSeal, the first audio watermarking technique designed specifically for localized detection of AI-generated speech 26. AudioSeal employs a generator / detector architecture trained jointly with a localization loss to enable localized watermark detection up to the sample level, and a novel perceptual loss inspired by auditory masking, that enables AudioSeal to achieve better imperceptibility. AudioSeal achieves state-of-the-art performance in terms of robustness to real life audio manipulations and imperceptibility based on automatic and human evaluation metrics. Additionally, AudioSeal is designed with a fast, single-pass detector, that significantly surpasses existing models in speed, achieving detection up to two orders of magnitude faster, making it ideal for large-scale and real-time applications. Code is available at audioseal.

7.1.10 A Fast and Sound Tagging Method for Discontinuous Named-Entity Recognition

Participant: Caio Corro.

We introduce a novel tagging scheme for discontinuous named entity recognition based on an explicit description of the inner structure of discontinuous mentions 16. We rely on a weighted finite state automaton for both marginal and maximum a posteriori inference. As such, our method is sound in the sense that (1) well-formedness of predicted tag sequences is ensured via the automaton structure and (2) there is an unambiguous mapping between well-formed sequences of tags and (discontinuous) mentions. We evaluate our approach on three English datasets in the biomedical domain, and report comparable results to state-of-the-art while having a way simpler and faster model.

7.1.11 Few-Shot Domain Adaptation for Named-Entity Recognition via Joint Constrained k-Means and Subspace Selection

Participants: Ayoub Hammal [STL LISN], Benno Uthayasooriyar [LMBA, SCOR SE], Caio Corro.

Named-entity recognition (NER) is a task that typically requires large annotated datasets, which limits its applicability across domains with varying entity definitions. This paper addresses few-shot NER, aiming to transfer knowledge to new domains with minimal supervision 22. Unlike previous approaches that rely solely on limited annotated data, we propose a weakly supervised algorithm that combines small labeled datasets with large amounts of unlabeled data. Our method extends the k-means algorithm with label supervision, cluster size constraints and domain-specific discriminative subspace selection. This unified framework achieves state-of-the-art results in few-shot NER on several English datasets.

7.1.12 Training LayoutLM from Scratch for Efficient Named-Entity Recognition in the Insurance Domain

Participants: Benno Uthayasooriyar [LMBA, SCOR SE], Antoine Ly [SCOR SE], Franck Vermet [LMBA], Caio Corro.

Generic pre-trained neural networks may struggle to produce good results in specialized domains like finance and insurance. This is due to a domain mismatch between training data and downstream tasks, as in-domain data are often scarce due to privacy constraints. In this work, we compare different pre-training strategies for LAYOUTLM 30. We show that using domain-relevant documents improves results on a named-entity recognition (NER) problem using a novel dataset of anonymized insurance-related financial documents called PAYSLIPS. Moreover, we show that we can achieve competitive results using a smaller and faster model.

7.1.13 WaterMax: breaking the LLM watermark detectability-robustness-quality trade-off

Participants: Eva Giboulot, Teddy Furon.

Watermarking is a technical means to dissuade malfeasant usage of Large Language Models. This paper proposes a novel watermarking scheme, so-called WaterMax, that enjoys high detectability while sustaining the quality of the generated text of the original LLM 21. Its new design leaves the LLM untouched (no modification of the weights, logits, temperature, or sampling technique). WaterMax balances robustness and complexity contrary to the watermarking techniques of the literature inherently provoking a trade-off between quality and robustness. Its performance is both theoretically proven and experimentally validated. It outperforms all the SotA techniques under the most complete benchmark suite.

7.1.14 When does gradient estimation improve black-box adversarial attacks?

Participants: Enoal Gesny, Eva Giboulot, Teddy Furon.

The recent black-box adversarial attack SurFree demonstrated its high effectiveness resorting to a purely geometric construction. The method drastically reduced the number of queries necessary to craft low-distortion adversarial examples compared to the preceding art which relied on costly gradient estimation. Recently, CGBA proposed to reintroduce gradient information to SurFree. Despite promising empirical results, no theoretical study of the method was provided. This paper fills this gap by providing a comprehensive analysis of the performance of SurFree and CGBA 20. Notably, we express conditions under which using the gradient information is guaranteed to improve upon SurFree performance. We also provide the theoretical distortion of each attack at a given iteration, demonstrating the convergence of CGBA to the optimal adversarial image. Finally, we study the optimal query allocation schedule for CGBA. The accompanying code is to be found at Use-of-gradient-for-black-box-attacks.

7.1.15 SWIFT: Semantic Watermarking for Image Forgery Thwarting

Participants: Gautier Evennou [IMATAG], Vivien Chappelier [IMATAG], Ewa Kijak, Teddy Furon.

This paper proposes a novel approach towards image authentication and tampering detection by using watermarking as a communication channel for semantic information 17. We modify the HiDDeN deep-learning watermarking architecture to embed and extract high-dimensional real vectors representing image captions. Our method improves significantly robustness on both malign and benign edits. We also introduce a local confidence metric correlated with Message Recovery Rate, enhancing the method's practical applicability. This approach bridges the gap between traditional watermarking and passive forensic methods, offering a robust solution for image integrity verification. The code is available at swift_watermarking.

7.1.16 Distinctive Image Captioning: Leveraging ground truth captions in CLIP guided reinforcement learning

Participants: Antoine Chaffin [IMATAG], Ewa Kijak, Vincent Claveau.

Training image captioning models using teacher forcing results in very generic samples, whereas more distinctive captions can be very useful in retrieval applications or to produce alternative texts describing images for accessibility. Reinforcement Learning (RL) allows to use cross-modal retrieval similarity score between the generated caption and the input image as reward to guide the training, leading to more distinctive captions. Recent studies show that pre-trained cross-modal retrieval models can be used to provide this reward, completely eliminating the need for reference captions. However, we argue in this paper that Ground Truth (GT) captions can still be useful in this RL framework. We propose a new image captioning model training strategy that makes use of GT captions in different ways 15. Firstly, they can be used to train a simple MLP discriminator that serves as a regularization to prevent reward hacking and ensures the fluency of generated captions, resulting in a textual GAN setup extended for multimodal inputs. Secondly, they can serve as strong baselines when added to the pool of captions used to compute the proposed contrastive reward to reduce the variance of gradient estimate. Thirdly, they can serve as additional trajectories in the RL strategy, resulting in a teacher forcing loss weighted by the similarity of the GT to the image. This objective acts as an additional learning signal grounded to the distribution of the GT captions. Experiments on MS COCO demonstrate the interest of the proposed training strategy to.

7.1.17 Fast Reliability Estimation for Neural Networks with Adversarial Attack-Driven Importance Sampling

Participants: Karim Tit [uni.lu], Teddy Furon.

This paper introduces a novel approach to evaluate the reliability of Neural Networks (NNs) by integrating adversarial attacks with Importance Sampling (IS), enhancing the assessment's precision and efficiency 29. Leveraging adversarial attacks to guide IS, our method efficiently identifies vulnerable input regions, offering a more directed alternative to traditional Monte Carlo methods. While comparing our approach with classical reliability techniques like FORM and SORM, and with classical rare event simulation methods such as Cross-Entropy IS, we acknowledge its reliance on the effectiveness of adversarial attacks and its inability to handle very high-dimensional data such as ImageNet. Despite these challenges, our comprehensive empirical validations on the datasets the MNIST and CIFAR10 demonstrate the method's capability to accurately estimate NN reliability for a variety of models. Our research not only presents an innovative strategy for reliability assessment in NNs but also sets the stage for further work exploiting the connection between adversarial robustness and the field of statistical reliability engineering.

7.1.18 Watermarking Makes Language Models Radioactive

Participants: Tom Sander [Meta AI Research, X], Pierre Fernandez [Meta AI Research], Alain Durmus [Centre Borelli], Matthijs Douze [Meta AI Research], Teddy Furon.

We investigate the radioactivity of text generated by large language models (LLM), i.e., whether it is possible to detect that such synthetic input was used to train a subsequent LLM. Current methods like membership inference or active IP protection either work only in settings where the suspected text is known or do not provide reliable statistical guarantees. We discover that, on the contrary, it is possible to reliably determine if a language model was trained on synthetic data if that data is output by a watermarked LLM. Our new methods, specialized for radioactivity, detects with a provable confidence weak residuals of the watermark signal in the fine-tuned LLM 27. We link the radioactivity contamination level to the following properties: the watermark robustness, its proportion in the training set, and the fine-tuning process. For instance, if the suspect model is open-weight, we demonstrate that training on watermarked instructions can be detected with high confidence (p-value $< 10^{- 5}$ ) even when as little as 5% of training text is watermarked.

7.1.19 A Double-Edged Sword: The Power of Two in Defending Against DNN Backdoor Attacks

Participants: Quentin Le Roux [THALES], Kassem Kallas, Teddy Furon.

Backdoor attacks on deep neural networks work by injecting them with a malicious behavior during training. Such behavior can then be activated at test-time using cleverly-crafted triggers. Defending against backdoors is key in machine learning security in order to safeguard the trust between model providers and users. This paper demonstrates the open problem of back-door defense performance against a representative selection of backdoor attacks, with a main focus on input purification (a valuable defense category in black-box contexts where all DNN inputs are preprocessed in the hope of erasing a potential trigger). We show that current defenses are adversary-aware and dataset-dependent. They typically focus on patch-based attacks and simpler image classification datasets. This brittleness when using stand-alone defenses highlights the cat-and-mouse game currently affecting the backdoor literature. In this context, we propose a two-defense strategy using existing methods as a palliative solution while waiting for future developments 24.

7.1.20 A Comprehensive Survey on Backdoor Attacks and Their Defenses in Face Recognition Systems

Participants: Quentin Le Roux [THALES], Eric Bourbao [THALES], Yannick Teglia, Kassem Kallas.

Deep learning has significantly transformed face recognition, enabling the deployment of large-scale, state-of-the-art solutions worldwide. However, the widespread adoption of deep neural networks (DNNs) and the rise of Machine Learning as a Service emphasize the need for secure DNNs. This paper revisits the face recognition threat model in the context of DNN ubiquity and the common practice of outsourcing their training and hosting to third-parties. Here, we identify backdoor attacks as a significant threat to modern DNN-based face recognition systems (FRS). Backdoor attacks involve an attacker manipulating a DNN’s training or deployment, injecting it with a stealthy and malicious behavior. Once the DNN has entered its inference stage, the attacker may activate the backdoor and compromise the DNN’s intended functionality. Given the critical nature of this threat to DNN-based FRS, our paper comprehensively surveys the literature of backdoor attacks and defenses previously demonstrated on FRS DNNs 13. As a last point, we highlight potential vulnerabilities and unexplored areas in FRS security.

7.1.21 Beyond Internet Images: Evaluating Vision-Language Models for Domain Generalization on Synthetic-to-Real Industrial Datasets

Participants: Louis Hemadou, Helena Vorobieva [Safran Tech], Ewa Kijak, Frédéric Jurie [GREYC, UNICAEN].

Vision Language Foundation Models (VLFMs) have shown impressive generalization capabilities, making them suitable for Domain Generalization (DG) tasks, such as training on synthetic images and testing on real data. However, existing evaluations predominantly use academic benchmarks constructed from internet images, akin to the datasets used for training VLFMs. This paper assesses the performance of VLFM-based DG algorithms on two synthetic-to-real classification datasets, Rareplanes-tiles and Aerial Vehicles, designed to emulate industrial contexts 33. Our findings reveal that while VLFMs excel on academic benchmarks, outperforming randomly initialized networks, their advantage is significantly diminished on these industrial-like datasets. This study underscores the importance of evaluating models on diverse, representative data to understand their real-world applicability and limitations.

7.1.22 HYBRINFOX at CheckThat! 2024 - Task 2: Enriching BERT Models with the Expert System VAGO for Subjectivity Detection

Participants: Morgane Casanova, Julien Chanson [Mondeca], Benjamin Icard [DEC, ENS-PSL, PSL, SMA], Géraud Faye [MICS, Airbus Defence and Space], Guillaume Gadek [Airbus Defence and Space], Guillaume Gravier, Paul Égré [IJN, ENS-PSL].

This paper presents the HYBRINFOX method used to solve Task 2 of Subjectivity detection of the CLEF 2024 CheckThat! competition 14. The specificity of the method is to use a hybrid system, combining a RoBERTa model, fine-tuned for subjectivity detection, a frozen sentence-BERT (sBERT) model to capture semantics, and several scores calculated by the English version of the expert system VAGO, developed independently of this task to measure vagueness and subjectivity in texts based on the lexicon. In English, the HYBRINFOX method ranked 1st with a macro F1 score of 0.7442 on the evaluation data. For the other languages, the method used a translation step into English, producing more mixed results (ranking 1st in Multilingual and 2nd in Italian over the baseline, but under the baseline in Bulgarian, German, and Arabic). We explain the principles of our hybrid approach, and outline ways in which the method could be improved for other languages besides English.

7.2 Accessing Information

7.2.1 A Multi-Label Dataset of French Fake News: Human and Machine Insights

Participants: Benjamin Icard [DEC, ENS-PSL, PSL, SMA], François Maine [SMA, Freedom Partners], Morgane Casanova, Géraud Faye [MICS, Airbus Defence and Space], Julien Chanson [Mondeca], Guillaume Gadek [Airbus Defence and Space], Ghislain Atemezing [Mondeca], François Bancilhon [Observatoire des Médias], Paul Égré [IJN, ENS-PSL].

We present in 23 a corpus of 100 documents, OBSINFOX, selected from 17 sources of French press considered unreliable by expert agencies, annotated using 11 labels by 8 annotators. By collecting more labels than usual, by more annotators than is typically done, we can identify features that humans consider as characteristic of fake news, and compare them to the predictions of automated classifiers. We present a topic and genre analysis using Gate Cloud, indicative of the prevalence of satire-like text in the corpus. We then use the subjectivity analyzer VAGO, and a neural version of it, to clarify the link between ascriptions of the label Subjective and ascriptions of the label Fake News. The annotated dataset is available online at the following url: OBSINFOX

7.2.2 Exposing propaganda: an analysis of stylistic cues comparing human annotations and machine classification

Participants: Géraud Faye [MICS, Airbus Defence and Space], Benjamin Icard [DEC, ENS-PSL, PSL, SMA], Morgane Casanova, Julien Chanson [Mondeca], François Maine [SMA, Freedom Partners], François Bancilhon [Observatoire des Médias], Guillaume Gadek [Airbus Defence and Space], Guillaume Gravier, Paul Égré [IJN, ENS-PSL].

This paper investigates the language of propaganda and its stylistic features 18. It presents the PPN dataset, standing for Propagandist Pseudo-News, a multisource, multilingual, multimodal dataset composed of news articles extracted from websites identified as propaganda sources by expert agencies. A limited sample from this set was randomly mixed with papers from the regular French press, and their URL masked, to conduct an annotation-experiment by humans, using 11 distinct labels. The results show that human annotators were able to reliably discriminate between the two types of press across each of the labels. We propose different NLP techniques to identify the cues used by the annotators, and to compare them with machine classification. They include the analyzer VAGO to measure discourse vagueness and subjectivity, a TF-IDF to serve as a baseline, and four different classifiers: two RoBERTa-based models, CATS using syntax, and one XGBoost combining syntactic and semantic features.

8 Bilateral contracts and grants with industry

8.1 Bilateral contracts with industry

CIFRE PhD: Certification of Deep Neural Networks

Participants: Teddy Furon, Quentin Le Roux.

Duration: 3 years, started in November 2022 Partner:THALES

This is a CIFRE PhD thesis project aiming at assessing the security of already trained Deep Neural Networks, especially in the context of face recognition.

CIFRE PhD: Watermarking and deep learning

Participants: Teddy Furon, Pierre Fernandez.

Duration: 3 years, started in May 2022 Partner: META AI

This is a CIFRE PhD thesis project aiming at watermarking deep learning models analyzing or generating images or at using deep learning to watermark images.

CIFRE PhD: Domain generalization exploiting synthetic data

Participants: Ewa Kijak, Louis Hemadou.

Duration: 3 years, started in Nov. 2022 Partner: SAFRAN

This is a CIFRE PhD thesis project aiming at exploiting synthetic data to be able to perform transfer learning in presence of very few or inexistent real data in the context of image detection or classification tasks.

CIFRE PhD: Detection and explanation of semantic manipulations in multimedia content

Participants: Ewa Kijak, Gautier Evennou.

Duration: 3 years, started in Sep. 2023 Partner: IMATAG

This is a CIFRE PhD thesis project aiming at detecting and explaining semantic manipulations in multimedia content, in the context of misinformation.

CIFRE PhD: Machine learning for identification of factors impacting the quality of service of urban buses

Participants: Simon Malinowski, Guillaume Gravier, Erwan Vincent.

Duration: 3 years, started in Feb. 2022Partner: KEOLIS

This is a CIFRE PhD thesis project aiming at identifying factors that have an impact on the quality of service of urban buses, and at predicting inter-arrival times in order to better understand the urban bus network.

CIFRE PhD: Introduction of rejection capabilities and externalized language models in deep learning systems for text reading under adverse conditions

Participants: Guillaume Gravier.

Duration: 3 years, started in June 2023 Partner: ANTAI

The thesis, in conjunction with the team SHADOC at IRISA, studies deep models for license plate recognition capable of balancing end-to-end training with separate language model training and adaptation.

Telegramme-CNRS bilateral contract: NLP for computational journalism

Participants: Laurent Amsaleg, Pascale Sébillot, Christian Raymond [Insa Rennes], Nicolas Fouqué.

Duration: 2 years, started in Jan 2022

The project aims at developing a wide range of text-mining and classification tools with the French press group Le Télégramme. In particular, we aim at discovering cues of success in the already published news articles and then exploit them to propose new angles of coverage of newsworthy events to the journalists.

DGA-Inria collaboration

Participants: Teddy Furon, Charly Faure [DGA-MI], Virgile Dine.

Duration: 3 years, started in Oct. 2024

The project aims at developing algorithms to make computer unlearn. From a model trained over a training dataset, we aim at deriving a second model ignoring some training samples, or some classes of samples without retraining it from scratch.

9 Partnerships and cooperations

9.1 International initiatives

9.1.1 Inria associate team not involved in an IIL or an international program

LOGIC

Title:
Learning on graph-based hierarchical methods for image and multimedia data
Duration:
2020 ->
Coordinator:
Silvio Jamil Guimaraes (sjamil@pucminas.br)
Partners:
- Pontifícia Universidade Católica de Minas Gerais Belo Horizonte (Brésil)
Inria contact:
Simon Malinowski
Summary:
The main goal of this project is related to learning graph-based hierarchical methods to be applied on image and multimedia data. Regarding image data, we aim at advancing in the state-of-the-art on hierarchy of partitions taking into account aspects of efficiency, quality, and interactivity, as well as the use of hierarchical information to help the information extraction process. Research on graph-based multimedia label/information propagation will be developed within this project along two main lines of research : - construction of multimedia graphs where links should depict semantic proximity between documents or fragments of documents - how different graph structures can be used to propagate information (usually tags or labels) from one document to another and across modalities

9.1.2 STIC/MATH/CLIMAT AmSud projects

GIMMD

Title:
Graph-based analysis and understanding of image, video and multimedia data
Program:
STIC-AmSud
Duration:
January 2, 2024 – December 31, 2025
Local supervisor:
Simon Malinowski
Partners:
- Guimarães (Brésil)
- Randall (Uruguay)
Inria contact:
Simon Malinowski
Summary:
Graphs can be seen as a way of representing relationships between elements, which can be pixels in image analysis, voxels in video analysis, people in contact networks, or even weather stations for data capture. Understanding the relationships between elements, called vertices, as well as identifying groups of elements that have similar characteristics make the use of graphs a powerful tool to solve real problems through their representation (or modeling) in graphs. Still, methods of analyzing images and videos, and even social networks, which use hierarchical representations, aim to explore the visual representation as a space-scale oriented by regions, that is, a set of representations based, for example, on graphs, with different levels of detail, in which representation at finer levels are nested to obtain coarser levels, thus producing a hierarchy of partitions. This type of data structure has been successfully applied in medical imaging, object detection and video captioning, as well as community identification in social networks. Despite the various approaches to computing partition hierarchies, developing efficient and effective methods is not an easy task, due to the semantic information needed to perform the segmentation. In fact, the state-of-the-art in graph partitioning methods are highly dependent on using good gradients, when there is differentiability between elements, to produce good results. Models based on optimal paths in trees represent an excellent direction to consider any problems produced by hierarchies, since any errors in the delineation of the borders of the regions can be corrected. These methods can eventually be transformed, without loss of quality, into hierarchical methods, incorporating new properties thanks to the use of hierarchy. In addition, with the advances of deep learning, it becomes essential to explore semantic relationships through graphs for the annotation of pseudo labels in order to train deep neural networks in addition to estimating saliences through networks to assist in the graphbased segmentation. The main objective of this study is both to advance the state of the art in partition hierarchy, considering aspects of efficiency, quality, hierarchical transformations and interactivity, as well as to explore the relationships of graphs and neural networks in image/video applications like inpainting, video captioning, for instances. Finally, we will explore methods of semi-supervised segmentation through the (semi) automatic location of markers. The results of these studies will be used to resolve various applications such as identi cation of cancer-susceptible cells in medical images, labeling regions in images and videos, identifying superpixels and supervoxels, inpainting, predicting solar irradiation in regions of interest, among others. We will build upon existing research and skills at LIGM, IRISA, UNICAMP, PUC Minas and UDELAR to develop collaborative work exploiting complementarity of these institutions.

9.2 National initiatives

Chaire Security of AI for Defense Applications (SAIDA)

Participants: Teddy Furon, Laurent Amsaleg, Mathias Rousset [SIMSMART], Quentin Le Roux, Karim Tit.

Duration: 4 years, started Sept 2020 ANR-20-CHIA-0011-01

SAIDA targets the AID "Fiabilité de l’intelligence artificielle, vulnérabilités et contre-mesures" chair. It aims at establishing the fundamental principles for designing reliable and secure AI systems: a reliable AI maintains its good performance even under uncertainties; a secure AI resists attacks in hostile environments. Reliability and security are challenged at training and at test time. SAIDA therefore studies core issues in relation with poisoning training data, stealing the parameters of the model or inferring sensitive training from information leaks. Additionally, SAIDA targets uncovering the fundamentals of attacks and defenses engaging AI at test time. Three converging research directions make SAIDA: 1) theoretical investigations grounded in statistics and applied mathematics to discover the underpinnings of reliability and security, 2) connects adversarial sampling and Information Forensics and Security, 3) protecting the training data and the AI system. SAIDA thus combines theoretical investigations with more applied and heuristic studies to guarantee the applicability of the findings as well as the ability to cope with real world settings.

ANR MEERQAT: MultimEdia Entity Representation and Question Answering Tasks

Participants: Laurent Amsaleg, Yannis Avrithis, Ewa Kijak, Shashanka Venkataramanan.

Duration: 3.5 year, started in April 2020 Partners: Inria project-teams Linkmedia, CEA LIST, LIMSI, IRIT.

The overall goal of the project is to tackle the problem of ambiguities of visual and textual content by learning then combining their representations. As a final use case, we propose to solve a Multimedia Question Answering task, that requires to rely on three different sources of information to answer a (textual) question with regard to visual data as well as an external knowledge base containing millions of unique entities, each being represetd by textual and visual content as well as some links to other entities. An important work will deal with the representation of entities into a common tri-modal space, in which one should determine the content to associate to an entity to adequately represent it. The challenge consists in defining a representation that is compact (for performance) while still expressive enough to reflect the potential links between the entity and a variety of others.

MinArm: EVE4

Participants: Teddy Furon, Eva Giboulot.

Duration: 3 year, started in April 2022 Partners: MinArm, CRIStAL Lille, LIRMM, Univ. Troyes, Univ. Paris Saclay

Teaching and technology survey on steganography and steganalysis in the real world.

ASTRID: HybrInfox

Participants: Vincent Claveau, Guillaume Gravier, Morgane Casanova.

Duration: 20 months, started Jan. 2022

This ANR-AID funded project aims at building exploring how hybridation of symbolic and deep learning NLP tools. These hybrid tools are expected to be used to detect some types of disinformation; in particular, these NLP tools target vagueness (non precise) or subjective (opinion rather than factual) discourses.

Labcom Synapses

Participants: Laurent Amsaleg, Guillaume Gravier, Pascale Sébillot, Michel Le Nouy [Ouest-France], Morgane Casanova.

Duration: 54 months, started Jan. 2024

In spring 2024, the French ANR accepted to financially support the Synapses Laboratoire commun with Ouest-France. It is administratively managed by the CNRS. For 5 years, starting in spring 2024, we will work closely with Ouest-France on a rather applied research program with the goal to eventually transfer some technological solutions to their development teams. The support from ANR amounts will be used to hire two engineers who will prepare proof-of-concept prototypes demonstrating the power of DL technologies applied to a subset of their photo stock and of their news archives. CIFRE PhDs as well as PhDs funded by academia will be enrolled to explore open issues. Note that the consortium agreement signed for Synapses includes chapters clarifying the intellectual property and PGDR issues.

ANR AGAPE

Participants: Laurent Amsaleg, Guillaume Gravier, Pascale Sébillot.

Duration: 48 months, started Jan. 2025

That ANR (ANR-24-CE38-7253), accepted during the summer of 2024, is coordinated by the Lastig laboratory of the IGN. It includes Linkmedia, Ilda from INRIA, the LIRIS, the National Archives, France TV and the University G. Eiffel. AGAPE aims to aggregate and process multimedia content related to cultural and natural heritage, leveraging open data policies and the vast information available online. The project focuses on visual-based documents, such as images, videos, 3D point clouds, and text descriptions. Its first goal is to conduct innovative research on multimodal analysis to link and structure this diverse content. The second objective is to integrate the structured data into a 3D environment, offering new ways of visualizing, navigating, and interacting with it. AGAPE seeks to create an open-source, interoperable, and reproducible framework encapsulated in a digital twin dedicated to heritage. This framework will be validated and applied in various fields, supporting archivists in enriching collections, historians in studying substandard housing, and journalists in engaging the public through media. A Ph.D. thesis for Linkmedia will be funded by AGAPE.

PEPR Cybersecurity COMPROMIS project

Participants: Teddy Furon, Eva Giboulot, Ewa Kijak, Enoal Gesny.

Duration: 4.5 years, started Apr. 2024

The COMPROMIS project is based on a modern vision of multimedia data protection, with deep learning at its heart. This project defends the idea that the protection of multimedia data must necessarily be associated with the security of the tools that analyse this data, i.e. these days Artificial Intelligence (AI). The observation is simple: the protection of multimedia data is undoubtedly the area of cybersecurity that has benefited most from AI, but it has neglected to check the level of security of this new tool. AI has become one of the weak links in the protection of multimedia data. The scientific hurdles thus concern both the classic applications of multimedia data protection and the emerging field of deep learning.

10 Dissemination

10.1 Promoting scientific activities

10.1.1 Scientific events: organisation

General chair, scientific chair

Laurent Amsaleg was the general co-chair of CBMI 2024
Teddy Furon was the general chair of ESSAI 2024, European Symposium on Security of Artificial Intelligence
Ewa Kijak was technical program chair for CBMI 2024, the 21st International Conference on Content-based Multimedia Indexing

10.1.2 Scientific events: selection

Member of the conference program committees

Caio Corro was an area chair for EMNLP 2024
Laurent Amsaleg was a senior area chair for ACM Multimedia 2024
Laurent Amsaleg was a PC member of ICMR, ICME, MMM, SISAP, CBMI
Pascale Sébillot was a PC member for Conférence nationale en intelligence artificielle CNIA 2024

Reviewer

Caio Corro was a reviewer for Coling 2025
Teddy Furon was a reviewer for CVPR 2025, ICLR 2025, ICML 2025, NeurIPS 2024
Pascale Sébillot was a reviewer for LREC-Coling 2024
Eva Giboulot was a reviewer for IEEE WIFS 2024, ACM AiSec 2024, IEEE ICASSP 2025, ACM IH&MMSec 2024

10.1.3 Journal

Reviewer - reviewing activities

Teddy Furon was a reviewer for IEEE Transactions on Information Forensics and Security, IEEE Transactions on Dependable and Secure Computing, Transactions on Machine Learning Research
Eva Giboulot was a reviewer for IEEE Transactions on Information Forensics and Security

10.1.4 Invited talks

Teddy Furon was an invited speaker at `Trustworthy Machine Learning' (Sorbonne Center for AI), Salon VivaTech Paris, and Atelier Inria - UK AI Safety Institute, Summer school `Cyber in Normandy', Winter school of the CyberSchool
Ewa Kijak gave an invited talk for the scientific seminar of the ENS Rennes computer science department
Ewa Kijak was an invited speaker for the French Ministry of Defence's scientific and strategic day on LLMs and generative AIs
Caio Corro gave an invited talk at INRIA Paris: “Named-Entity Recognition: Resurrecting Old School Machine Learning in the Era of Deep Learning”, Dec. 2024

10.1.5 Leadership within the scientific community

Guillaume Gravier is a member of the scientific board of the GDR Traitement automatique des langues
Pascale Sébillot is a member of the board of the GDR Traitement automatique des langues

10.1.6 Scientific expertise

Caio Corro reviewed grants for UTTER’s Financial Support for Third Parties call (collaborative Research and Innovation project funded under Horizon Europe)
Teddy Furon reviewed grant proposal for Region Normandy

10.1.7 Research administration

Guillaume Gravier is director of IRISA (UMR 6074)
Pascale Sébillot is deputy director of IRISA

10.2 Teaching - Supervision - Juries

10.2.1 Teaching

Participants: Eva Giboulot, Ewa Kijak, Laurent Amsaleg, Guillaume Gravier, Pascale Sébillot.

Master: Laurent Amsaleg, Bases de données avancées, 25h, M2, INSA Rennes, France
Master: Eva Giboulot, Rare Event Simulations, 40h, INSA Rennes, France
Licence: Guillaume Gravier, Natural language processing, 12h, L3, INSA Rennes
Licence: Guillaume Gravier, Markov models, 6h, L3, INSA Rennes
Master: Guillaume Gravier, Natural Language Processing, 6h, M1, INSA Rennes
Master: Guillaume Gravier, Natural Language Processing, 51h, M2, ENSAI
Master: Pascale Sébillot, Natural Language Processing, 4h, M1, INSA Rennes, France
Master: Pascale Sébillot, Databases, 18h, M1, DIGISPORT graduate school (EUR), France
Licence: Pascale Sébillot, Natural Language Processing, 6h, L3, INSA Rennes, France
Licence: Caio Corro, Databases, 34h, L2, INSA Rennes, France
Licence: Caio Corro, Probabilities, 26h, L3, INSA Rennes, France
Ewa Kijak is head of the Image engineering track (M1-M2) of ESIR, Univ. Rennes
Master: Ewa Kijak, Information retrieval and Multimodal applications, 24h, M2, ESIR
Master: Ewa Kijak, Deep Learning for Vision, 12h, M2, ESIR
Master: Ewa Kijak, Supervised machine learning, 20h, M1R, ENS Rennes
Master: Ewa Kijak, Machine learning, 12h, M1, ESIR
Master: Ewa Kijak, Image processing, 45h, M1, ESIR, Univ. Rennes

10.2.2 Supervision

Ph.D. Duc Hau Nguyen, Making AI understandable for humans: the plausibility of attention-based mechanisms in natural language processing. Oct. 11, 2024. With Guillaume Gravier and Pascale Sébillot.
Ph.D. Shashanka Venkataramanan, Metric learning for instance- and category-level visual representations. Jul. 1, 2024. With Yannis Avrithis and Ewa Kijak.
Ph.D. Karim Tit, Reliability of Deep Learning with rare event simulation : theory and practice. Apr. 22, 2024. With Matthias Rousset and Teddy Furon
Ph.D. Deniz Engin, Videa Question Answering with limited ressources. Jun. 11, 2024. With Yannis Avrithis and Teddy Furon
PhD in progress: Pierre Fernandez, Watermarking Generative AI. Started Oct. 2022, Teddy Furon
PhD in progress: Gautier Evennou, Detection and explanation of semantic manipulations in multimedia content. Started in Sep. 2023, Ewa Kijak
PhD in progress: Louis Hemadou, Domain generalization exploiting synthetic data. Started Nov. 2022, Ewa Kijak
PhD in progress: Carolina Jeronimo, Machine learning for temporal graphs. Started in Sept. 2022. Simon Malinowski and Guillaume Gravier
PhD in progress: Hugo Thomas, Zero-shot and few-shot relation extraction in press archives. Started Sept. 2022, Guillaume Gravier and Pascale Sébillot
PhD in progress: Ahmed Abdourahman, AI-driven character simulation based on Multi-Agents Interaction Imitation Learning. Started Dec. 2023, Ewa Kijak and Franck Multon (MIMETIC Team at IRISA)
PhD in progress: Adèle Denis, IA-based automated detection and behavior analysis among piglets. Started Sep. 2024, Ewa Kijak, Caroline Clouard (INRAE) and Céline Tallet (INRAE)
PhD in progress: Virgile Dine, Machine Unlearning. Started Oct. 2024, Teddy Furon
PhD in progress: Enoal Gesny, Watermarking of Generative AI. Started Nov. 2024, Eva Giboulot and Teddy Furon
PhD in progress: Chloé Imadache, Security of Deep Learning based Watermarking. Started Dec. 2024, Eva Giboulot and Teddy Furon

10.2.3 Juries

Laurent Amsaleg was a reviewer for the PhD. of Huiyu Li, Univ. Nice, Nov. 2024.
Laurent Amsaleg was the president of the PhD jury of Tom Bachard, Univ. Rennes, Nov. 2024.
Teddy Furon was the president of the PhD jury of Etienne Levecque, Univ. Lille, Nov. 2024.
Ewa Kijak was a reviewer for the PhD. of Emile Blettery, Univ. Gustave Eiffel, Jan. 2024.
Ewa Kijak was a jury member for the PhD. of Mireille El Assal, Univ. Lille, Fev. 2024.
Ewa Kijak was a jury member for the PhD. of Guillaume Jeanneret, Univ. Caen, Sep. 2024.
Ewa Kijak was a jury member for the PhD. of Tom Bachard, Univ. Rennes, Nov. 2024.
Ewa Kijak was a jury member for the PhD. of Paul Berg, Univ. Bretagne Sud, Dec. 2024.
Pascale Sébillot was a jury member for the PhD. of Kim-Anh Nguyen, Sorbonne-Univ., Apr. 2024
Pascale Sébillot was a reviewer for the PhD. of Evan Dufraisse, Univ. Lorraine, Sept. 2024
Pascale Sébillot was the president of the PhD. jury of Hui-Syuan Yeh, Univ. Paris-Saclay, Dec. 2024
Caio Corro was a jury member for the PhD. of Nathan Godey, Inria, Dec. 2024.

10.3 Popularization

10.3.1 Productions (articles, videos, podcasts, serious games, ...)

Teddy Furon was involved in the writing of the policy paper "Cybersecurity specific to AI" from the Inria Program Agency

10.3.2 Participation in Live events

Laurent Amsaleg was involved into the "Chiche" program with 6 classes at the Lycée Saint joseph, Bruz.
Laurent Amsaleg was invited at a panel during the "50 ans du Club de la presse de Bretagne"
Teddy Furon was a speaker at `Math and Art' Festival of Saint-Brieuc
Teddy Furon was involved into the "Chiche" program with 6 classes at Lycée Lasalle (Verrières en Anjou) and Lycée Rabelais (Saint-Brieuc)

11 Scientific production

11.1 Major publications

1 articleL.Laurent Amsaleg, J.James Bailey, A.Amelie Barbe, S.Sarah Erfani, T.Teddy Furon, M.Michael Houle, M.Milos Radovanovic and N. X.Nguyen Xuan Vinh. High Intrinsic Dimensionality Facilitates Adversarial Attack: Theoretical Evidence.IEEE Transactions on Information Forensics and Security16September 2020, 1-12HAL DOI
2 articleB.Benoit Bonnet, T.Teddy Furon and P.Patrick Bas. Generating Adversarial Images in Quantized Domains.IEEE Transactions on Information Forensics and Security2022HAL DOI
3 inproceedingsA.Antoine Chaffin, V.Vincent Claveau and E.Ewa Kijak. PPL-MCTS: Constrained Textual Generation Through Discriminator-Guided Decoding.CtrlGen 2021 - Workshop on Controllable Generative Modeling in Language and Vision at NeurIPS 2021Proceedings of the CtrlGen workshopvirtual, United StatesDecember 2021, 1-19HAL
4 inproceedingsP.Pierre Fernandez, A.Antoine Chaffin, K.Karim Tit, V.Vivien Chappelier and T.Teddy Furon. Three bricks to consolidate watermarks for large language models.Proceedings of IEEE WIFSWIFS 2023 - IEEE International Workshop on Information Forensics and SecurityNuremberg, GermanyIEEEDecember 2023, 1-9HAL
5 inproceedingsP.Pierre Fernandez, G.Guillaume Couairon, H.Hervé Jégou, M.Matthijs Douze and T.Teddy Furon. The Stable Signature: Rooting Watermarks in Latent Diffusion Models.2023 IEEE International Conference on Computer Vision (ICCV)ICCV 2023 - International Conference on Computer Vision2023 IEEE International Conference on Computer VisionParis, FranceOctober 2023HAL
6 inproceedingsA.Ahmet Iscen, G.Giorgos Tolias, Y.Yannis Avrithis, T.Teddy Furon and O.Ondřej Chum. Efficient Diffusion on Region Manifolds: Recovering Small Objects with Compact CNN Representations.2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)Honolulu, United StatesJuly 2017HAL
7 inproceedingsT.Thibault Maho, T.Teddy Furon and E. L.Erwan Le Merrer. SurFree: a fast surrogate-free black-box attack.CVPR 2021 - Conference on Computer Vision and Pattern RecognitionProc. of {IEEE} Conference on Computer Vision and Pattern Recognition, {CVPR}Virtual, FranceJune 2021, 10430--10439HAL
8 inproceedingsS.Shashanka Venkataramanan, E.Ewa Kijak, L.Laurent Amsaleg and Y.Yannis Avrithis. AlignMixup: Improving Representations By Interpolating Aligned Features.CVPR 2022 - IEEE/CVF Conference on Computer Vision and Pattern RecognitionNew Orleans, United StatesIEEEJune 2022, 1-13HAL
9 articleV.Vedran Vukotić, C.Christian Raymond and G.Guillaume Gravier. A Crossmodal Approach to Multimodal Fusion in Video Hyperlinking.IEEE MultiMedia2522018, 11-23HAL DOI

11.2 Publications of the year

International journals

10 articleM.Marzieh Gheisari, J.Javad Amirian, T.Teddy Furon and L.Laurent Amsaleg. AggNet: Learning to aggregate faces for group membership verification.Signal Processing: Image Communication132December 2024, 117237HAL DOI back to text
11 articleC.Carolina Jerônimo, Z.Zenilton Patrocínio, S.Simon Malinowski, G.Guillaume Gravier and S. J.Silvio Jamil Ferzoli Guimarães. Decreasing graph complexity with transitive reduction to improve temporal graph classification.International Journal of Data Science and Analytics2024HAL DOI back to text
12 articleM.Maxime Oquab, T.Timothée Darcet, T.Théo Moutakanni, H.Huy Vo, M.Marc Szafraniec, V.Vasil Khalidov, P.Pierre Fernandez, D.Daniel Haziza, F.Francisco Massa, A.Alaaeldin El-Nouby, M.Mahmoud Assran, N.Nicolas Ballas, W.Wojciech Galuba, R.Russell Howes, P.-Y.Po-Yao Huang, S.-W.Shang-Wen Li, I.Ishan Misra, M.Michael Rabbat, V.Vasu Sharma, G.Gabriel Synnaeve, H.Hu Xu, H.Hervé Jegou, J.Julien Mairal, P.Patrick Labatut, A.Armand Joulin and P.Piotr Bojanowski. DINOv2: Learning Robust Visual Features without Supervision.Transactions on Machine Learning Research Journal2024HAL DOI back to text
13 articleQ. L.Quentin Le Roux, E.Eric Bourbao, Y.Yannick Teglia and K.Kassem Kallas. A Comprehensive Survey on Backdoor Attacks and Their Defenses in Face Recognition Systems.IEEE Access122024, 47433-47468HAL DOI back to text

International peer-reviewed conferences

14 inproceedingsM.Morgane Casanova, J.Julien Chanson, B.Benjamin Icard, G.Géraud Faye, G.Guillaume Gadek, G.Guillaume Gravier and P.Paul Égré. HYBRINFOX at CheckThat! 2024 - Task 2: Enriching BERT Models with the Expert System VAGO for Subjectivity Detection.Proceedings of the Conference and Labs of the Evaluation Forum (CLEF 2024 CheckThat!)CLEF 2024 - Conference and Labs of the Evaluation ForumGrenoble, France2024, 1-9HAL DOI back to text
15 inproceedingsA.Antoine Chaffin, E.Ewa Kijak and V.Vincent Claveau. Distinctive image captioning: leveraging ground truth captions in clip guided reinforcement learning.Proceedings of 2024 IEEE International Conference on Image ProcessingICIP 2024 - IEEE International Conference on Image ProcessingAbu Dhabi, United Arab EmiratesOctober 2024HAL back to text
16 inproceedingsC.Caio Corro. A Fast and Sound Tagging Method for Discontinuous Named-Entity Recognition.EMNLP 2024 - Conference on Empirical Methods in Natural Language ProcessingMiami, United StatesAssociation for Computational LinguisticsNovember 2024, 19506-19518HAL DOI back to text
17 inproceedingsG.Gautier Evennou, V.Vivien Chappelier, E.Ewa Kijak and T.Teddy Furon. SWIFT: Semantic Watermarking for Image Forgery Thwarting.Proc. of IEEE WIFSWIFS 2024 - 16th IEEE International Workshop on Information Forensics and SecurityRoma, ItalyIEEEDecember 2024, 1-6HAL back to text
18 inproceedingsG.Géraud Faye, B.Benjamin Icard, M.Morgane Casanova, J.Julien Chanson, F.François Maine, F.François Bancilhon, G.Guillaume Gadek, G.Guillaume Gravier and P.Paul Égré. Exposing propaganda: an analysis of stylistic cues comparing human annotations and machine classification.Proceedings of the EACL Workshop on Understanding Implicit and Underspecified Language (UnImplicit 2024)EACL Workshop on Understanding Implicit and Underspecified Language (UnImplicit 2024)Malte, Malta2024HAL back to text
19 inproceedingsP.Pierre Fernandez, G.Guillaume Couairon, T.Teddy Furon and M.Matthijs Douze. Functional invariants to watermark large transformers.Proceedings of ICASSP'24ICASSP 2024 - IEEE International Conference on Acoustics, Speech and Signal ProcessingSeoul, South KoreaApril 2024, 1-5HAL back to text
20 inproceedingsE.Enoal Gesny, E.Eva Giboulot and T.Teddy Furon. When does gradient estimation improve black-box adversarial attacks?Proceedings of IEEE WIFS 2024WIFS 2024 -16th IEEE International Workshop on Information Forensics and SecurityRoma, ItalyIEEEDecember 2024, 1-6HAL back to text
21 inproceedingsE.Eva Giboulot and T.Teddy Furon. WaterMax: breaking the LLM watermark detectability-robustness-quality trade-off.38th Conference on Neural Information Processing Systems (NeurIPS 2024)NeurIPS 2024 - 38th Conference on Neural Information Processing SystemsVancouver, CanadaDecember 2024, 1-34HAL back to text
22 inproceedingsA.Ayoub Hammal, B.Benno Uthayasooriyar and C.Caio Corro. Few-Shot Domain Adaptation for Named-Entity Recognition via Joint Constrained k-Means and Subspace Selection.Proceedings of the 31st International Conference on Computational Linguistics (COLING 2025)Abu DHABI, FranceJanuary 2025HAL back to text
23 inproceedingsB.Benjamin Icard, F.François Maine, M.Morgane Casanova, G.Géraud Faye, J.Julien Chanson, G.Guillaume Gadek, G.Ghislain Atemezing, F.François Bancilhon and P.Paul Égré. A Multi-Label Dataset of French Fake News: Human and Machine Insights.Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)LREC-COLING 2024 - The 2024 Joint International Conference on Computational Linguistics, Language Resources and EvaluationTorino, ItalyELRA; ICCLMay 2024, 812–818HAL back to text
24 inproceedingsQ.Quentin Le Roux, K.Kassem Kallas and T.Teddy Furon. A Double-Edged Sword: The Power of Two in Defending Against DNN Backdoor Attacks.EUSIPCO 2024 - 32nd IEEE European Signal Processing ConferenceLyon, FranceIEEE2024, 2007-2011HAL DOI back to text
25 inproceedingsQ.Quentin Le Roux, K.Kassem Kallas and T.Teddy Furon. REStore: Exploring a Black-Box Defense against DNN Backdoors using Rare Event Simulation.SaTML 2024 - 2nd IEEE Conference on Secure and Trustworthy Machine LearningProceedings of the 2nd IEEE Conference on Secure and Trustworthy Machine Learning (SaTML)Toronto, CanadaIEEE2024, 1-22HAL back to text
26 inproceedingsR. S.Robin San Roman, P.Pierre Fernandez, H.Hady Elsahar, A.Alexandre Défossez, T.Teddy Furon and T.Tuan Tran. Proactive Detection of Voice Cloning with Localized Watermarking.Proceedings of the 41st International Conference on Machine LearningICML 2024 - 41st International Conference on Machine Learning235Vienna, AustriaJuly 2024, 1-17HAL back to text
27 inproceedingsT.Tom Sander, P.Pierre Fernandez, A.Alain Durmus, M.Matthijs Douze and T.Teddy Furon. Watermarking Makes Language Models Radioactive.38th Conference on Neural Information Processing Systems (NeurIPS 2024).NeurIPS 2024 - 38th Conference on Neural Information Processing SystemsSpotlightVancouver, CanadaDecember 2024, 1-35HAL back to text
28 inproceedingsH.Hugo Thomas, G.Guillaume Gravier and P.Pascale Sébillot. One-shot relation retrieval in news archives: adapting N-way K-shot relation classification for efficient knowledge extraction.KES 2024 - 28th International Conference on Knowledge-Based and Intelligent Information & Engineering SystemsSeville, Spain2024, 1060-1069HAL back to text
29 inproceedingsK.Karim Tit and T.Teddy Furon. Fast Reliability Estimation for Neural Networks with Adversarial Attack-Driven Importance Sampling.UAI 2024 - 40th Conference on Uncertainty in Artificial IntelligenceBarcelona, Spain2024HAL back to text
30 inproceedingsB.Benno Uthayasooriyar, A.Antoine Ly, F.Franck Vermet and C.Caio Corro. Training LayoutLM from Scratch for Efficient Named-Entity Recognition in the Insurance Domain.Proceeedings of the COLING 2025 Workshop on Financial Technology and Natural Language Processing (FinNLP), Financial Narrative Processing (FNP), and on Large Language Models for Finance and Legal (LLMFinLegal)COLING 2025 Workshop on Financial Technology and Natural Language Processing (FinNLP), Financial Narrative Processing (FNP), and on Large Language Models for Finance and Legal (LLMFinLegal)Abu Dabi, United Arab EmiratesDecember 2024HAL back to text
31 inproceedingsS.Shashanka Venkataramanan, M. N.Mamshad Nayeem Rizve, J.João Carreira, Y.Yuki Asano and Y.Yannis Avrithis. Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video.ICLR 2024 - Twelfth International Conference on Learning RepresentationsVienna, Austria2024, 1-21HAL back to text

National peer-reviewed Conferences

32 inproceedingsH.Hugo Thomas, G.Guillaume Gravier and P.Pascale Sébillot. Recherche de relation à partir d’un seul exemple fondée sur un modèle N-way K-shot : une histoire de distracteurs.35èmes Journées d'Études sur la Parole (JEP 2024)35èmes Journées d'Études sur la Parole (JEP 2024) 31ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN 2024) 26ème Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL 2024)1 : articles longs et prises de positionToulouse, FranceATALA & AFPC2024, 157-168HAL back to text

Conferences without proceedings

33 inproceedingsL.Louis Hémadou, H.Héléna Vorobieva, E.Ewa Kijak and F.Frédéric Jurie. Beyond Internet Images: Evaluating Vision-Language Models for Domain Generalization on Synthetic-to-Real Industrial Datasets.Synthetic Data for Computer Vision Workshop @ CVPR 2024Seattle, Washington, United StatesJune 2024HAL back to text

Doctoral dissertations and habilitation theses

34 thesisD.Deniz Engin. Video question answering with limited supervision.Université de RennesJune 2024HAL
35 thesisS.Shashanka Venkataramanan. Metric learning for instance and category-level visual representation.Université de RennesJuly 2024HAL

Reports & preprints

36 miscR. S.Robin San Roman, P.Pierre Fernandez, A.Antoine Deleforge, Y.Yossi Adi and R.Romain Serizel. Latent Watermarking of Audio Generative Models.2024HAL DOI

11.3 Cited publications

37 inproceedingsL.Laurent Amsaleg, J. E.James E. Bailey, D.Dominique Barbe, S.Sarah Erfani, M. E.Michael E. Houle, V.Vinh Nguyen and M.Miloš Radovanović. The Vulnerability of Learning to Adversarial Perturbation Increases with Intrinsic Dimensionality.WIFS2017HAL back to text
38 inproceedingsL.Laurent Amsaleg, O.Oussama Chelly, T.Teddy Furon, S.Stephane Girard, M. E.Michael E. Houle, K.-I.Ken-Ichi Kawarabayashi and M.Michael Nett. Estimating Local Intrinsic Dimensionality.KDD2015HAL back to text back to text back to text
39 articleL.Laurent Amsaleg, G. \.Gylfi \TH{}ór Gu\dh{}mundsson, B. \.Björn \TH{}ór Jónsson and M. J.Michael J Franklin. Prototyping a Web-Scale Multimedia Retrieval Service Using Spark.ACM TOMCCAP143s2018HAL back to text
40 inproceedingsL.Laurent Amsaleg, B. \.Björn \TH{}ór Jónsson and H.Herwig Lejsek. Scalability of the NV-tree: Three Experiments.SISAP2018HAL back to text
41 inproceedingsR.Raghavendran Balu, T.Teddy Furon and L.Laurent Amsaleg. Sketching techniques for very large matrix factorization.ECIR2016HAL back to text
42 inproceedingsS.-A.Sid-Ahmed Berrani, H.Haykel Boukadida and P.Patrick Gros. Constraint Satisfaction Programming for Video Summarization.ISM2013back to text
43 articleB.Battista Biggio and F.Fabio Roli. Wild Patterns: Ten Years After the Rise of Adversarial Machine Learning.Pattern Recognition2018back to text
44 phdthesisP.Petra Bosilj. Image indexing and retrieval using component trees.Université de Bretagne Sud2016HAL back to text
45 phdthesisX.Xavier Bost. A storytelling machine? : Automatic video summarization: the case of TV series.University of Avignon, France2016back to text
46 inproceedingsM.Mateusz Budnik, M.Mikail Demirdelen and G.Guillaume Gravier. A Study on Multimodal Video Hyperlinking with Visual Aggregation.ICME2018back to text
47 inproceedingsR.Ricardo Carlini Sperandio, S.Simon Malinowski, L.Laurent Amsaleg and R.Romain Tavenard. Time Series Retrieval using DTW-Preserving Shapelets.SISAP2018HAL back to text
48 articleN.Nicholas Carlini and D. A.David A. Wagner. Audio Adversarial Examples: Targeted Attacks on Speech-to-Text.CoRRabs/1801.019442018back to text
49 inproceedingsV.Vincent Claveau, L. E.Lucas Emanuel Silva Oliveira, G.Guillaume Bouzillé, M.Marc Cuggia, C. M.Claudia Maria Cabral Moro and N.Natalia Grabar. Numerical eligibility criteria in clinical protocols: annotation, automatic detection and interpretation.AIME2017HAL back to text
50 inproceedingsA.Agni Delvinioti, H.Hervé Jégou, L.Laurent Amsaleg and M. E.Michael E. Houle. Image Retrieval with Reciprocal and shared Nearest Neighbors.VISAPP2014HAL back to text
51 inproceedingsC. B.Cheikh Brahim El Vaigh, F.François Goasdoué, G.Guillaume Gravier and P.Pascale Sébillot. Using Knowledge Base Semantics in Context-Aware Entity Linking.DocEng 2019 - 19th ACM Symposium on Document EngineeringBerlin, GermanyACMSeptember 2019, 1-10HAL DOI back to text back to text
52 bookH.Hany Farid. Photo Forensics.The MIT Press2016back to text
53 articleM.Mahak Gambhir and V.Vishal Gupta. Recent automatic text summarization techniques: a survey.Artif. Intell. Rev.4712017back to text
54 bookI.Ian Goodfellow, Y.Yoshua Bengio and A.Aaron Courville. Deep Learning.MIT Press2016back to text
55 inproceedingsG.Guillaume Gravier, M.Martin Ragot, L.Laurent Amsaleg, R.Rémi Bois, G.Grégoire Jadi, E.Eric Jamet, L.Laura Monceaux and P.Pascale Sébillot. Shaping-Up Multimedia Analytics: Needs and Expectations of Media Professionals.MMM, Special Session Perspectives on Multimedia Analytics2016HAL back to text
56 inproceedingsA.Ahmet Iscen, L.Laurent Amsaleg and T.Teddy Furon. Scaling Group Testing Similarity Search.ICMR2016HAL back to text
57 inproceedingsA.Ahmet Iscen, G.Giorgos Tolias, Y.Yannis Avrithis and O.Ondřej Chum. Mining on Manifolds: Metric Learning without Labels.CVPR2018HAL back to text back to text back to text back to text
58 inproceedingsB. \.Björn \TH{}ór Jónsson, G.Gr\'imur Tómasson, H.Hlynur Sigur\th{}órsson, Á.Áslaug Er\'iksdóttir, L.Laurent Amsaleg and M. K.Marta Kristin Larusdottir. A Multi-Dimensional Data Model for Personal Photo Browsing.MMM2015HAL back to text
59 inproceedingsB. \.Björn \TH{}ór Jónsson, M.Marcel Worring, J.Jan Zahálka, S.Stevan Rudinac and L.Laurent Amsaleg. Ten Research Questions for Scalable Multimedia Analytics.MMM, Special Session Perspectives on Multimedia Analytics2016HAL back to text
60 articleH.H. Kim, P.P. Garrido, A.A. Tewari, W.W. Xu, J.J. Thies, N.N. Nie\ss}ner, P.P. P{érez, C.C. Richardt, M.M. Zollhöfer and C.C. Theobalt. Deep Video Portraits.ACM TOG2018back to text
61 inproceedingsM.Mathieu Laroze, R.Romain Dambreville, C.Chloé Friguet, E.Ewa Kijak and S.Sébastien Lefèvre. Active Learning to Assist Annotation of Aerial Images in Environmental Surveys.CBMI2018back to text
62 articleS.Sam Leroux, P.Pavlo Molchanov, P.Pieter Simoens, B.Bart Dhoedt, T.Thomas Breuel and J.Jan Kautz. IamNN: Iterative and Adaptive Mobile Neural Network for Efficient Image Classification.CoRRabs/1804.101232018back to text
63 inproceedingsA.Arnaud Lods, S.Simon Malinowski, R.Romain Tavenard and L.Laurent Amsaleg. Learning DTW-Preserving Shapelets.IDA2017HAL back to text
64 inproceedingsC.Cédric Maigrot, E.Ewa Kijak and V.Vincent Claveau. Context-Aware Forgery Localization in Social-Media Images: A Feature-Based Approach Evaluation.ICIP2018back to text
65 inproceedingsD.Dafna Shahaf and C.Carlos Guestrin. Connecting the dots between news articles.KDD2010back to text
66 inproceedingsM.Miaojing Shi, H.Holger Caesar and V.Vittorio Ferrari. Weakly Supervised Object Localization Using Things and Stuff Transfer.ICCV2017back to text
67 inproceedingsR.Ronan Sicre, Y.Yannis Avrithis, E.Ewa Kijak and F.Frédéric Jurie. Unsupervised part learning for visual recognition.CVPR2017HAL back to text
68 inproceedingsR.Ronan Sicre and H.Hervé Jégou. Memory Vectors for Particular Object Retrieval with Multiple Queries.ICMR2015HAL back to text
69 inproceedingsA.Allan da Silva Pinto, D.Daniel Moreira, A.Aparna Bharati, J.Joel Brogan, K. W.Kevin W. Bowyer, P. J.Patrick J. Flynn, W. J.Walter J. Scheirer and A.Anderson Rocha. Provenance filtering for multimedia phylogeny.ICIP2017back to text
70 inproceedingsO.Oriane Siméoni, A.Ahmet Iscen, G.Giorgos Tolias, Y.Yannis Avrithis and O.Ondřej Chum. Unsupervised Object Discovery for Instance Recognition.WACV2018HAL back to text back to text
71 inproceedingsH. O.Hyun Oh Song, Y.Yu Xiang, S.Stefanie Jegelka and S.Silvio Savarese. Deep Metric Learning via Lifted Structured Feature Embedding.CVPR2016back to text
72 inproceedingsC.-Y.Chun-Yu Tsai, M. L.Michelle L. Alexander, N.Nnenna Okwara and J. R.John R. Kender. Highly Efficient Multimedia Event Recounting from User Semantic Preferences.ICMR2014back to text back to text
73 articleO.Oriol Vinyals, A.Alexander Toshev, S.Samy Bengio and D.Dumitru Erhan. Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge.TPAMI3942017back to text
74 phdthesisV.Vedran Vukotić. Deep Neural Architectures for Automatic Representation Learning from Multimedia Multimodal Data.INSA de Rennes2017HAL back to text back to text
75 inproceedingsV.Vedran Vukotić, C.Christian Raymond and G.Guillaume Gravier. Bidirectional Joint Representation Learning with Symmetrical Deep Neural Networks for Multimodal and Crossmodal Applications.ICMR2016HAL back to text
76 inproceedingsV.Vedran Vukotić, C.Christian Raymond and G.Guillaume Gravier. Generative Adversarial Networks for Multimodal Representation Learning in Video Hyperlinking.ICMR2017HAL back to text back to text
77 articleJ.Jason Weston, S.Sumit Chopra and A.Antoine Bordes. Memory Networks.CoRRabs/1410.39162014back to text
78 inproceedingsH.Haonan Yu, J.Jiang Wang, Z.Zhiheng Huang, Y.Yi Yang and W.Wei Xu. Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks.CVPR2016back to text
79 inproceedingsJ.Jan Zahálka and M.M. Worring. Towards interactive, intelligent, and integrated multimedia analytics.VAST2014back to text
80 inproceedingsL.Lu Zhang, M.Miaojing Shi and Q.Qiaobo Chen. Crowd Counting via Scale-Adaptive Convolutional Neural Network.WACV2018HAL back to text
81 articleX.Xiangyu Zhang, X.Xinyu Zhou, M.Mengxiao Lin and J.Jian Sun. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices.CoRRabs/1707.010832017back to text

LINKMEDIA - 2024

LINKMEDIA - 2024

2024Activity reportProject-TeamLINKMEDIA

Keywords

Computer Science and Digital Science

Other Research Topics and Application Domains

1 Team members, visitors, external collaborators

Research Scientists

Faculty Members

Post-Doctoral Fellows

PhD Students

Technical Staff

Interns and Apprentices

Administrative Assistants

Visiting Scientists

External Collaborator

2 Overall objectives

2.1 Context

2.2 Scientific objectives

3 Research program

3.1 Scientific background

3.2 Workplan

3.3 Research Direction 1: Extracting and Representing Information

Machine Learning for Multimedia Material.

Adversarial Machine Learning.

Multimedia Knowledge Extraction.

3.4 Research Direction 2: Accessing Information

Searching.

Navigating.

Summarizing.

4 Application domains

4.1 Asset management in the entertainment business

4.2 Multimedia Internet

4.3 Data journalism

5 Social and environmental responsibility

5.1 Impact of research results

The Synapses Labcom

6 Highlights of the year

7 New results

7.1 Extracting and Representing Information

7.1.1 Decreasing graph complexity with transitive reduction to improve temporal graph classification

7.1.2 DINOv2: Learning Robust Visual Features without Supervision

7.1.3 Functional invariants to watermark large transformers

7.1.4 Recherche de relation à partir d'un seul exemple fondée sur un modèle N-way K-shot : une histoire de distracteurs

7.1.5 One-shot relation retrieval in news archives: adapting N-way K-shot relation classification for efficient knowledge extraction

7.1.6 Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video

7.1.7 AggNet: Learning to aggregate faces for group membership verification

7.1.8 REStore: Exploring a Black-Box Defense against DNN Backdoors using Rare Event Simulation

7.1.9 Proactive Detection of Voice Cloning with Localized Watermarking

7.1.10 A Fast and Sound Tagging Method for Discontinuous Named-Entity Recognition

7.1.11 Few-Shot Domain Adaptation for Named-Entity Recognition via Joint Constrained k-Means and Subspace Selection

7.1.12 Training LayoutLM from Scratch for Efficient Named-Entity Recognition in the Insurance Domain

7.1.13 WaterMax: breaking the LLM watermark detectability-robustness-quality trade-off

7.1.14 When does gradient estimation improve black-box adversarial attacks?

7.1.15 SWIFT: Semantic Watermarking for Image Forgery Thwarting

7.1.16 Distinctive Image Captioning: Leveraging ground truth captions in CLIP guided reinforcement learning

7.1.17 Fast Reliability Estimation for Neural Networks with Adversarial Attack-Driven Importance Sampling

7.1.18 Watermarking Makes Language Models Radioactive

7.1.19 A Double-Edged Sword: The Power of Two in Defending Against DNN Backdoor Attacks

7.1.20 A Comprehensive Survey on Backdoor Attacks and Their Defenses in Face Recognition Systems

7.1.21 Beyond Internet Images: Evaluating Vision-Language Models for Domain Generalization on Synthetic-to-Real Industrial Datasets

7.1.22 HYBRINFOX at CheckThat! 2024 - Task 2: Enriching BERT Models with the Expert System VAGO for Subjectivity Detection

7.2 Accessing Information

7.2.1 A Multi-Label Dataset of French Fake News: Human and Machine Insights

7.2.2 Exposing propaganda: an analysis of stylistic cues comparing human annotations and machine classification

8 Bilateral contracts and grants with industry

8.1 Bilateral contracts with industry

CIFRE PhD: Certification of Deep Neural Networks

CIFRE PhD: Watermarking and deep learning

CIFRE PhD: Domain generalization exploiting synthetic data

CIFRE PhD: Detection and explanation of semantic manipulations in multimedia content

CIFRE PhD: Machine learning for identification of factors impacting the quality of service of urban buses

CIFRE PhD: Introduction of rejection capabilities and externalized language models in deep learning systems for text reading under adverse conditions

Telegramme-CNRS bilateral contract: NLP for computational journalism

DGA-Inria collaboration

9 Partnerships and cooperations

9.1 International initiatives

9.1.1 Inria associate team not involved in an IIL or an international program

LOGIC

9.1.2 STIC/MATH/CLIMAT AmSud projects