2025Activity reportProject-TeamCEDAR
RNSR: 201622056J- Research center Inria Saclay Centre at Institut Polytechnique de Paris
- In partnership with:Institut Polytechnique de Paris, CNRS
- Team name: Rich Data Exploration at Cloud Scale
- In collaboration with:Laboratoire d'informatique de l'école polytechnique (LIX)
Creation of the Project-Team: 2018 April 01
Each year, Inria research teams publish an Activity Report presenting their work and results over the reporting period. These reports follow a common structure, with some optional sections depending on the specific team. They typically begin by outlining the overall objectives and research programme, including the main research themes, goals, and methodological approaches. They also describe the application domains targeted by the team, highlighting the scientific or societal contexts in which their work is situated.
The reports then present the highlights of the year, covering major scientific achievements, software developments, or teaching contributions. When relevant, they include sections on software, platforms, and open data, detailing the tools developed and how they are shared. A substantial part is dedicated to new results, where scientific contributions are described in detail, often with subsections specifying participants and associated keywords.
Finally, the Activity Report addresses funding, contracts, partnerships, and collaborations at various levels, from industrial agreements to international cooperations. It also covers dissemination and teaching activities, such as participation in scientific events, outreach, and supervision. The document concludes with a presentation of scientific production, including major publications and those produced during the year.
Keywords
Computer Science and Digital Science
- A3.3. Data and knowledge analysis
- A9.1. Knowledge
- A9.2. Machine learning
- A9.2.1. Supervised learning
- A9.2.2. Unsupervised learning
- A9.2.3. Reinforcement learning
- A9.2.6. Neural networks
- A9.2.8. Deep learning
- A9.4. Natural language processing
- A9.13. Agentic AI
- A9.15. Symbolic AI
- A9.16. Societal impact of AI
- A9.17. Cybersecurity and AI
Other Research Topics and Application Domains
- B2.3. Epidemiology
- B6.5. Information systems
- B8.5.1. Participative democracy
- B9.5.6. Data science
- B9.7.2. Open data
- B9.10. Privacy
- B9.11.2. Financial risks
1 Team members, visitors, external collaborators
Research Scientists
- Ioana Manolescu Goujot [Team leader, INRIA, Senior Researcher, HDR]
- Oana-Denisa Balalau [INRIA, ISFP]
- Oana Goga [Inria, Senior Researcher, HDR]
- Madhulika Mohanty [INRIA, Researcher]
Faculty Member
- Yanlei Diao [ECOLE POLY PALAISEAU, Professor]
Post-Doctoral Fellows
- Garima Gaur [INRIA, Post-Doctoral Fellow]
- Chadi Helwe [INRIA, Post-Doctoral Fellow, until Apr 2025]
- Guillaume Lachaud [ECOLE POLY PALAISEAU, Post-Doctoral Fellow]
- Kun Zhang [ECOLE POLY PALAISEAU, from Apr 2025 until May 2025]
PhD Students
- Ines Abdelaziz [INRIA, from Dec 2025]
- Nardjes Amieur [CNRS]
- Gabriel Ben Zenou [Ministère Armées]
- Abir Benzaamia [CNRS]
- Theo Bouganim [INRIA, until Mar 2025]
- Tom Calamai [INRIA & Amundi, CIFRE]
- Salim Chouaki [CNRS]
- Przemyslaw Dominikowski [ECOLE POLY PALAISEAU, from Sep 2025]
- Asmaa El Fraihi [CNRS]
- Vincent Jacob [ECOLE POLY PALAISEAU, until Mar 2025]
- Hritika Kathuria [INRIA]
- Muhammad Khan [INRIA, until Sep 2025]
- Gabriel Lozano Pinzon [ECOLE POLY PALAISEAU, from Sep 2025]
- Mohamed Mezhoudi [BNP PARIBAS , CIFRE]
- Kun Zhang [INRIA, until Mar 2025]
Technical Staff
- Ines Abdelaziz [INRIA, Engineer, until Nov 2025]
- Simon Ebel [INRIA, Engineer, until Jun 2025]
- Theo Galizzi [INRIA, Engineer, until Jun 2025]
- Ismail Hatim [ECOLE POLYTECHNIQUE, Engineer, from Nov 2025]
- Aurelien Peden [INRIA, Engineer, from Mar 2025 until Oct 2025]
- Georgios Siachamis [INRIA, Engineer]
Interns and Apprentices
- Pablo Bertaud-Velten [INRIA, Intern, from Mar 2025 until Jul 2025]
- Nikola Dobricic [INRIA, Intern, until Mar 2025]
- Przemyslaw Dominikowski [INRIA, Intern, from Mar 2025 until Aug 2025]
- Paul Kronlund-Drouault [INRIA, Intern, from Jun 2025 until Aug 2025]
- Gabriel Lozano Pinzon [ECOLE POLY PALAISEAU, Intern, from Mar 2025 until Aug 2025]
- Maria-Justina-Adriana Mateescu [INRIA, Intern, from Jul 2025 until Jul 2025]
- Maria Jesus Mellado Tenorio [INRIA, Intern, from Mar 2025 until May 2025]
- Saba Shahsavari [INRIA, Intern, from Apr 2025 until Aug 2025]
- Yanis Zaamoun [ECOLE POLY PALAISEAU, Intern, until Mar 2025]
Administrative Assistant
- Michael Barbosa [INRIA]
External Collaborators
- Alexandre Barlot [Radio France]
- Nelly Barret [ECOLE POLYT. MILAN, until Apr 2025]
- Antoine Deiana [Radio France, until May 2025]
- Helena Galhardas [Instituto Superior Técnico, University of Lisbon]
- Emilie Gautreau [Radio France, until Apr 2025]
- Remi Guillou [ECOLE POLY PALAISEAU, from Jun 2025 until Aug 2025]
- Samuel Guimaraes [CNRS, until Mar 2025]
- Paul Kronlund-Drouault [ENS DE LYON, from Sep 2025]
- Chenghao Lyu [Univ Massachusetts Amherst, from Sep 2025]
- Adrien Maumy [Radio France, until Apr 2025]
- Tobias Moller [TELECOM PARIS, from Jul 2025 until Nov 2025]
- Thomas Pontillon [Radio France, until Apr 2025]
- Gerald Roux [Radio France, until Apr 2025]
- Prajna Devi Upadhyay [BITS PILANI HYDERABAD CAMPUS]
- Joanna Yakin [Radio France, until Apr 2025]
2 Overall objectives
Our research aims at models, algorithms and tools for highly efficient, easy-to-use data and knowledge management; throughout our research, performance at scale is a core concern, which we address, among other techniques, by designing algorithms for a cloud (massively parallel) setting. In addition, we explore and mine rich data via machine learning techniques. Our scientific contributions fall into four interconnected areas:
-
Optimization and performance at scale.
We work to devise efficient and effective optimization techniques which seek to make processing of data at very large scale, as efficient as possible. These efforts span over relational, graph, and text-rich data, in centralized as well as in distributed architectures.
-
Data discovery and exploration.
Today's Big Data is complex; understanding and exploiting it is daunting, especially to novice users such as journalists or domain scientists. We work to devise techniques for allowing users to explore graph data, large, heterogeneous data lakes, as well as more subtle signals hidden in the data, such as anomalies in time series and in dynamic graphs.
-
Natural language understanding for analyzing and supporting digital arenas.
In this area, we are interested in applications with high social value, such as analysing public discourse with the goal of finding elements that could bias the world view of citizens, such as false claims, fallacious arguments, propaganda, or greenwashing.
-
Safeguarding information systems.
Recent events have brought to light the easiness of using current online systems to propagate information (that is sometimes false) and that we are facing an information war. We create knowledge and technology in this area to make the online information space safer.
3 Research program
3.1 Multi-model querying
As the world's affairs get increasingly more digital, a large and varied set of data sources becomes available: they are either structured databases, such as government-gathered data (demographics, economics, taxes, elections), legal records, stock quotes for specific companies, un-structured or semi-structured, including in particular graph data, sometimes endowed with semantics (see e.g., the Linked Open Data cloud). Modern data management applications, such as data journalism, are eager to combine in innovative ways both static and dynamic information coming from structured, semi-structured, and unstructured databases and social feeds. However, current content management tools for this task are not suited for the task, in particular when they require a lengthy rigid cycle of data integration and consolidation in a warehouse. Thus, we need flexible tools allowing us to interconnect various kinds of data sources and query them together.
3.2 New methods for exploring and querying data graphs
Semantic graphs, including data and knowledge, are hard to apprehend for users due to the complexity of their structure and, often to their large volumes. To help tame this complexity, we seek new methods for exploring highly heterogeneous data graphs resulting from integrating structured, semi-structured, and unstructured (text) data. In this context, we study methods for automatically identifying, in a large corpus of data sources, interesting data paths that connect Named Entities (NE) to each other. Further, in some application contexts where RDF data graphs are collaboratively used, it is essential that access control methods be in place to guard access to the data. Query answers need, then, to be computed by taking into account access control restrictions, as well as ontologies that describe the data semantics.
3.3 Navigating the continuum between text and (semi) structured data
In data journalism and fact-checking applications, useful information comes both in structured records and in natural language text,
3.4 An unified framework for optimizing data analytics
Data analytics in the cloud has become an integral part of enterprise businesses. Big data analytics systems, however, still lack the ability to take user performance goals and budgetary constraints for a task collectively referred to as task objectives, and automatically configure an analytic job to achieve the objectives. Our goal is to develop a data analytics optimizer that can automatically determine a cluster configuration with a suitable number of cores and other runtime system parameters that best meet the task objectives. To achieve this, we also need to design a multi-objective optimizer that constructs a Pareto optimal set of job configurations for task-specific objectives and recommends new job configurations to best meet these objectives.
3.5 Elastic resource management for virtualized database engines
Database engines are migrating to the cloud to leverage the opportunities for efficient resource management by adapting to the variations and heterogeneity of the workloads. Resource management in a virtualized setting, like the cloud, must be enforced in a performance-efficient manner to avoid introducing overheads to the execution. We design elastic systems that change their configuration at runtime with minimal cost to adapt to the workload every time. Changes in the design include both different resource allocations and different data layouts. We consider different workloads, including transactional, analytical, and mixed, and we study the performance implications on different configurations to propose a set of adaptive algorithms.
3.6 Argumentation mining
Argumentation appears when we evaluate the validity of new ideas, convince an addressee, or solve a difference of opinion. An argument contains a statement to be validated (a proposition also called claim or conclusion), a set of backing propositions (called premises, which should be accepted ideas), and a logical connection between all the pieces of information presented that allows the inference of the conclusion. In our work, we focus on fallacious arguments, where evidence does not prove or disprove the claim, for example, in an "ad hominem" argument, a claim is declared false because the person making it has a character flaw. We study the impact of fallacies in online discussions and show the need for improving tools for their detection. In addition, we look into detecting verifiable claims made by politicians. We started a collaboration with RadioFrance and with Wikidébats, a debate platform focused on proving quality arguments for controversial topics.
3.7 Measuring and mitigating risks of AI-driven information targeting
We are witnessing a massive shift in the way people consume information. In the past, people had an active role in selecting the news they read. More recently, the information started to appear on people's social media feeds as a byproduct of one's social relations. We see a new shift brought by the emergence of online advertising platforms where third parties can pay ad platforms to show specific information to particular groups of people through paid targeted ads. AI-driven algorithms power these targeting technologies. Our goal is to study the risks with AI-driven information targeting at three levels: (1) human-level-in which conditions targeted information can influence an individual's beliefs; (2) algorithmic- level in which conditions AI-driven targeting algorithms can exploit people's vulnerabilities; and (3) platform-level are targeting technologies leading to biases in the quality of information different groups of people receive and assimilate. Then, we will use this understanding to propose protection mechanisms for platforms, regulators, and users.
4 Application domains
4.1 Cloud computing
Cloud computing services are strongly developing and more and more companies and institutions resort to running their computations in the cloud, in order to avoid the hassle of running their own infrastructure. Today's cloud service providers guarantee machine availabilities in their Service Level Agreement (SLA), without any guarantees on performance measures according to a specific cost budget. Running analytics on big data systems require the user not to only reserve the suitable cloud instances over which the big data system will be running, but also setting many system parameters like the degree of parallelism and granularity of scheduling. Chosing values for these parameters, and chosing cloud instances need to meet user objectives regarding latency, throughput and cost measures, which is a complex task if it is done manually by the user. Hence, we need need to transform cloud service models from availabily to user performance objective rises and leads to the problem of multi-objective optimization. Research carried out in the team within the ERC project “Big and Fast Data Analytics” aims to develop a novel optimization framework for providing guarantees on the performance while controlling the cost of data processing in the cloud.
4.2 Computational journalism
Modern journalism increasingly relies on content management technologies in order to represent, store, and query source data and media objects themselves. Writing news articles increasingly requires consulting several sources, interpreting their findings in context, and crossing links between related sources of information. Cedar research results directly applicable to this area provide techniques and tools for rich Web content warehouse management. Within the SourcesSay AI Chair project, we work to devise concrete algorithms and platforms to help journalists perform their work better and/or faster. This work is in collaboration with the journalists from RadioFrance, the team Le vrai du faux.
4.3 Computational social science
Political discussions revolve around ideological conflicts that often split the audience into two opposing parties. Both parties try to win the argument by bringing forward information. However, often this information is misleading, and its dissemination employs propaganda techniques. We investigate the impact of propaganda in online forums and we study a particular type of propagandist content, the fallacious argument. We show that identifying such arguments remains a difficult task, but one of high importance because of the pervasiveness of this type of discourse. We also explore trends around the diffusion and consumption of propaganda and how this can impact or be a reflection of society.
4.4 Online targeted advertising
The enormous financial success of online advertising platforms is partially due to the precise targeting features they offer. Ad platforms collect large amounts of data on users and use powerful AI-driven algorithms to infer users' fine-grain interests and demographics, which they make available to advertisers to target users. For instance, advertisers can target groups of users as small as tens or hundreds and as specific as “people interested in anti-abortion movements that have a particular education level”. Ad platforms also employ AI-driven targeting algorithms to predict how “relevant” ads are to particular groups of people to decide to whom to deliver them. While these targeting technologies are creating opportunities for businesses to reach interested parties and lead to economic growth, they also open the way for interested groups to use user's data to manipulate them by targeting messages that resonate with each user.
5 Social and environmental responsibility
5.1 Contribution to Diversity, Equity and Inclusion
Madhulika Mohanty co-led the SCOUT action of the Diversity, Equity and Inclusion initiative (website) for the DB research community from 2021-2025. This action provided a checklist of items to be checked before submitting a paper to promote and ensure more DEI-compliant papers. This will be integrated within the standard submission systems for DB conferences. This has led to the publication of 13.
6 Highlights of the year
6.1 Awards
The paper “RDF Query Answering in the Presence of Access Restrictions”' by Maxime Buron , Hritika Kathuria , Ioana Manolescu Goujot and Georgios Siachamis won the CoopIS 2025 Best Paper Award 28
7 Latest software developments, platforms, open data
7.1 Latest software developments
7.1.1 ConnectionLens
-
Name:
Integration of heterogeneous data using information extraction
-
Keyword:
Data analysis
-
Functional Description:
ConnectionLens treats a set of heterogeneous, independently authored data sources as a single virtual graph, whereas nodes represent fine-granularity data items (relational tuples, attributes, key-value pairs, RDF, JSON or XML nodes…) and edges correspond either to structural connections (e.g., a tuple is in a database, an attribute is in a tuple, a JSON node has a parent…) or to similarity (sameAs) links. To further enrich the content journalists work with, we also apply entity extraction which enables to detect the people, organizations etc. mentioned in text, whether full-text or text snippets found e.g. in RDF or XML. ConnectionLens is thus capable of finding and exploiting connections present across heterogeneous data sources without requiring the user to specify any join predicate.
- URL:
- Publications:
-
Contact:
Manolescu Ioana
7.1.2 Abstra
-
Name:
Abstra: Toward Generic Abstractions for Data of Any Model
-
Keywords:
Heterogeneous Data, Data Exploration, Data analysis, Databases, LOD - Linked open data
-
Functional Description:
Abstra computes a description meant for humans, based on the idea that, regardless of the syntax or the data model, any dataset holds some collections of entities/records, that are possibly linked with relationships. Abstra relies on a common graph representation of any incoming dataset, it leverages Information Extraction to detect what the dataset is about, and relies on an original algorithm for selecting the core entity collections and their relations. Abstractions are shown both as HTML text and a lightweight Entity-Relationship diagram.
- URL:
- Publications:
-
Contact:
Madhulika Mohanty
-
Participants:
Ioana Manolescu Goujot, Madhulika Mohanty, Nelly Barret, Prajna Devi Upadhyay
7.1.3 StatCheck
-
Name:
Fact-checking Multidimensional Statistic Claims in French
-
Keywords:
Machine learning, Databases, Natural language processing, Software engineering
-
Scientific Description:
To strengthen public trust and counter disinformation, computational fact-checking, leveraging digital data sources, attracts interest from the journalists and the computer science community. A particular class of interesting data sources comprises statistics, that is, numerical data compiled mostly by governments, administrations, and international organizations. Statistics are often multidimensional datasets, where multiple dimensions characterize one value, and the dimensions may be organized in hierarchies. This paper describes STATCHECK, a statistic fact-checking system jointly developed by the authors, which are either computer science researchers or fact-checking journalists working for a French-language media with a daily audience of more than 15 millions (aud, 2022). The technical novelty of STATCHECK is twofold: (i) we focus on multidimensional, complex-structure statistics, which have received little attention so far, despite their practical importance, and (ii) novel statistical claim extraction modules for French, an area where few resources exist. We validate the efficiency and quality of our system on large statistic datasets (hundreds of millions of facts), including the complete INSEE (French) and Eurostat (European Union) datasets, as well as French presidential election debates.
-
Functional Description:
StatCheck firstly allows the collection of data for its operation. Two types of data are collected: statistical tables and posts from social networks: - Acquisition of statistical files on the site of referent organisations (INSEE, Eurostat) - Extraction of statistical tables from these files, and storage of the extracted tables - Acquisition of political tweets from a list of accounts The application allows the detection, extraction and search of statistical facts: - Detection and extraction of statistical facts from Twitter posts (e.g. "Unemployment rate increased by 30% in 2023) - Search for statistical facts in our database. Display of the twenty most relevant statistical tables for a statistical fact - Automatic transcription of audio files to detect and extract transcripts of statistical facts.
-
Release Contributions:
- Redesign of the user interface - Modification of the software architecture - Addition of audio transcription
- URL:
- Publications:
-
Contact:
Ioana Manolescu Goujot
-
Participants:
Antoine Gauquier, Tien Duc Cao, Ioana Manolescu Goujot, Xavier Tannier, Oana-Denisa Balalau, Simon Ebel, Theo Galizzi
7.1.4 ConnectionStudio
-
Keywords:
Heterogeneous Data, Data Exploration
-
Functional Description:
ConnectionStudio integrates highly heterogeneous data into graphs, enriched with extracted entities. Studio users can discover the entities in their data, navigate across connections between datasets, explore and query the data in many ways. The Studio currently supports: CSV, JSON, XML, RDF, text, property graphs, all Office formats, and PDF datasets.
ConnectionStudio is a novel front-end to ConnectionLens, Abstra and PathWays (see also the respective Web sites). Its own novel features are outlined in a CoopIS 2023 article.
- URL:
- Publications:
-
Contact:
Ioana Manolescu Goujot
-
Participants:
Madhulika Mohanty, Simon Ebel, Theo Galizzi
7.1.5 FactSpotter
-
Keywords:
Factual Faithfulness, Text generation
-
Functional Description:
We propose a new metric that correctly identifies factual faithfulness, i.e., given a triple (subject, predicate, object), it decides if the triple is present in a generated text. We show that our metric FactSpotter achieves the highest correlation with human annotations on data correct- ness, data coverage, and relevance. In addition, FactSpotter can be used as a plug-in feature to improve the factual faithfulness of existing models.
-
Contact:
Kun Zhang
-
Partner:
Ecole Polytechnique
7.1.6 PathWays
-
Name:
PathWays: finding entity paths in heterogeneous data graphs
-
Keywords:
Named entities, Data Journalism, Heterogeneous Data
-
Functional Description:
PathWays models heteroegenous datasets in a graph (see ConnectionLens). To identify interesting paths in this graph, Pathways works on its (smaller) summary (see Abstra) for efficiency and optimisation. Then, it sorts paths by their potential interest (metric based on the entity found and the information diluation along the path) before evaluating them with the help of a new multi-query optimisation algorithm. Finally, PathWays shows the most interesting (evaluated) paths in the form of tables, wich are very easy to understanf for journalists who are at the initiative of this scenario.
- URL:
- Publications:
-
Contact:
Ioana Manolescu Goujot
7.1.7 OpenIEEntity
-
Name:
Open Information Extraction with Entity Focused Constraints
-
Keyword:
Information extraction
-
Functional Description:
This tool takes in input a sentence and outputs the facts contained in the sentence, in the format (subject,predicate,object).
-
Contact:
Oana-Denisa Balalau
7.1.8 FactCheckBureau
-
Name:
FactCheckBureau: Build Your Own Fact-Check Analysis Pipeline
-
Keywords:
Fact Check Retireval, Fact-checking
-
Functional Description:
FactCheckBurea is an end-to-end solution that enables researchers to easily and interactively design and evaluate Fact Check retrieval pipelines. Further, it provides a query interface for non-technical users to find relevant Fact Checks for the input query in the form of a key phrase, social media post, or an image.
- URL:
- Publication:
-
Contact:
Ioana Manolescu Goujot
7.1.9 FDSpotter
-
Name:
Structured Discourse Representation for Factual Consistency Verification
-
Keyword:
LLM
-
Functional Description:
The repository includes the tool to test for factual consistency, but also all the code necessary to compare our tool with state of the art methods for factual consistency.
-
Contact:
Oana-Denisa Balalau
7.1.10 COI-OpenIE
-
Keywords:
Conflict Of Interest Mining, Knowledge graph, Scientific Text, Information extraction
-
Functional Description:
This software expects as input a collection of certain sections (Acknowledgment, Funding disclosure, and so on) of scientific publications, and produces a knowledge graph that has information about the different interesting relations among Individuals and Organizations that were present in the input text corpus.
-
Contact:
Oana-Denisa Balalau
7.1.11 ClimateNLP toolbox
-
Name:
Climate NLP toolbox
-
Keywords:
Climate change, Classification, Natural language processing
-
Functional Description:
Python Scripts to train or download models (BERT-based models, TF-IDF). It also contains scripts to run LLM pipelines to perform the same tasks.
-
Contact:
Tom Calamai
7.1.12 MultilingualPoliticalLLMs
-
Keywords:
LLM, Multilingual
-
Functional Description:
We test different scenarios, where we vary the language of the prompt while also assigning a nationality to the model. We evaluate models on the 50 most populous countries and their official languages.
- URL:
-
Contact:
Oana-Denisa Balalau
8 New results
8.1 Data management for analyzing and verifying digital arenas
8.1.1 Graph data lakes of heterogeneous data sources for data journalism
Participants: Oana-Denisa Balalau, Pablo Bertaud-Velten, Nikola Dobricic, Przemyslaw Dominikowski, Simon Ebel, Theo Galizzi, Garima Gaur, Ioana Manolescu, Maria Jesus Mellado Tenorio, Madhulika Mohanty, Saba Shahsavari, Georgios Siachamis.
Work carried within the ANR AI Chair SourcesSay project has focused on developing a platform, ConnectionLens, for integrating arbitrary heterogeneous data into a graph, then exploring and querying that graph using simple, intuitive query interfaces. The main technical challenges addressed were: (i) how to interconnect structured and semi-structured data sources? We address this through information extraction (when an entity appears in two data sources or two places in the same graph, we only create one node, thus interlinking the two locations) and through similarity comparisons7.1.1; (ii) how to find all connections between nodes matching specific search criteria, or certain keywords? The question is particularly challenging in our context since ConnectionLens graphs can be pretty large, and query answers can traverse edges in both directions(iii) how to convert this graph into standard graph data models like property graphs, etc. ConnectionLens is available online at: ConnectionLens Gitlab repository, while ConnectionStudio, its GUI, is available at ConnectionStudio Gitlab repository.
With the ANR TopOL project, we now extend our contributions to large scale data lakes of heterogeneous data sources and explore novel ways of exploration. In this context, the following new contributions have been brought:
-
Efficiently Profiling, Indexing and Querying Heterogeneous Datasets in Graph Data Lakes Building on the ConnectionLens 7.1.1 and Abstra 7.1.2 frameworks, this work focuses on enabling natural language question answering over large-scale heterogenous data lake. In each dataset, we have formalized the concept of entities and their contexts, which serve as natural "anchors" of users' questions, e.g. which Person interacted with which Organization, and at what Location. To support efficient search over the set of entities-in-context, we developed an end-to-end system that ingests heterogenous sources into a graph data lake (using ConnectionLens), abstracts them into collections (using Abstra) and finally builds and indexes the entities-in-context. The developed indexes include: Locality Sensitive Hashing (LSH) for semantic similarity search and TRIE-like structure for exact lookups.
This work provides a foundation for the future work (e.g. building Retrieval-Augmented Generation system) allowing non-technical users like journalists to uncover the interesting facts over the large heterogenous data lakes, in particular in domains such as investigative journalism (with the team's ongoing collaboration with ICIJ).
- Batch Generic Evaluation of Keyword Queries on Graphs Keyword search is a popular paradigm for searching for information in graphs: users specify a few search terms (or keywords), and the system returns subtrees of the graph, where each keyword is matched by a node in each returned subtree. Because the problem is NP-hard in general, many keyword search algorithms consider a fixed score function which is applied to rank result trees, and explore only part of the search space, pruning trees with low scores. In contrast, generic algorithms explore the complete search space (subject to space or time limits, due to the high complexity), but can be used with any score function. In this work, we consider the problem of simultaneously answering a set (batch) of keyword queries, in a way compatible with any score function. Building upon our recent one-query generic algorithm 36, we show that when graph nodes match keywords from multiple queries, graph exploration effort can be shared, to speed up the evaluation of the query batch. We formally establish guarantees on the correctness and completess of our algorithm, and demonstrate its efficiency through comprehensive experiments over synthetic and real-world graphs.
- Named Entity Cleaning and Enhancement with Human-in-the-loop Named Entities (NEs, in short) are frequently encountered in datasets about varied topics, e.g., journalistic investigations (people, places, companies), market analysis (companies and officers), etc. NEs often appear under different forms within or across datasets, due to spelling variants or mistakes. To leverage NE-rich datasets, the NEs need to be clean (error-free), and possibly enriched with information from external sources. While numerous data cleaning solutions exist, in this work, we focus on the specific challenges raised by the cleaning of NE sets, in particular () through a visual workflow interface, () leveraging old and new techniques (string distances, Knowledge Bases, and carefully controlled access to LLMs), and especially () enabling human inspection and interaction with the NE cleaning process, down to the granularity of an individual attribute of a record. The latter need is crucial in order to capture advanced knowledge that only domain experts have, and which may be absent from all other sources of information (KB, LLM, etc.) We support this by gathering how-provenance that traces the numerous ways in which information is brought to clean NEs. We built NiceT, a system addressing these challenges, and tested on a variety of real-life datasets.
-
Named Entity Centric Querying over Heterogeneous Data Integrating information from diverse sources, particularly in investigative journalism, often hinges on linking data through shared named entities (NEs). The same entity may appear across multiple sources, each providing a different contextual perspective. For instance, when combining U.S. financial and political datasets, Donald Trump may emerge as a common entity, associated with distinct roles such as businessperson and politician. From a journalistic standpoint, the ability to seamlessly integrate heterogeneous data sources and query entity roles or inter-entity relationships—without requiring advanced technical expertise—is critical.
This project, centered on the problem of extracting and integrating information about a named entity (NE) that may appear across heterogeneous datasets within a datalake, gives rise to two concrete research challenges. First, given a input NE, identify the roles (context) it plays across different datasets and aggregate relevant information about the NE. We refer to the aggregated output as the Infocard of the NE. Second, given a collection of heterogeneous datasets and an NE, find its interesting relationships with the other Named entities. We leverage the capabilities of our in-house tools, ConnectionLens and Abstra, that can integrate structured, semi-structured, and unstructured datasets into a unified graph, and create high-level semantic abstractions of the complex datasets.
8.1.2 RDF Query Answering in the Presence of Access Restrictions
Participants: Maxime Buron, Hritika Kathuria, Ioana Manolescu, Georgios Siachamis.
In this work, we explore algorithms for answering conjunctive RDF queries in the presence of RDFS ontologies and access control. We consider an access control setting where by default all users have access to the complete graph, and a restriction can forbid user a user's access to specific IRIs. Here, restricting for user the access to an IRI entails that: no answer to a query by may contain the IRI ; no triple containing can be used to compute an answer for a query by , nor to entail such a triple via reasoning with the ontology. We present a set of query answering algorithms for this novel context, and prove that five among them are correct, i.e., sound and complete, with respect to both the ontology and the access restrictions in place. We have implemented all our algorithms and present experiments comparing their performance. This work was published in CoopIS 2025 28 where it won the Best Paper Award.
8.1.3 FactCheck-KG: Towards LLM-backed FC Retrieval
Participants: Garima Gaur, Madhulika Mohanty.
There is an unprecedented rise in the volume and reach of disinformation due to the popularity of social media and the advent of generative AI models. Fact-checking, that is, checking the veracity of a certain claim, is unfeasible at this scale, by human effort alone. This is primarily due to the rise in the volume of claims requiring verification, and also the number of documents to be processed to verify a certain claim. This process is further complicated by disinformation re-surfacing in paraphrased forms, altered context, incomplete, or shifted context. The fact-checkers often find themselves re-assessing a previously evaluated claim, which wastes precious human effort. In order to tackle these challenges, fact-check retrieval(FCR) pipelines have been developed that, given a newly encountered claim, aim to identify the most relevant claims among a set of previously assessed claims. In this work, we leverage NLP techniques over a set of fact-checked claims and their related articles, to build a Knowledge Graph (KG) FactCheck-KG of named entities, topics, claims and articles with edges capturing the connection across different fact-checks via common topics and named entities. This representation lays the foundation for more context-aware, fine-grained fact-check retrieval. For example, with the success of retrieval augmented generation (RAG) and its extension to a Graph-based retrieval(GraphRAG) framework, our KG can form a starting point for its application to solve the fact-check retrieval problem.
8.1.4 Efficient and Scalable Search for Statistics
Participants: Simon Ebel, Helena Galhardas, Theo Galizzi, Ioana Manolescu, Aurelien Peden.
Informed public debate needs high-quality data. In this context, high-quality statistical data sources are a valuable category of reference information based on which a claim can be checked. To facilitate the work of journalists or other fact-checkers, users' questions about a specific claim should be automatically answered based on statistical tables. This task is complicated by the large number, size, and variety of statistical datasets. This work introduces the statistical table discovery problem (STD, in short), which aims, given a natural language question and a set of statistic datasets (multidimensional tables), to find the tables most relevant for the question. We then describe STAR, an algorithm for solving the STD problem. Unlike existing table discovery (TD) solutions aimed at relational tables, STAR is devised specifically for multidimensional ones. Further, STAR treats the space and time dimensions of statistical datasets separately. We experimentally show that these features, together, make STAR outperform state-of-the-art TD systems adapted to the STD problem, in terms of scalability, search quality, preprocessing and question answering time. It has been informally presented at BDA 2025 19 and the code is available at its Gitlab repository.
8.1.5 Structured Discourse Representation for Factual Consistency Verification
Participants: Oana-Denisa Balalau, Ioana Manolescu, Kun Zhang.
Analysing the differences in how events are represented across texts, or verifying whether the language model generations hallucinate, requires the ability to systematically compare their content. To support such a comparison, a structured representation that captures fine-grained information plays a vital role. In particular, identifying distinct atomic facts and the discourse relations connecting them enables deeper semantic comparison. Our proposed approach combines structured discourse information extraction with a classifier, FDSpotter, for factual consistency verification. We show that adversarial discourse relations pose challenges for language models, but fine-tuning on our annotated data, DiscInfer, achieves competitive performance. Our proposed approach advances factual consistency verification by grounding in linguistic structure and decomposing it into interpretable components. We demonstrate the effectiveness of our method on the evaluation of two tasks: data-to-text generation and text summarisation. This work has been published in ACL (Findings) 2025 27 and the software is available on BIL 7.1.9.
8.1.6 The Search for Conflicts of Interest: Open Information Extraction in Scientific Publications
Participants: Oana-Denisa Balalau, Garima Gaur, Ioana Manolescu, Prajna Upadhyay.
A conflict of interest (COI) appears when a person or a company has two or more interests that may directly conflict. This happens, for instance, when a scientist whose research is funded by a company audits the same company. For transparency and to avoid undue influence, public repositories of relations of interest are increasingly recommended or mandated in various domains, and can be used to avoid COIs. In this work, we propose an LLM-based open information extraction (OpenIE) framework for extracting financial or other types of interesting relations from scientific text. We target scientific publications in which authors declare funding sources or collaborations in the acknowledgment section, in the metadata, or in the publication, following editors’ requirements. We introduce an extraction methodology and present a knowledge base (KB) with a comprehensive taxonomy of COI centric relations. Finally, we perform a comparative study of disclosures of two journals in the field of toxicology and pharmacology. The work has been published in EMNLP (Findings) 2025 20 and the software is available on BIL 7.1.10.
8.2 Online targeted advertising
Participants: Ines Abdelaziz, Nardjes Amieur, Abir Benzaamia, Salim Chouaki, Asmaa El Fraihi, Oana Goga.
8.2.1 A Year Under the DSA: Ad Transparency's Uneven Landscape
The Digital Services Act (DSA) has put platform accountability on center stage, requiring online platforms to provide greater transparency into how advertisements are targeted and delivered to users. Central to these obligations are two mechanisms: user-facing ad explanations, which inform individuals why they were shown a given ad, and public ad repositories, which are intended to enable independent auditing of advertising practices. This study provides the first multi-platform evaluation of these two mechanisms across Facebook, Instagram, YouTube and X. Using 48,511 user-facing “Why am I seeing this ad?” (WAIST) notices, and a systematic analysis of each platform's public ad repository, we assess how well current implementations disclose the parameters and decision processes involved in targeting. To do so, we develop and apply an operational framework based on Articles 26 and 39 of the DSA—capturing the granularity, attribution of targeting and delivery choices, data source disclosures, and accuracy—and apply it across both user-facing notices and public ad repositories. Our findings show that transparency remains fragmented and inconsistent across platforms. User-facing explanations vary widely in precision and often omit key targeting information, while repositories provide incomplete, misattributed, and at times difficult-to-interpret targeting data. Moreover, discrepancies between explanations and repository entries undermine the reliability of both mechanisms. Overall, current transparency infrastructures fall short of the DSA's expectations and highlight the need for clearer and more enforceable standards for advertising transparency moving forward. It has been accepted for publication in PETs/PoPETs 2026.
8.2.2 A Comparative Study of News Exposure and Consumption On and Off Facebook.
Social media giants like Meta, Google, and X leverage powerful algorithms to personalize user feeds, a practice now under intense public scrutiny. These algorithms can inadvertently skew the information users consume, potentially influencing political opinions and voting decisions. This raises critical questions: Do social media platforms foster misinformation and contribute to echo chambers? To address this ongoing debate, our study directly compares news exposure on Facebook (where algorithmic influence is strong) with news consumption off-platform (where user behavior plays a larger role). Specifically, we investigate: (1) Are users exposed to more/less misinformation on Facebook compared with their off-platform misinformation consumption? (2) Is news exposure on Facebook more/less diverse than off-platform news consumption? (3) To what extent do socio-demographic and psychological factors influence misinformation exposure on Facebook and consumption off Facebook? (4) Is there a relationship between socio-demographic and psychological factors and news diversity on and off Facebook? and (5) Is users' exposure to misinformation on Facebook correlated to off-platform news consumption?
Our study of 123,995 news-related posts on Facebook and 70,587 news articles visits off Facebook, collected from 642 users during 12 weeks, reveals the following central findings: (1) Only a small fraction 4% of users' news consumption off Facebook is driven by news exposure on Facebook, and only 5.7% of misinformation consumption off Facebook is driven by news exposure on Facebook. (2) There is a higher prevalence of misinformation in user-received content on Facebook compared to deliberately consumed content off-platform. On Facebook, 5.9% of our users' news exposure comes from sources known for spreading misinformation, while off-platform, only 2.6% of our users' news consumption is from misinformation sources. Conversely, Facebook presents more diverse content - 22% of users received content from only one political leaning on Facebook, compared to 36% of users who consumed content from only one political leaning off-platform. (3) Several socio-demographic and psychological factors showed a statistically significant correlation with misinformation exposure on Facebook but not misinformation consumption off Facebook. (4) The proportion of misinformation consumed off Facebook emerged as a statistically significant predictor of users' exposure to misinformation on Facebook, independent of news consumption on Facebook.
This work has been published in CSCW 2025 15.
8.2.3 Privacy Settings and Ad Perception: The Shift from Third-Party Cookies to the Privacy Sandbox
Online behavioral advertising, heavily reliant on privacy-invasive third-party cookie tracking, faces a significant shift as browsers like Safari, Brave, and Firefox have already deprecated them. Google Chrome announced its parallel move with the "Privacy Sandbox Initiative" in 2019, proposing privacy-preserving advertising mechanisms. The extent to which Privacy Sandbox can deliver comparable ad relevance and purchase intent to the established third-party cookie ecosystem will likely determine its adoption as a widespread alternative. This paper presents the first user study evaluating the impact of Privacy Sandbox APIs on ad perception. Our findings show that users perceive Privacy Sandbox ads as less relevant and exhibit lower purchase intent compared to third-party cookie–based ads, without a corresponding increase in perceived privacy protection. These results contribute to the ongoing assessment of Privacy Sandbox as an alternative to third-party cookies.
8.2.4 Is Contextual Advertising Safe? Analyzing Systemic Risks with Ads on YouTube.
Contextual advertising is seeing a resurgence in popularity as a privacy-preserving alternative to behavioral targeting. While often regarded as a coarse-grained approach, advances in AI-driven content analysis have transformed it into a highly granular form of targeting.This work examines the safety risks of contextual targeting through a two-part empirical study, analyzing its potential to enable targeting of audiences with sensitive attributes and exposing users to harmful or exploitative ads. In controlled ad experiments, we show that advertisers can target audiences defined by sensitive attributes (e.g., religious belief, mental health condition, and political ideology) by strategically selecting contextual placements—circumventing policies that prohibit such targeting through behavioral signals. To understand how this risk manifests in practice, we develop an automated measurement framework to collect contextual ads delivered on high-risk content environments, focusing on conspiracy videos. We find that contextual ads are highly prevalent in these environments, disproportionately deliver sensitive categories (e.g., alternative health, religion, and political), and lack transparency. We argue that contextual ad systems require deeper empirical scrutiny and robust transparency mechanisms to prevent exploitation and abuse, and regulators should extend behavioral advertising risk principles to the contextual domain.
8.2.5 A Framework for Auditing Ad Delivery Responsiveness to Psychological Traits
Online advertising platforms increasingly personalize ad delivery using users' behavioral signals, even when advertisers cannot explicitly target many underlying user characteristics. Auditing delivery skews for traits that are latent, complex, or not directly targetable through advertiser-facing tools remains challenging. We propose an experimental framework for auditing ad delivery across latent traits by constructing trait-defined audiences and examining how delivery systems allocate ads to these audiences under controlled competitive conditions. We demonstrate this framework on Meta's advertising platform using extraversion as a case study. We construct trait-based audiences using two approaches: psychometric assessment combined with tracking-based retargeting, and behavioral profiling based on on-platform engagement. Under controlled delivery conditions, we examine how the platform allocates personality-aligned and misaligned ads across these audiences. We find a statistically significant alignment effect in ad delivery: ads are more likely to be delivered when their framing matches the personality of the target audience (). This effect is strongest in behaviorally profiled segments, where misaligned ads also exhibit reduced reach relative to aligned ads. Our framework provides a general approach for auditing ad delivery behavior and personalization dynamics driven by latent user traits.
8.2.6 How Persuasive Are LLMs in the Wild? Assessing Personalized Ads in Real-World Delivery
Large language models (LLMs) have demonstrated persuasive potential in controlled experiments and survey-based studies across commercial, political, and social domains. However, their effectiveness in real-world communication environments remains largely unexplored. This work addresses this gap by evaluating LLM-generated personalized messages deployed in controlled advertising experiments on Meta platforms. We assess effectiveness along three complementary dimensions: (1) behavioral user engagement measured through field experiments, (2) perceived appeal captured via user surveys, and (3) platform-level dynamics analyzed through algorithmic ad delivery patterns. Our results show that LLM-based personalized messages do not significantly improve user engagement compared to non-personalized messages. We also show that user perceptions—measured through surveys—can diverge significantly from observed behavioral outcomes online. This highlights the limitations of relying on survey-based evaluations alone to assess the persuasive capabilities of LLMs. Finally, we show that LLM-generated personalization can influence platform ad delivery—shifting impressions toward the intended audience by up to 8% even without explicit targeting instructions. These effects are often constrained by the platform's relevance predictions, which may override the cues embedded in the message. Together, these findings provide a comprehensive real-world audit for the effectiveness and limits of LLM-based persuasion in the wild. It has been accepted for publication in AAAI ICWSM 2026.
8.3 Bias and issues in LLMs and Benchmarks
Participants: Oana-Denisa Balalau, Tom Calamai, Chadi Helwe.
8.3.1 Navigating the Political Compass: Evaluating Multilingual LLMs across Languages and Nationalities
Large Language Models (LLMs) have become ubiquitous in today's technological landscape, boasting a plethora of applications, and even endangering human jobs in complex and creative fields. One such field is journalism: LLMs are being used for summarization, generation and even fact-checking. However, in today's political landscape, LLMs could accentuate tensions if they exhibit political bias. In this work, we evaluate the political bias of the most used 15 multilingual LLMs via the Political Compass Test. We test different scenarios, where we vary the language of the prompt, while also assigning a nationality to the model. We evaluate models on the 50 most populous countries and their official languages. Our results indicate that language has a strong influence on the political ideology displayed by a model. In addition, smaller models tend to display a more stable political ideology, i.e. ideology that is less affected by variations in the prompt. The work has been published in ACL (Findings) 2025 21 and the tool is available on BIL 7.1.12.
8.3.2 Benchmarking the Benchmarks: Reproducing Climate-Related NLP Tasks
Significant efforts have been made in the NLP community to facilitate the automatic analysis of climate-related corpora by tasks such as climate-related topic detection, climate risk classification, question answering over climate topics, and many more. In this work, we perform a reproducibility study on 8 tasks and 29 datasets, testing 6 models. We find that many tasks rely heavily on surface-level keyword patterns rather than deeper semantic or contextual understanding. Moreover, we find that 96% of the datasets contain annotation issues, with 16.6% of the sampled wrong predictions of a zero-shot classifier being actually clear annotation mistakes, and 38.8% being ambiguous examples.These results call into question the reliability of current benchmarks to meaningfully compare models and highlight the need for improved annotation practices. We conclude by outlining actionable recommendations to enhance dataset quality and evaluation robustness. The work has been published in ACL (Findings) 2025 18 and the tool is available on BIL 7.1.11.
8.4 Efficient Big Data analytics
8.4.1 Graph Transformers for Query Plan Representation: Potentials and Challenges
Participants: Yanlei Diao, Guillaume Lachaud, Gabriel Lozano Pinzon, Chenghao Lyu.
Query Plan Representation (QPR) is central to workload modeling, with various deep-learning based architectures proposed in the literature. Our work is motivated by two key observations: (i) the research community still lacks clarity on which model, if any, best suits the QPR problem; and (ii) while transformers have revolutionized many fields, their potential for QPR remains largely underexplored. This study examines the strengths and challenges of Graph Transformers for QPR. We introduce a new taxonomy that unifies deep-learning based QPR techniques along key design axes. Our benchmark analysis of common QPR architectures reveals that Graph Transformer Networks (GTNs) consistently outperform alternatives, but can degrade under limited training data. To address this, we propose novel data augmentation techniques to enhance training diversity and refine GTN architectures by replacing ineffective language-model-inspired components with techniques better suited for query plans. Evaluation on JOB, TPC-H, and TPC-DS benchmarks shows that with sufficient training data, enhanced GTNs outperform existing models for capturing complex queries (JOB Full and TPC-DS) and enable the query embedder trained on TPC-DS to generalize to TPC-H queries out of the box. The work has been accepted in VLDB 2026.
8.4.2 Unsupervised Anomaly Detection in Multivariate Time Series across Heterogeneous Domains
Participants: Yanlei Diao, Vincent Jacob.
The widespread adoption of digital services, along with the scale and complexity at which they operate, has made incidents in IT operations increasingly more likely, diverse, and impactful. This has led to the rapid development of a central aspect of "Artificial Intelligence for IT Operations" (AIOps), focusing on detecting anomalies in vast amounts of multivariate time series data generated by service entities. In this paper, we begin by introducing a unifying framework for benchmarking unsupervised anomaly detection (AD) methods, and highlight the problem of shifts in normal behaviors that can occur in practical AIOps scenarios. To tackle anomaly detection under domain shift, we then cast the problem in the framework of domain generalization and propose a novel approach, Domain-Invariant VAE for Anomaly Detection (DIVAD), to learn domain-invariant representations for unsupervised anomaly detection. Our evaluation results using the Exathlon benchmark show that the two main DIVAD variants significantly outperform the best unsupervised AD method in maximum performance, with 20% and 15% improvements in maximum peak F1-scores, respectively. Evaluation using the Application Server Dataset further demonstrates the broader applicability of our domain generalization methods. The work has been published in VLDB 2025 22.
8.4.3 Transactional Stateful Functions on Streaming Dataflows
Participants: Georgios Siachamis.
Developing stateful cloud applications, such as low-latency workflows and microservices with strict consistency requirements, remains arduous for programmers. The Stateful Functions-as-a-Service (SFaaS) paradigm aims to serve these use cases. However, existing approaches provide weak transactional guarantees or perform expensive external state accesses requiring inefficient transactional protocols that increase execution latency. In this paper, we present Styx, a novel dataflow-based SFaaS runtime that executes serializable transactions consisting of stateful functions that form arbitrary call-graphs with exactly-once guarantees. Styx extends a deterministic transactional protocol by contributing: i) a function acknowledgment scheme to determine transaction boundaries required in SFaaS workloads, ii) a function-execution caching mechanism, and iii) an early commit-reply mechanism that substantially reduces transaction execution latency. Experiments with the YCSB, TPC-C, and Deathstar benchmarks show that Styx outperforms state-of-the-art approaches by achieving at least one order of magnitude higher throughput while exhibiting near-linear scalability and low latency. This work has been published in SIGMOD 2025 24 and demonstrated in VLDB 2025 25.
8.4.4 Dynamic Graph Databases with Out-of-order Updates
Participants: Muhammad Khan, Ioana Manolescu.
Dynamic graphs are omnipresent in real-time applications that generate massive amounts of data. We consider dynamic graphs, where edges are continuously added and deleted to a single graph, from multiple update streams. The dynamic graphs are stored in a transactional graph database. Each edge update or deletion carries a source (stream) time , assigned at the moment when it was emitted, and an arrival (or transaction) time , assigned when the graph database receives it. Updates may be received at the database out-of-order (ooo, in short): due to different latencies on the propagation paths between the data source and the database. We proposed HAL, a novel in-memory dynamic graph database design, addressing these challenges. HAL outperforms comparable systems by a factor of up to 73 in terms of update processing throughput and up to 357 for analytics, while being the first to support out-of-order updates. We have also extended it with support for node and edge properties, and for historical queries, whereas queries should be evaluated over the graph such as it was at a specific moment in the past. This work has been accepted in VLDB 2025 12, VLDB 2025 Large-Scale Graph Data Analytics (LSGDA) workshop 16 and demonstrated in SIGMOD 2025 17. The code is available on Gitlab (code).
Participants: Ioana Manolescu, Oana Balalau, Yanlei Diao, Ghufran Khan, Maxime Buron, Hritika Kathuria, Georgios Siachamis.
9 Bilateral contracts and grants with industry
9.1 Bilateral contracts with industry
The collaborative contract with RadioFrance in which Oana-Denisa Balalau and Ioana Manolescu Goujot participate has ended. We have successfully transferred the StatCheck software to our RadioFrance partner.
The collaborative contract with Amundi led by Oana-Denisa Balalau for the CIFRE project has ended, the PhD student will defend his PhD in 2026.
9.2 Bilateral Grants with Industry
Ioana Manolescu Goujot is involved in the BPI-funded project CodeCommons, in collaboration with the Software Heritage Foundation (SWF). We work to generalize, enlarge, and enable the efficient processing of the world's largest repository of free software. The end of the PhD of Muhammad Khan contributed to the project.
Ioana Manolescu Goujot , Georgios Siachamis and Hritika Kathuria have been involved in the BPI-funded project DXP (Data Exchange Project), with Amadeus, the international tourism services operator. We participate in this project in collaboration with Maxime Buron, former team member, now an Assistant Professor at UCA. Our contribution here is to devise an architecture for decentralized, access-controled data sharing, allowing tourism service providers and clients to exchange their information via Amadeus' platform.
Participants: Ioana Manolescu Goujot, Oana-Denisa Balalau, Oana Goga, Madhulika Mohanty, Garima Gaur, Yanlei Diao, Muhammad Khan, Maxime Buron, Hritika Kathuria, Georgios Siachamis.
10 Partnerships and cooperations
10.1 International initiatives
10.1.1 Associate Teams in the framework of an Inria International Lab or in the framework of an Inria International Program
MediumAI
-
Title:
Responsible AI for Journalism
-
Duration:
2024 - 2026
-
Coordinator:
Davide Ceolin (Davide.Ceolin@cwi.nl)
-
Partners:
- CWI Amsterdam (Pays-Bas)
-
Inria contact:
Oana-Denisa Balalau
-
Summary:
From recommender systems to large language models, data-driven AI tools have shown different forms of limitations and bias. Bias in AI tools may stem from multiple factors, including bias in the input data the AI tools are trained on, the algorithm and the individuals responsible for designing the AI tools, and bias in the evaluation and interpretation of AI tool outputs. Limitations are due to technical difficulties in achieving specific tasks. Media outlets use different algorithmic aids in their workflow: keyword extraction, entities and relations extractions, event extraction, sentiment analysis, automatic summarization, newsworthy story detection, semi-automatic production of news using text generation models, and search, among others. Given the importance of the media sector for our democracies, shortcomings in the tools they use could have severe consequences. Both Inria and CWI have partnerships with large media groups and can help them address bias and limitations in their AI workflows.
10.2 International research visitors
10.2.1 Visits of international scientists
Other international visits to the team
Benjamin Ocampo
-
Status
PhD
-
Institution of origin:
Human-Centered Data Analytics team, University of Amsterdam
-
Country:
Netherlands
-
Dates:
October 13-17, 2025
-
Context of the visit:
Associated team MediumAI
-
Mobility program/type of mobility:
research stay
Davide Ceolin
-
Status
researcher
-
Institution of origin:
Human-Centered Data Analytics team, CWI
-
Country:
Netherlands
-
Dates:
October 16-17, 2025
-
Context of the visit:
Associated team MediumAI
-
Mobility program/type of mobility:
research stay
Mae Sosto
-
Status
post-doc
-
Institution of origin:
Human-Centered Data Analytics team, CWI
-
Country:
Netherlands
-
Dates:
November 27-December 03, 2025
-
Context of the visit:
Associated team MediumAI
-
Mobility program/type of mobility:
research stay
10.2.2 Visits to international teams
Research stays abroad
persTomCalamai
-
Visited institution:
CWI, Amsterdam
-
Country:
Netherlands
-
Context of the visit:
Associated team MediumAI
-
Mobility program/type of mobility:
research stay
10.3 European initiatives
10.3.1 Horizon Europe
ELIAS
ELIAS project on cordis.europa.eu
-
Title:
European Lighthouse of AI for Sustainability
-
Duration:
From September 1, 2023 to August 31, 2027
-
Partners:
- ECOLE POLYTECHNIQUE (EP), France
- INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET AUTOMATIQUE (INRIA), France
- ROBERT BOSCH KFT, Hungary
- BITDEFENDER SRL (Bitdefender), Romania
- ETHNIKO KENTRO EREVNAS KAI TECHNOLOGIKIS ANAPTYXIS (CENTRE FOR RESEARCH AND TECHNOLOGY HELLAS CERTH), Greece
- THE UNIVERSITY OF MANCHESTER (UNIVERSITY OF MANCHESTER), United Kingdom
- ROBERT BOSCH GMBH (BOSCH), Germany
- INSTITUT JOZEF STEFAN (JSI), Slovenia
- INSTITUT POLYTECHNIQUE DE PARIS, France
- UNIVERSITAT DE VALENCIA (UVEG), Spain
- PROMETEIA SOCIETA PER AZIONI (Prometeia), Italy
- IBM IRELAND LIMITED, Ireland
- KOBENHAVNS UNIVERSITET (UCPH), Denmark
- AALTO KORKEAKOULUSAATIO SR (AALTO), Finland
- IDEAS NCBR SP Z O.O., Poland
- UMEA UNIVERSITET, Sweden
- INSTITUT MINES-TELECOM, France
- FONDAZIONE ISTITUTO ITALIANO DI TECNOLOGIA (IIT), Italy
- FONDATION DE L'INSTITUT DE RECHERCHE IDIAP (IDIAP), Switzerland
- UNIVERSITATEA NATIONALA DE STIINTASI TEHNOLOGIE POLITEHNICA BUCURESTI (NATIONAL UNIVERSITY OF SCIENCE ANDTECHNOLOGY POLITEHNICA BUCHAREST), Romania
- EIDGENOESSISCHE TECHNISCHE HOCHSCHULE ZUERICH (ETH Zürich), Switzerland
- CESKE VYSOKE UCENI TECHNICKE V PRAZE (CVUT), Czechia
- FUNDACION DE LA COMUNITAT VALENCIANA UNIDAD ELLIS ALICANTE, Spain
- FONDAZIONE BRUNO KESSLER (FBK), Italy
- POLITECNICO DI MILANO (POLIMI), Italy
- LA COMMUNAUTE D UNIVERSITES ET ETABLISSEMENTS DE TOULOUSE (LA COMMUNAUTE D UNIVERSITES ET ETABLISSEMENTS DE TOULOUSE), France
- UNIVERSITA DEGLI STUDI DI TRENTO (UNITN), Italy
- UNIVERSITA DEGLI STUDI DI MILANO (UMIL), Italy
- HASSO-PLATTNER-INSTITUT FUR DIGITAL ENGINEERING GGMBH (HPI), Germany
- ENGINEERING - INGEGNERIA INFORMATICA SPA (ENG), Italy
- EBERHARD KARLS UNIVERSITAET TUEBINGEN (UT), Germany
- UNIVERSITA DEGLI STUDI DI GENOVA (UNIGE), Italy
- MAX-PLANCK-GESELLSCHAFT ZUR FORDERUNG DER WISSENSCHAFTEN EV (MPG), Germany
- UNIVERSITA DEGLI STUDI DI MODENA E REGGIO EMILIA (UNIMORE), Italy
- UNIVERSITEIT VAN AMSTERDAM (UvA), Netherlands
-
Inria contact:
Ioana Manolescu
- Coordinator:
-
Summary:
We live in a crucial historical moment, with tremendous challenges ahead, from climate change to the energy crisis. ELIAS emerges from the belief that AI will be a key discipline to help us tackle these challenges. At the same time, the development of AI entails deep ethical and societal concerns that need to be addressed. As for fundamental research, ELIAS will address key scientific questions about how AI can reduce computational costs, serves to model effects of policy decisions on society, and impacts individuals. ELIAS will strive for a deep integration of the fundamental research that takes place in academia and the more applications-focused research from industry.
ELIAS builds on and expands the highly successful and internationally recognized European
Laboratory for Learning and Intelligent Systems (ELLIS). ELIAS will further develop the excellence criteria and the pillars in ELLIS and implement actions that will support AI researchers and young talents at different stages of their careers. Furthermore, ELIAS will develop a Sciencentrepreneurship track, with the purpose of attracting and empowering talents at the interface of scientific innovation and business and establish original AI solutions that move towards a sustainable long-term future for our planet, contribute to a cohesive society, and respect individual rights.
The outcome of ELIAS will be to establish Europe as a leader in AI research in which impact on the environment, society and the individual are integral considerations during development. We will measure the success of this endeavor in terms of key indicators, including the number of new cross-institutional collaborations, the number of cross-disciplinary collaborations, the number of industry-academic partnerships, publications in top conferences and journals, patents, and the number of projects that have resulted in deployed technologies.
10.3.2 H2020 projects
Ioana Manolescu Goujot is the local PI for the Inria partner in the project "ELIAS - European Lighthouse of AI for Sustainability" (2,800,000 euros). Madhulika Mohanty and Garima Gaur have also been strongly involved.
Yanlei Diao has been awarded the ERC Grant - ERC Proof of Concept - on ExplainableAD: Explainable Anomaly Detection for Safeguarding and Enhancing Modern Data Industry.
10.4 National initiatives
10.4.1 ANR
- Oana Goga is the local PI for LIX partner - ANR PRC 2022 - 2026 “FeedingBias: A multi-platform mixed-methods approach to news exposure on social media” (our part: 128,000 euros)
- Oana Goga is the local PI for LIX partner - ANR PRCE 2021 - 2025 “PROPEOS: Privacy-oriented Personalization of Online Services” (our part: 202,720 euros)
- The project "TopOL (Top of the Lake): discovery and exploitation of heterogeneous data lakes through graph models", coordinated by Ioana Manolescu Goujot , has been funded by the ANR. The project is a collaboration with U. Paris Saclay, U. Paris Dauphine, U. Blois and U. Tours; the International Consortium of Investigative Journalism (ICIJ) is a non-funded partner. Madhulika Mohanty also participates and is a Work Package co-leader.
10.5 Regional initiatives
Ioana Manolescu Goujot has been awarded a Fellowship of the Hi!Paris AI Cluster "PREDIAL: AI Data Dialogs for the Press".
Yanlei Diao has been awarded an AAP Premat IP Paris 2025.
11 Dissemination
11.1 Promoting scientific activities
Chair of conference program committees
Ioana Manolescu Goujot was the Demonstration chair at EDBT 2025.
Madhulika Mohanty was the demonstration chair of at the French data base conference, BDA 2025.
Member of the conference program committees
The team members have been part of the following program committees:
- Ioana Manolescu Goujot : ACL Rolling Review 2025, IEEE ICDE 2025, ACM PACMMOD (formerly SIGMOD) 2025, BDA 2025
- Oana-Denisa Balalau : ACL Rolling Review February 2025
- Madhulika Mohanty : VLDB 2025, ICDE 2025, EDBT 2025 (Demo), ICDE 2025 (Demo), VLDB 2025 (Demo), CODS 2025, CMLS Workshop in ER 2025
- Garima Gaur : CIKM 2025, BDA 2025, CMLS Workshop in ER 2025
- Georgios Siachamis : ICDE 2025, EDBT 2025 (Demo), DEBS 2025
11.1.1 Journal
Member of the editorial boards
Ioana Manolescu Goujot served as an Associate Editor for PVLDB 2025.
Reviewer - reviewing activities
Madhulika Mohanty reviewed for Transactions on Graph Data and Knowledge (TGDK) and Georgios Siachamis reviewed for the VLDB Journal (VLDBJ).
11.1.2 Invited talks
Ioana Manolescu Goujot delivered a keynote at AFIA (French AI Reseach Association) workshop “Perspectives et Défis de l'IA” on « Désinformation, Démocratie et IA », June 10, 2025 (link).
Oana-Denisa Balalau delivered a talk at ESSEC in the workshop Comprendre et Changer le Monde (CCM), titled “Improving the quality of public debate with AI”.
Madhulika Mohanty delivered the following talks:
- “Intelligence Artificielle: un outil au service de l'investigation” at VIGINUM in May 2025.
- “Effective Exploration of Graph-Structured Data” at LHC and IDIA Days 2025 in June 2025.
Tom Calamai delivered a workshop on “les applications de l'IA pour l'investissement responsable” organised by the FIR (forum pour l'investissement responsable) (link)
11.1.3 Leadership within the scientific community
Ioana Manolescu Goujot has been the president of the informal French Data Management Association (BDA).
11.1.4 Research administration
Ioana Manolescu Goujot represents Inria in the Comité Operationnel of Hi!Paris, an AI Pole of Excellency comprising IP Paris and HEC. She is also an elected member of IP Paris' Comité Académique and serves on its Scientific Committee.
11.2 Teaching - Supervision - Juries - Educational and pedagogical outreach
11.2.1 Teaching
Ioana Manolescu Goujot is a part-time professor (50%) at Ecole Polytechnique. She taught:
- Courses, labs and TDs in CSC_51053_EP (Database Management Systems);
- She is in charge of the M1 Internship program in Artificial Intelligence and Data Science (CSC_52992_EP).
- She is also in charge of the Artificial Intelligence M1 program at Ecole Polytechnique
Madhulika Mohanty has a 25% Chargée d'Enseignement contract at Ecole Polytechnique for 10 months. She taught:
- Labs and TDs in CSC_51053_EP (Database Management Systems)
- Labs and TDs in CSC_52083_EP (Systems for Big Data)
- She also taught 3h of CM and 3h of TP for ECE_5DA04_TP (Big Graph Data Management) at Télécom for DATAAI Masters.
Oana-Denisa Balalau is a part-time (33%) assistant professor at Ecole Polytechnique, where she teaches “Mining, learning and reasoning on Web Graphs”, L3
Przemyslaw Dominikowski carries out a complementary teaching assignment (64h) at Ecole Polytechnique. He teaches the labs in CSC_2F001_EP (Object Oriented Programming in C++).
Garima Gaur carried out following teaching duties:
- Course, Labs and TDs in CSC_52640_EP (Database Management Systems) offered by DMAP, Ecole Polytechnique
- Labs and TDs in CSC_51053_EP (Database Management Systems)
- 3h of CM and 3h of TP for ECE_5D04_TP (Big Graph Data Management) at Télécom for DATAAI Masters.
Hritika Kathuria carries out a complementary teaching assignment (64h) at Ecole Polytechnique and teaches 2 Labs in CSE_102.
Tom Calamai has a 30h teaching assistant (Vacataire) contract at Télécom Paris and Ecole Polytechnique. He teaches:
- INF473G
- Machine Learning for Text Mining
- Machine learning avancé
- Database
- Language Modeling
Georgios Siachamis carried out 3h of CM and 3h of TP for ECE_5D04_TP (Big Graph Data Management) at Télécom for DATAAI Masters.
Yanlei Diao holds a part-time (50%) full Professor position at Ecole Polytechnique. She teaches Systems for Big Data (CSC_52083_EP Systems for Big Data), M1, Ecole Polytechnique.
Guillaume Lachaud has a 58h teaching assistant position at Ecole Polytechnique. He teaches:
- CSC_52087_EP- Advanced Deep Learning
- CSC_41011_EP - Les bases de la programmation et de l'algorithmique
- CSC_43M02_EP (for one day) - Modal d'informatique - Exploration et apprentissage sur les graphes du Web
11.2.2 Supervision
The team supervised the following PhDs:
- Przemysław Dominikowski, Sep 2025 - Dec 2025, advised by Ioana Manolescu Goujot and Madhulika Mohanty
- Kun Zhang, Jan 2025-April 2025, advised by Ioana Manolescu Goujot and Oana-Denisa Balalau
- Tom Calamai, Jan 2025-Dec 2025, advised by Fabian Suchanek and Oana-Denisa Balalau
- Hritika Kathuria, Jan 2025-Dec 2025, advised by Ioana Manolescu Goujot and Maxime Buron
- Ines Abdelaziz, Dec 2025, advised by Oana Goga
- Nardjes Amieur, Jan 2025-Dec 2025, advised by Oana Goga
- Abir Benzaamia, Jan 2025-Dec 2025, advised by Oana Goga
- Asmaa El Fraihi, Jan 2025-Dec 2025, advised by Oana Goga
- Gabriel Ben Zenou, Jan 2025-Dec 2025, advised by Oana Goga
- Gabriel Lozano, Sept 2025-Dec 2025, advised by Yanlei Diao and Guillaume Lachaud
- Nazim Mezhoudi, Jan 2025-Dec 2025, advised by Yanlei Diao and Mariam Barry (BNP Paribas)
The team supervised the following postdocs:
- Chadi Helwe, Jan 2025-March 2025, advised by Oana-Denisa Balalau and Davide Ceolin
- Guillaume Lachaud, Jan 2025-Dec 2025, advised by Yanlei Diao
The team supervised the following engineers:
- Simon Ebel and Théo Galizzi (January to June 2025), Aurélien Peden (March to August 2025): Oana-Denisa Balalau and Ioana Manolescu Goujot supervised them on their collaboration project with RadioFrance.
- George Siachamis: supervised by Ioana Manolescu Goujot and Madhulika Mohanty on efficient and expressive graph data management.
- Ines Abdelaziz (January to November 2025): supervised by Oana Goga .
The team supervised the following interns:
- Pablo Bertaud-Velten, M1 IP Paris, advised by Ioana Manolescu Goujot , Madhulika Mohanty , Garima Gaur and Georgios Siachamis
- Przemyslaw Dominikowski, M2 UP Saclay, advised by Ioana Manolescu Goujot , Madhulika Mohanty , Garima Gaur and Georgios Siachamis
- Nikola Dobriçic, X Bachelor 3rd year, advised by Ioana Manolescu Goujot , Madhulika Mohanty and Georgios Siachamis
- Joanne Jegou, X Bachelor 3rd year, co-advised by Ioana Manolescu Goujot and Michael Thy (APHP)
- Paul Kronlund-Drouault, ENS Lyon Bachelor 2nd year, advised by Ioana Manolescu Goujot .
- Maria Mellado, M2 University of Chile, advised by Ioana Manolescu Goujot , Madhulika Mohanty and Garima Gaur
- Saba Shashsavari, M1 IP Paris, advised by Ioana Manolescu Goujot , Madhulika Mohanty , Garima Gaur and Georgios Siachamis
- Vlada Voronina, M1, advised by Oana-Denisa Balalau and Marine Le Morvan
- Rémi Guillou, X Bachelor 3rd Year, advised by Yanlei Diao
- Yanis Zaamoun, X Bachelor 3rd year, advised by Yanlei Diao
The team supervised the following part-time projects:
- PSC "Analyse du discours médiatique autour du changement climatique", advised by Oana-Denisa Balalau and Etienne Ollion
- Léo Nivelle (X3A), "Automatic verbalisation of statistics", advised by Ioana Manolescu Goujot
- Yiheng Chen, Antoine Delacour and Elliot Thorel (X3A): "Natural language querying of large heterogeneous datasets", advised by Ioana Manolescu Goujot , Madhulika Mohanty , Garima Gaur and Georgios Siachamis
- Cédric Trinh and Tom Léon (X3A): "Building a Knowledge Graph for Fact-checks", advised by Madhulika Mohanty and Garima Gaur
- Moritz Sommer (X and RWTH Exchange Program): "Identification of Core Properties for Semantic Concepts in Universal Datasets", advised by Ioana Manolescu Goujot , Madhulika Mohanty and Garima Gaur
- Maximilien Rambaud, Nicolas Gromitsaris, Anthony Chassagne (X3A): "Anomaly detection and explaination in dynamic graphs, with applications in finance", advised by Yanlei Diao and Guillaume Lachaud
- Gabriel Cheval, Armand Vabre (X3A): "Detecting data drift in graphs for model retraining" advised by Yanlei Diao and Guillaume Lachaud
- Loric Roger, Joseph de Roffignac, Sylvain Dehayem (X3A): "Anomaly detection in dynamic graphs", advised by Yanlei Diao and Guillaume Lachaud
- Berthé Zié, Goly Kodia (X3A): "Explainable dynamic graph neural networks for anomaly detection", advised by Yanlei Diao and Guillaume Lachaud
11.2.3 Juries
Oana-Denisa Balalau has served as a:
- member of the recruitment comittee for assistant professor at Télécom Paris
- part of the PhD defense committee of Jonathan Colin (Université Paris Saclay), William Soto (Université de Lorraine)
Ioana Manolescu Goujot has served in the following juries:
- Member of a Professor hiring committee at Université de Paris Dauphine (june 2025)
- Reported on the PhD thesis of Yifan Wang, Université de Lille, defended in November 2025
11.3 Popularization
11.3.1 Specific official responsibilities in science outreach structures
Oana-Denisa Balalau is a member of Inria Saclay's Scientific Commission. She also animated the foresight seminar on LLMs&Science at the "Data and Knowledge" Inria seminar in March 2025.
Ioana Manolescu Goujot ia an elected member of Inria's Comité d'Evaluation.
11.3.2 Participation in Live events
Ioana Manolescu Goujot had several intervention in national media:
- Participated to ARTE "28 minutes" show on the impact of AI on society, December 24, 2025.
- Interviewed by Michaël Szadkowsky (Le Monde) for the article "2025, l'année où la vidéo par IA a envahi les réseaux sociaux", December 22, 2025.
- Interviewed by Désirée de Lamarzelle (Forbes Magazine) for the article "Future of work: is AI a friend or a foe?", November 13, 2025.
- Interviewed by Mélinée Le Priol (La Croix) for the article Faut-il avoir peur de la 'superintelligence artificielle'?", October 30, 2025
- Interviewed by Marina Alcaraz (Les Echos) on the frequency of fake news in chatbot responses, September 2025
- Interviewed by Marina Alcaraz (Les Echos) on disinformation sometimes present in Mistral outputs, July 2025
- Interviewed by Alexandre Capron whether a GenAI vi (TF1) on fake AI videos, June 6, 2025.
- Guest in the radio show "Je pense donc j'agis": Où vont nos données et comment les protéger?", hosted by Melchior Gormand, on RCF, April 3, 2025
- In a press conference organized as part of a "Stand Up for Science" day on April 3, 2025 (dépêche AEF, video recording)
- Member of a panel about ethical and regulatory bounds on research in "Journée Sciences et Médias" (French Association of Science Journalists), February 10, 2025.
- Interviewed by Chloé Woitier for the article "C'est une nouvelle pollution numérique : le Slop, ce raz-de-marée de contenus IA qui menace internet", Le Figaro, February 2, 2025.
- Authored an invited opinion piece in l'Humanité "Les réseaux sociaux nuisent-ils à la démocratie?" on January 27, 2025.
11.3.3 Others science outreach relevant activities
Przemyslaw Dominikowski conducted an outreach session (1.5h) for high school students (stage de seconde), presenting CEDAR's team research, in particular data lake indexing.
Ioana Manolescu Goujot gave a presentation for CPES (1st year higher education) students at Lycée International de Palaiseau Paris-Saclay.
12 Scientific production
12.1 Major publications
- 1 inproceedingsTowards Scalable Hybrid Stores: Constraint-Based Rewriting to the Rescue.SIGMOD 2019 - ACM SIGMOD International Conference on Management of DataAmsterdam, NetherlandsJune 2019HAL
- 2 inproceedingsFact-checking Multidimensional Statistic Claims in French.TTO 2022 - Truth and Trust OnlineBoston [Hybrid Event], United StatesOctober 2022HAL
- 3 inproceedingsFrom the Stage to the Audience: Propaganda on Reddit.EACL 2021 - 16th Conference of the European Chapter of the Association for Computational LinguisticsOnline, FranceApril 2021HAL
- 4 inproceedingsReformulation-based query answering for RDF graphs with RDFS ontologies.ESWC 2019 - European Semantic Web ConferencePortoroz, SloveniaMarch 2019HAL
- 5 inproceedingsTeaching an RDBMS about ontological constraints.Very Large Data BasesNew Delhi, IndiaSeptember 2016HAL
- 6 inproceedingsA Content Management Perspective on Fact-Checking.The Web Conference 2018 - alternate paper tracks "Journalism, Misinformation and Fact Checking"Lyon, FranceApril 2018, 565-574HAL
- 7 articleSummarizing Semantic Graphs: A Survey.The VLDB Journal2018HAL
- 8 inproceedingsSpade: A Modular Framework for Analytical Exploration of RDF Graphs.VLDB 2019 - 45th International Conference on Very Large Data BasesProceedings of the VLDB Endowment, Vol. 12, No. 12Los Angeles, United StatesAugust 2019HALDOI
- 9 articleOptimization for active learning-based interactive database exploration.Proceedings of the VLDB Endowment (PVLDB)121September 2018, 71-84HALDOI
- 10 inproceedingsMassively Parallel Processing of Whole Genome Sequence Data: An In-Depth Performance Study.SIGMOD '17 Proceedings of the 2017 ACM International Conference on Management of DatSIGMOD '17 Proceedings of the 2017 ACM International Conference on Management of DataSIGMOD ACM Special Interest Group on Management of DataChicago, Illinois, United StatesACMMay 2017, 187-202HALDOI
- 11 inproceedingsBreaking Down the Invisible Wall of Informal Fallacies in Online Discussions.ACL-IJCNLP 2021 - Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language ProcessingOnline, FranceAugust 2021HAL
12.2 Publications of the year
International journals
International peer-reviewed conferences
Conferences without proceedings
Reports & preprints
Other scientific publications
Scientific popularization
12.3 Cited publications
- 36 inproceedingsIntegrating Connection Search in Graph Queries.ICDE 2023 - 39th IEEE International Conference on Data EngineeringAnaheim (CA), United StatesApril 2023HALback to text