CEDAR

CEDAR - 2025

2025Activity reportProject-TeamCEDAR‌

RNSR: 201622056J

Research center Inria Saclay Centre at‌ Institut Polytechnique de Paris
In partnership with:Institut‌ Polytechnique de Paris, CNRS
Team name: Rich Data‌ Exploration at Cloud Scale
In collaboration with:Laboratoire‌ d'informatique de l'école polytechnique (LIX)

Creation of the‌ Project-Team: 2018 April 01

Each year, Inria research‌ teams publish an Activity Report presenting their work‌ and results over the reporting period. These reports‌ follow a common structure, with some optional sections‌ depending on the specific team. They typically begin‌ by outlining the overall objectives and research programme,‌ including the main research themes, goals, and methodological‌ approaches. They also describe the application domains targeted‌ by the team, highlighting the scientific or societal‌ contexts in which their work is situated.

The‌ reports then present the highlights of the year,‌ covering major scientific achievements, software developments, or teaching‌ contributions. When relevant, they include sections on software,‌ platforms, and open data, detailing the tools developed‌ and how they are shared. A substantial part‌ is dedicated to new results, where scientific contributions‌ are described in detail, often with subsections specifying‌ participants and associated keywords.

Finally, the Activity Report‌ addresses funding, contracts, partnerships, and collaborations at various‌ levels, from industrial agreements to international cooperations. It‌ also covers dissemination and teaching activities, such as‌ participation in scientific events, outreach, and supervision. The‌ document concludes with a presentation of scientific production,‌ including major publications and those produced during the‌ year.

Keywords

Computer Science and Digital Science

A3.3.‌ Data and knowledge analysis
A9.1. Knowledge
A9.2. Machine‌ learning
A9.2.1. Supervised learning
A9.2.2. Unsupervised learning
A9.2.3.‌ Reinforcement learning
A9.2.6. Neural networks
A9.2.8. Deep learning‌
A9.4. Natural language processing
A9.13. Agentic AI
A9.15.‌ Symbolic AI
A9.16. Societal impact of AI
A9.17.‌ Cybersecurity and AI

1 Team members, visitors,‌ external collaborators

Research Scientists

Ioana Manolescu Goujot [‌Team leader, INRIA, Senior Researcher,‌ HDR]
Oana-Denisa Balalau [INRIA, ISFP‌]
Oana Goga [Inria, Senior Researcher‌, HDR]
Madhulika Mohanty [INRIA,‌ Researcher]

Faculty Member

Yanlei Diao [ECOLE‌ POLY PALAISEAU, Professor]

Post-Doctoral Fellows

Garima‌ Gaur [INRIA, Post-Doctoral Fellow]
Chadi‌ Helwe [INRIA, Post-Doctoral Fellow, until‌ Apr 2025]
Guillaume Lachaud [ECOLE POLY‌ PALAISEAU, Post-Doctoral Fellow]
Kun Zhang [‌ECOLE POLY PALAISEAU, from Apr 2025 until‌ May 2025]

PhD Students

Ines Abdelaziz [INRIA, from Dec‌ 2025]
Nardjes Amieur‌ [CNRS]
Gabriel‌‌ Ben Zenou [Ministère Armées]
Abir Benzaamia‌ [CNRS]
Theo‌ Bouganim [INRIA,‌‌ until Mar 2025]
Tom Calamai [INRIA‌ & Amundi, CIFRE‌]
Salim Chouaki [‌‌CNRS]
Przemyslaw Dominikowski [ECOLE POLY PALAISEAU‌, from Sep 2025‌]
Asmaa El Fraihi‌‌ [CNRS]
Vincent Jacob [ECOLE POLY‌ PALAISEAU, until Mar‌ 2025]
Hritika Kathuria‌‌ [INRIA]
Muhammad Khan [INRIA,‌ until Sep 2025]‌
Gabriel Lozano Pinzon [‌‌ECOLE POLY PALAISEAU, from Sep 2025]‌
Mohamed Mezhoudi [BNP‌ PARIBAS , CIFRE]‌‌
Kun Zhang [INRIA, until Mar 2025‌]

Technical Staff

Ines‌ Abdelaziz [INRIA,‌‌ Engineer, until Nov 2025]
Simon Ebel‌ [INRIA, Engineer‌, until Jun 2025‌‌]
Theo Galizzi [INRIA, Engineer,‌ until Jun 2025]‌
Ismail Hatim [ECOLE‌‌ POLYTECHNIQUE, Engineer, from Nov 2025]‌
Aurelien Peden [INRIA‌, Engineer, from‌‌ Mar 2025 until Oct 2025]
Georgios Siachamis‌ [INRIA, Engineer‌]

Interns and Apprentices‌‌

Pablo Bertaud-Velten [INRIA, Intern, from‌ Mar 2025 until Jul‌ 2025]
Nikola Dobricic‌‌ [INRIA, Intern, until Mar 2025‌]
Przemyslaw Dominikowski [‌INRIA, Intern,‌‌ from Mar 2025 until Aug 2025]
Paul‌ Kronlund-Drouault [INRIA,‌ Intern, from Jun‌‌ 2025 until Aug 2025]
Gabriel Lozano Pinzon‌ [ECOLE POLY PALAISEAU‌, Intern, from‌‌ Mar 2025 until Aug 2025]
Maria-Justina-Adriana Mateescu‌ [INRIA, Intern‌, from Jul 2025‌‌ until Jul 2025]
Maria Jesus Mellado Tenorio‌ [INRIA, Intern‌, from Mar 2025‌‌ until May 2025]
Saba Shahsavari [INRIA‌, Intern, from‌ Apr 2025 until Aug‌‌ 2025]
Yanis Zaamoun [ECOLE POLY PALAISEAU‌, Intern, until‌ Mar 2025]

Administrative‌‌ Assistant

Michael Barbosa [INRIA]

External Collaborators‌

Alexandre Barlot [Radio‌ France]
Nelly Barret‌‌ [ECOLE POLYT. MILAN, until Apr 2025‌]
Antoine Deiana [‌Radio France, until‌‌ May 2025]
Helena Galhardas [Instituto Superior‌ Técnico, University of Lisbon‌]
Emilie Gautreau [‌‌Radio France, until Apr 2025]
Remi‌ Guillou [ECOLE POLY‌ PALAISEAU, from Jun‌‌ 2025 until Aug 2025]
Samuel Guimaraes [‌CNRS, until Mar‌ 2025]
Paul Kronlund-Drouault‌‌ [ENS DE LYON, from Sep 2025‌]
Chenghao Lyu [‌Univ Massachusetts Amherst,‌‌ from Sep 2025]
Adrien Maumy [Radio‌ France, until Apr‌ 2025]
Tobias Moller‌‌ [TELECOM PARIS, from Jul 2025 until‌ Nov 2025]
Thomas‌ Pontillon [Radio France‌‌, until Apr 2025]
Gerald Roux [‌Radio France, until‌ Apr 2025]
Prajna‌‌ Devi Upadhyay [BITS PILANI HYDERABAD CAMPUS]‌
Joanna Yakin [Radio‌ France, until Apr‌‌ 2025]

2 Overall‌ objectives

Our research aims at models, algorithms and‌ tools for highly efficient, easy-to-use data and knowledge‌ management; throughout our research, performance at scale‌ is a core concern, which we address, among‌ other techniques, by designing algorithms for a cloud‌ (massively parallel) setting. In addition, we explore and‌ mine rich data via machine learning techniques. Our‌ scientific contributions fall into four interconnected areas:

Optimization‌ and performance at scale.
We work to devise‌ efficient and effective optimization techniques which seek to‌ make processing of data at very large scale,‌ as efficient as possible. These efforts span over‌ relational, graph, and text-rich data, in centralized as‌ well as in distributed architectures.
Data discovery and‌ exploration.
Today's Big Data is complex; understanding and‌ exploiting it is daunting, especially to novice users‌ such as journalists or domain scientists. We work‌ to devise techniques for allowing users to explore‌ graph data, large, heterogeneous data lakes, as well‌ as more subtle signals hidden in the data,‌ such as anomalies in time series and in‌ dynamic graphs.
Natural language understanding for analyzing and‌ supporting digital arenas.
In this area, we are‌ interested in applications with high social value, such‌ as analysing public discourse with the goal of‌ finding elements that could bias the world view‌ of citizens, such as false claims, fallacious arguments,‌ propaganda, or greenwashing.
Safeguarding information systems.
Recent events‌ have brought to light the easiness of using‌ current online systems to propagate information (that is‌ sometimes false) and that we are facing an‌ information war. We create knowledge and technology in‌ this area to make the online information space‌ safer.

3 Research program

3.1 Multi-model querying

As‌ the world's affairs get increasingly more digital, a‌ large and varied set of data sources becomes‌ available: they are either structured databases, such as‌ government-gathered data (demographics, economics, taxes, elections), legal records,‌ stock quotes for specific companies, un-structured or semi-structured,‌ including in particular graph data, sometimes endowed with‌ semantics (see e.g., the Linked Open Data cloud).‌ Modern data management applications, such as data journalism,‌ are eager to combine in innovative ways both‌ static and dynamic information coming from structured, semi-structured,‌ and unstructured databases and social feeds. However, current‌ content management tools for this task are not‌ suited for the task, in particular when they‌ require a lengthy rigid cycle of data integration‌ and consolidation in a warehouse. Thus, we need‌ flexible tools allowing us to interconnect various kinds‌ of data sources and query them together.

3.2‌ New methods for exploring and querying data graphs‌

Semantic graphs, including data and knowledge, are hard‌ to apprehend for users due to the complexity‌ of their structure and, often to their large‌ volumes. To help tame this complexity, we seek‌ new methods for exploring highly heterogeneous data graphs‌ resulting from integrating structured, semi-structured, and unstructured (text)‌ data. In this context, we study methods for‌ automatically identifying, in a large corpus of data sources, interesting data paths‌ that connect Named Entities‌ (NE) to each other.‌‌ Further, in some application contexts where RDF data‌ graphs are collaboratively used,‌ it is essential that‌‌ access control methods be in place to guard‌ access to the data.‌ Query answers need, then,‌‌ to be computed by taking into account access‌ control restrictions, as well‌ as ontologies that describe‌‌ the data semantics.

3.3 Navigating the continuum between‌ text and (semi) structured‌ data

In data journalism‌‌ and fact-checking applications, useful information comes both in‌ structured records and in‌ natural language text,

3.4‌‌ An unified framework for optimizing data analytics

Data‌ analytics in the cloud‌ has become an integral‌‌ part of enterprise businesses. Big data analytics systems,‌ however, still lack the‌ ability to take user‌‌ performance goals and budgetary constraints for a task‌ collectively referred to as‌ task objectives, and automatically‌‌ configure an analytic job to achieve the objectives.‌ Our goal is to‌ develop a data analytics‌‌ optimizer that can automatically determine a cluster configuration‌ with a suitable number‌ of cores and other‌‌ runtime system parameters that best meet the task‌ objectives. To achieve this,‌ we also need to‌‌ design a multi-objective optimizer that constructs a Pareto‌ optimal set of job‌ configurations for task-specific objectives‌‌ and recommends new job configurations to best meet‌ these objectives.

3.5 Elastic‌ resource management for virtualized‌‌ database engines

Database engines are migrating to the‌ cloud to leverage the‌ opportunities for efficient resource‌‌ management by adapting to the variations and heterogeneity‌ of the workloads. Resource‌ management in a virtualized‌‌ setting, like the cloud, must be enforced in‌ a performance-efficient manner to‌ avoid introducing overheads to‌‌ the execution. We design elastic systems that change‌ their configuration at runtime‌ with minimal cost to‌‌ adapt to the workload every time. Changes in‌ the design include both‌ different resource allocations and‌‌ different data layouts. We consider different workloads, including‌ transactional, analytical, and mixed,‌ and we study the‌‌ performance implications on different configurations to propose a‌ set of adaptive algorithms.‌

3.6 Argumentation mining

Argumentation‌‌ appears when we evaluate the validity of new‌ ideas, convince an addressee,‌ or solve a difference‌‌ of opinion. An argument contains a statement to‌ be validated (a proposition‌ also called claim or‌‌ conclusion), a set of backing propositions (called premises,‌ which should be accepted‌ ideas), and a logical‌‌ connection between all the pieces of information presented‌ that allows the inference‌ of the conclusion. In‌‌ our work, we focus on fallacious arguments, where‌ evidence does not prove‌ or disprove the claim,‌‌ for example, in an "ad hominem" argument, a‌ claim is declared false‌ because the person making‌‌ it has a character flaw. We study the‌ impact of fallacies in‌ online discussions and show‌‌ the need for improving tools for their detection.‌ In addition, we look‌ into detecting verifiable claims‌‌ made by politicians. We started a collaboration with‌ RadioFrance and with Wikidébats,‌ a debate platform focused‌‌ on proving quality arguments‌ for controversial topics.

3.7 Measuring and mitigating risks‌ of AI-driven information targeting

We are witnessing a‌ massive shift in the way people consume information.‌ In the past, people had an active role‌ in selecting the news they read. More recently,‌ the information started to appear on people's social‌ media feeds as a byproduct of one's social‌ relations. We see a new shift brought by‌ the emergence of online advertising platforms where third‌ parties can pay ad platforms to show specific‌ information to particular groups of people through paid‌ targeted ads. AI-driven algorithms power these targeting technologies.‌ Our goal is to study the risks with‌ AI-driven information targeting at three levels: (1) human-level-in‌ which conditions targeted information can influence an individual's‌ beliefs; (2) algorithmic- level in which conditions AI-driven‌ targeting algorithms can exploit people's vulnerabilities; and (3)‌ platform-level are targeting technologies leading to biases in‌ the quality of information different groups of people‌ receive and assimilate. Then, we will use this‌ understanding to propose protection mechanisms for platforms, regulators,‌ and users.

4 Application domains

4.1 Cloud computing‌

Cloud computing services are strongly developing and more‌ and more companies and institutions resort to running‌ their computations in the cloud, in order to‌ avoid the hassle of running their own infrastructure.‌ Today's cloud service providers guarantee machine availabilities in‌ their Service Level Agreement (SLA), without any guarantees‌ on performance measures according to a specific cost‌ budget. Running analytics on big data systems require‌ the user not to only reserve the suitable‌ cloud instances over which the big data system‌ will be running, but also setting many system‌ parameters like the degree of parallelism and granularity‌ of scheduling. Chosing values for these parameters, and‌ chosing cloud instances need to meet user objectives‌ regarding latency, throughput and cost measures, which is‌ a complex task if it is done manually‌ by the user. Hence, we need need to‌ transform cloud service models from availabily to user‌ performance objective rises and leads to the problem‌ of multi-objective optimization. Research carried out in the‌ team within the ERC project “Big and Fast‌ Data Analytics” aims to develop a novel optimization‌ framework for providing guarantees on the performance while‌ controlling the cost of data processing in the‌ cloud.

4.2 Computational journalism

Modern journalism increasingly relies‌ on content management technologies in order to represent,‌ store, and query source data and media objects‌ themselves. Writing news articles increasingly requires consulting several‌ sources, interpreting their findings in context, and crossing‌ links between related sources of information. Cedar research‌ results directly applicable to this area provide techniques‌ and tools for rich Web content warehouse management.‌ Within the SourcesSay AI Chair project, we work‌ to devise concrete algorithms and platforms to help‌ journalists perform their work better and/or faster. This‌ work is in collaboration with the journalists from‌ RadioFrance, the team Le vrai du faux.

4.3‌ Computational social science

Political discussions revolve around ideological conflicts that often split‌ the audience into two‌ opposing parties. Both parties‌‌ try to win the argument by bringing forward‌ information. However, often this‌ information is misleading, and‌‌ its dissemination employs propaganda techniques. We investigate the‌ impact of propaganda in‌ online forums and we‌‌ study a particular type of propagandist content, the‌ fallacious argument. We show‌ that identifying such arguments‌‌ remains a difficult task, but one of high‌ importance because of the‌ pervasiveness of this type‌‌ of discourse. We also explore trends around the‌ diffusion and consumption of‌ propaganda and how this‌‌ can impact or be a reflection of society.‌

4.4 Online targeted advertising‌

The enormous financial success‌‌ of online advertising platforms is partially due to‌ the precise targeting features‌ they offer. Ad platforms‌‌ collect large amounts of data on users and‌ use powerful AI-driven algorithms‌ to infer users' fine-grain‌‌ interests and demographics, which they make available to‌ advertisers to target users.‌ For instance, advertisers can‌‌ target groups of users as small as tens‌ or hundreds and as‌ specific as “people interested‌‌ in anti-abortion movements that have a particular education‌ level”. Ad platforms also‌ employ AI-driven targeting algorithms‌‌ to predict how “relevant” ads are to particular‌ groups of people to‌ decide to whom to‌‌ deliver them. While these targeting technologies are creating‌ opportunities for businesses to‌ reach interested parties and‌‌ lead to economic growth, they also open the‌ way for interested groups‌ to use user's data‌‌ to manipulate them by targeting messages that resonate‌ with each user.

5‌ Social and environmental responsibility‌‌

5.1 Contribution to Diversity, Equity and Inclusion

Madhulika‌ Mohanty co-led the SCOUT‌ action of the Diversity,‌‌ Equity and Inclusion initiative (website) for the DB‌ research community from 2021-2025.‌ This action provided a‌‌ checklist of items to be checked before submitting‌ a paper to promote‌ and ensure more DEI-compliant‌‌ papers. This will be integrated within the standard‌ submission systems for DB‌ conferences. This has led‌‌ to the publication of 13.

6 Highlights‌ of the year

6.1‌ Awards

The paper “RDF‌‌ Query Answering in the Presence of Access Restrictions”'‌ by Maxime Buron ,‌ Hritika Kathuria , Ioana‌‌ Manolescu Goujot and Georgios Siachamis won the CoopIS‌ 2025 Best Paper Award‌ 28

7 Latest software‌‌ developments, platforms, open data

7.1 Latest software developments‌

7.1.1 ConnectionLens

Name:
Integration‌ of heterogeneous data using‌‌ information extraction
Keyword:
Data analysis
Functional Description:
ConnectionLens‌ treats a set of‌ heterogeneous, independently authored data‌‌ sources as a single virtual graph, whereas nodes‌ represent fine-granularity data items‌ (relational tuples, attributes, key-value‌‌ pairs, RDF, JSON or XML nodes…) and edges‌ correspond either to structural‌ connections (e.g., a tuple‌‌ is in a database, an attribute is in‌ a tuple, a JSON‌ node has a parent…)‌‌ or to similarity (sameAs) links. To further enrich‌ the content journalists work‌ with, we also apply‌‌ entity extraction which enables to detect the people,‌ organizations etc. mentioned in‌ text, whether full-text or‌‌ text snippets found e.g.‌ in RDF or XML. ConnectionLens is thus capable‌ of finding and exploiting connections present across heterogeneous‌ data sources without requiring the user to specify‌ any join predicate.
URL:
https://team.inria.fr/cedar/connectionlens/
Publications:
hal-02934277,‌ hal-02904797, hal-01841009
Contact:
Manolescu Ioana

7.1.2 Abstra‌

Name:
Abstra: Toward Generic Abstractions for Data of‌ Any Model
Keywords:
Heterogeneous Data, Data Exploration, Data‌ analysis, Databases, LOD - Linked open data
Functional‌ Description:
Abstra computes a description meant for humans,‌ based on the idea that, regardless of the‌ syntax or the data model, any dataset holds‌ some collections of entities/records, that are possibly linked‌ with relationships. Abstra relies on a common graph‌ representation of any incoming dataset, it leverages Information‌ Extraction to detect what the dataset is about,‌ and relies on an original algorithm for selecting‌ the core entity collections and their relations. Abstractions‌ are shown both as HTML text and a‌ lightweight Entity-Relationship diagram.
URL:
https://team.inria.fr/cedar/projects/abstra/
Publications:
hal-04131974,‌ hal-04131974, hal-03767967, hal-03774599
Contact:
Madhulika Mohanty‌
Participants:
Ioana Manolescu Goujot, Madhulika Mohanty, Nelly Barret,‌ Prajna Devi Upadhyay

7.1.3 StatCheck

Name:
Fact-checking Multidimensional‌ Statistic Claims in French
Keywords:
Machine learning, Databases,‌ Natural language processing, Software engineering
Scientific Description:
To‌ strengthen public trust and counter disinformation, computational fact-checking,‌ leveraging digital data sources, attracts interest from the‌ journalists and the computer science community. A particular‌ class of interesting data sources comprises statistics, that‌ is, numerical data compiled mostly by governments, administrations,‌ and international organizations. Statistics are often multidimensional datasets,‌ where multiple dimensions characterize one value, and the‌ dimensions may be organized in hierarchies. This paper‌ describes STATCHECK, a statistic fact-checking system jointly developed‌ by the authors, which are either computer science‌ researchers or fact-checking journalists working for a French-language‌ media with a daily audience of more than‌ 15 millions (aud, 2022). The technical novelty of‌ STATCHECK is twofold: (i) we focus on multidimensional,‌ complex-structure statistics, which have received little attention so‌ far, despite their practical importance, and (ii) novel‌ statistical claim extraction modules for French, an area‌ where few resources exist. We validate the efficiency‌ and quality of our system on large statistic‌ datasets (hundreds of millions of facts), including the‌ complete INSEE (French) and Eurostat (European Union) datasets,‌ as well as French presidential election debates.
Functional‌ Description:
StatCheck firstly allows the collection of data‌ for its operation. Two types of data are‌ collected: statistical tables and posts from social networks:‌ - Acquisition of statistical files on the site‌ of referent organisations (INSEE, Eurostat) - Extraction of‌ statistical tables from these files, and storage of‌ the extracted tables - Acquisition of political tweets‌ from a list of accounts The application allows‌ the detection, extraction and search of statistical facts:‌ - Detection and extraction of statistical facts from‌ Twitter posts (e.g. "Unemployment rate increased by 30%‌ in 2023) - Search for statistical facts in‌ our database. Display of the twenty most relevant‌ statistical tables for a statistical fact - Automatic transcription of audio files‌ to detect and extract‌ transcripts of statistical facts.‌‌
Release Contributions:
- Redesign of the user interface‌ - Modification of the‌ software architecture - Addition‌‌ of audio transcription
URL:
https://cedar-rf.saclay.inria.fr/
Publications:
hal-01496700,‌ hal-01745768, hal-02121389,‌ hal-01915148, hal-03767992,‌‌ hal-03791175
Contact:
Ioana Manolescu Goujot
Participants:
Antoine Gauquier,‌ Tien Duc Cao, Ioana‌ Manolescu Goujot, Xavier Tannier,‌‌ Oana-Denisa Balalau, Simon Ebel, Theo Galizzi

7.1.4 ConnectionStudio‌

Keywords:
Heterogeneous Data, Data‌ Exploration
Functional Description:

ConnectionStudio‌‌ integrates highly heterogeneous data into graphs, enriched with‌ extracted entities. Studio users‌ can discover the entities‌‌ in their data, navigate across connections between datasets,‌ explore and query the‌ data in many ways.‌‌ The Studio currently supports: CSV, JSON, XML, RDF,‌ text, property graphs, all‌ Office formats, and PDF‌‌ datasets.

ConnectionStudio is a novel front-end to ConnectionLens,‌ Abstra and PathWays (see‌ also the respective Web‌‌ sites). Its own novel features are outlined in‌ a CoopIS 2023 article.‌
URL:
https://connectionstudio.inria.fr/
Publications:
hal-04185938‌‌, hal-04591897, hal-04727209
Contact:
Ioana Manolescu Goujot‌
Participants:
Madhulika Mohanty, Simon‌ Ebel, Theo Galizzi

7.1.5‌‌ FactSpotter

Keywords:
Factual Faithfulness, Text generation
Functional Description:‌
We propose a new‌ metric that correctly identifies‌‌ factual faithfulness, i.e., given a triple (subject, predicate,‌ object), it decides if‌ the triple is present‌‌ in a generated text. We show that our‌ metric FactSpotter achieves the‌ highest correlation with human‌‌ annotations on data correct- ness, data coverage, and‌ relevance. In addition, FactSpotter‌ can be used as‌‌ a plug-in feature to improve the factual faithfulness‌ of existing models.
Contact:‌
Kun Zhang
Partner:
Ecole‌‌ Polytechnique

7.1.6 PathWays

Name:
PathWays: finding entity paths‌ in heterogeneous data graphs‌
Keywords:
Named entities, Data‌‌ Journalism, Heterogeneous Data
Functional Description:
PathWays models heteroegenous‌ datasets in a graph‌ (see ConnectionLens). To identify‌‌ interesting paths in this graph, Pathways works on‌ its (smaller) summary (see‌ Abstra) for efficiency and‌‌ optimisation. Then, it sorts paths by their potential‌ interest (metric based on‌ the entity found and‌‌ the information diluation along the path) before evaluating‌ them with the help‌ of a new multi-query‌‌ optimisation algorithm. Finally, PathWays shows the most interesting‌ (evaluated) paths in the‌ form of tables, wich‌‌ are very easy to understanf for journalists who‌ are at the initiative‌ of this scenario.
URL:‌‌
https://team.inria.fr/cedar/projects/pathways/
Publications:
hal-04131977, hal-04727209, hal-04131977
Contact:‌
Ioana Manolescu Goujot

7.1.7‌ OpenIEEntity

Name:
Open Information‌‌ Extraction with Entity Focused Constraints
Keyword:
Information extraction‌
Functional Description:
This tool‌ takes in input a‌‌ sentence and outputs the facts contained in the‌ sentence, in the format‌ (subject,predicate,object).
Contact:
Oana-Denisa Balalau‌‌

7.1.8 FactCheckBureau

Name:
FactCheckBureau: Build Your Own Fact-Check‌ Analysis Pipeline
Keywords:
Fact‌ Check Retireval, Fact-checking
Functional‌‌ Description:
FactCheckBurea is an end-to-end solution that enables‌ researchers to easily and‌ interactively design and evaluate‌‌ Fact Check retrieval pipelines. Further, it provides a‌ query interface for non-technical‌ users to find relevant‌‌ Fact Checks for the input query in the‌ form of a key‌ phrase, social media post,‌‌ or an image.
URL:‌
https://gitlab.inria.fr/cedar/factcheckbureau
Publication:
hal-04684068
Contact:
Ioana Manolescu Goujot

7.1.9‌ FDSpotter

Name:
Structured Discourse Representation for Factual Consistency‌ Verification
Keyword:
LLM
Functional Description:
The repository includes‌ the tool to test for factual consistency, but‌ also all the code necessary to compare our‌ tool with state of the art methods for‌ factual consistency.
Contact:
Oana-Denisa Balalau

7.1.10 COI-OpenIE

Keywords:‌
Conflict Of Interest Mining, Knowledge graph, Scientific Text,‌ Information extraction
Functional Description:
This software expects as‌ input a collection of certain sections (Acknowledgment, Funding‌ disclosure, and so on) of scientific publications, and‌ produces a knowledge graph that has information about‌ the different interesting relations among Individuals and Organizations‌ that were present in the input text corpus.‌
Contact:
Oana-Denisa Balalau

7.1.11 ClimateNLP toolbox

Name:
Climate‌ NLP toolbox
Keywords:
Climate change, Classification, Natural language‌ processing
Functional Description:
Python Scripts to train or‌ download models (BERT-based models, TF-IDF). It also contains‌ scripts to run LLM pipelines to perform the‌ same tasks.
Contact:
Tom Calamai

7.1.12 MultilingualPoliticalLLMs

Keywords:‌
LLM, Multilingual
Functional Description:
We test different scenarios,‌ where we vary the language of the prompt‌ while also assigning a nationality to the model.‌ We evaluate models on the 50 most populous‌ countries and their official languages.
URL:
https://github.com/ChadiHelwe/navigating_the_political_compass
Contact:‌
Oana-Denisa Balalau

8 New results

8.1 Data management‌ for analyzing and verifying digital arenas

8.1.1 Graph‌ data lakes of heterogeneous data sources for data‌ journalism

Participants: Oana-Denisa Balalau, Pablo Bertaud-Velten,‌ Nikola Dobricic, Przemyslaw Dominikowski, Simon Ebel‌, Theo Galizzi, Garima Gaur, Ioana‌ Manolescu, Maria Jesus Mellado Tenorio, Madhulika‌ Mohanty, Saba Shahsavari, Georgios Siachamis.‌

Work carried within the ANR AI Chair SourcesSay‌ project has focused on developing a platform, ConnectionLens,‌ for integrating arbitrary heterogeneous data into a graph,‌ then exploring and querying that graph using simple,‌ intuitive query interfaces. The main technical challenges addressed‌ were: (i) how to interconnect structured and semi-structured‌ data sources? We address this through information extraction‌ (when an entity appears in two data sources‌ or two places in the same graph, we‌ only create one node, thus interlinking the two‌ locations) and through similarity comparisons7.1.1; (ii)‌ how to find all connections between nodes matching‌ specific search criteria, or certain keywords? The question‌ is particularly challenging in our context since ConnectionLens‌ graphs can be pretty large, and query answers‌ can traverse edges in both directions(iii) how to‌ convert this graph into standard graph data models‌ like property graphs, etc. ConnectionLens is available online‌ at: ConnectionLens Gitlab repository, while ConnectionStudio, its‌ GUI, is available at ConnectionStudio Gitlab repository.‌

With the ANR TopOL project, we now extend‌ our contributions to large scale data lakes of‌ heterogeneous data sources and explore novel ways of‌ exploration. In this context, the following new contributions‌ have been brought:

Efficiently Profiling, Indexing and Querying‌ Heterogeneous Datasets in Graph Data Lakes Building on‌ the ConnectionLens 7.1.1 and Abstra 7.1.2 frameworks, this work focuses on enabling‌ natural language question answering‌ over large-scale heterogenous data‌‌ lake. In each dataset, we have formalized the‌ concept of entities and‌ their contexts, which serve‌‌ as natural "anchors" of users' questions, e.g. which‌ Person interacted with which‌ Organization, and at what‌‌ Location. To support efficient search over the set‌ of entities-in-context, we developed‌ an end-to-end system that‌‌ ingests heterogenous sources into a graph data lake‌ (using ConnectionLens), abstracts them‌ into collections (using Abstra)‌‌ and finally builds and indexes the entities-in-context. The‌ developed indexes include: Locality‌ Sensitive Hashing (LSH) for‌‌ semantic similarity search and TRIE-like structure for exact‌ lookups.

This work provides‌ a foundation for the‌‌ future work (e.g. building Retrieval-Augmented Generation system) allowing‌ non-technical users like journalists‌ to uncover the interesting‌‌ facts over the large heterogenous data lakes, in‌ particular in domains such‌ as investigative journalism (with‌‌ the team's ongoing collaboration with ICIJ).
Batch Generic‌ Evaluation of Keyword Queries‌ on Graphs Keyword search‌‌ is a popular paradigm for searching for information‌ in graphs: users specify‌ a few search terms‌‌ (or keywords), and the system returns subtrees of‌ the graph, where each‌ keyword is matched by‌‌ a node in each returned subtree. Because the‌ problem is NP-hard in‌ general, many keyword search‌‌ algorithms consider a fixed score function which is‌ applied to rank result‌ trees, and explore only‌‌ part of the search space, pruning trees with‌ low scores. In contrast,‌ generic algorithms explore the‌‌ complete search space (subject to space or time‌ limits, due to the‌ high complexity), but can‌‌ be used with any score function. In this‌ work, we consider the‌ problem of simultaneously answering‌‌ a set (batch) of keyword queries, in a‌ way compatible with any‌ score function. Building upon‌‌ our recent one-query generic algorithm 36, we‌ show that when graph‌ nodes match keywords from‌‌ multiple queries, graph exploration effort can be shared,‌ to speed up the‌ evaluation of the query‌‌ batch. We formally establish guarantees on the correctness‌ and completess of our‌ algorithm, and demonstrate its‌‌ efficiency through comprehensive experiments over synthetic and real-world‌ graphs.
Named Entity Cleaning‌ and Enhancement with Human-in-the-loop‌‌ Named Entities (NEs, in short) are frequently encountered‌ in datasets about varied‌ topics, e.g., journalistic investigations‌‌ (people, places, companies), market analysis (companies and officers),‌ etc. NEs often appear‌ under different forms within‌‌ or across datasets, due to spelling variants or‌ mistakes. To leverage NE-rich‌ datasets, the NEs need‌‌ to be clean (error-free), and possibly enriched with‌ information from external sources.‌ While numerous data cleaning‌‌ solutions exist, in this work, we focus on‌ the specific challenges raised‌ by the cleaning of‌‌ NE sets, in particular ( $i$ ) through‌ a visual workflow interface,‌ ( $i i$ )‌‌ leveraging old and new techniques (string distances, Knowledge‌ Bases, and carefully controlled‌ access to LLMs), and‌‌ especially ( $i i i$ ) enabling human‌ inspection and interaction with‌ the NE cleaning process,‌‌ down to the granularity‌ of an individual attribute of a record. The‌ latter need is crucial in order to capture‌ advanced knowledge that only domain experts have, and‌ which may be absent from all other sources‌ of information (KB, LLM, etc.) We support this‌ by gathering how-provenance that traces the numerous ways‌ in which information is brought to clean NEs.‌ We built NiceT, a system addressing these challenges,‌ and tested on a variety of real-life datasets.‌
Named Entity Centric Querying over Heterogeneous Data Integrating‌ information from diverse sources, particularly in investigative journalism,‌ often hinges on linking data through shared named‌ entities (NEs). The same entity may appear across‌ multiple sources, each providing a different contextual perspective.‌ For instance, when combining U.S. financial and political‌ datasets, Donald Trump may emerge as a common‌ entity, associated with distinct roles such as businessperson‌ and politician. From a journalistic standpoint, the ability‌ to seamlessly integrate heterogeneous data sources and query‌ entity roles or inter-entity relationships—without requiring advanced technical‌ expertise—is critical.

This project, centered on the problem‌ of extracting and integrating information about a named‌ entity (NE) that may appear across heterogeneous datasets‌ within a datalake, gives rise to two concrete‌ research challenges. First, given a input NE, identify‌ the roles (context) it plays across different datasets‌ and aggregate relevant information about the NE. We‌ refer to the aggregated output as the Infocard‌ of the NE. Second, given a collection of‌ heterogeneous datasets and an NE, find its interesting‌ relationships with the other Named entities. We leverage‌ the capabilities of our in-house tools, ConnectionLens and‌ Abstra, that can integrate structured, semi-structured, and unstructured‌ datasets into a unified graph, and create high-level‌ semantic abstractions of the complex datasets.

8.1.2 RDF‌ Query Answering in the Presence of Access Restrictions‌

Participants: Maxime Buron, Hritika Kathuria, Ioana‌ Manolescu, Georgios Siachamis.

In this work,‌ we explore algorithms for answering conjunctive RDF queries‌ in the presence of RDFS ontologies and access‌ control. We consider an access control setting where‌ by default all users have access to the‌ complete graph, and a restriction can forbid user‌ a user's access to specific IRIs. Here, restricting‌ for user $u$ the access to an IRI‌ $i$ entails that: no answer to a query‌ by $u$ may contain the IRI $i$ ;‌ no triple containing $i$ can be used to‌ compute an answer for a query by $i‌$ , nor to entail such a triple via‌ reasoning with the ontology. We present a set‌ of query answering algorithms for this novel context,‌ and prove that five among them are correct,‌ i.e., sound and complete, with respect to both‌ the ontology and the access restrictions in place.‌ We have implemented all our algorithms and present‌ experiments comparing their performance. This work was published‌ in CoopIS 2025 28 where it won the‌ Best Paper Award.

8.1.3 FactCheck-KG: Towards LLM-backed FC‌ Retrieval

Participants: Garima Gaur, Madhulika Mohanty.

There is an unprecedented‌ rise in the volume‌ and reach of disinformation‌‌ due to the popularity of social media and‌ the advent of generative‌ AI models. Fact-checking, that‌‌ is, checking the veracity of a certain claim,‌ is unfeasible at this‌ scale, by human effort‌‌ alone. This is primarily due to the rise‌ in the volume of‌ claims requiring verification, and‌‌ also the number of documents to be processed‌ to verify a certain‌ claim. This process is‌‌ further complicated by disinformation re-surfacing in paraphrased forms,‌ altered context, incomplete, or‌ shifted context. The fact-checkers‌‌ often find themselves re-assessing a previously evaluated claim,‌ which wastes precious human‌ effort. In order to‌‌ tackle these challenges, fact-check retrieval(FCR) pipelines have been‌ developed that, given a‌ newly encountered claim, aim‌‌ to identify the most relevant claims among a‌ set of previously assessed‌ claims. In this work,‌‌ we leverage NLP techniques over a set of‌ fact-checked claims and their‌ related articles, to build‌‌ a Knowledge Graph (KG) FactCheck-KG of named entities,‌ topics, claims and articles‌ with edges capturing the‌‌ connection across different fact-checks via common topics and‌ named entities. This representation‌ lays the foundation for‌‌ more context-aware, fine-grained fact-check retrieval. For example, with‌ the success of retrieval‌ augmented generation (RAG) and‌‌ its extension to a Graph-based retrieval(GraphRAG) framework, our‌ KG can form a‌ starting point for its‌‌ application to solve the fact-check retrieval problem.

8.1.4‌ Efficient and Scalable Search‌ for Statistics

Participants: Simon‌‌ Ebel, Helena Galhardas, Theo Galizzi,‌ Ioana Manolescu, Aurelien‌ Peden.

Informed public‌‌ debate needs high-quality data. In this context, high-quality‌ statistical data sources are‌ a valuable category of‌‌ reference information based on which a claim can‌ be checked. To facilitate‌ the work of journalists‌‌ or other fact-checkers, users' questions about a specific‌ claim should be automatically‌ answered based on statistical‌‌ tables. This task is complicated by the large‌ number, size, and variety‌ of statistical datasets. This‌‌ work introduces the statistical table discovery problem (STD,‌ in short), which aims,‌ given a natural language‌‌ question and a set of statistic datasets (multidimensional‌ tables), to find the‌ tables most relevant for‌‌ the question. We then describe STAR, an algorithm‌ for solving the STD‌ problem. Unlike existing table‌‌ discovery (TD) solutions aimed at relational tables, STAR‌ is devised specifically for‌ multidimensional ones. Further, STAR‌‌ treats the space and time dimensions of statistical‌ datasets separately. We experimentally‌ show that these features,‌‌ together, make STAR outperform state-of-the-art TD systems adapted‌ to the STD problem,‌ in terms of scalability,‌‌ search quality, preprocessing and question answering time. It‌ has been informally presented‌ at BDA 2025 19‌‌ and the code is available at its Gitlab‌ repository.

8.1.5 Structured‌ Discourse Representation for Factual‌‌ Consistency Verification

Participants: Oana-Denisa Balalau, Ioana Manolescu‌, Kun Zhang.‌

Analysing the differences in‌‌ how events are represented across texts, or verifying‌ whether the language model‌ generations hallucinate, requires the‌‌ ability to systematically compare‌ their content. To support such a comparison, a‌ structured representation that captures fine-grained information plays a‌ vital role. In particular, identifying distinct atomic facts‌ and the discourse relations connecting them enables deeper‌ semantic comparison. Our proposed approach combines structured discourse‌ information extraction with a classifier, FDSpotter, for factual‌ consistency verification. We show that adversarial discourse relations‌ pose challenges for language models, but fine-tuning on‌ our annotated data, DiscInfer, achieves competitive performance. Our‌ proposed approach advances factual consistency verification by grounding‌ in linguistic structure and decomposing it into interpretable‌ components. We demonstrate the effectiveness of our method‌ on the evaluation of two tasks: data-to-text generation‌ and text summarisation. This work has been published‌ in ACL (Findings) 2025 27 and the software‌ is available on BIL 7.1.9.

8.1.6 The‌ Search for Conflicts of Interest: Open Information Extraction‌ in Scientific Publications

Participants: Oana-Denisa Balalau, Garima‌ Gaur, Ioana Manolescu, Prajna Upadhyay.‌

A conflict of interest (COI) appears when a‌ person or a company has two or more‌ interests that may directly conflict. This happens, for‌ instance, when a scientist whose research is funded‌ by a company audits the same company. For‌ transparency and to avoid undue influence, public repositories‌ of relations of interest are increasingly recommended or‌ mandated in various domains, and can be used‌ to avoid COIs. In this work, we propose‌ an LLM-based open information extraction (OpenIE) framework for‌ extracting financial or other types of interesting relations‌ from scientific text. We target scientific publications in‌ which authors declare funding sources or collaborations in‌ the acknowledgment section, in the metadata, or in‌ the publication, following editors’ requirements. We introduce an‌ extraction methodology and present a knowledge base (KB)‌ with a comprehensive taxonomy of COI centric relations.‌ Finally, we perform a comparative study of disclosures‌ of two journals in the field of toxicology‌ and pharmacology. The work has been published in‌ EMNLP (Findings) 2025 20 and the software is‌ available on BIL 7.1.10.

8.2 Online targeted‌ advertising

Participants: Ines Abdelaziz, Nardjes Amieur,‌ Abir Benzaamia, Salim Chouaki, Asmaa El‌ Fraihi, Oana Goga.

8.2.1 A Year‌ Under the DSA: Ad Transparency's Uneven Landscape

The‌ Digital Services Act (DSA) has put platform accountability‌ on center stage, requiring online platforms to provide‌ greater transparency into how advertisements are targeted and‌ delivered to users. Central to these obligations are‌ two mechanisms: user-facing ad explanations, which inform individuals‌ why they were shown a given ad, and‌ public ad repositories, which are intended to enable‌ independent auditing of advertising practices. This study provides‌ the first multi-platform evaluation of these two mechanisms‌ across Facebook, Instagram, YouTube and X. Using 48,511‌ user-facing “Why am I seeing this ad?” (WAIST)‌ notices, and a systematic analysis of each platform's‌ public ad repository, we assess how well current‌ implementations disclose the parameters and decision processes involved‌ in targeting. To do so, we develop and apply an operational framework‌ based on Articles 26‌ and 39 of the‌‌ DSA—capturing the granularity, attribution of targeting and delivery‌ choices, data source disclosures,‌ and accuracy—and apply it‌‌ across both user-facing notices and public ad repositories.‌ Our findings show that‌ transparency remains fragmented and‌‌ inconsistent across platforms. User-facing explanations vary widely in‌ precision and often omit‌ key targeting information, while‌‌ repositories provide incomplete, misattributed, and at times difficult-to-interpret‌ targeting data. Moreover, discrepancies‌ between explanations and repository‌‌ entries undermine the reliability of both mechanisms. Overall,‌ current transparency infrastructures fall‌ short of the DSA's‌‌ expectations and highlight the need for clearer and‌ more enforceable standards for‌ advertising transparency moving forward.‌‌ It has been accepted for publication in PETs/PoPETs‌ 2026.

8.2.2 A Comparative‌ Study of News Exposure‌‌ and Consumption On and Off Facebook.

Social media‌ giants like Meta, Google,‌ and X leverage powerful‌‌ algorithms to personalize user feeds, a practice now‌ under intense public scrutiny.‌ These algorithms can inadvertently‌‌ skew the information users consume, potentially influencing political‌ opinions and voting decisions.‌ This raises critical questions:‌‌ Do social media platforms foster misinformation and contribute‌ to echo chambers? To‌ address this ongoing debate,‌‌ our study directly compares news exposure on Facebook‌ (where algorithmic influence is‌ strong) with news consumption‌‌ off-platform (where user behavior plays a larger role).‌ Specifically, we investigate: (1)‌ Are users exposed to‌‌ more/less misinformation on Facebook compared with their off-platform‌ misinformation consumption? (2) Is‌ news exposure on Facebook‌‌ more/less diverse than off-platform news consumption? (3) To‌ what extent do socio-demographic‌ and psychological factors influence‌‌ misinformation exposure on Facebook and consumption off Facebook?‌ (4) Is there a‌ relationship between socio-demographic and‌‌ psychological factors and news diversity on and off‌ Facebook? and (5) Is‌ users' exposure to misinformation‌‌ on Facebook correlated to off-platform news consumption?

Our‌ study of 123,995 news-related‌ posts on Facebook and‌‌ 70,587 news articles visits off Facebook, collected from‌ 642 users during 12‌ weeks, reveals the following‌‌ central findings: (1) Only a small fraction 4%‌ of users' news consumption‌ off Facebook is driven‌‌ by news exposure on Facebook, and only 5.7%‌ of misinformation consumption off‌ Facebook is driven by‌‌ news exposure on Facebook. (2) There is a‌ higher prevalence of misinformation‌ in user-received content on‌‌ Facebook compared to deliberately consumed content off-platform. On‌ Facebook, 5.9% of our‌ users' news exposure comes‌‌ from sources known for spreading misinformation, while off-platform,‌ only 2.6% of our‌ users' news consumption is‌‌ from misinformation sources. Conversely, Facebook presents more diverse‌ content - 22% of‌ users received content from‌‌ only one political leaning on Facebook, compared to‌ 36% of users who‌ consumed content from only‌‌ one political leaning off-platform. (3) Several socio-demographic and‌ psychological factors showed a‌ statistically significant correlation with‌‌ misinformation exposure on Facebook but not misinformation consumption‌ off Facebook. (4) The‌ proportion of misinformation consumed‌‌ off Facebook emerged as a statistically significant predictor‌ of users' exposure to‌ misinformation on Facebook, independent‌‌ of news consumption on‌ Facebook.

This work has been published in CSCW‌ 2025 15.

8.2.3 Privacy Settings and Ad‌ Perception: The Shift from Third-Party Cookies to the‌ Privacy Sandbox

Online behavioral advertising, heavily reliant on‌ privacy-invasive third-party cookie tracking, faces a significant shift‌ as browsers like Safari, Brave, and Firefox have‌ already deprecated them. Google Chrome announced its parallel‌ move with the "Privacy Sandbox Initiative" in 2019,‌ proposing privacy-preserving advertising mechanisms. The extent to which‌ Privacy Sandbox can deliver comparable ad relevance and‌ purchase intent to the established third-party cookie ecosystem‌ will likely determine its adoption as a widespread‌ alternative. This paper presents the first user study‌ evaluating the impact of Privacy Sandbox APIs on‌ ad perception. Our findings show that users perceive‌ Privacy Sandbox ads as less relevant and exhibit‌ lower purchase intent compared to third-party cookie–based ads,‌ without a corresponding increase in perceived privacy protection.‌ These results contribute to the ongoing assessment of‌ Privacy Sandbox as an alternative to third-party cookies.‌

8.2.4 Is Contextual Advertising Safe? Analyzing Systemic Risks‌ with Ads on YouTube.

Contextual advertising is seeing‌ a resurgence in popularity as a privacy-preserving alternative‌ to behavioral targeting. While often regarded as a‌ coarse-grained approach, advances in AI-driven content analysis have‌ transformed it into a highly granular form of‌ targeting.This work examines the safety risks of contextual‌ targeting through a two-part empirical study, analyzing its‌ potential to enable targeting of audiences with sensitive‌ attributes and exposing users to harmful or exploitative‌ ads. In controlled ad experiments, we show that‌ advertisers can target audiences defined by sensitive attributes‌ (e.g., religious belief, mental health condition, and political‌ ideology) by strategically selecting contextual placements—circumventing policies that‌ prohibit such targeting through behavioral signals. To understand‌ how this risk manifests in practice, we develop‌ an automated measurement framework to collect contextual ads‌ delivered on high-risk content environments, focusing on conspiracy‌ videos. We find that contextual ads are highly‌ prevalent in these environments, disproportionately deliver sensitive categories‌ (e.g., alternative health, religion, and political), and lack‌ transparency. We argue that contextual ad systems require‌ deeper empirical scrutiny and robust transparency mechanisms to‌ prevent exploitation and abuse, and regulators should extend‌ behavioral advertising risk principles to the contextual domain.‌

8.2.5 A Framework for Auditing Ad Delivery Responsiveness‌ to Psychological Traits

Online advertising platforms increasingly personalize‌ ad delivery using users' behavioral signals, even when‌ advertisers cannot explicitly target many underlying user characteristics.‌ Auditing delivery skews for traits that are latent,‌ complex, or not directly targetable through advertiser-facing tools‌ remains challenging. We propose an experimental framework for‌ auditing ad delivery across latent traits by constructing‌ trait-defined audiences and examining how delivery systems allocate‌ ads to these audiences under controlled competitive conditions.‌ We demonstrate this framework on Meta's advertising platform‌ using extraversion as a case study. We construct‌ trait-based audiences using two approaches: psychometric assessment combined‌ with tracking-based retargeting, and behavioral profiling based on‌ on-platform engagement. Under controlled delivery conditions, we examine how the platform allocates‌ personality-aligned and misaligned ads‌ across these audiences. We‌‌ find a statistically significant alignment effect in ad‌ delivery: ads are more‌ likely to be delivered‌‌ when their framing matches the personality of the‌ target audience ( $β‌ = 0 . 40‌‌, p < 0 . 001$ ). This‌ effect is strongest in‌ behaviorally profiled segments, where‌‌ misaligned ads also exhibit reduced reach relative to‌ aligned ads. Our framework‌ provides a general approach‌‌ for auditing ad delivery behavior and personalization dynamics‌ driven by latent user‌ traits.

8.2.6 How Persuasive‌‌ Are LLMs in the Wild? Assessing Personalized Ads‌ in Real-World Delivery

Large‌ language models (LLMs) have‌‌ demonstrated persuasive potential in controlled experiments and survey-based‌ studies across commercial, political,‌ and social domains. However,‌‌ their effectiveness in real-world communication environments remains largely‌ unexplored. This work addresses‌ this gap by evaluating‌‌ LLM-generated personalized messages deployed in controlled advertising experiments‌ on Meta platforms. We‌ assess effectiveness along three‌‌ complementary dimensions: (1) behavioral user engagement measured through‌ field experiments, (2) perceived‌ appeal captured via user‌‌ surveys, and (3) platform-level dynamics analyzed through algorithmic‌ ad delivery patterns. Our‌ results show that LLM-based‌‌ personalized messages do not significantly improve user engagement‌ compared to non-personalized messages.‌ We also show that‌‌ user perceptions—measured through surveys—can diverge significantly from observed‌ behavioral outcomes online. This‌ highlights the limitations of‌‌ relying on survey-based evaluations alone to assess the‌ persuasive capabilities of LLMs.‌ Finally, we show that‌‌ LLM-generated personalization can influence platform ad delivery—shifting impressions‌ toward the intended audience‌ by up to 8%‌‌ even without explicit targeting instructions. These effects are‌ often constrained by the‌ platform's relevance predictions, which‌‌ may override the cues embedded in the message.‌ Together, these findings provide‌ a comprehensive real-world audit‌‌ for the effectiveness and limits of LLM-based persuasion‌ in the wild. It‌ has been accepted for‌‌ publication in AAAI ICWSM 2026.

8.3 Bias and‌ issues in LLMs and‌ Benchmarks

Participants: Oana-Denisa Balalau‌‌, Tom Calamai, Chadi Helwe.

8.3.1‌ Navigating the Political Compass:‌ Evaluating Multilingual LLMs across‌‌ Languages and Nationalities

Large Language Models (LLMs) have‌ become ubiquitous in today's‌ technological landscape, boasting a‌‌ plethora of applications, and even endangering human jobs‌ in complex and creative‌ fields. One such field‌‌ is journalism: LLMs are being used for summarization,‌ generation and even fact-checking.‌ However, in today's political‌‌ landscape, LLMs could accentuate tensions if they exhibit‌ political bias. In this‌ work, we evaluate the‌‌ political bias of the most used 15 multilingual‌ LLMs via the Political‌ Compass Test. We test‌‌ different scenarios, where we vary the language of‌ the prompt, while also‌ assigning a nationality to‌‌ the model. We evaluate models on the 50‌ most populous countries and‌ their official languages. Our‌‌ results indicate that language has a strong influence‌ on the political ideology‌ displayed by a model.‌‌ In addition, smaller models tend to display a‌ more stable political ideology,‌ i.e. ideology that is‌‌ less affected by variations‌ in the prompt. The work has been published‌ in ACL (Findings) 2025 21 and the tool‌ is available on BIL 7.1.12.

8.3.2 Benchmarking‌ the Benchmarks: Reproducing Climate-Related NLP Tasks

Significant efforts‌ have been made in the NLP community to‌ facilitate the automatic analysis of climate-related corpora by‌ tasks such as climate-related topic detection, climate risk‌ classification, question answering over climate topics, and many‌ more. In this work, we perform a reproducibility‌ study on 8 tasks and 29 datasets, testing‌ 6 models. We find that many tasks rely‌ heavily on surface-level keyword patterns rather than deeper‌ semantic or contextual understanding. Moreover, we find that‌ 96% of the datasets contain annotation issues, with‌ 16.6% of the sampled wrong predictions of a‌ zero-shot classifier being actually clear annotation mistakes, and‌ 38.8% being ambiguous examples.These results call into question‌ the reliability of current benchmarks to meaningfully compare‌ models and highlight the need for improved annotation‌ practices. We conclude by outlining actionable recommendations to‌ enhance dataset quality and evaluation robustness. The work‌ has been published in ACL (Findings) 2025 18‌ and the tool is available on BIL 7.1.11‌.

8.4 Efficient Big Data analytics

8.4.1 Graph‌ Transformers for Query Plan Representation: Potentials and Challenges‌

Participants: Yanlei Diao, Guillaume Lachaud, Gabriel‌ Lozano Pinzon, Chenghao Lyu.

Query Plan‌ Representation (QPR) is central to workload modeling, with‌ various deep-learning based architectures proposed in the literature.‌ Our work is motivated by two key observations:‌ (i) the research community still lacks clarity on‌ which model, if any, best suits the QPR‌ problem; and (ii) while transformers have revolutionized many‌ fields, their potential for QPR remains largely underexplored.‌ This study examines the strengths and challenges of‌ Graph Transformers for QPR. We introduce a new‌ taxonomy that unifies deep-learning based QPR techniques along‌ key design axes. Our benchmark analysis of common‌ QPR architectures reveals that Graph Transformer Networks (GTNs)‌ consistently outperform alternatives, but can degrade under limited‌ training data. To address this, we propose novel‌ data augmentation techniques to enhance training diversity and‌ refine GTN architectures by replacing ineffective language-model-inspired components‌ with techniques better suited for query plans. Evaluation‌ on JOB, TPC-H, and TPC-DS benchmarks shows that‌ with sufficient training data, enhanced GTNs outperform existing‌ models for capturing complex queries (JOB Full and‌ TPC-DS) and enable the query embedder trained on‌ TPC-DS to generalize to TPC-H queries out of‌ the box. The work has been accepted in‌ VLDB 2026.

8.4.2 Unsupervised Anomaly Detection in Multivariate‌ Time Series across Heterogeneous Domains

Participants: Yanlei Diao‌, Vincent Jacob.

The widespread adoption of‌ digital services, along with the scale and complexity‌ at which they operate, has made incidents in‌ IT operations increasingly more likely, diverse, and impactful.‌ This has led to the rapid development of‌ a central aspect of "Artificial Intelligence for IT‌ Operations" (AIOps), focusing on detecting anomalies in vast‌ amounts of multivariate time series data generated by service entities. In this‌ paper, we begin by‌ introducing a unifying framework‌‌ for benchmarking unsupervised anomaly detection (AD) methods, and‌ highlight the problem of‌ shifts in normal behaviors‌‌ that can occur in practical AIOps scenarios. To‌ tackle anomaly detection under‌ domain shift, we then‌‌ cast the problem in the framework of domain‌ generalization and propose a‌ novel approach, Domain-Invariant VAE‌‌ for Anomaly Detection (DIVAD), to learn domain-invariant representations‌ for unsupervised anomaly detection.‌ Our evaluation results using‌‌ the Exathlon benchmark show that the two main‌ DIVAD variants significantly outperform‌ the best unsupervised AD‌‌ method in maximum performance, with 20% and 15%‌ improvements in maximum peak‌ F1-scores, respectively. Evaluation using‌‌ the Application Server Dataset further demonstrates the broader‌ applicability of our domain‌ generalization methods. The work‌‌ has been published in VLDB 2025 22.‌

8.4.3 Transactional Stateful Functions‌ on Streaming Dataflows

Participants:‌‌ Georgios Siachamis.

Developing stateful cloud applications, such‌ as low-latency workflows and‌ microservices with strict consistency‌‌ requirements, remains arduous for programmers. The Stateful Functions-as-a-Service‌ (SFaaS) paradigm aims to‌ serve these use cases.‌‌ However, existing approaches provide weak transactional guarantees or‌ perform expensive external state‌ accesses requiring inefficient transactional‌‌ protocols that increase execution latency. In this paper,‌ we present Styx, a‌ novel dataflow-based SFaaS runtime‌‌ that executes serializable transactions consisting of stateful functions‌ that form arbitrary call-graphs‌ with exactly-once guarantees. Styx‌‌ extends a deterministic transactional protocol by contributing: i)‌ a function acknowledgment scheme‌ to determine transaction boundaries‌‌ required in SFaaS workloads, ii) a function-execution caching‌ mechanism, and iii) an‌ early commit-reply mechanism that‌‌ substantially reduces transaction execution latency. Experiments with the‌ YCSB, TPC-C, and Deathstar‌ benchmarks show that Styx‌‌ outperforms state-of-the-art approaches by achieving at least one‌ order of magnitude higher‌ throughput while exhibiting near-linear‌‌ scalability and low latency. This work has been‌ published in SIGMOD 2025‌ 24 and demonstrated in‌‌ VLDB 2025 25.

8.4.4 Dynamic Graph Databases‌ with Out-of-order Updates

Participants:‌ Muhammad Khan, Ioana‌‌ Manolescu.

Dynamic graphs are omnipresent in real-time‌ applications that generate massive‌ amounts of data. We‌‌ consider dynamic graphs, where edges are continuously added‌ and deleted to a‌ single graph, from multiple‌‌ update streams. The dynamic graphs are stored in‌ a transactional graph database.‌ Each edge update or‌‌ deletion carries a source (stream) time $S T‌$ , assigned at the‌ moment when it was‌‌ emitted, and an arrival (or transaction) time $W‌ T$ , assigned when‌ the graph database receives‌‌ it. Updates may be received at the database‌ out-of-order (ooo, in short):‌ due to different latencies‌‌ on the propagation paths between the data source‌ and the database. We‌ proposed HAL, a novel‌‌ in-memory dynamic graph database design, addressing these challenges.‌ HAL outperforms comparable systems‌ by a factor of‌‌ up to 73 $\times$ in terms of update‌ processing throughput and up‌ to 357 $\times$ for‌‌ analytics, while being the first to support out-of-order‌ updates. We have also‌ extended it with support‌‌ for node and edge‌ properties, and for historical queries, whereas queries should‌ be evaluated over the graph such as it‌ was at a specific moment in the past.‌ This work has been accepted in VLDB 2025‌ 12, VLDB 2025 Large-Scale Graph Data Analytics‌ (LSGDA) workshop 16 and demonstrated in SIGMOD 2025‌ 17. The code is available on Gitlab‌ (code).

Participants: Ioana Manolescu, Oana Balalau‌, Yanlei Diao, Ghufran Khan, Maxime‌ Buron, Hritika Kathuria, Georgios Siachamis.‌

9 Bilateral contracts and grants with industry

9.1‌ Bilateral contracts with industry

The collaborative contract with‌ RadioFrance in which Oana-Denisa Balalau and Ioana Manolescu‌ Goujot participate has ended. We have successfully transferred‌ the StatCheck software to our RadioFrance partner.

The‌ collaborative contract with Amundi led by Oana-Denisa Balalau‌ for the CIFRE project has ended, the PhD‌ student will defend his PhD in 2026.

9.2‌ Bilateral Grants with Industry

Ioana Manolescu Goujot is‌ involved in the BPI-funded project CodeCommons, in collaboration‌ with the Software Heritage Foundation (SWF). We work‌ to generalize, enlarge, and enable the efficient processing‌ of the world's largest repository of free software.‌ The end of the PhD of Muhammad Khan‌ contributed to the project.

Ioana Manolescu Goujot ,‌ Georgios Siachamis and Hritika Kathuria have been involved‌ in the BPI-funded project DXP (Data Exchange Project),‌ with Amadeus, the international tourism services operator. We‌ participate in this project in collaboration with Maxime‌ Buron, former team member, now an Assistant Professor‌ at UCA. Our contribution here is to devise‌ an architecture for decentralized, access-controled data sharing, allowing‌ tourism service providers and clients to exchange their‌ information via Amadeus' platform.

Participants: Ioana Manolescu Goujot‌, Oana-Denisa Balalau, Oana Goga, Madhulika‌ Mohanty, Garima Gaur, Yanlei Diao,‌ Muhammad Khan, Maxime Buron, Hritika Kathuria‌, Georgios Siachamis.

10 Partnerships and cooperations‌

10.1 International initiatives

10.1.1 Associate Teams in the‌ framework of an Inria International Lab or in‌ the framework of an Inria International Program

MediumAI‌

Title:
Responsible AI for Journalism
Duration:
2024 -‌ 2026
Coordinator:
Davide Ceolin (Davide.Ceolin@cwi.nl)
Partners:
- CWI Amsterdam‌ (Pays-Bas)
Inria contact:
Oana-Denisa Balalau
Summary:
From recommender‌ systems to large language models, data-driven AI tools‌ have shown different forms of limitations and bias.‌ Bias in AI tools may stem from multiple‌ factors, including bias in the input data the‌ AI tools are trained on, the algorithm and‌ the individuals responsible for designing the AI tools,‌ and bias in the evaluation and interpretation of‌ AI tool outputs. Limitations are due to technical‌ difficulties in achieving specific tasks. Media outlets use‌ different algorithmic aids in their workflow: keyword extraction,‌ entities and relations extractions, event extraction, sentiment analysis,‌ automatic summarization, newsworthy story detection, semi-automatic production of‌ news using text generation models, and search, among‌ others. Given the importance of the media sector‌ for our democracies, shortcomings in the tools they‌ use could have severe consequences. Both Inria and CWI have partnerships with‌ large media groups and‌ can help them address‌‌ bias and limitations in their AI workflows.

10.2‌ International research visitors

10.2.1‌ Visits of international scientists‌‌

Other international visits to the team

Benjamin Ocampo‌

Status
PhD
Institution of‌ origin:
Human-Centered Data Analytics‌‌ team, University of Amsterdam
Country:
Netherlands
Dates:
October‌ 13-17, 2025
Context of‌ the visit:
Associated team‌‌ MediumAI
Mobility program/type of mobility:
research stay

Davide‌ Ceolin

Status
researcher
Institution‌ of origin:
Human-Centered Data‌‌ Analytics team, CWI
Country:
Netherlands
Dates:
October 16-17,‌ 2025
Context of the‌ visit:
Associated team MediumAI‌‌
Mobility program/type of mobility:
research stay

Mae Sosto‌

Status
post-doc
Institution of‌ origin:
Human-Centered Data Analytics‌‌ team, CWI
Country:
Netherlands
Dates:
November 27-December 03,‌ 2025
Context of the‌ visit:
Associated team MediumAI‌‌
Mobility program/type of mobility:
research stay

10.2.2 Visits‌ to international teams

Research‌ stays abroad

persTomCalamai

Visited‌‌ institution:
CWI, Amsterdam
Country:
Netherlands
Context of the‌ visit:
Associated team MediumAI‌
Mobility program/type of mobility:‌‌
research stay

10.3 European initiatives

10.3.1 Horizon Europe‌

ELIAS

ELIAS project on‌ cordis.europa.eu

Title:
European Lighthouse‌‌ of AI for Sustainability
Duration:
From September 1,‌ 2023 to August 31,‌ 2027
Partners:
- ECOLE POLYTECHNIQUE‌‌ (EP), France
- INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE‌ ET AUTOMATIQUE (INRIA), France‌
- ROBERT BOSCH KFT, Hungary‌‌
- BITDEFENDER SRL (Bitdefender), Romania
- ETHNIKO KENTRO EREVNAS KAI‌ TECHNOLOGIKIS ANAPTYXIS (CENTRE FOR‌ RESEARCH AND TECHNOLOGY HELLAS‌‌ CERTH), Greece
- THE UNIVERSITY OF MANCHESTER (UNIVERSITY OF‌ MANCHESTER), United Kingdom
- ROBERT‌ BOSCH GMBH (BOSCH), Germany‌‌
- INSTITUT JOZEF STEFAN (JSI), Slovenia
- INSTITUT POLYTECHNIQUE DE‌ PARIS, France
- UNIVERSITAT DE‌ VALENCIA (UVEG), Spain
- PROMETEIA‌‌ SOCIETA PER AZIONI (Prometeia), Italy
- IBM IRELAND LIMITED,‌ Ireland
- KOBENHAVNS UNIVERSITET (UCPH),‌ Denmark
- AALTO KORKEAKOULUSAATIO SR‌‌ (AALTO), Finland
- IDEAS NCBR SP Z O.O., Poland‌
- UMEA UNIVERSITET, Sweden
- INSTITUT‌ MINES-TELECOM, France
- FONDAZIONE ISTITUTO‌‌ ITALIANO DI TECNOLOGIA (IIT), Italy
- FONDATION DE L'INSTITUT‌ DE RECHERCHE IDIAP (IDIAP),‌ Switzerland
- UNIVERSITATEA NATIONALA DE‌‌ STIINTASI TEHNOLOGIE POLITEHNICA BUCURESTI (NATIONAL UNIVERSITY OF SCIENCE‌ ANDTECHNOLOGY POLITEHNICA BUCHAREST), Romania‌
- EIDGENOESSISCHE TECHNISCHE HOCHSCHULE ZUERICH‌‌ (ETH Zürich), Switzerland
- CESKE VYSOKE UCENI TECHNICKE V‌ PRAZE (CVUT), Czechia
- FUNDACION‌ DE LA COMUNITAT VALENCIANA‌‌ UNIDAD ELLIS ALICANTE, Spain
- FONDAZIONE BRUNO KESSLER (FBK),‌ Italy
- POLITECNICO DI MILANO‌ (POLIMI), Italy
- LA COMMUNAUTE‌‌ D UNIVERSITES ET ETABLISSEMENTS DE TOULOUSE (LA COMMUNAUTE‌ D UNIVERSITES ET ETABLISSEMENTS‌ DE TOULOUSE), France
- UNIVERSITA‌‌ DEGLI STUDI DI TRENTO (UNITN), Italy
- UNIVERSITA DEGLI‌ STUDI DI MILANO (UMIL),‌ Italy
- HASSO-PLATTNER-INSTITUT FUR DIGITAL‌‌ ENGINEERING GGMBH (HPI), Germany
- ENGINEERING - INGEGNERIA INFORMATICA‌ SPA (ENG), Italy
- EBERHARD‌ KARLS UNIVERSITAET TUEBINGEN (UT),‌‌ Germany
- UNIVERSITA DEGLI STUDI DI GENOVA (UNIGE), Italy‌
- MAX-PLANCK-GESELLSCHAFT ZUR FORDERUNG DER‌ WISSENSCHAFTEN EV (MPG), Germany‌‌
- UNIVERSITA DEGLI STUDI DI MODENA E REGGIO EMILIA‌ (UNIMORE), Italy
- UNIVERSITEIT VAN‌ AMSTERDAM (UvA), Netherlands
Inria‌‌ contact:
Ioana Manolescu
Coordinator:
Summary:

We live in‌ a crucial historical moment,‌ with tremendous challenges ahead,‌‌ from climate change to the energy crisis. ELIAS‌ emerges from the belief‌ that AI will be‌‌ a key discipline to help us tackle these‌ challenges. At the same‌ time, the development of‌‌ AI entails deep ethical‌ and societal concerns that need to be addressed.‌ As for fundamental research, ELIAS will address key‌ scientific questions about how AI can reduce computational‌ costs, serves to model effects of policy decisions‌ on society, and impacts individuals. ELIAS will strive‌ for a deep integration of the fundamental research‌ that takes place in academia and the more‌ applications-focused research from industry.

ELIAS builds on and‌ expands the highly successful and internationally recognized European‌

Laboratory for Learning and Intelligent Systems (ELLIS). ELIAS‌ will further develop the excellence criteria and the‌ pillars in ELLIS and implement actions that will‌ support AI researchers and young talents at different‌ stages of their careers. Furthermore, ELIAS will develop‌ a Sciencentrepreneurship track, with the purpose of attracting‌ and empowering talents at the interface of scientific‌ innovation and business and establish original AI solutions‌ that move towards a sustainable long-term future for‌ our planet, contribute to a cohesive society, and‌ respect individual rights.

The outcome of ELIAS will‌ be to establish Europe as a leader in‌ AI research in which impact on the environment,‌ society and the individual are integral considerations during‌ development. We will measure the success of this‌ endeavor in terms of key indicators, including the‌ number of new cross-institutional collaborations, the number of‌ cross-disciplinary collaborations, the number of industry-academic partnerships, publications‌ in top conferences and journals, patents, and the‌ number of projects that have resulted in deployed‌ technologies.

10.3.2 H2020 projects

Ioana Manolescu Goujot is‌ the local PI for the Inria partner in‌ the project "ELIAS - European Lighthouse of AI‌ for Sustainability" (2,800,000 euros). Madhulika Mohanty and Garima‌ Gaur have also been strongly involved.

Yanlei Diao‌ has been awarded the ERC Grant - ERC‌ Proof of Concept - on ExplainableAD: Explainable Anomaly‌ Detection for Safeguarding and Enhancing Modern Data Industry.‌

10.4 National initiatives

10.4.1 ANR

Oana Goga is‌ the local PI for LIX partner - ANR‌ PRC 2022 - 2026 “FeedingBias: A multi-platform mixed-methods‌ approach to news exposure on social media” (our‌ part: 128,000 euros)
Oana Goga is the local‌ PI for LIX partner - ANR PRCE 2021‌ - 2025 “PROPEOS: Privacy-oriented Personalization of Online Services”‌ (our part: 202,720 euros)
The project "TopOL (Top‌ of the Lake): discovery and exploitation of heterogeneous‌ data lakes through graph models", coordinated by Ioana‌ Manolescu Goujot , has been funded by the‌ ANR. The project is a collaboration with U.‌ Paris Saclay, U. Paris Dauphine, U. Blois and‌ U. Tours; the International Consortium of Investigative Journalism‌ (ICIJ) is a non-funded partner. Madhulika Mohanty also‌ participates and is a Work Package co-leader.

10.5‌ Regional initiatives

Ioana Manolescu Goujot has been awarded‌ a Fellowship of the Hi!Paris AI Cluster "PREDIAL:‌ AI Data Dialogs for the Press".

Yanlei Diao‌ has been awarded an AAP Premat IP Paris‌ 2025.

11 Dissemination

11.1 Promoting scientific activities

Chair‌ of conference program committees

Ioana Manolescu Goujot was‌ the Demonstration chair at EDBT 2025.

Madhulika Mohanty was the demonstration chair‌ of at the French‌ data base conference, BDA‌‌ 2025.

Member of the conference program committees

The‌ team members have been‌ part of the following‌‌ program committees:

Ioana Manolescu Goujot : ACL Rolling‌ Review 2025, IEEE ICDE‌ 2025, ACM PACMMOD (formerly‌‌ SIGMOD) 2025, BDA 2025
Oana-Denisa Balalau : ACL‌ Rolling Review February 2025‌
Madhulika Mohanty : VLDB‌‌ 2025, ICDE 2025, EDBT 2025 (Demo), ICDE 2025‌ (Demo), VLDB 2025 (Demo),‌ CODS 2025, CMLS Workshop‌‌ in ER 2025
Garima Gaur : CIKM 2025,‌ BDA 2025, CMLS Workshop‌ in ER 2025
Georgios‌‌ Siachamis : ICDE 2025, EDBT 2025 (Demo), DEBS‌ 2025

11.1.1 Journal

Member‌ of the editorial boards‌‌

Ioana Manolescu Goujot served as an Associate Editor‌ for PVLDB 2025.

Reviewer‌ - reviewing activities

Madhulika‌‌ Mohanty reviewed for Transactions on Graph Data and‌ Knowledge (TGDK) and Georgios‌ Siachamis reviewed for the‌‌ VLDB Journal (VLDBJ).

11.1.2 Invited talks

Ioana Manolescu‌ Goujot delivered a keynote‌ at AFIA (French AI‌‌ Reseach Association) workshop “Perspectives et Défis de l'IA”‌ on « Désinformation, Démocratie‌ et IA », June‌‌ 10, 2025 (link).

Oana-Denisa Balalau delivered‌ a talk at ESSEC‌ in the workshop Comprendre‌‌ et Changer le Monde (CCM), titled “Improving the‌ quality of public debate‌ with AI”.

Madhulika Mohanty‌‌ delivered the following talks:

“Intelligence Artificielle: un outil‌ au service de l'investigation”‌ at VIGINUM in May‌‌ 2025.
“Effective Exploration of Graph-Structured Data” at LHC‌ and IDIA Days 2025‌ in June 2025.

Tom‌‌ Calamai delivered a workshop on “les applications de‌ l'IA pour l'investissement responsable”‌ organised by the FIR‌‌ (forum pour l'investissement responsable) (link)

11.1.3‌ Leadership within the scientific‌ community

Ioana Manolescu Goujot‌‌ has been the president of the informal French‌ Data Management Association (BDA).‌

11.1.4 Research administration

Ioana‌‌ Manolescu Goujot represents Inria in the Comité Operationnel‌ of Hi!Paris, an AI‌ Pole of Excellency comprising‌‌ IP Paris and HEC. She is also an‌ elected member of IP‌ Paris' Comité Académique and‌‌ serves on its Scientific Committee.

11.2 Teaching -‌ Supervision - Juries -‌ Educational and pedagogical outreach‌‌

11.2.1 Teaching

Ioana Manolescu Goujot is a part-time‌ professor (50%) at Ecole‌ Polytechnique. She taught:

Courses,‌‌ labs and TDs in CSC_51053_EP (Database Management Systems);‌
She is in charge‌ of the M1 Internship‌‌ program in Artificial Intelligence and Data Science (CSC_52992_EP).‌
She is also in‌ charge of the Artificial‌‌ Intelligence M1 program at Ecole Polytechnique

Madhulika Mohanty‌ has a 25% Chargée‌ d'Enseignement contract at Ecole‌‌ Polytechnique for 10 months. She taught:

Labs and‌ TDs in CSC_51053_EP (Database‌ Management Systems)
Labs and‌‌ TDs in CSC_52083_EP (Systems for Big Data)
She‌ also taught 3h of‌ CM and 3h of‌‌ TP for ECE_5DA04_TP (Big Graph Data Management) at‌ Télécom for DATAAI Masters.‌

Oana-Denisa Balalau is a‌‌ part-time (33%) assistant professor at Ecole Polytechnique, where‌ she teaches “Mining, learning‌ and reasoning on Web‌‌ Graphs”, L3

Przemyslaw Dominikowski carries out a complementary‌ teaching assignment (64h) at‌ Ecole Polytechnique. He teaches‌‌ the labs in CSC_2F001_EP‌ (Object Oriented Programming in C++).

Garima Gaur carried‌ out following teaching duties:

Course, Labs and TDs‌ in CSC_52640_EP (Database Management Systems) offered by DMAP,‌ Ecole Polytechnique
Labs and TDs in CSC_51053_EP (Database‌ Management Systems)
3h of CM and 3h of‌ TP for ECE_5D04_TP (Big Graph Data Management) at‌ Télécom for DATAAI Masters.

Hritika Kathuria carries out‌ a complementary teaching assignment (64h) at Ecole Polytechnique‌ and teaches 2 Labs in CSE_102.

Tom Calamai‌ has a 30h teaching assistant (Vacataire) contract at‌ Télécom Paris and Ecole Polytechnique. He teaches:

INF473G‌
Machine Learning for Text Mining
Machine learning avancé‌
Database
Language Modeling

Georgios Siachamis carried out 3h‌ of CM and 3h of TP for ECE_5D04_TP‌ (Big Graph Data Management) at Télécom for DATAAI‌ Masters.

Yanlei Diao holds a part-time (50%) full‌ Professor position at Ecole Polytechnique. She teaches Systems‌ for Big Data (CSC_52083_EP Systems for Big Data),‌ M1, Ecole Polytechnique.

Guillaume Lachaud has a 58h‌ teaching assistant position at Ecole Polytechnique. He teaches:‌

CSC_52087_EP- Advanced Deep Learning
CSC_41011_EP - Les bases‌ de la programmation et de l'algorithmique
CSC_43M02_EP (for‌ one day) - Modal d'informatique - Exploration et‌ apprentissage sur les graphes du Web

11.2.2 Supervision‌

The team supervised the following PhDs:

Przemysław Dominikowski,‌ Sep 2025 - Dec 2025, advised by Ioana‌ Manolescu Goujot and Madhulika Mohanty
Kun Zhang, Jan‌ 2025-April 2025, advised by Ioana Manolescu Goujot and‌ Oana-Denisa Balalau
Tom Calamai, Jan 2025-Dec 2025, advised‌ by Fabian Suchanek and Oana-Denisa Balalau
Hritika Kathuria,‌ Jan 2025-Dec 2025, advised by Ioana Manolescu Goujot‌ and Maxime Buron
Ines Abdelaziz, Dec 2025, advised‌ by Oana Goga
Nardjes Amieur, Jan 2025-Dec 2025,‌ advised by Oana Goga
Abir Benzaamia, Jan 2025-Dec‌ 2025, advised by Oana Goga
Asmaa El Fraihi,‌ Jan 2025-Dec 2025, advised by Oana Goga
Gabriel‌ Ben Zenou, Jan 2025-Dec 2025, advised by Oana‌ Goga
Gabriel Lozano, Sept 2025-Dec 2025, advised by‌ Yanlei Diao and Guillaume Lachaud
Nazim Mezhoudi, Jan‌ 2025-Dec 2025, advised by Yanlei Diao and Mariam‌ Barry (BNP Paribas)

The team supervised the following‌ postdocs:

Chadi Helwe, Jan 2025-March 2025, advised by‌ Oana-Denisa Balalau and Davide Ceolin
Guillaume Lachaud, Jan‌ 2025-Dec 2025, advised by Yanlei Diao

The team‌ supervised the following engineers:

Simon Ebel and Théo‌ Galizzi (January to June 2025), Aurélien Peden (March‌ to August 2025): Oana-Denisa Balalau and Ioana Manolescu‌ Goujot supervised them on their collaboration project with‌ RadioFrance.
George Siachamis: supervised by Ioana Manolescu Goujot‌ and Madhulika Mohanty on efficient and expressive graph‌ data management.
Ines Abdelaziz (January to November 2025):‌ supervised by Oana Goga .

The team supervised‌ the following interns:

Pablo Bertaud-Velten, M1 IP Paris,‌ advised by Ioana Manolescu Goujot , Madhulika Mohanty‌ , Garima Gaur and Georgios Siachamis
Przemyslaw Dominikowski,‌ M2 UP Saclay, advised by Ioana Manolescu Goujot‌ , Madhulika Mohanty , Garima Gaur and Georgios‌ Siachamis
Nikola Dobriçic, X Bachelor 3rd year, advised‌ by Ioana Manolescu Goujot , Madhulika Mohanty and‌ Georgios Siachamis
Joanne Jegou, X Bachelor 3rd year, co-advised by Ioana Manolescu‌ Goujot and Michael Thy‌ (APHP)
Paul Kronlund-Drouault, ENS‌‌ Lyon Bachelor 2nd year, advised by Ioana Manolescu‌ Goujot .
Maria Mellado,‌ M2 University of Chile,‌‌ advised by Ioana Manolescu Goujot , Madhulika Mohanty‌ and Garima Gaur
Saba‌ Shashsavari, M1 IP Paris,‌‌ advised by Ioana Manolescu Goujot , Madhulika Mohanty‌ , Garima Gaur and‌ Georgios Siachamis
Vlada Voronina,‌‌ M1, advised by Oana-Denisa Balalau and Marine Le‌ Morvan
Rémi Guillou, X‌ Bachelor 3rd Year, advised‌‌ by Yanlei Diao
Yanis Zaamoun, X Bachelor 3rd‌ year, advised by Yanlei‌ Diao

The team supervised‌‌ the following part-time projects:

PSC "Analyse du discours‌ médiatique autour du changement‌ climatique", advised by Oana-Denisa‌‌ Balalau and Etienne Ollion
Léo Nivelle (X3A), "Automatic‌ verbalisation of statistics", advised‌ by Ioana Manolescu Goujot‌‌
Yiheng Chen, Antoine Delacour and Elliot Thorel (X3A):‌ "Natural language querying of‌ large heterogeneous datasets", advised‌‌ by Ioana Manolescu Goujot , Madhulika Mohanty ,‌ Garima Gaur and Georgios‌ Siachamis
Cédric Trinh and‌‌ Tom Léon (X3A): "Building a Knowledge Graph for‌ Fact-checks", advised by Madhulika‌ Mohanty and Garima Gaur‌‌
Moritz Sommer (X and RWTH Exchange Program): "Identification‌ of Core Properties for‌ Semantic Concepts in Universal‌‌ Datasets", advised by Ioana Manolescu Goujot , Madhulika‌ Mohanty and Garima Gaur‌
Maximilien Rambaud, Nicolas Gromitsaris,‌‌ Anthony Chassagne (X3A): "Anomaly detection and explaination in‌ dynamic graphs, with applications‌ in finance", advised by‌‌ Yanlei Diao and Guillaume Lachaud
Gabriel Cheval, Armand‌ Vabre (X3A): "Detecting data‌ drift in graphs for‌‌ model retraining" advised by Yanlei Diao and Guillaume‌ Lachaud
Loric Roger, Joseph‌ de Roffignac, Sylvain Dehayem‌‌ (X3A): "Anomaly detection in dynamic graphs", advised by‌ Yanlei Diao and Guillaume‌ Lachaud
Berthé Zié, Goly‌‌ Kodia (X3A): "Explainable dynamic graph neural networks for‌ anomaly detection", advised by‌ Yanlei Diao and Guillaume‌‌ Lachaud

11.2.3 Juries

Oana-Denisa Balalau has served as‌ a:

member of the‌ recruitment comittee for assistant‌‌ professor at Télécom Paris
part of the PhD‌ defense committee of Jonathan‌ Colin (Université Paris Saclay),‌‌ William Soto (Université de Lorraine)

Ioana Manolescu Goujot‌ has served in the‌ following juries:

Member of‌‌ a Professor hiring committee at Université de Paris‌ Dauphine (june 2025)
Reported‌ on the PhD thesis‌‌ of Yifan Wang, Université de Lille, defended in‌ November 2025

11.3 Popularization‌

11.3.1 Specific official responsibilities‌‌ in science outreach structures

Oana-Denisa Balalau is a‌ member of Inria Saclay's‌ Scientific Commission. She also‌‌ animated the foresight seminar on LLMs&Science at the‌ "Data and Knowledge" Inria‌ seminar in March 2025.‌‌

Ioana Manolescu Goujot ia an elected member of‌ Inria's Comité d'Evaluation.

11.3.2‌ Participation in Live events‌‌

Ioana Manolescu Goujot had several intervention in national‌ media:

Participated to ARTE‌ "28 minutes" show on‌‌ the impact of AI on society, December 24,‌ 2025.
Interviewed by Michaël‌ Szadkowsky (Le Monde) for‌‌ the article "2025, l'année où la vidéo par‌ IA a envahi les‌ réseaux sociaux", December 22,‌‌ 2025.
Interviewed by Désirée de Lamarzelle (Forbes Magazine)‌ for the article "Future‌ of work: is AI‌‌ a friend or a‌ foe?", November 13, 2025.
Interviewed by Mélinée Le‌ Priol (La Croix) for the article Faut-il avoir‌ peur de la 'superintelligence artificielle'?", October 30, 2025‌
Interviewed by Marina Alcaraz (Les Echos) on the‌ frequency of fake news in chatbot responses, September‌ 2025
Interviewed by Marina Alcaraz (Les Echos) on‌ disinformation sometimes present in Mistral outputs, July 2025‌
Interviewed by Alexandre Capron whether a GenAI vi‌ (TF1) on fake AI videos, June 6, 2025.‌
Guest in the radio show "Je pense donc‌ j'agis": Où vont nos données et comment les‌ protéger?", hosted by Melchior Gormand, on RCF, April‌ 3, 2025
In a press conference organized as‌ part of a "Stand Up for Science" day‌ on April 3, 2025 (dépêche AEF, video recording)‌
Member of a panel about ethical and regulatory‌ bounds on research in "Journée Sciences et Médias"‌ (French Association of Science Journalists), February 10, 2025.‌
Interviewed by Chloé Woitier for the article "C'est‌ une nouvelle pollution numérique : le Slop, ce‌ raz-de-marée de contenus IA qui menace internet", Le‌ Figaro, February 2, 2025.
Authored an invited opinion‌ piece in l'Humanité "Les réseaux sociaux nuisent-ils à‌ la démocratie?" on January 27, 2025.

11.3.3 Others‌ science outreach relevant activities

Przemyslaw Dominikowski conducted an‌ outreach session (1.5h) for high school students (stage‌ de seconde), presenting CEDAR's team research, in particular‌ data lake indexing.

Ioana Manolescu Goujot gave a‌ presentation for CPES (1st year higher education) students‌ at Lycée International de Palaiseau Paris-Saclay.

12 Scientific‌ production

12.1 Major publications

1 inproceedingsR.Rana‌ Alotaibi, D.Damian Bursztyn, A.Alin‌ Deutsch, I.Ioana Manolescu and S.Stamatis‌ Zampetakis. Towards Scalable Hybrid Stores: Constraint-Based Rewriting‌ to the Rescue.SIGMOD 2019 - ACM‌ SIGMOD International Conference on Management of DataAmsterdam,‌ NetherlandsJune 2019HAL
2 inproceedingsO.Oana‌ Balalau, S.Simon Ebel, T.Théo‌ Galizzi, I.Ioana Manolescu, Q.Quentin‌ Massonnat, A.Antoine Deiana, E.Emilie‌ Gautreau, A.Antoine Krempf, T.Thomas‌ Pontillon, G.Gérald Roux and J.Joanna‌ Yakin. Fact-checking Multidimensional Statistic Claims in French‌.TTO 2022 - Truth and Trust Online‌Boston [Hybrid Event], United StatesOctober 2022HAL‌
3 inproceedingsO.Oana Balalau and R.Roxana‌ Horincar. From the Stage to the Audience:‌ Propaganda on Reddit.EACL 2021 - 16th‌ Conference of the European Chapter of the Association‌ for Computational LinguisticsOnline, FranceApril 2021HAL‌
4 inproceedingsM.Maxime Buron, F.François‌ Goasdoué, I.Ioana Manolescu and M.-L.Marie-Laure‌ Mugnier. Reformulation-based query answering for RDF graphs‌ with RDFS ontologies.ESWC 2019 - European‌ Semantic Web ConferencePortoroz, SloveniaMarch 2019HAL‌
5 inproceedingsD.Damian Bursztyn, F.François‌ Goasdoué and I.Ioana Manolescu. Teaching an‌ RDBMS about ontological constraints.Very Large Data‌ BasesNew Delhi, IndiaSeptember 2016HAL
6‌ inproceedingsS.Sylvie Cazalens, P.Philippe Lamarre, J.Julien Leblay‌, I.Ioana Manolescu‌ and X.Xavier Tannier‌‌. A Content Management Perspective on Fact-Checking.‌The Web Conference 2018‌ - alternate paper tracks‌‌ "Journalism, Misinformation and Fact Checking"Lyon, FranceApril‌ 2018, 565-574HAL‌
7 articleS.Sejla‌‌ Cebiric, F.François Goasdoué, H.Haridimos‌ Kondylakis, D.Dimitris‌ Kotzinos, I.Ioana‌‌ Manolescu, G.Georgia Troullinou and M.Mussab‌ Zneika. Summarizing Semantic‌ Graphs: A Survey.‌‌The VLDB Journal2018HAL
8 inproceedingsY.‌Yanlei Diao, P.‌Pawel Guzewicz, I.‌‌Ioana Manolescu and M.Mirjana Mazuran. Spade:‌ A Modular Framework for‌ Analytical Exploration of RDF‌‌ Graphs.VLDB 2019 - 45th International Conference‌ on Very Large Data‌ BasesProceedings of the‌‌ VLDB Endowment, Vol. 12, No. 12Los Angeles,‌ United StatesAugust 2019‌HAL DOI
9 article‌‌E.Enhui Huang, L.Liping Peng,‌ L. D.Luciano Di‌ Palma, A.Ahmed‌‌ Abdelkafi, A.Anna Liu and Y.Yanlei‌ Diao. Optimization for‌ active learning-based interactive database‌‌ exploration.Proceedings of the VLDB Endowment (PVLDB)‌121September 2018‌, 71-84HAL DOI‌‌
10 inproceedingsA.Abhishek Roy, Y.Yanlei‌ Diao, U.Uday‌ Evani, A.Avinash‌‌ Abhyankar, C.Clinton Howarth, R.Rémi‌ Le Priol and T.‌Toby Bloom. Massively‌‌ Parallel Processing of Whole Genome Sequence Data: An‌ In-Depth Performance Study.‌SIGMOD '17 Proceedings of‌‌ the 2017 ACM International Conference on Management of‌ DatSIGMOD '17 Proceedings‌ of the 2017 ACM‌‌ International Conference on Management of DataSIGMOD ACM‌ Special Interest Group on‌ Management of DataChicago,‌‌ Illinois, United StatesACMMay 2017, 187-202‌HAL DOI
11 inproceedings‌S. Y.Saumya Yashmohini‌‌ Sahai, O.Oana Balalau and R.Roxana‌ Horincar. Breaking Down‌ the Invisible Wall of‌‌ Informal Fallacies in Online Discussions.ACL-IJCNLP 2021‌ - Joint Conference of‌ the 59th Annual Meeting‌‌ of the Association for Computational Linguistics and the‌ 11th International Joint Conference‌ on Natural Language Processing‌‌Online, FranceAugust 2021HAL

12.2 Publications of‌ the year

International journals‌

12 articleA. C.‌‌Angelos Christos Anadiotis, M.Muhammad Ghufran Khan‌ and I.Ioana Manolescu‌. Dynamic Graph Databases‌‌ with Out-of-Order Updates.Proceedings of the VLDB‌ Endowment (PVLDB)1713‌February 2025, 4799-4812‌‌HAL DOI back to text
13 articleN.‌Nelly Barret, S.‌ S.Sourav S. Bhowmick‌‌, A.Angela Bonifati, B.Barbara Catania‌, S.Stratos Idreos‌, E.Ekaterini Ioannou‌‌, M.Madhulika Mohanty, S.Sana Sellami‌, R.Roee Shraga‌, U.Utku Sirin‌‌, J.Juno Steegmans, P.Pinar Tozun‌, S.Soror Sahri‌ and G.Genoveva Vargas-Solar‌‌. Diversity, Equity and Inclusion Activities in Database‌ Conferences: A 2024 Report‌.SIGMOD record54‌‌32025, 40-43HAL DOI back to‌ text
14 articleN.‌Nelly Barret, A.‌‌Antoine Gauquier, J.-J.‌Jia-Jean Law and I.Ioana Manolescu. Finding‌ meaningful paths in heterogeneous graphs with PathWays.‌Information Systems127January 2025, 102463HAL‌DOI

International peer-reviewed conferences

15 inproceedingsN.Nardjes‌ Amieur, S.Salim Chouaki, O.Oana‌ Goga and B.Beatrice Roussillon. A Comparative‌ Study of News Exposure and Consumption On and‌ Off Facebook.Proc. ACM Hum.-Comput. Interact.CSCW‌ 2025 - 28th ACM SIGCHI Conference on Computer-Supported‌ Cooperative Work & Social Computing97Bergen,‌ NorwayNovember 2025, CSCW359HAL DOI back‌ to text
16 inproceedingsA.-C.Angelos-Christos Anadiotis,‌ M.Muhammad Ghufran Khan and I.Ioana Manolescu‌. Growing Up HAL: Historic and Property Graph‌ Queries.LSGDA 2025 - 4th International Workshop‌ on Large Scale Graph Data Analytics (in conjunction‌ with VLDB 2025)London, United KingdomSeptember 2025‌HAL back to text
17 inproceedingsA. C.‌Angelos C Anadiotis, M. G.Muhammad Ghufran‌ Khan and I.Ioana Manolescu. Catching up‌ with Disorder: Dynamic Graphs with Out-of-Order Updates.‌In Companion of the 2025 International Conference on‌ Management of Data (SIGMOD-Companion ’25)ACM SIGMOD/PODS 2025‌ - International Conference on Management of DataBerlin,‌ GermanyJune 2025, 15 - 18HAL‌DOI back to text
18 inproceedingsT.Tom‌ Calamai, O.Oana Balalau and F. M.‌Fabian M Suchanek. Benchmarking the Benchmarks: Reproducing‌ Climate-Related NLP Tasks.ACL 2025 - The‌ 63rd Annual Meeting of the Association for Computational‌ LinguisticsVienne, AustriaJuly 2025HAL back to‌ text
19 inproceedingsA.Antoine Gauquier, S.‌Simon Ebel, H.Helena Galhardas, T.‌Théo Galizzi, I.Ioana Manolescu, A.‌Aurélien Peden and P.Pierre Senellart. Efficient‌ and Scalable Search for Statistics.ICDE 2026‌ - 42nd IEEE International Conference on Data Engineering‌Montréal, CanadaMay 2026HAL back to text‌
20 inproceedingsG.Garima Gaur, O.Oana‌ Balalau, I.Ioana Manolescu and P. D.‌Prajna Devi Upadhyay. The Search for Conflicts‌ of Interest: Open Information Extraction in Scientific Publications‌.EMNLP 2025 - Conference on Empirical Methods‌ in Natural Language ProcessingSuzhou, ChinaNovember 2025‌HAL back to text
21 inproceedingsC.Chadi‌ Helwe, O.Oana Balalau and D.Davide‌ Ceolin. Navigating the Political Compass: Evaluating Multilingual‌ LLMs across Languages and Nationalities.ACL 2025‌ - 63rd Annual Meeting of the Association for‌ Computational LinguisticsVienna, AustriaAssociation for Computational Linguistics‌July 2025, 17179-17204HAL DOI back to‌ text
22 inproceedingsV.Vincent Jacob and Y.‌Yanlei Diao. Unsupervised Anomaly Detection in Multivariate‌ Time Series across Heterogeneous Domains.Proceedings of‌ the VLDB EndowmentVLDB 2025 - 51th International‌ Conference on Very Large Databases186Londres,‌ United KingdomACM Digital LibraryFebruary 2025,‌ 1691-1704HAL DOI back to text
23 inproceedings‌C.Chenghao Lyu, G.Guillaume Lachaud,‌ G.Gabriel Lozano and Y.Yanlei Diao. Graph Transformers for Query‌ Plan Representation: Potentials and‌ Challenges.VLDB 2026‌‌ - 52th International Conference on Very Large Databases‌Boston (MA), United States‌August 2026HAL
24‌‌ inproceedingsK.Kyriakos Psarakis, G.George Christodoulou‌, G.Georgios Siachamis‌, M.Marios Fragkoulis‌‌ and A.Asterios Katsifodimos. Styx: Transactional Stateful‌ Functions on Streaming Dataflows‌.Styx: Transactional Stateful‌‌ Functions on Streaming DataflowsInternational Conference on Management‌ of Data (SIGMOD)3‌3Berlin, GermanyJune‌‌ 2025, 1-28HALDOI back to text‌
25 inproceedingsK.Kyriakos‌ Psarakis, O.Oto‌‌ Mraz, G.George Christodoulou, G.Georgios‌ Siachamis, M.Marios‌ Fragkoulis and A.Asterios‌‌ Katsifodimos. Styx in Action: Transactional Cloud Applications‌ Made Easy.Styx‌ in Action: Transactional Cloud‌‌ Applications Made EasyInternational Conference on Very Large‌ Data Bases (VLDB)18‌12Londres, United Kingdom‌‌September 2025, 5275-5278HAL DOI back to‌ text
26 inproceedingsA.‌Atte Torri, P.‌‌Przemysław Dominikowski, B.Brice Pointal, O.‌Oguz Kaya, L.‌Laércio Lima Pilla and‌‌ O.Olivier Coulaud. Near-Optimal Contraction Strategies for‌ the Scalar Product in‌ the Tensor-Train Format.‌‌Euro-Par 2025: Parallel ProcessingEuro-Par 2025 - 31‌ International European Conference on‌ Parallel and Distributed Computing‌‌15902Lecture Notes in Computer ScienceDresden, Germany‌Springer Nature SwitzerlandAugust‌ 2025, 63-77HAL‌‌DOI
27 inproceedingsK.Kun Zhang, O.‌Oana Balalau and I.‌Ioana Manolescu. Structured‌‌ Discourse Representation for Factual Consistency Verification.ACL‌ 2025 - 63rd Annual‌ Meeting of the Association‌‌ for Computational LinguisticsVienne, AustriaJuly 2025HAL‌back to text

Conferences‌ without proceedings

28 inproceedings‌‌M.Maxime Buron, H.Hritika Kathuria,‌ I.Ioana Manolescu and‌ G.George Siachamis.‌‌ RDF Query Answering in the Presence of Access‌ Restrictions.LNCS Series‌ bookCoopIS 2025 -‌‌ 31 st International Conference on Cooperative Information Systems‌Marbella, SpainOctober 2025‌HAL back to text‌‌back to text

Reports & preprints

29 report‌M.Maxime Buron,‌ H.Hritika Kathuria,‌‌ I.Ioana Manolescu and G.George Siachamis.‌ RDF Query Answering in‌ the Presence of Access‌‌ Restrictions.INRIA; Ecole polytechniqueSeptember 2025HAL‌
30 reportP.Paul‌ Kronlund-Drouault. Towards better‌‌ identification of fact tables in statistical spreadsheets.‌Inria Saclay; LIX --‌ Ecole polytechnique; ENS Lyon‌‌July 2025HAL
31 miscM.Marijan Soric‌, C.Cécile Gracianne‌, I.Ioana Manolescu‌‌ and P.Pierre Senellart. Benchmarking Table Extraction‌ from Heterogeneous Scientific Extraction‌ Documents.November 2025‌‌HAL
32 reportL.Lex Zard, O.‌Oana Goga, A.‌Asmaa El fraihi and‌‌ N.Nataliia Bielova. Feedback to the European‌ Data Protection Board's Guidelines‌ 3/2025 on the interplay‌‌ between the DSA and the GDPR (Version 1.1)‌ - Advertisement.Inria‌ & Université Cote d'Azur,‌‌ Sophia Antipolis, FranceOctober 2025, 1-20HAL‌

Other scientific publications

33‌ inproceedingsP.Przemysław Dominikowski‌‌, A.Atte Torri‌, B.Brice Pointal, O.Oguz Kaya‌, L.Laercio Lima Pilla and O.Olivier‌ Coulaud. Exploring Near-Optimal Contraction Strategies for the‌ Scalar Product in the Tensor-Train Format.IPDPS‌ 2025 - 39th IEEE International Parallel & Distributed‌ Processing SymposiumMilan, ItalyJune 2025, 1274-1276‌HAL DOI
34 thesisM.Marijan Soric.‌ Understanding and Extracting Table Information from BRGM Documents‌.Ecole centrale de LyonFebruary 2025HAL‌

Scientific popularization

35 inbookI.Ioana Manolescu and‌ P.Patrick Valduriez. De nouvelles architectures pour‌ les Big Data.Le calcul à découvert‌CNRS EditionsJanuary 2025HAL

12.3 Cited publications‌

36 inproceedingsA. C.Angelos Christos Anadiotis,‌ I.Ioana Manolescu and M.Madhulika Mohanty.‌ Integrating Connection Search in Graph Queries.ICDE‌ 2023 - 39th IEEE International Conference on Data‌ EngineeringAnaheim (CA), United StatesApril 2023HAL‌back to text

CEDAR - 2025

CEDAR - 2025

2025​​﻿﻿Activity reportProject-TeamCEDAR​​​‌

Keywords

Computer Science​​﻿﻿ and Digital Science

Other​​﻿﻿ Research Topics and Application​​​‌ Domains

1 Team members, visitors,​‌﻿﻿ external collaborators

Research Scientists​​﻿﻿

Faculty Member﻿​﻿﻿

Post-Doctoral Fellows

PhD​​﻿﻿ Students

Technical Staff

Interns and Apprentices﻿‌​‌

Administrative﻿‌​‌ Assistant

External Collaborators​​​‌

2 Overall​​​‌ objectives

3 Research program​​﻿﻿

3.1 Multi-model querying

3.2​‌﻿﻿ New methods for exploring​​﻿﻿ and querying data graphs​​​‌

3.3﻿​​﻿ Navigating the continuum between​​​‌ text and (semi) structured﻿﻿﻿‌ data

3.4﻿‌​‌ An unified framework for﻿​​﻿ optimizing data analytics

3.5 Elastic﻿﻿﻿‌ resource management for virtualized﻿‌​‌ database engines

3.6 Argumentation mining

3.7﻿​﻿﻿ Measuring and mitigating risks​‌﻿﻿ of AI-driven information targeting​​﻿﻿

4 Application​​﻿﻿ domains

4.1 Cloud computing​​​‌

4.2 Computational journalism﻿​﻿﻿

4.3​‌﻿﻿ Computational social science

4.4 Online targeted advertising﻿﻿﻿‌

5﻿﻿﻿‌ Social and environmental responsibility﻿‌​‌

5.1 Contribution to Diversity,﻿​​﻿ Equity and Inclusion

6 Highlights​​​‌ of the year

6.1﻿﻿﻿‌ Awards

7 Latest software﻿‌​‌ developments, platforms, open data﻿​​﻿

7.1 Latest software developments​​​‌

7.1.1 ConnectionLens

7.1.2 Abstra​‌﻿﻿

7.1.3​​﻿﻿ StatCheck

7.1.4 ConnectionStudio​​​‌

7.1.5﻿‌​‌ FactSpotter

7.1.6 PathWays

7.1.7﻿﻿﻿‌ OpenIEEntity

7.1.8 FactCheckBureau

7.1.9​‌﻿﻿ FDSpotter

7.1.10 COI-OpenIE

7.1.11﻿​﻿﻿ ClimateNLP toolbox

7.1.12 MultilingualPoliticalLLMs

8 New​​﻿﻿ results

8.1 Data management​​​‌ for analyzing and verifying﻿​﻿﻿ digital arenas

8.1.1 Graph​‌﻿﻿ data lakes of heterogeneous​​﻿﻿ data sources for data​​​‌ journalism

8.1.2 RDF​‌﻿﻿ Query Answering in the​​﻿﻿ Presence of Access Restrictions​​​‌

8.1.3﻿​﻿﻿ FactCheck-KG: Towards LLM-backed FC​‌﻿﻿ Retrieval

8.1.4​​​‌ Efficient and Scalable Search﻿﻿﻿‌ for Statistics

8.1.5 Structured﻿﻿﻿‌ Discourse Representation for Factual﻿‌​‌ Consistency Verification

8.1.6 The​​​‌ Search for Conflicts of﻿​﻿﻿ Interest: Open Information Extraction​‌﻿﻿ in Scientific Publications

8.2 Online targeted​‌﻿﻿ advertising

8.2.1 A Year​​​‌ Under the DSA: Ad﻿​﻿﻿ Transparency's Uneven Landscape

8.2.2 A Comparative﻿﻿﻿‌ Study of News Exposure﻿‌​‌ and Consumption On and﻿​​﻿ Off Facebook.

8.2.3​​﻿﻿ Privacy Settings and Ad​​​‌ Perception: The Shift from﻿​﻿﻿ Third-Party Cookies to the​‌﻿﻿ Privacy Sandbox

8.2.4 Is Contextual Advertising﻿​﻿﻿ Safe? Analyzing Systemic Risks​‌﻿﻿ with Ads on YouTube.​​﻿﻿

8.2.5 A Framework for​​﻿﻿ Auditing Ad Delivery Responsiveness​​​‌ to Psychological Traits

8.2.6 How Persuasive﻿‌​‌ Are LLMs in the﻿​​﻿ Wild? Assessing Personalized Ads​​​‌ in Real-World Delivery

8.3 Bias and​​​‌ issues in LLMs and﻿﻿﻿‌ Benchmarks

8.3.1​​​‌ Navigating the Political Compass:﻿﻿﻿‌ Evaluating Multilingual LLMs across﻿‌​‌ Languages and Nationalities

8.3.2 Benchmarking​‌﻿﻿ the Benchmarks: Reproducing Climate-Related​​﻿﻿ NLP Tasks

8.4 Efficient Big​​﻿﻿ Data analytics

8.4.1 Graph​​​‌ Transformers for Query Plan﻿​﻿﻿ Representation: Potentials and Challenges​‌﻿﻿

8.4.2 Unsupervised﻿​﻿﻿ Anomaly Detection in Multivariate​‌﻿﻿ Time Series across Heterogeneous​​﻿﻿ Domains

8.4.3 Transactional Stateful Functions﻿﻿﻿‌ on Streaming Dataflows

8.4.4 Dynamic Graph Databases​​​‌ with Out-of-order Updates

9 Bilateral contracts and﻿​﻿﻿ grants with industry

9.1​‌﻿﻿ Bilateral contracts with industry​​﻿﻿

9.2​​​‌ Bilateral Grants with Industry﻿​﻿﻿

10 Partnerships and cooperations​​​‌

10.1 International initiatives

10.1.1﻿​﻿﻿ Associate Teams in the​‌﻿﻿ framework of an Inria​​﻿﻿ International Lab or in​​​‌ the framework of an﻿​﻿﻿ Inria International Program

MediumAI​‌﻿﻿

10.2​​​‌ International research visitors

10.2.1﻿﻿﻿‌ Visits of international scientists﻿‌​‌

Other international visits to﻿​​﻿ the team

2025Activity reportProject-TeamCEDAR‌

Computer Science and Digital Science

Other Research Topics and Application‌ Domains

1 Team members, visitors,‌ external collaborators

Research Scientists

Faculty Member

PhD Students

Interns and Apprentices‌‌

Administrative‌‌ Assistant

External Collaborators‌

2 Overall‌ objectives

3 Research program

3.2‌ New methods for exploring and querying data graphs‌

3.3 Navigating the continuum between‌ text and (semi) structured‌ data

3.4‌‌ An unified framework for optimizing data analytics

3.5 Elastic‌ resource management for virtualized‌‌ database engines

3.7 Measuring and mitigating risks‌ of AI-driven information targeting

4 Application domains

4.1 Cloud computing‌

4.2 Computational journalism

4.3‌ Computational social science

4.4 Online targeted advertising‌

5‌ Social and environmental responsibility‌‌

5.1 Contribution to Diversity, Equity and Inclusion

6 Highlights‌ of the year

6.1‌ Awards

7 Latest software‌‌ developments, platforms, open data

7.1 Latest software developments‌

7.1.2 Abstra‌

7.1.3 StatCheck

7.1.4 ConnectionStudio‌

7.1.5‌‌ FactSpotter

7.1.7‌ OpenIEEntity

7.1.9‌ FDSpotter

7.1.11 ClimateNLP toolbox

8 New results

8.1 Data management‌ for analyzing and verifying digital arenas

8.1.1 Graph‌ data lakes of heterogeneous data sources for data‌ journalism

8.1.2 RDF‌ Query Answering in the Presence of Access Restrictions‌

8.1.3 FactCheck-KG: Towards LLM-backed FC‌ Retrieval

8.1.4‌ Efficient and Scalable Search‌ for Statistics

8.1.5 Structured‌ Discourse Representation for Factual‌‌ Consistency Verification

8.1.6 The‌ Search for Conflicts of Interest: Open Information Extraction‌ in Scientific Publications

8.2 Online targeted‌ advertising

8.2.1 A Year‌ Under the DSA: Ad Transparency's Uneven Landscape

8.2.2 A Comparative‌ Study of News Exposure‌‌ and Consumption On and Off Facebook.

8.2.3 Privacy Settings and Ad‌ Perception: The Shift from Third-Party Cookies to the‌ Privacy Sandbox

8.2.4 Is Contextual Advertising Safe? Analyzing Systemic Risks‌ with Ads on YouTube.

8.2.5 A Framework for Auditing Ad Delivery Responsiveness‌ to Psychological Traits

8.2.6 How Persuasive‌‌ Are LLMs in the Wild? Assessing Personalized Ads‌ in Real-World Delivery

8.3 Bias and‌ issues in LLMs and‌ Benchmarks

8.3.1‌ Navigating the Political Compass:‌ Evaluating Multilingual LLMs across‌‌ Languages and Nationalities

8.3.2 Benchmarking‌ the Benchmarks: Reproducing Climate-Related NLP Tasks

8.4 Efficient Big Data analytics

8.4.1 Graph‌ Transformers for Query Plan Representation: Potentials and Challenges‌

8.4.2 Unsupervised Anomaly Detection in Multivariate‌ Time Series across Heterogeneous Domains

8.4.3 Transactional Stateful Functions‌ on Streaming Dataflows

8.4.4 Dynamic Graph Databases‌ with Out-of-order Updates

9 Bilateral contracts and grants with industry

9.1‌ Bilateral contracts with industry

9.2‌ Bilateral Grants with Industry

10 Partnerships and cooperations‌

10.1.1 Associate Teams in the‌ framework of an Inria International Lab or in‌ the framework of an Inria International Program

MediumAI‌

10.2‌ International research visitors

10.2.1‌ Visits of international scientists‌‌

Other international visits to the team

Benjamin Ocampo‌

Davide‌ Ceolin

Mae Sosto‌

10.2.2 Visits‌ to international teams

Research‌ stays abroad

10.3 European initiatives

10.3.1 Horizon Europe‌

10.3.2 H2020 projects

10.4.1 ANR

10.5‌ Regional initiatives

11.1 Promoting scientific activities

Chair‌ of conference program committees

Member of the conference program committees