VALDA

VALDA - 2024

2024Activity reportProject-TeamVALDA

RNSR: 201622223R

Research center Inria Paris Centre
In partnership with:Ecole normale supérieure de Paris, CNRS
Team name: Value from Data
In collaboration with:Département d'Informatique de l'Ecole Normale Supérieure
Domain:Perception, Cognition and Interaction
Theme:Data and Knowledge Representation and Processing

Keywords

Computer Science and Digital Science

A3.1. Data
A3.1.1. Modeling, representation
A3.1.2. Data management, quering and storage
A3.1.3. Distributed data
A3.1.4. Uncertain data
A3.1.5. Control access, privacy
A3.1.6. Query optimization
A3.1.7. Open data
A3.1.8. Big data (production, storage, transfer)
A3.1.9. Database
A3.1.10. Heterogeneous data
A3.1.11. Structured data
A3.2. Knowledge
A3.2.1. Knowledge bases
A3.2.2. Knowledge extraction, cleaning
A3.2.3. Inference
A3.2.4. Semantic Web
A3.2.5. Ontologies
A3.2.6. Linked data
A3.3. Data and knowledge analysis
A3.3.1. On-line analytical processing
A3.3.2. Data mining
A3.3.3. Big data analysis
A3.4.3. Reinforcement learning
A3.4.5. Bayesian methods
A3.5.1. Analysis of large graphs
A4.7. Access control
A7.2. Logic in Computer Science
A7.3. Calculability and computability
A9.1. Knowledge
A9.8. Reasoning

1 Team members, visitors, external collaborators

Research Scientists

Serge Abiteboul [Inria, Emeritus]
Paul Boniol [Inria, ISFP]
Camille Bourgaux [CNRS, Researcher]
Luc Segoufin [Inria, Senior Researcher, HDR]
Michaël Thomazo [Inria, Researcher, HDR]

Faculty Member

Pierre Senellart [Team leader, ENS Paris, Professor, HDR]

Post-Doctoral Fellow

Sven Dziadek [Inria, Post-Doctoral Fellow, until Aug 2024]

PhD Students

Felix Chavelli [Inria, from Oct 2024]
Anatole Dahan [Université de Paris]
Antoine Gauquier [ENS]
Robin Jean [CNRS]
Lucas Larroque [ENS]
Shrey Mishra [ENS, until Jul 2024]
Aryak Sen [CNRS, from Feb 2024]
Emmanouil Sylligardos [ENS, from Feb 2024]

Technical Staff

N. Smith [ENS, Engineer, until Feb 2024]

Interns and Apprentices

Leo Boullot [PSL, Intern, until Jun 2024]
Atefe Khodadaditaghanaki [Université Paris-Cité & CNRS, Intern, from May 2024 until Sep 2024]
Haoming Lin [EPFL & Inria, Intern, from Mar 2024 until Sep 2024]
Paul Sevestre [PSL, Intern, until Jun 2024]
Marijan Soric [Centrale Lyon & Inria, Intern, from Sep 2024]
Julie Zhan [PSL, Intern, until Jun 2024]

Administrative Assistant

Meriem Guemair [Inria]

Visiting Scientists

Thomas Schwentick [TU Dortmund, until Jan 2024]
Victor Vianu [UC San Diego, from Jun 2024]

2 Overall objectives

Valda's focus is on both foundational and systems aspects of complex data management, especially human-centric data. The data we are interested in is typically heterogeneous, massively distributed, rapidly evolving, intensional, and often subjective, possibly erroneous, imprecise, incomplete. In this setting, Valda is in particular concerned with the optimization of complex resources such as computer time and space, communication, monetary, and privacy budgets. The goal is to extract value from data, beyond simple query answering.

Data management 30, 32 is now an old, well-established field, for which many scientific results and techniques have been accumulated since the sixties. Originally, most works dealt with static, homogeneous, and precise data. Later, works were devoted to heterogeneous data 2931, and possibly distributed 36 but at a small scale.

However, these classical techniques are poorly adapted to handle the new challenges of data management. Consider human-centric data, which is either produced by humans, e.g., emails, chats, recommendations, or produced by systems when dealing with humans, e.g., geolocation, business transactions, results of data analysis. When dealing with such data, and to accomplish any task to extract value from such data, we rapidly encounter the following facets:

Heterogeneity: data may come in many different structures such as unstructured text, graphs, data streams, complex aggregates, etc., using many different schemas or ontologies.
Massive distribution: data may come from a large number of autonomous sources distributed over the web, with complex access patterns.
Rapid evolution: many sources may be producing data in real time, even if little of it is perhaps relevant to the specific application. Typically, recent data is of particular interest and changes have to be monitored.
Intensionality1: in a classical database, all the data is available. In modern applications, the data is more and more available only intensionally, possibly at some cost, with the difficulty to discover which source can contribute towards a particular goal, and this with some uncertainty.
Confidentiality and security: some personal data is critical and need to remain confidential. Applications manipulating personal data must take this into account and must be secure against linking.
Uncertainty: modern data, and in particular human-centric data, typically includes errors, contradictions, imprecision, incompleteness, which complicates reasoning. Furthermore, the subjective nature of the data, with opinions, sentiments, or biases, also makes reasoning harder since one has, for instance, to consider different agents with distinct, possibly contradicting knowledge.

These problems have already been studied individually and have led to techniques such as query rewriting34 or distributed query optimization35.

Among all these aspects, intensionality is perhaps the one that has least been studied, so let us expand a bit on this. Consider a user's query, taken in a very broad sense: it may be a classical database query, some information retrieval search, a clustering or classification task, or some more advanced knowledge extraction request. Because of intensionality of data, solving such a query is a typically dynamic task: each time new data is obtained, the partial knowledge a system has of the world is revised, and query plans need to be updated, as in adaptive query processing 33 or aggregated search 39. The system then needs to decide, based on this partial knowledge, of the best next access to perform. This is reminiscent of the central problem of reinforcement learning 38 (train an agent to accomplish a task in a partially known world based on rewards obtained) and of active learning 37 (decide which action to perform next in order to optimize a learning strategy) and we intend to explore this connection further.

Uncertainty of the data interacts with its intensionality: efforts are required to obtain more precise, more complete, sounder results, which yields a trade-off between processing cost and data quality.

Other aspects, such as heterogeneity and massive distribution, are of major importance as well. A standard data management task, such as query answering, information retrieval, or clustering, may become much more challenging when taking into account the fact that data is not available in a central location, or in a common format. We aim to take these aspects into account, to be able to apply our research to real-world applications.

3 Research program

3.1 Research axis 1: Foundations of data management

This axis covers the theory of data management, broadly taken, and in particular the fields of database theory, knowledge representation, and some symbolic aspects of artificial intelligence (especially, reasoning on data).

The goal is to define solid and high-level foundations of data management tasks (query evaluation and optimization of various forms of queries, counting, reasoning, verification of data-centric processes, etc.) through formal tools, such as logics (esp., finite model theory), automata theory, complexity theory; we occasionally have contributions in these areas as well, though most of our work is motivated by data applications. We are especially interested in clean specifications of key aspects of database systems and data management tasks (e.g, confidentiality, access control, robustness), whether they are properties of the data or appropriate (query) languages for these tasks. We study expressive power of languages, computability and complexity of deciding or computing results, as well as the design of appropriate structures (e.g., indexes) to optimize these tasks.

3.2 Research axis 2: Uncertainty, provenance, and explainability in data management

This research axis deals with the modeling and efficient management of data that come with some uncertainty (probabilistic distributions, logical incompleteness, missing values, inconsistencies, open-world assumption, etc.) and with provenance information (indicating where the data originates from), as well as with the extraction of uncertainty and provenance annotations from real-world data. Provenance is also linked to explainability: determining where the result of a data management task comes from, how and why it was produced, helps explaining it. Interestingly, the foundations and tools for uncertainty management often rely on provenance annotations. For example, a typical way to compute the probability of query results in probabilistic databases is the so-called intensional approach: first generate the provenance of these query results (in some appropriate framework, e.g., that of Boolean functions or of provenance semirings), and then compute the probability of the resulting provenance annotation. For this reason, we deal with uncertainty and provenance in a unified manner, and with explainability as an application thereof.

3.3 Research axis 3: Knowledge discovery at scale

Our final axis deals with knowledge discovery at scale. The goal is to use techniques such as data mining, information extraction, data cleaning, information integration, machine learning, to derive knowledge from raw, dirty, inconsistent, heterogeneous, rapidly changing, data from real-world application scenarios.

We intend to leverage our expertise on data management to focus on the scalability of the approaches and tools developed. This is also in some sense an application axis for techniques developed in the other two axes; in particular, we have a focus on intensionality of data (i.e., cost to data access), on the trade-off between data uncertainty and its cost, on data provenance and explanations.

This axis is typically very changing in subtopics, depending on projects, collaborations, application partners.

4 Application domains

A large part of Valda's research is foundational in nature and not tailored to any specific application domain. Some applied works target certain application domains however:

Web data
in a broad-sense (semi-structured, structured or unstructured content extracted from Web databases; knowledge bases from the Semantic Web; social networks; Web archives and Web crawls; Web applications and deep Web databases; crowdsourcing platforms). This is a historical domain of interest of Valda researchers, and we have expertise in the acquisition, extraction, and management of this kind of data.
Open science
(publication databases, scientific publications, open-source software).
Clinical data
(notably inconsistent or incomplete hospital records).
Energy
(notably data from power stations, in collaboration with industrial partners).
Geoscience
(seismology or vulcanology time series, strucutred data about geological campaigns).
Data journalism
(statistical datasets, fact checking data).

Finally, transversal concerns which occur in different applications area and motivate some of our theory work are ethics of data management and privacy.

5 Highlights of the year

5.1 Scientific events

In March 2024, Valda organized an event in the honor of Serge Abiteboul, joint with the EDBT/ICDT 2024 conference in Paestum, Italy. This is at the occasion of Serge's 70th birthday and retirement from Arcep, and in celebration of his scientific career and achievements. This event included talks by his colleagues, former students, and Serge himself.

5.2 Awards

Paul Boniol 's and Emmanouil Sylligardos 's demonstration paper 18 was distinguished as a runner up for best demonstration at the ICDE 2024 conference.

5.3 Inria policy

Note : Readers are advised that the Institute does not endorse the text in the “Highlights of the year” section, which is the sole responsibility of the team leader.

At the end of 2024, Inria's top management enacted a new “contrat d'objectifs, de moyens et de performance” (COMP), which defines Inria's objectives for the period 2024–2028. We are very unhappy and concerned about the content of this document and the way it was imposed.

Neither the staff nor their representative bodies were given the opportunity to participate in (or influence) the drafting of this document.
The document defines Inria's main mission as “contributing to the digital sovereignty of the Nation through research and innovation” and proposes to amend Inria's founding decree to reflect this new definition. We strongly believe that our primary mission is (and should remain) the advancement of human knowledge through research. Research is not a means to achieve “digital sovereignty”, whatever that may mean. Research should not be associated with any particular nation, whatever that nation may be.
The document announces the creation of a funding agency within Inria. France already has an independent funding agency, the ANR. The creation of a new funding agency within a research institute is unnecessary and a waste of resources. It is also likely to create confusion, opacity, and conflicts of interest.
Many aspects of the document reflect a desire to drive research in a top-down manner, for example through the selection of “strategic partner institutions” and “strategic themes”. This threatens the fundamental freedom of researchers to choose their research topics and collaborations.
The document indicates that all of Inria's research should have “dual nature”, that is, both civilian and military applications. While some of the institute's research may have military applications, the vast majority of it is independent of the military, and should remain so.
The document announces a desire to place all of Inria in a “restricted regime area” (ZRR), which means that the hiring of researchers and interns will be reviewed and possibly vetoed by the Fonctionnaire Sécurité Défense. This creates administrative delays, subjects hiring to opaque criteria, and discourages the hiring of foreign nationals, thus harming research and collaboration.
Staff opposition to these policies, which has been expressed in several votes and petitions, has been largely ignored.

6 New software, platforms, open data

6.1 New software

6.1.1 ProvSQL

Keywords:
Databases, Provenance, Probability
Functional Description:
The goal of the ProvSQL project is to add support for (m-)semiring provenance and uncertainty management to PostgreSQL databases, in the form of a PostgreSQL extension/module/plugin.
News of the Year:
Revamp of the implementation of the circuit storage, now stored in memory-mapped files accessed through a single worked process. Addition of a local circuit cache within each PostgreSQL backend. Implementation of expected value computation. Implementation of union of intervals semiring. Support for provenance tracking of deletions. Drop support for PostgreSQL 9.6. Support for PostgreSQL 17. Mac OS and WSL continuous integration. Various optimizations, bug fixes, and robustness enhancements.
URL:
https://github.com/PierreSenellart/provsql
Publications:
hal-01672566, hal-01851538, hal-04561331
Contact:
Pierre Senellart
Participant:
Pierre Senellart

6.1.2 apxproof

Keyword:
LaTeX
Functional Description:
apxproof is a LaTeX package facilitating the typesetting of research articles with proofs in appendix, a common practice in database theory and theoretical computer science in general. The appendix material is written in the LaTeX code along with the main text which it naturally complements, and it is automatically deferred. The package can automatically send proofs to the appendix, can repeat in the appendix the theorem environments stated in the main text, can section the appendix automatically based on the sectioning of the main text, and supports a separate bibliography for the appendix material.
Release Contributions:
Support for lipcs's claimproof, support optional arguments in proofs
News of the Year:
Support for user-defined claimproof environment.
URL:
https://github.com/PierreSenellart/apxproof
Contact:
Pierre Senellart
Participant:
Pierre Senellart

6.1.3 dissem.in

Name:
Dissemin
Keywords:
Open Access, Publishing, HAL
Functional Description:
Dissemin is a web platform gathering metadata from many sources to analyze the open-access full text availability of publications of researchers. It has been designed to foster the use of repositories such as HAL (rather than preprints posted on personal homepages). It allows deposit on these repositories.
News of the Year:
Finalization of crossref ingestion.
URL:
https://gitlab.com/dissemin/dissemin
Contact:
Pierre Senellart
Participant:
Pierre Senellart
Partner:
CAPSH

7 New results

7.1 Research axis 1: Foundations of data management

Participants: Camille Bourgaux, Sven Dziadek, Lucas Larroque, Michaël Thomazo, Luc Segoufin.

Knowledge representation and knowledge bases

Ontology-based query answering is a problem that takes as input an ontology $ℛ$ (in our context, a set of existential rules), a set $ℱ$ of facts, and a Boolean conjunctive query (CQ) $q$ , and asks whether $q$ follows from $(ℛ, ℱ)$ under standard first-order semantics. This problem is undecidable in general, and a widely investigated approach to tackle it in some cases is query rewriting: given a “rule query” $(ℛ, q)$ , we compute a Boolean query $q^{'}$ such that, for any fact set $ℱ$ , it holds that $q$ follows from $(ℛ, ℱ)$ if and only if $q^{'}$ follows from $ℱ$ . Insofar, previous work has mostly focused on output queries $q^{'}$ expressed as union of Boolean conjunctive queries (UCQs), and an effective algorithm that computes such a query $q^{'}$ whenever it exists has been proposed in the literature. However, UCQ-rewritability is not a very general notion and many real-world interesting rule queries do no admit UCQ-rewritings. This raises the question whether such a generic algorithm can be designed for a more expressive target language, such as datalog. We solve this question in 20 by the negative, by studying the difference between datalog-expressibility and datalog-rewritability. More precisely, we show that query answering under datalog-expressible rule queries is undecidable.

Research on knowledge graph embeddings has recently evolved into knowledge base embeddings, where the goal is not only to map facts into vector spaces but also constrain the models so that they take into account the relevant conceptual knowledge available. 19 examines recent methods that have been proposed to embed knowledge bases in description logic into vector spaces through the lens of their geometric-based semantics. We identify several relevant theoretical properties, which we draw from the literature and sometimes generalize or unify. We then investigate how concrete embedding methods fit in this theoretical framework.

Consistent query answering

In 25, we consider the dichotomy conjecture for consistent query answering under primary key constraints. It states that, for every fixed Boolean conjunctive query $q$ , testing whether $q$ is certain (i.e. whether it evaluates to true over all repairs of a given inconsistent database) is either PTime or CoNP-complete. This conjecture has been verified for self-join-free and path queries. We show that it also holds for queries with two atoms.

Other aspects of theoretical computer science

Our research occasionally touches other aspects of theoretical computer science not related to data management. In 11, we show how to efficiently solve problems involving a quantitative measure, here called energy, as well as a qualitative acceptance condition, expressed as a Büchi or Parity objective, in finite weighted automata and in one-clock weighted timed automata. Solving the former problem and extracting the corresponding witness is our main contribution and is handled by a modified version of the Bellman-Ford algorithm interleaved with Couvreur’s algorithm. The latter problem is handled via a reduction to the former relying on the corner-point abstraction.

7.2 Research axis 2: Uncertainty, provenance, and explainability in data management

Participants: Camille Bourgaux, Robin Jean, Pierre Senellart.

Inconsistent knowledge bases

In 15, we explore a quantitative approach to querying inconsistent description logic knowledge bases. We consider weighted knowledge bases in which both axioms and assertions have (possibly infinite) weights, which are used to assign a cost to each interpretation based upon the axioms and assertions it violates. Two notions of certain and possible answer are defined by either considering interpretations whose cost does not exceed a given bound or restricting attention to optimal-cost interpretations. Our main contribution is a comprehensive analysis of the combined and data complexity of bounded cost satisfiability and certain and possible answer recognition, for description logics between $ℰℒ ⊥$ and $𝒜ℒ𝒞𝒪$ .

In 16, we present a novel approach to querying classical inconsistent description logic (DL) knowledge bases by adopting a paraconsistent semantics with the four “Belnapian” values: exactly true, exactly false, both, and neither. In contrast to prior studies on paraconsistent DLs, we allow truth value operators in the query language, which can be used to differentiate between answers having contradictory evidence and those having only positive evidence. We present a reduction to classical DL query answering that allows us to pinpoint the precise combined and data complexity of answering queries with values in paraconsistent $𝒜ℒ𝒞ℋℐ$ and its sublogics. Notably, we show that tractable data complexity is retained for Horn DLs. We present a comparison with repair-based inconsistency-tolerant semantics, showing that the two approaches are incomparable.

Provenance management and probabilistic databases

In 26, we report on the impact that the theory of provenance semirings, developed by Val Tannen and his collaborators, has had on the design on a practical system for maintaining the provenance of query results over a relational database, namely ProvSQL.

Shapley values, originating in game theory and increasingly prominent in explainable AI, have been proposed to assess the contribution of facts in query answering over databases, along with other similar power indices such as Banzhaf values. In 12 we adapt these Shapley-like scores to probabilistic settings, the objective being to compute their expected value. We show that the computations of expected Shapley values and of the expected values of Boolean functions are interreducible in polynomial time, thus obtaining the same tractability landscape. We investigate the specific tractable case where Boolean functions are represented as deterministic decomposable circuits, designing a polynomial-time algorithm for this setting. We present applications to probabilistic databases through database provenance, and an effective implementation of this algorithm within the ProvSQL system, which experimentally validates its feasibility over a standard benchmark.

7.3 Research axis 3: Knowledge discovery at scale

Participants: Paul Boniol, Antoine Gauquier, Shrey Mishra, Pierre Senellart, Emmanouil Sylligardos.

Mining time series

Anomaly detection is an important problem in data analytics with applications in many domains. In recent years, there has been an increasing interest in anomaly detection tasks applied to time series. In two tutorials at two separate conferences 13, 17, we take a holistic view of anomaly detection in time series, starting from the core definitions and taxonomies related to time series and anomaly types, to an extensive description of the anomaly detection methods proposed by different communities in the literature. We explore the literature and the proposed methods by demonstrating systems that help users understand the core computational steps of some methods and navigate benchmark results. Finally, we describe the problem of model selection for anomaly detection and discuss recent experimental results.

Despite increasing academic interest in anomaly detection over time series and the large number of methods proposed in the literature, recent benchmark and evaluation studies demonstrated that there exists no single best anomaly detection method when applied to heterogeneous time series datasets. Therefore, the only scalable and viable solution to solve anomaly detection over very different time series collected from diverse domains is to propose a model selection method that will choose, based on time series characteristics, the best anomaly detection method to run. 18 describes ADecimo, a modular and extensible web application that helps users understand the performance of time series classification algorithms used as model selection methods for time series anomaly detection. Overall, our system enables users to compare 17 different classifiers over 1980 time series, and decide on the most suitable time series classification method for their own time series and use cases.

Beyond anomaly detection, exploring and comparing non-stationary multivariate time series is an important problem in many domains and real-world applications. In work conducted before Paul Boniol 's arrival in Valda, we introduced $d_{symb}$ , a symbolic representation that transforms multivariate time series into interpretable symbolic sequences that come along with a compatible and efficient distance measure to compare the obtained symbolic sequences. We have shown how $d_{symb}$ can handle the non-stationarity of multivariate physiological signals, how interpretable the symbolization is, and how suitable the distance measure is compared to Dynamic Time Warping (DTW) variants. We have also empirically shown that the computation time when using $d_{symb}$ on a clustering time is significantly smaller than with DTW variants (typically 100 times faster). In 21, we present the $d_{symb}$ playground, an interactive web-based tool to interpret and compare a large multivariate time series dataset quickly. We showcase the relevance of this tool in several scenarios based on real-world datasets.

Information extraction and structuring

22 describes Antoine Gauquier 's PhD project, which aims to study holistic methods for building, populating, and exploiting warehouses of heterogeneous content. Each warehouse is characterized by a specification of the types of content we search for; a set of websites in which to search for the content; a set of dedicated methods to analyze and understand the content, including to establish or find links that connect different pieces of content. AI and uncertainty are naturally involved in these steps. We present the overall thesis aims, as well as encouraging preliminary results for one use case: the acquisition of statistical data resources from French government websites, leveraging reinforcement learning.

23 explores first steps towards extracting information about theorems and proofs from scholarly documents to build a knowledge base of interlinked results. Specifically, we consider two main tasks: extractions of results and their proofs from the PDF of scientific articles; and establishing which results is used in the proof of which, across the scientific literature. We discuss the problem statement, methodologies, as well as preliminary findings employed in both phases of our approach, highlighting the challenges faced.

More specifically, in 24, 28 we address the extraction of mathematical statements and their proofs from scholarly PDF articles as a multimodal classification problem, utilizing text, font features, and bitmap image renderings of PDFs as distinct modalities. We propose a modular sequential multimodal machine learning approach specifically designed for extracting theorem-like environments and proofs. This is based on a cross-modal attention mechanism to generate multimodal paragraph embeddings, which are then fed into our novel multimodal sliding window transformer architecture to capture sequential information across paragraphs. Our approach demonstrates performance improvements obtained by transitioning from unimodality to multimodality, and finally by incorporating sequential modeling over paragraphs.

8 Partnerships and cooperations

8.1 International initiatives

8.1.1 Participation in other International Programs

DesCartes

Participants: Pierre Senellart.

Title:
Intelligent Modelling for Decision-making in Critical Urban Systems
Partner Institution(s):
CNRS@CREATE, National University of Singapore
Duration:
2021–2026
Additional info:
DesCartes is a project managed by CNRS@CREATE, a CNRS subsidiary in Singapore and funded by Singapore’s National Research Foundation, with 50 million total budget. Pierre Senellart is involved in the project as one of the French PIs.

PHC AURORA with Ana Ozaki

Participants: Camille Bourgaux.

Title:
Learning and Reasoning in Knowledge Graph Embeddings
Partner Institution(s):
University of Oslo, University of Bergen
Duration:
2023–2024

International ANR project EQUUS

Participants: Luc Segoufin.

Title:
Efficient query answering under updates
Partner Institution(s):
TU Ilmenau, Uni. Bayreuth, HU Berlin, CNRS
Duration:
2020–2025

8.2 International research visitors

8.2.1 Visits of international scientists

Other international visits to the team

Thomas Schwentick

Status:
Professor
Institution of origin:
TU Dortmund
Country:
Germany
Dates:
October 2023 to January 2024
Mobility program:
Sabbatical

Victor Vianu

Status:
Professor
Institution of origin:
UC San Diego
Country:
USA
Dates:
June 2024 to January 2025
Mobility program:
Sabbatical

8.2.2 Visits to international teams

Research stays abroad

Pierre Senellart was an invited participant to the Representation, Provenance, and Explanations in Database Theory and Logic seminar at Dagstuhl (January 2024).

8.3 National initiatives

8.3.1 ANR

Valda has been part of two national ANR projects in 2024:

CQFD
(2018–2024; 19 k€ for Valda, budget managed by Inria), with Inria Sophia (GraphIK, coordinator), LaBRI, LIG, Inria Saclay (Cedar), IRISA, Inria Lille (Spirals), and Télécom ParisTech, on complex ontological queries over federated and heterogeneous data.
QUID
(2018–2024; 49 k€ for Valda, budget managed by Inria), LIGM (coordinator), IRIF, and LaBRI, on incomplete and inconsistent data.

Camille Bourgaux has been participating in the AI Chair of Meghyn Bienvenu on INTENDED (Intelligent handling of imperfect data) since 2020.

Pierre Senellart held a chair within the PR[AI]RIE institute for artificial intelligence in Paris since 2019. He now holds a chair in the new PR[AI]RIE – Paris School of AI (PSAI) AI cluster approved for funding from 2024. Camille Bourgaux is also a Research Fellow of PSAI.

8.3.2 Others

Dissemin
(2021–2024; 124 k€ for Valda, budget managed by ENS), sole partner, on the development of the dissem.in platform for open science promotion. Funded by the Fonds National Science Ouverte.

9 Dissemination

9.1 Promoting scientific activities

9.1.1 Scientific events: organisation

Member of the organizing committees

Camille Bourgaux , member of the DL steering committee
Luc Segoufin , member of the STACS steering committee
Pierre Senellart , editorial board of the LIPIcs series of conference proceedings

Chair of conference program committees

Camille Bourgaux , co-chair of the KR 2024 Doctoral Consortium

Member of the conference program committees

Camille Bourgaux , BDA 2024 (demonstrations), DL 2024, IJCAI 2024, CSL 2025
Paul Boniol , BDA 2024, DSAA 2024 (Applications track), MulTiSA 2024
Pierre Senellart , JCDL 2024 (senior PC), SUM 2024, TaPP 2024, ICDT 2025
Michaël Thomazo , KR 2024, RuleML+RR 2024, IJCAI 2024
Victor Vianu , PODS 2024

Member of the editorial boards

Luc Segoufin , associate editor, ACM ToCL
Victor Vianu , editor, Database Theory column, SIGACT News

9.1.2 Invited talks

Camille Bourgaux , Querying inconsistent prioritized data, invited talk at DL 2024 (June 2024, Bergen, Norway) 14
Pierre Senellart , Provenance, Probabilities, and Power Indices in Databases, department seminar at ENS Paris-Saclay (March 2024, Gif-sur-Yvette, France)
Paul Boniol ,Anomaly Detection in Time Series, keynote for AALTD workshop at ECML/PKDD 2024 (September 2024, Vilnius, Lithuania)
Camille Bourgaux , Querying inconsistent databases, department seminar at ENS Rennes (October 2024, Bruz, France)
Paul Boniol , Time Series Anomaly Detection: Overview and New Trends, Keynote for ML4Jets International Conference (November 2024, Paris France)
Pierre Senellart , Qu'est-ce que l'IA ?, keynote at Assemblée des partenaires de Hal (November 2024, Lyon, France)

9.1.3 Leadership within the scientific community

Serge Abiteboul is a member of the French Academy of Sciences, of the Academia Europaea, and an ACM Fellow.
Pierre Senellart is a junior member of the Institut Universitaire de France.

9.1.4 Research administration

Serge Abiteboul was a member of the jury for prizes of the Académie des Sciences: Computer science (president), Lovelace-Babbage, Mines-Télécom
Serge Abiteboul was a member of the commission of new member of Académie des Sciences
Serge Abiteboul is a member of the scientific committe of the Programme Inria Quadrant (PIQ)
Luc Segoufin is a member of the Formation Spécialisée de Site (FSS) of the Inria Paris research centre.
Pierre Senellart is the president of section 6 of the National Committee for Scientific Research. As a representative of CoNRS, Pierre Senellart was in the Hcéres evaluation committee of the LMF research unit.
Pierre Senellart is a member of the board of the conference of presidents of the national committee (CPCN) and as such a member of the coordination of managing parties of the national committee (C3N).
Pierre Senellart is deputy director of the DI ENS laboratory, joint between ENS, CNRS, and Inria.
Pierre Senellart is the scientific resource person for Scientific information & edition of the Inria Paris centre
We participated in the following hiring and promotion juries within universities:
- Serge Abiteboul , Professeur, CNAM
- Camille Bourgaux , Maître de conférences, IUT d'Orsay
- Pierre Senellart , Repyramidage Professeur des Universités, Université Paris-Cité
- Michaël Thomazo , Maître de conférences, Université de Bordeaux
- Michaël Thomazo , Maître de conférences, CentraleSupélec

9.2 Teaching - Supervision - Juries

9.2.1 Teaching

Licence: Algorithms, L1, CPES, PSL – Antoine Gauquier
Licence: Differential calculus, L2, CPES, PSL – Antoine Gauquier
Licence: Practical Computing, L3, École normale supérieure – Pierre Senellart
Licence: Formal Languages, Computability, Complexity, L3, École normale supérieure – Michaël Thomazo , Lucas Larroque
Licence: Databases, L3, École normale supérieure – Pierre Senellart , Lucas Larroque
Master: Logiques de description, M1, DCI – Camille Bourgaux
Master: Data Acquisition, Extraction, and Storage, M2, IASD – Pierre Senellart
Master: NoSQL Databases, M2, IASD – Paul Boniol
Master: Knowledge graphs, description logics, and reasoning on data, M2, IASD – Camille Bourgaux , Michaël Thomazo
Master: Description logics and reasoning on data, M2, LMFI – Camille Bourgaux , Michaël Thomazo
Professional training: Web Security, PESTO (Corps des Mines professional training) – Pierre Senellart

As a professor at ENS, Pierre Senellart holds various teaching responsibilities (M1 projects, M2 administration, entrance competition) at ENS. Pierre Senellart is the academic director of the graduate program of PSL.

As an adjunct professor at PSL, Michaël Thomazo is in charge of PhD committees within DI ENS and co-responsible of the international entrance competition at ENS.

We also gave invited courses in summer schools:

Paul Boniol , Time Series Anomaly Detection, International Summer School on Internet of Things - ISSIOT, Salerno, Italy
Paul Boniol , Time Series Anomaly Detection, diiP Summer School 2024 – dSDS, Paris, France
Michaël Thomazo , Summer School of Palacky University, Olomouc, Czechia

Most permanent members of the group are also involved in tutoring ENS students, advising them on their curriculum, their internships, etc. They are also occasionally involved with reviewing internship reports, supervising student projects, etc.

9.2.2 Supervision

PhD defended: Shrey Mishra , Multimodal extraction of proofs and theorems from the scientific literature, 2021–2024, Pierre Senellart27
PhD in progress: Anatole Dahan, Logical foundations of the polynomial hierarchy, started in October 2020, Arnaud Durand (Université Paris-Cité) & Luc Segoufin
PhD in progress: Antoine Gauquier , Intelligent construction of a multimodal and heterogeneous data warehouse, with data traceability, started in September 2023, Pierre Senellart & Ioana Manolescu (Inria Cedar)
PhD in progress: Lucas Larroque , Extension of rewriting procedures for reasoning using existential rules, started in September 2023, Michaël Thomazo
PhD in progress: Robin Jean , Integration of preferences and domain knowledge in inconsistency-tolerant ontology-based data access, started in October 2023, Meghyn Bienvenu (CNRS LaBRI) & Camille Bourgaux
PhD in progress: Aryak Sen , Scalability of a data provenance and probability management system, started in February 2024, Silviu Maniu (Université Grenoble Alpes) & Pierre Senellart
PhD in progress: Emmanouil Sylligardos , Accuracy and execution time trade-off in ensembling and model selection for time series analytics., started in February 2024, Paul Boniol & Pierre Senellart
PhD in progress: Felix Chavelli , Graph representations for multivariate time series analytics, started in October 2024, Paul Boniol & Michaël Thomazo
PhD in progress: Pratik Karmakar, Quality, uncertainty, and lineage of data, Stéphane Bressan (NUS) & Pierre Senellart (as he is based in Singapore, he is not considered a Valda member)
M2 internship: Atefe Khodadaditaghanaki , Camille Bourgaux
M2 internship: Haoming Lin , Paul Boniol & Michaël Thomazo
M2 internship: Marijan Soric , Cécile Gracianne (BRGM), Ioana Manolescu (Inria Cedar) & Pierre Senellart
L2 internship: Leo Boullot , Antoine Gauquier & Pierre Senellart
L2 internship: Paul Sevestre , Antoine Gauquier & Pierre Senellart
L2 internship: Julie Zhan , Michaël Thomazo

9.2.3 Juries

PhD: Edwige Cyffers [president], Université de Lille, Pierre Senellart
PhD: Florent Martin-Lafay, Université Paris 1 Panthéon-Sorbonne, Serge Abiteboul
PhD: Alexandra Rogova, Université Paris-Cité, Michaël Thomazo
HdR: Pierre Bourhis [president & examiner], Université de Lille, Pierre Senellart & Serge Abiteboul
HdR: Charles Paperman [reviewer], Université de Lille, Pierre Senellart

9.3 Popularization

9.3.1 Specific official responsibilities in science outreach structures

Serge Abiteboul is the president of the scientific council of the direction of public finances (DGFIP)
Serge Abiteboul is a member of the board of the Inria Foundation, of the Sopra Steria Foundation, of the Blaise Pascal Foundation
Serge Abiteboul is a member of the scientific council of La Main à la Pâte, Cigref (on responsible digitalization); he was a member of the scientific council of the exhibition on IA “Double Je” in Toulouse in February 2024
Pierre Senellart is a scientific expert advising the Scientific and Ethical Committee of Parcoursup and MonMaster, the platforms for the selection of higher-education students at the first-year level and the Master's level. As such, he contributed to the 6th yearly report of the committee to the French parliament

9.3.2 Productions (articles, videos, podcasts, serious games, ...)

Serge Abiteboul is an editor of the binaire blog
Serge Abiteboul is the co-author (with François Bancilhon) of a popularization book on digital commons: Vive les communs numériques, Odile Jacob, février 2024

9.3.3 Participation in Live events

Serge Abiteboul is one of the author of the “Qui a hacké Garoutzia” play, which was staged at various occasions througout France in 2024, and which has had regular performance in Paris at La Scène parisienne since September 2024
Serge Abiteboul participated in numerous (several dozens) scientific outreach activities: lectures, round tables, interviews (notably with the Académie des sciences; on the radio or on TV; and in many other settings)
Camille Bourgaux participated to the RJMI (Rendez-Vous des Jeunes Mathématiciennes et Informaticiennes), an invent to promote mathematics and computer science to high-school female students

10 Scientific production

10.1 Major publications

1 articleM.Michael Benedikt, P.Pierre Bourhis, G.Georg Gottlob and P.Pierre Senellart. Monadic Datalog, Tree Validity, and Limited Access Containment.ACM Transactions on Computational Logic2112020, 6:1-6:45HAL DOI
2 inproceedingsP.Paul Boniol, E.Emmanouil Sylligardos, J.John Paparrizos, P.Panos Trahanias and T.Themis Palpanas. ADecimo: Model Selection for Time Series Anomaly Detection.ICDE 2024 - IEEE 40th International Conference on Data EngineeringUtrecht, NetherlandsMay 2024HAL
3 inproceedingsC.Camille Bourgaux, P.Pierre Bourhis, L.Liat Peterfreund and M.Michaël Thomazo. Revisiting Semiring Provenance for Datalog.KR 2022 - 19th International Conference on Principles of Knowledge Representation and ReasoningProceedings of the 19th International Conference on Principles of Knowledge Representation and ReasoningHaifa, IsraelJuly 2022, 91–101HAL DOI
4 inproceedingsC.Camille Bourgaux, D.David Carral, M.Markus Krötzsch, S.Sebastian Rudolph and M.Michaël Thomazo. Capturing Homomorphism-Closed Decidable Queries with Existential Rules.KR 2021 - 18th International Conference on Principles of Knowledge Representation and ReasoningVirtual, VietnamNovember 2021, 141--150HAL
5 inproceedingsM.Maxime Buron, M.-L.Marie-Laure Mugnier and M.Michaël Thomazo. Parallelisable Existential Rules: a Story of Pieces.KR 2021 - 18th International Conference on Principles of Knowledge Representation and ReasoningVirtual, VietnamNovember 2021HAL
6 inproceedingsN.Nofar Carmeli and L.Luc Segoufin. Conjunctive Queries With Self-Joins, Towards a Fine-Grained Complexity Analysis.PODS'23Seattle, United StatesJune 2023HAL
7 articleN.Nathan Grosshans, P.Pierre Mckenzie and L.Luc Segoufin. Tameness and the power of programs over monoids in DA.Logical Methods in Computer Science183August 2022, 14:1–14:34HAL DOI
8 articleP.Pratik Karmakar, M.Mikaël Monet, P.Pierre Senellart and S.Stéphane Bressan. Expected Shapley-Like Scores of Boolean Functions: Complexity and Applications to Probabilistic Databases.Proceedings of the ACM on Management of Data22 (PODS)January 2024HAL DOI
9 articleN.Nicole Schweikardt, L.Luc Segoufin and A.Alexandre Vigny. Enumeration for FO Queries over Nowhere Dense Graphs.Journal of the ACM (JACM)693June 2022, 1-37HAL DOI
10 articleP.Pierre Senellart, L.Louis Jachiet, S.Silviu Maniu and Y.Yann Ramusat. ProvSQL: Provenance and Probability Management in PostgreSQL.Proceedings of the VLDB Endowment (PVLDB)1112August 2018, 2034-2037HAL DOI

10.2 Publications of the year

International journals

11 articleS.Sven Dziadek, U.Uli Fahrenberg and P.Philipp Schlehuber. ω-Regular Energy Problems.Formal Aspects of ComputingJuly 2024. In press. HAL DOI back to text
12 articleP.Pratik Karmakar, M.Mikaël Monet, P.Pierre Senellart and S.Stéphane Bressan. Expected Shapley-Like Scores of Boolean Functions: Complexity and Applications to Probabilistic Databases.Proceedings of the ACM on Management of Data22 (PODS)January 2024HAL DOI back to text
13 articleQ.Qinghua Liu, P.Paul Boniol, T.Themis Palpanas and J.John Paparrizos. Time-Series Anomaly Detection: Overview and New Trends.Proceedings of the VLDB Endowment (PVLDB)17122024, 4229-4232HAL DOI back to text

Invited conferences

14 inproceedingsC.Camille Bourgaux. Querying Inconsistent Prioritized Data.DL 2024 - 37th International Workshop on Description LogicsBergen, NorwayJune 2024HAL back to text

International peer-reviewed conferences

15 inproceedingsM.Meghyn Bienvenu, C.Camille Bourgaux and R.Robin Jean. Cost-Based Semantics for Querying Inconsistent Weighted Knowledge Bases.Proceedings of the 21st International Conference on Principles of Knowledge Representation and ReasoningKR 2024 - 21st International Conference on Principles of Knowledge Representation and ReasoningHanoi, VietnamNovember 2024HAL back to text
16 inproceedingsM.Meghyn Bienvenu, C.Camille Bourgaux and D.Daniil Kozhemiachenko. Queries With Exact Truth Values in Paraconsistent Description Logics.Proceedings of the 21st International Conference on Principles of Knowledge Representation and ReasoningKR 2024 - 21st International Conference on Principles of Knowledge Representation and ReasoningHanoi, VietnamNovember 2024HAL back to text
17 inproceedingsP.Paul Boniol, J.John Paparrizos and T.Themis Palpanas. An Interactive Dive into Time-Series Anomaly Detection.ICDE 2024 - 40th International Conference on Data EngineeringUtrecht, NetherlandsMay 2024HAL back to text
18 inproceedingsP.Paul Boniol, E.Emmanouil Sylligardos, J.John Paparrizos, P.Panos Trahanias and T.Themis Palpanas. ADecimo: Model Selection for Time Series Anomaly Detection.ICDE 2024 - IEEE 40th International Conference on Data EngineeringUtrecht, NetherlandsMay 2024HAL back to text back to text
19 inproceedingsC.Camille Bourgaux, R.Ricardo Guimarães, R.Raoul Koudijs, V.Victor Lacerda and A.Ana Ozaki. Knowledge Base Embeddings: Semantics and Theoretical Properties.Proceedings of the 21st International Conference on Principles of Knowledge Representation and ReasoningKR 2024 - 21st International Conference on Principles of Knowledge Representation and ReasoningHanoi, VietnamNovember 2024HAL back to text
20 inproceedingsD.David Carral, L.Lucas Larroque and M.Michaël Thomazo. Ontology-Based Query Answering over Datalog-Expressible Rule Sets is Undecidable.KR 2024 - 21st International Conference on Principles of Knowledge Representation and ReasoningHanoi, VietnamNovember 2024HAL back to text
21 inproceedingsS. W.Sylvain W Combettes, P.Paul Boniol, C.Charles Truong and L.Laurent Oudre. d_symb playground: an interactive tool to explore large multivariate time series datasets.ICDE 2024 IEEE 40th International Conference on Data EngineeringUtrecht, NetherlandsMay 2024HAL back to text
22 inproceedingsA.Antoine Gauquier. Towards Efficient Construction of a Traceable, Multimodal, and Heterogeneous Data Warehouse.CEUR Workshop ProceedingsVLDB 2024 PhD Workshop - The 50th International Conference on Very Large Data BasesGuangzhou, ChinaAugust 2024HAL back to text
23 inproceedingsS.Shrey Mishra, Y.Yacine Brihmouche, T.Theo Delemazure, A.Antoine Gauquier and P.Pierre Senellart. First Steps in Building a Knowledge Base of Mathematical Results.SDP Fourth Workshop on Scholarly Document Processing at ACL 2024Bangkok, ThailandAugust 2024HAL back to text
24 inproceedingsS.Shrey Mishra, A.Antoine Gauquier and P.Pierre Senellart. Modular Multimodal Machine Learning for Extraction of Theorems and Proofs in Long Scientific Documents.JCDL 2024 - ACM/IEEE-CS Joint Conference on Digital LibrariesHong Kong, ChinaDecember 2024HAL DOI back to text
25 inproceedingsA.Anantha Padmanabha, L.Luc Segoufin and C.Cristina Sirangelo. A Dichotomy in the Complexity of Consistent Query Answering for Two Atom Queries With Self-Join.PODS'24 - ACM Conference on Principle of Database Systems22Santiago, ChileMay 2024, 1-15HAL DOI back to text

Scientific book chapters

26 inbookP.Pierre Senellart. On the Impact of Provenance Semiring Theory on the Design of a Provenance-Aware Database System.The Provenance of Elegance in Computation — Essays Dedicated to Val TannenOpenAccess Series in InformaticsSchloss Dagstuhl2024HAL DOI back to text

Doctoral dissertations and habilitation theses

27 thesisS.Shrey Mishra. Multimodal Extraction of Proofs and Theorems from the Scientific Literature.Université Paris Sciences & LettresJuly 2024HAL back to text

Reports & preprints

28 miscS.Shrey Mishra, A.Antoine Gauquier and P.Pierre Senellart. Modular Multimodal Machine Learning for Extraction of Theorems and Proofs in Long Scientific Documents (Extended Version).November 2024HAL back to text

10.3 Cited publications

29 bookS.Serge Abiteboul, P.Peter Buneman and D.Dan Suciu. Data on the Web: From Relations to Semistructured Data and XML.Morgan Kaufmann1999back to text
30 bookS.Serge Abiteboul, R.Richard Hull and V.Victor Vianu. Foundations of Databases.Addison-Wesley1995, URL: http://webdam.inria.fr/Alice/back to text
31 bookS.Serge Abiteboul, I.Ioana Manolescu, P.Philippe Rigaux, M.-C.Marie-Christine Rousset and P.Pierre Senellart. Web Data Management.Cambridge University Press2011, URL: http://webdam.inria.fr/Jorgeback to text
32 incollectionM.Michael Benedikt and P.Pierre Senellart. Databases.Computer Science, The Hardware, Software and Heart of ItSpringer2011, 169-229URL: https://doi.org/10.1007/978-1-4614-1168-0_10DOI back to text
33 articleA.Amol Deshpande, Z. G.Zachary G. Ives and V.Vijayshankar Raman. Adaptive Query Processing.Foundations and Trends in Databases112007, 1-140URL: https://doi.org/10.1561/1900000001DOI back to text
34 articleA. Y.Alon Y. Halevy. Answering queries using views: A survey.VLDB J.1042001, 270-294URL: https://doi.org/10.1007/s007780100054DOI back to text
35 articleD.Donald Kossmann. The State of the art in distributed query processing.ACM Comput. Surv.3242000, 422-469URL: http://doi.acm.org/10.1145/371578.371598DOI back to text
36 bookM. T.M. Tamer Özsu and P.Patrick Valduriez. Principles of Distributed Database Systems, Third Edition.Springer2011, URL: https://doi.org/10.1007/978-1-4419-8834-8DOI back to text
37 bookB.Burr Settles. Active Learning.Synthesis Lectures on Artificial Intelligence and Machine LearningMorgan & Claypool Publishers2012, URL: https://doi.org/10.2200/S00429ED1V01Y201207AIM018DOI back to text
38 bookR. S.Richard S. Sutton and A. G.Andrew G. Barto. Reinforcement learning - an introduction.Adaptive computation and machine learningMIT Press1998, URL: http://www.worldcat.org/oclc/37293240back to text
39 inproceedingsK.Ke Zhou, M.Mounia Lalmas, T.Tetsuya Sakai, R.Ronan Cummins and J. M.Joemon M. Jose. On the reliability and intuitiveness of aggregated search metrics.22nd ACM International Conference on Information and Knowledge Management, CIKM'13, San Francisco, CA, USA, October 27 - November 1, 20132013, 689-698URL: http://doi.acm.org/10.1145/2505515.2505691DOI back to text

VALDA - 2024

VALDA - 2024

2024Activity reportProject-TeamVALDA

Keywords

Computer Science and Digital Science

Other Research Topics and Application Domains

1 Team members, visitors, external collaborators

Research Scientists

Faculty Member

Post-Doctoral Fellow

PhD Students

Technical Staff

Interns and Apprentices

Administrative Assistant

Visiting Scientists

2 Overall objectives

3 Research program

3.1 Research axis 1: Foundations of data management

3.2 Research axis 2: Uncertainty, provenance, and explainability in data management

3.3 Research axis 3: Knowledge discovery at scale

4 Application domains

5 Highlights of the year

5.1 Scientific events

5.2 Awards

5.3 Inria policy

6 New software, platforms, open data

6.1 New software

6.1.1 ProvSQL

6.1.2 apxproof

6.1.3 dissem.in

7 New results

7.1 Research axis 1: Foundations of data management

Knowledge representation and knowledge bases

Consistent query answering

Other aspects of theoretical computer science

7.2 Research axis 2: Uncertainty, provenance, and explainability in data management

Inconsistent knowledge bases

Provenance management and probabilistic databases

7.3 Research axis 3: Knowledge discovery at scale

Mining time series

Information extraction and structuring

8 Partnerships and cooperations

8.1 International initiatives

8.1.1 Participation in other International Programs

DesCartes

PHC AURORA with Ana Ozaki

International ANR project EQUUS

8.2 International research visitors

8.2.1 Visits of international scientists

Other international visits to the team

Thomas Schwentick

Victor Vianu

8.2.2 Visits to international teams

Research stays abroad

8.3 National initiatives

8.3.1 ANR

8.3.2 Others

9 Dissemination

9.1 Promoting scientific activities

9.1.1 Scientific events: organisation

Member of the organizing committees

Chair of conference program committees

Member of the conference program committees

Member of the editorial boards

9.1.2 Invited talks

9.1.3 Leadership within the scientific community

9.1.4 Research administration

9.2 Teaching - Supervision - Juries

9.2.1 Teaching

9.2.2 Supervision

9.2.3 Juries

9.3 Popularization

9.3.1 Specific official responsibilities in science outreach structures

9.3.2 Productions (articles, videos, podcasts, serious games, ...)

9.3.3 Participation in Live events

10 Scientific production

10.1 Major publications

10.2 Publications of the year

International journals

Invited conferences