2025Activity reportProject-TeamVALDA
RNSR: 201622223R- Research center Inria Paris Centre
- In partnership with:Ecole normale supérieure de Paris, CNRS
- Team name: Value from Data
- In collaboration with:Département d'Informatique de l'Ecole Normale Supérieure
Creation of the Project-Team: 2018 January 01
Each year, Inria research teams publish an Activity Report presenting their work and results over the reporting period. These reports follow a common structure, with some optional sections depending on the specific team. They typically begin by outlining the overall objectives and research programme, including the main research themes, goals, and methodological approaches. They also describe the application domains targeted by the team, highlighting the scientific or societal contexts in which their work is situated.
The reports then present the highlights of the year, covering major scientific achievements, software developments, or teaching contributions. When relevant, they include sections on software, platforms, and open data, detailing the tools developed and how they are shared. A substantial part is dedicated to new results, where scientific contributions are described in detail, often with subsections specifying participants and associated keywords.
Finally, the Activity Report addresses funding, contracts, partnerships, and collaborations at various levels, from industrial agreements to international cooperations. It also covers dissemination and teaching activities, such as participation in scientific events, outreach, and supervision. The document concludes with a presentation of scientific production, including major publications and those produced during the year.
Keywords
Computer Science and Digital Science
- A3.1. Data
- A3.1.1. Modeling, representation
- A3.1.2. Data management, quering and storage
- A3.1.3. Distributed data
- A3.1.4. Uncertain data
- A3.1.5. Control access, privacy
- A3.1.6. Query optimization
- A3.1.7. Open data
- A3.1.8. Big data (production, storage, transfer)
- A3.1.9. Database
- A3.1.10. Heterogeneous data
- A3.1.11. Structured data
- A3.2. Knowledge
- A3.2.1. Knowledge bases
- A3.2.2. Knowledge extraction, cleaning
- A3.2.3. Inference
- A3.2.4. Semantic Web
- A3.2.5. Ontologies
- A3.2.6. Linked data
- A3.3. Data and knowledge analysis
- A3.3.1. On-line analytical processing
- A3.3.2. Data mining
- A3.3.3. Big data analysis
- A3.5.1. Analysis of large graphs
- A4.7. Access control
- A7.2. Logic in Computer Science
- A7.3. Calculability and computability
- A9.1. Knowledge
- A9.2.3. Reinforcement learning
- A9.2.5. Bayesian methods
- A9.8. Reasoning
Other Research Topics and Application Domains
- B2. Digital health
- B3.3. Geosciences
- B4. Energy
- B4.2. Nuclear Energy Production
- B6.3.1. Web
- B6.3.5. Search engines
- B9.3. Medias
- B9.5.6. Data science
- B9.6.5. Sociology
- B9.6.10. Digital humanities
- B9.7.2. Open data
- B9.9. Ethics
- B9.10. Privacy
1 Team members, visitors, external collaborators
Research Scientists
- Serge Abiteboul [Inria, Emeritus, HDR]
- Paul Boniol [Inria, ISFP]
- Camille Bourgaux [CNRS, Researcher]
- Luc Segoufin [Inria, Senior Researcher, HDR]
- Michael Thomazo [Inria, Researcher, HDR]
Faculty Member
- Pierre Senellart [Team leader, ENS-PSL, Professor, HDR]
PhD Students
- Felix Chavelli [Inria]
- Anatole Dahan [Université Paris-Cité, until Jul 2025]
- Antoine Gauquier [ENS-PSL]
- Robin Jean [CNRS]
- Lucas Larroque [ENS-PSL]
- Magali Parrino [EDF, CIFRE, from Jul 2025]
- Aryak Sen [CNRS & Université de Grenoble]
- Marijan Soric [Inria, from Mar 2025]
- Emmanouil Sylligardos [ENS-PSL]
Technical Staff
- Louis Chanaron [Inria, Engineer, from Oct 2025]
Interns and Apprentices
- Arushi Goyal [IIT Delhi & ENS-PSL, Intern, until May 2025]
- Adam Rozzio [ENS Paris-Saclay & ENS-PSL, Intern, from Feb 2025 until Jul 2025]
- Marijan Soric [Centrale Lyon & Inria, Intern, until Feb 2025]
Administrative Assistant
- Meriem Guemair [Inria]
Visiting Scientist
- Victor Vianu [UC San Diego, from Jun 2025]
2 Overall objectives
Valda's focus is on both foundational and systems aspects of complex data management, especially human-centric data. The data we are interested in is typically heterogeneous, massively distributed, rapidly evolving, intensional, and often subjective, possibly erroneous, imprecise, incomplete. In this setting, Valda is in particular concerned with the optimization of complex resources such as computer time and space, communication, monetary, and privacy budgets. The goal is to extract value from data, beyond simple query answering.
Data management 50, 52 is now an old, well-established field, for which many scientific results and techniques have been accumulated since the sixties. Originally, most works dealt with static, homogeneous, and precise data. Later, works were devoted to heterogeneous data 4951, and possibly distributed 56 but at a small scale.
However, these classical techniques are poorly adapted to handle the new challenges of data management. Consider human-centric data, which is either produced by humans, e.g., emails, chats, recommendations, or produced by systems when dealing with humans, e.g., geolocation, business transactions, results of data analysis. When dealing with such data, and to accomplish any task to extract value from such data, we rapidly encounter the following facets:
- Heterogeneity: data may come in many different structures such as unstructured text, graphs, data streams, complex aggregates, etc., using many different schemas or ontologies.
- Massive distribution: data may come from a large number of autonomous sources distributed over the web, with complex access patterns.
- Rapid evolution: many sources may be producing data in real time, even if little of it is perhaps relevant to the specific application. Typically, recent data is of particular interest and changes have to be monitored.
- Intensionality1: in a classical database, all the data is available. In modern applications, the data is more and more available only intensionally, possibly at some cost, with the difficulty to discover which source can contribute towards a particular goal, and this with some uncertainty.
- Confidentiality and security: some personal data is critical and need to remain confidential. Applications manipulating personal data must take this into account and must be secure against linking.
- Uncertainty: modern data, and in particular human-centric data, typically includes errors, contradictions, imprecision, incompleteness, which complicates reasoning. Furthermore, the subjective nature of the data, with opinions, sentiments, or biases, also makes reasoning harder since one has, for instance, to consider different agents with distinct, possibly contradicting knowledge.
These problems have already been studied individually and have led to techniques such as query rewriting54 or distributed query optimization55.
Among all these aspects, intensionality is perhaps the one that has least been studied, so let us expand a bit on this. Consider a user's query, taken in a very broad sense: it may be a classical database query, some information retrieval search, a clustering or classification task, or some more advanced knowledge extraction request. Because of intensionality of data, solving such a query is a typically dynamic task: each time new data is obtained, the partial knowledge a system has of the world is revised, and query plans need to be updated, as in adaptive query processing 53 or aggregated search 59. The system then needs to decide, based on this partial knowledge, of the best next access to perform. This is reminiscent of the central problem of reinforcement learning 58 (train an agent to accomplish a task in a partially known world based on rewards obtained) and of active learning 57 (decide which action to perform next in order to optimize a learning strategy) and we intend to explore this connection further.
Uncertainty of the data interacts with its intensionality: efforts are required to obtain more precise, more complete, sounder results, which yields a trade-off between processing cost and data quality.
Other aspects, such as heterogeneity and massive distribution, are of major importance as well. A standard data management task, such as query answering, information retrieval, or clustering, may become much more challenging when taking into account the fact that data is not available in a central location, or in a common format. We aim to take these aspects into account, to be able to apply our research to real-world applications.
3 Research program
3.1 Research axis 1: Foundations of data management
This axis covers the theory of data management, broadly taken, and in particular the fields of database theory, knowledge representation, and some symbolic aspects of artificial intelligence (especially, reasoning on data).
The goal is to define solid and high-level foundations of data management tasks (query evaluation and optimization of various forms of queries, counting, reasoning, verification of data-centric processes, etc.) through formal tools, such as logics (esp., finite model theory), automata theory, complexity theory; we occasionally have contributions in these areas as well, though most of our work is motivated by data applications. We are especially interested in clean specifications of key aspects of database systems and data management tasks (e.g, confidentiality, access control, robustness), whether they are properties of the data or appropriate (query) languages for these tasks. We study expressive power of languages, computability and complexity of deciding or computing results, as well as the design of appropriate structures (e.g., indexes) to optimize these tasks.
3.2 Research axis 2: Uncertainty, provenance, and explainability in data management
This research axis deals with the modeling and efficient management of data that come with some uncertainty (probabilistic distributions, logical incompleteness, missing values, inconsistencies, open-world assumption, etc.) and with provenance information (indicating where the data originates from), as well as with the extraction of uncertainty and provenance annotations from real-world data. Provenance is also linked to explainability: determining where the result of a data management task comes from, how and why it was produced, helps explaining it. Interestingly, the foundations and tools for uncertainty management often rely on provenance annotations. For example, a typical way to compute the probability of query results in probabilistic databases is the so-called intensional approach: first generate the provenance of these query results (in some appropriate framework, e.g., that of Boolean functions or of provenance semirings), and then compute the probability of the resulting provenance annotation. For this reason, we deal with uncertainty and provenance in a unified manner, and with explainability as an application thereof.
3.3 Research axis 3: Knowledge discovery at scale
Our final axis deals with knowledge discovery at scale. The goal is to use techniques such as data mining, information extraction, data cleaning, information integration, machine learning, to derive knowledge from raw, dirty, inconsistent, heterogeneous, rapidly changing, data from real-world application scenarios.
We intend to leverage our expertise on data management to focus on the scalability of the approaches and tools developed. This is also in some sense an application axis for techniques developed in the other two axes; in particular, we have a focus on intensionality of data (i.e., cost to data access), on the trade-off between data uncertainty and its cost, on data provenance and explanations.
This axis is typically very changing in subtopics, depending on projects, collaborations, application partners.
4 Application domains
A large part of Valda's research is foundational in nature and not tailored to any specific application domain. Some applied works target certain application domains however:
-
Web data
in a broad-sense (semi-structured, structured or unstructured content extracted from Web databases; knowledge bases from the Semantic Web; social networks; Web archives and Web crawls; Web applications and deep Web databases; crowdsourcing platforms). This is a historical domain of interest of Valda researchers, and we have expertise in the acquisition, extraction, and management of this kind of data.
-
Open science
(publication databases, scientific publications, open-source software).
-
Clinical data
(notably inconsistent or incomplete hospital records).
-
Energy
(notably data from power stations, in collaboration with industrial partners).
-
Geoscience
(seismology or vulcanology time series, structured data about geological campaigns).
-
Data journalism
(statistical datasets, fact checking data).
Finally, transversal concerns which occur in different applications area and motivate some of our theory work are ethics of data management and privacy.
5 Highlights of the year
The Inria–BRGM Géolaug challenge, which Valda contributes to, was launched in September 2025.
5.1 Awards
Camille Bourgaux , Anton Gnatenko (Free University of Bozen–Bolzano), and Michael Thomazo have received a best contribution award at DL 2025 and an outstanding paper award at ECAI 2025 for their work on analysing temporal reasoning in description logics using formal grammars 27, 28.
6 Latest software developments, platforms, open data
6.1 Latest software developments
6.1.1 ProvSQL
-
Keywords:
Databases, Provenance, Probability
-
Scientific Description:
ProvSQL is a general and easy-to-deploy provenance tracking and probabilistic database system implemented as a PostgreSQL extension. ProvSQL’s data and query models closely reflect that of a large core of SQL, including multiset semantics, the full relational algebra, and aggregation. A key part of its implementation relies on generic provenance circuits stored in memory-mapped files.
-
Functional Description:
The goal of the ProvSQL project is to add support for (m-)semiring provenance and uncertainty management to PostgreSQL databases, in the form of a PostgreSQL extension/module/plugin.
-
News of the Year:
Compatibility with PostgreSQL 18. Support for PROV-XML output. Partial support of HAVING queries. Support for compiled semirings, including the counting, Boolean, and Why semirings. Basic documentation infrastructure. Temporal semiring and temporal database support. Various minor enhancements and bug fixes.
- URL:
-
Publications:
hal-05037471, hal-05072212, hal-04930705, hal-04911715, hal-04561331, hal-04393781, hal-01672566, hal-01851538
-
Contact:
Pierre Senellart
-
Participants:
Aryak Sen, Pierre Senellart
-
Partners:
Université Grenoble Alpes, CNRS, National University of Singapore
6.1.2 VUS
-
Name:
Volume Under the Surface
-
Keywords:
Time Series, Anomaly detection, Measures, Performance measure, Python
-
Scientific Description:
Anomaly detection (AD) is a fundamental task for time-series analytics with important implications for the downstream performance of many applications. In contrast to other domains where AD mainly focuses on point-based anomalies (i.e., outliers in standalone observations), AD for time series is also concerned with range-based anomalies (i.e., outliers spanning multiple observations). Nevertheless, it is common to use traditional point-based information retrieval measures, such as Precision, Recall, and F-score, to assess the quality of methods by thresholding the anomaly score to mark each point as an anomaly or not. However, mapping discrete labels into continuous data introduces unavoidable shortcomings, complicating the evaluation of range-based anomalies. Notably, the choice of evaluation measure may significantly bias the experimental outcome. Despite over six decades of attention, there has never been a large-scale systematic quantitative and qualitative analysis of time-series AD evaluation measures. This paper extensively evaluates quality measures for time-series AD to assess their robustness under noise, misalignments, and different anomaly cardinality ratios. Our results indicate that measures producing quality values independently of a threshold (i.e., AUC-ROC and AUC-PR) are more suitable for time-series AD. Motivated by this observation, we first extend the AUC-based measures to account for range-based anomalies. Then, we introduce a new family of parameter-free and threshold-independent measures, VUS (Volume Under the Surface), to evaluate methods while varying parameters. Our findings demonstrate that our four measures are significantly more robust in assessing the quality of time-series AD methods.
-
Functional Description:
The receiver operator characteristic (ROC) curve and the area under the curve (AUC) are widely used to compare the performance of different anomaly detectors. They mainly focus on point-based detection. However, the detection of collective anomalies concerns two factors: whether this outlier is detected and what percentage of this outlier is detected. The first factor is not reflected in the AUC. Another problem is the possible shift between the anomaly score and the real outlier due to the application of the sliding window. To tackle these problems, we incorporate the idea of range-based precision and recall, and suggest the range-based ROC and its counterpart in the precision-recall space, which provides a new evaluation for the collective anomalies. We finally introduce a new measure VUS (Volume Under the Surface) which corresponds to the averaged range-based measure when we vary the range size. We demonstrate in a large experimental evaluation that the proposed measures are significantly more robust to important criteria (such as lag and noise) and also significantly more useful to separate correctly the accurate from the the inaccurate methods.
-
News of the Year:
We recently published in 2025 a new paper introducing two optimized implementations of VUS that significantly reduce the execution time of the initial implementation.
Publication: https://inria.hal.science/hal-05076186
- URL:
- Publication:
-
Contact:
Paul Boniol
-
Participants:
Paul Boniol, Emmanouil Sylligardos, 9 anonymous participants
-
Partners:
Ohio State University, Université Paris-Descartes
6.1.3 TSB-UAD
-
Keywords:
Time Series, Anomaly detection, Python, Library
-
Scientific Description:
The detection of anomalies in time series has gained ample academic and industrial attention. However, no comprehensive benchmark exists to evaluate time-series anomaly detection methods. It is common to use (i) proprietary or synthetic data, often biased to support particular claims, or (ii) a limited collection of publicly available datasets. Consequently, we often observe methods performing exceptionally well in one dataset but surprisingly poorly in another, creating an illusion of progress. To address the issues above, we thoroughly studied over one hundred papers to identify, collect, process, and systematically format datasets proposed in the past decades. We summarize our effort in TSB-UAD, a new benchmark to ease the evaluation of univariate time-series anomaly detection methods. Overall, TSB-UAD contains 13766 time series with labeled anomalies spanning different domains with high variability of anomaly types, ratios, and sizes. TSB-UAD includes 18 previously proposed datasets containing 1980 time series and we contribute two collections of datasets. Specifically, we generate 958 time series using a principled methodology for transforming 126 time-series classification datasets into time series with labeled anomalies. In addition, we present data transformations with which we introduce new anomalies, resulting in 10828 time series with varying complexity for anomaly detection. Finally, we evaluate 12 representative methods demonstrating that TSB-UAD is a robust resource for assessing anomaly detection methods. TSB-UAD provides a valuable, reproducible, and frequently updated resource to establish a leaderboard of univariate time-series anomaly detection methods.
-
Functional Description:
TSB-UAD is a new open, end-to-end benchmark suite to ease the evaluation of univariate time-series anomaly detection methods. Overall, TSB-UAD contains 12686 time series with labeled anomalies spanning different domains with high variability of anomaly types, ratios, and sizes. Specifically, TSB-UAD includes 18 previously proposed datasets containing 1980 time series from real-world data science applications. Motivated by flaws in certain datasets and evaluation strategies in the literature, we study anomaly types and data transformations to contribute two collections of datasets. Specifically, we generate 958 time series using a principled methodology for transforming 126 time-series classification datasets into time series with labeled anomalies. In addition, we present a set of data transformations with which we introduce new anomalies in the public datasets, resulting in 10828 time series (92 datasets) with varying difficulty for anomaly detection.
- URL:
-
Contact:
Paul Boniol
-
Participants:
Paul Boniol, Emmanouil Sylligardos, 5 anonymous participants
-
Partners:
Université Paris-Descartes, Ohio State University
6.1.4 ADecimo
-
Name:
A Web-app for the Evaluation of Model selection for Anomaly Detection in Time Series
-
Keywords:
Time Series, Anomaly detection, Web Application
-
Scientific Description:
Anomaly detection is a fundamental task for time-series analytics with important implications for the downstream performance of many applications. Despite increasing academic interest and the large number of methods proposed in the literature, recent benchmark and evaluation studies demonstrated that there exists no single best anomaly detection method when applied to heterogeneous time series datasets. Therefore, the only scalable and viable solution to solve anomaly detection over very different time series collected from diverse domains is to propose a model selection method that will choose, based on time series characteristics, the best anomaly detection method to run. This paper describes ADecimo, a modular and extensible web application that helps users understand the performance of time series classification algorithms used as model selection methods for time series anomaly detection. Overall, our system enables users to compare 17 different classifiers over 1980 time series, and decide on the most suitable time series classification method for their own time series and use cases.
-
Functional Description:
We present here ADecimo, a modular and extensible web application that helps users understand the performance of time series classification algorithms used as model selection methods for time series anomaly detection. Overall, our system enables users to compare 17 different classifiers over 1980 time series, and decide on the most suitable time series classification method for their own time series and use cases.
- URL:
- Publication:
-
Contact:
Paul Boniol
-
Participants:
Paul Boniol, Emmanouil Sylligardos, 3 anonymous participants
6.1.5 MSAD
-
Name:
Model Selection for Anomaly Detection
-
Keywords:
Time Series, Machine learning, Classification, Ensemble classifier, Python
-
Scientific Description:
Anomaly detection is a fundamental task for time-series analytics with important implications for the downstream performance of many applications. Despite increasing academic interest and the large number of methods proposed in the literature, recent benchmark and evaluation studies demonstrated that no overall best anomaly detection methods exist when applied to very heterogeneous time series datasets. Therefore, the only scalable and viable solution to solve anomaly detection over very different time series collected from diverse domains is to propose a model selection method that will select, based on time series characteristics, the best anomaly detection method to run. Existing AutoML solutions are, unfortunately, not directly applicable to time series anomaly detection, and no evaluation of time series-based approaches for model selection exists. Towards that direction, this paper studies the performance of time series classification methods used as model selection for anomaly detection. Overall, we compare 17 different classifiers over 1800 time series, and we propose the first extensive experimental evaluation of time series classification as model selection for anomaly detection. Our results demonstrate that model selection methods outperform every single anomaly detection method while being in the same order of magnitude regarding execution time. This evaluation is the first step to demonstrate the accuracy and efficiency of time series classification algorithms for anomaly detection, and represents a strong baseline that can then be used to guide the model selection step in general AutoML pipelines.
-
Functional Description:
MSAD proposes a pipeline for model selection based on time series classification and an extensive experimental evaluation of existing classification algorithms for this new pipeline. Our results demonstrate that model selection methods outperform every single anomaly detection method while being in the same order of magnitude regarding execution time.
-
News of the Year:
In 2025, we published a new paper that extended the model selection pipeline, improving performance in Out-of-Distribution (OoD) settings.
Paper: https://inria.hal.science/hal-05343228
- URL:
- Publication:
-
Contact:
Paul Boniol
-
Participants:
Emmanouil Sylligardos, Paul Boniol, Pierre Senellart, 2 anonymous participants
-
Partners:
Ohio State University, Université Paris-Descartes
6.1.6 apxproof
-
Keyword:
LaTeX
-
Functional Description:
apxproof is a LaTeX package facilitating the typesetting of research articles with proofs in appendix, a common practice in database theory and theoretical computer science in general. The appendix material is written in the LaTeX code along with the main text which it naturally complements, and it is automatically deferred. The package can automatically send proofs to the appendix, can repeat in the appendix the theorem environments stated in the main text, can section the appendix automatically based on the sectioning of the main text, and supports a separate bibliography for the appendix material.
-
Release Contributions:
Fix forward linking when used in conjunction with aliascnt (e.g., in Springer classes), Compatibility with recent versions of acmart.cls
-
News of the Year:
- Fix forward linking when used in conjunction with aliascnt (e.g., in Springer classes) - Compatibility with recent versions of acmart.cls - Support for user-defined claimproof environments - Remove forward linking command from PDF bookmarks
- URL:
-
Contact:
Pierre Senellart
-
Participant:
Pierre Senellart
7 New results
7.1 Research axis 1: Foundations of data management
Participants: Camille Bourgaux, Anatole Dahan, Jean Robin, Lucas Larroque, Arthur Lombardo, Michaël Thomazo, Luc Segoufin.
Knowledge representation and knowledge bases
In 28, 27, we establish a correspondence between (fragments of) , a temporal extension of the description logic with the LTL operator , and some specific kinds of formal grammars, in particular, conjunctive grammars (context-free grammars equipped with the operation of intersection). This connection implies that does not possess the property of ultimate periodicity of models, and further leads to undecidability of query answering in , closing a question left open since the introduction of . Moreover, it also allows to establish decidability of query answering for some new interesting fragments of , and to reuse for this purpose existing tools and algorithms for conjunctive grammars.
Consistent query answering
In 17, we consider the dichotomy conjecture for consistent query answering under primary key constraints. It states that, for every fixed Boolean conjunctive query q, testing whether q is certain (i.e. whether it evaluates to true over all repairs of a given inconsistent database) is either polynomial time or coNP-complete. This conjecture has been verified for self-join-free and path queries. We propose a simple inflationary fixpoint algorithm for consistent query answering which, for a given database, naively computes a set of subsets of facts of the database of size at most k, where k is the size of the query q. The algorithm runs in polynomial time and can be formally defined as: (1) Initialize with all sets of at most facts such that . (2) Add any set of at most k facts to if there exists a block (i.e., a maximal set of facts sharing the same key) such that for every fact there is a set such that . For an input database , the algorithm answers "q is certain" iff eventually contains the empty set. The algorithm correctly computes certainty when the query q falls in the polynomial time cases of the known dichotomies for self-join-free queries and path queries. For arbitrary Boolean conjunctive queries, the algorithm is an under-approximation: the query is guaranteed to be certain if the algorithm claims so. However, there are polynomial time certain queries (with self-joins) which are not identified as such by the algorithm.
The Chase and Existential Rules
29 The chase is a fundamental algorithm with ubiquitous uses in database theory. Given a database and a set of existential rules (aka tuple-generating dependencies), it iteratively extends the database to ensure that the rules are satisfied in a most general way. This process may not terminate, and a major problem is to decide whether it does. This problem has been studied for a large number of chase variants, which differ by the conditions under which a rule is applied to extend the database. Surprisingly, the complexity of the universal termination of the restricted (aka standard) chase is not fully understood. We close this gap by placing universal restricted chase termination in the analytical hierarchy. This higher hardness is due to the fairness condition, and we propose an alternative condition to reduce the hardness of universal termination.
In 34, we address one of the fundamental open questions in the realm of existential rules: the conjecture on the finite controllability of bounded derivation depth rule sets (). We take a step toward a positive resolution of this conjecture by demonstrating that universal models generated by bdd rule sets cannot contain arbitrarily large tournaments (arbitrarily directed cliques) without entailing a loop query, . This simple yet elegant result narrows the space of potential counterexamples to the () conjecture.
Other aspects of theoretical computer science
Our research occasionally touches other aspects of theoretical computer science not related to data management.
In 31, we introduce an extension of fixed-point logic (FP) with a group-order operator (ord), that computes the size of a group generated by a definable set of permutations. This operation is a generalization of the rank operator (rk). We show that FP + ord constitutes a new candidate logic for the class of polynomial-time computable queries (P). As was the case for FP + rk, the model-checking of FP + ord formulae is polynomial-time computable. Moreover, the query separating FP + rk from P exhibited by Lichter in his recent breakthrough is definable in FP + ord. Precisely, we show that FP + ord canonizes structures with Abelian colors, a class of structures which contains Lichter's counter-example. This proof involves expressing a fragment of the group-theoretic approach to graph canonization in the logic FP + ord.
7.2 Research axis 2: Uncertainty, provenance, and explainability in data management
Participants: Camille Bourgaux, Robin Jean, Pierre Senellart, Aryak Sen.
Inconsistent knowledge bases
Repair-based semantics have been extensively studied as a means of obtaining meaningful answers to queries posed over inconsistent knowledge bases (KBs). While several works have considered how to exploit a priority relation between facts to select optimal repairs, the question of how to specify such preferences remains largely unaddressed. This motivates us in 23, 22 to introduce a declarative rule-based framework for specifying and computing a priority relation between conflicting facts. As the expressed preferences may contain undesirable cycles, we consider the problem of determining when a set of preference rules always yields an acyclic relation, and we also explore a pragmatic approach that extracts an acyclic relation by applying various cycle removal techniques. Towards an end-to-end system for querying inconsistent KBs, we present a preliminary implementation and experimental evaluation of the framework, which employs answer set programming to evaluate the preference rules, apply the desired cycle resolution techniques to obtain a priority relation, and answer queries under prioritized-repair semantics.
In 25, 24, we explore the issue of inconsistency handling in DatalogMTL, an extension of Datalog with metric temporal operators. Since facts are associated with time intervals, there are different manners to restore consistency when they contradict the rules, such as removing facts or modifying their time intervals. Our first contribution is the definition of relevant notions of conflicts (minimal explanations for inconsistency) and repairs (possible ways of restoring consistency) for this setting and the study of the properties of these notions and the associated inconsistency-tolerant semantics. Our second contribution is a data complexity analysis of the tasks of generating a single conflict / repair and query entailment under repair-based semantics.
Provenance and probability management
Ensemble methods aggregate the predictions of multiple models by some form of weighted voting. In 33, we consider the impact of the choice of the assignment of voting power to every individual model on the performance of ensemble methods. We empirically and comparatively evaluate the accuracy and running time of the different power voting ensemble methods using standard classifiers and mainstream classification benchmarks. The results show that power ensemble voting outperforms the equal-power baseline, and that unsupervised learning of the voting power can be competitive with respect to supervised learning; within supervised approaches, learning voting power through Shapley values and regression outperforms simply using accuracy.
The Shapley value provides a principled framework for attributing marginal contributions to players in coalitional games. While its axiomatic fairness guarantees have made it a cornerstone of value distribution in economics and multi-agent systems, recent computational advances have extended its applicability to data-driven domains. 32 bridges game-theoretic foundations with probabilistic reasoning by studying Shapley-like scores in stochastic environments. We prove that the expected Shapley value (EShap) – player's average impact in a game with an independent probabilistic setting – coincides with the Shapley value of the game whose utility is the expected utility of the original game (ShapE). This equality, however, fails for other power indices, such as the Banzhaf index, underscoring the Shapley value's specificity of consistency in uncertain settings. We further identify that for a certain class of coefficients (including normalized Banzhaf indices) the equality persists, broadening the scope of reliable attribution mechanisms.
ProvSQL is a PostgreSQL extension implementing provenance management and probabilistic database features. ProvSQL seamlessly extends relational database functionality to support the storage, tracking through derivations and transformations, and querying of metadata that explain and qualify the data and query results. In 40, ProvSQL is used to implement a content-based image retrieval system. A deep learning object detection model identifies objects of selected classes located within the images of a large-scale image data set. The uncertainty associated with object detection is recorded. ProvSQL's provenance model incorporates this uncertainty into the retrieval process, thus facilitating the generation of accurate and reliable results and allowing for decision-making in scenarios with incomplete or uncertain information. The demonstration illustrates how ProvSQL handles query processing, uncertainty tracking, and probability computation. It highlights the utility of a probabilistic database for applications dealing with uncertain data, compared to traditional threshold-based approaches.
In 39, we further enhance ProvSQL by enabling provenance tracking for update operations (DELETE, INSERT, UPDATE). We illustrate the practical utility of update provenance by implementing a temporal database capable of standard operations, including time travel (inspecting past database states), history tracking (monitoring tuple states over time), and undo (reversing previous updates). These features rely on a provenance formalism based on the union-of-intervals m-semiring. Additionally, we emphasize a key advantage of using semiring-based provenance model: its generality allows the same semiring structure to seamlessly support various applications, such as probabilistic databases, by simply modifying the semiring definition.
7.3 Research axis 3: Knowledge discovery at scale
Participants: Paul Boniol, Felix Chavelli, Antoine Gauquier, Magali Parrino, Pierre Senellart, Marijan Soric, Emmanouil Sylligardos.
Mining time series
Recent advances in data collection technology, accompanied by the ever-rising volume and velocity of streaming data, underscore the vital need for time series analytics. In this regard, time-series anomaly detection has been an important activity, entailing various applications in fields such as cyber security, financial markets, law enforcement, and health care. While traditional literature on anomaly detection is centered on statistical measures, the increasing number of machine learning algorithms in recent years calls for a structured, general characterization of the research methods for time-series anomaly detection. In 36, we present a process-centric taxonomy for time-series anomaly detection methods, systematically categorizing traditional statistical approaches and contemporary machine learning techniques. Beyond this taxonomy, we conduct a meta-analysis of the existing literature to identify broad research trends. Given the absence of a one-size-fits-all anomaly detector, we also introduce emerging trends for time-series anomaly detection. Furthermore, we review commonly used evaluation measures and benchmarks, followed by an analysis of benchmark results to provide insights into the impact of different design choices on model performance. Through these contributions, we aim to provide a holistic perspective on time-series anomaly detection and highlight promising avenues for future investigation.
Anomaly detection is a fundamental task for time series analytics with important implications for the downstream performance of many applications. Despite increasing academic interest and the large number of methods proposed in the literature, recent benchmarks and evaluation studies demonstrated that no overall best anomaly detection methods exist when applied to very heterogeneous time series datasets. Therefore, the only scalable and viable solution to solve anomaly detection over very different time series collected from diverse domains is to propose a model selection method that will select, based on time series characteristics, the best anomaly detection methods to run. Existing AutoML solutions are, unfortunately, not directly applicable to time series anomaly detection, and no evaluation of time series-based approaches for model selection exists. Towards that direction, 19 studies the performance of time series classification methods used as model selection for anomaly detection. In total, we evaluate 234 model configurations derived from 16 base classifiers across more than 1980 time series, and we propose the first extensive experimental evaluation of time series classification as model selection for anomaly detection. Our results demonstrate that model selection methods outperform every single anomaly detection method while being in the same order of magnitude regarding execution time. This evaluation is the first step to demonstrate the accuracy and efficiency of time series classification algorithms for anomaly detection, and represents a strong.
In contrast to other domains where AD mainly focuses on point-based anomalies (i.e., outliers in standalone observations), AD for time series is also concerned with range-based anomalies (i.e., outliers spanning multiple observations). Nevertheless, it is common to use traditional point-based information retrieval measures, such as Precision, Recall, and F-score, to assess the quality of methods by thresholding the anomaly score to mark each point as an anomaly or not. However, mapping discrete labels into continuous data introduces unavoidable shortcomings, complicating the evaluation of range-based anomalies. Notably, the choice of evaluation measure may significantly bias the experimental outcome. Despite over six decades of attention, there has never been a large-scale systematic quantitative and qualitative analysis of time-series AD evaluation measures. 15 extensively evaluates quality measures for time-series AD to assess their robustness under noise, misalignments, and different anomaly cardinality ratios. Our results indicate that measures producing quality values independently of a threshold (i.e., AUC-ROC and AUC-PR) are more suitable for time-series AD. Motivated by this observation, we first extend the AUC-based measures to account for range-based anomalies. Then, we introduce a new family of parameter-free and threshold-independent measures, Volume Under the Surface (VUS), to evaluate methods while varying parameters. We also introduce two optimized implementations for VUS that reduce significantly the execution time of the initial implementation. Our findings demonstrate that our four measures are significantly more robust in assessing the quality of time-series AD methods.
Motif Discovery involves identifying recurring patterns and locating their occurrences within a time series without prior knowledge about their shape or location. In practice, Motif Discovery faces several data-related challenges, leading to various definitions of the problem and multiple algorithms addressing these challenges to different extents. However, there has been no systematic evaluation and comparison of these diverse approaches. Consequently, 18 presents a comprehensive literature review covering data-related challenges, motif definitions, and algorithms. We also analyze the strengths and limitations of algorithms carefully chosen to represent the literature diversity. The analysis is structured around key research questions identified from our review. Our experimental findings provide practical guidelines for selecting Motif Discovery algorithms suitable for a given task and suggest directions for future research.
Time series clustering poses a significant challenge with diverse applications across domains. A prominent drawback of existing solutions lies in their limited interpretability, often confined to presenting users with centroids. In addressing this gap, 16 presents k-Graph, an unsupervised method explicitly crafted to augment interpretability in time series clustering. Leveraging a graph representation of time series subsequences, k-Graph constructs multiple graph representations based on different subsequence lengths. This feature accommodates variable-length time series without requiring users to predetermine subsequence lengths. Our experimental results reveal that k-Graph outperforms current state-of-the-art time series clustering algorithms in accuracy, while providing users with meaningful explanations and interpretations of the clustering outcomes.
Time series clustering is important for identifying patterns in these datasets. However, prevailing methods often encounter obstacles in maintaining data relationships and ensuring interpretability. We present in 26 Graphint, an innovative system based on the -Graph methodology that addresses these challenges. Graphint integrates a robust time series clustering algorithm with an interactive tool for comparison and interpretation. More precisely, our system allows users to compare results against competing approaches, identify discriminative subsequences within specified datasets, and visualize the critical information utilized by -Graph to generate outputs. Overall, Graphint offers a comprehensive solution for extracting actionable insights from complex temporal datasets.
Time series segmentation is a fundamental task in analyzing temporal data across various domains, from human activity recognition to energy monitoring. While numerous state-of-the-art methods have been developed to tackle this problem, the evaluation of their performance remains critically limited. Existing measures predominantly focus on change point accuracy or rely on point-based measures such as Adjusted Rand Index (ARI), which fail to capture the quality of the detected segments, ignore the nature of errors, and offer limited interpretability. In 30, we address these shortcomings by introducing two novel evaluation measures: WARI (Weighted Adjusted Rand Index), that accounts for the position of segmentation errors, and SMS (State Matching Score), a fine-grained measure that identifies and scores four fundamental types of segmentation errors while allowing error-specific weighting. We empirically validate WARI and SMS on synthetic and real-world benchmarks, showing that they not only provide a more accurate assessment of segmentation quality but also uncover insights, such as error provenance and type, that are inaccessible with traditional measures.
In recent years, electricity suppliers have installed millions of smart meters worldwide to improve the management of the smart grid system. These meters collect a large amount of electrical consumption data to produce valuable information to help consumers reduce their electricity footprint. However, having non-expert users (e.g., consumers or sales advisors) understand these data and derive usage patterns for different appliances has become a significant challenge for electricity suppliers because these data record the aggregated behavior of all appliances. At the same time, ground-truth labels (which could train appliance detection and localization models) are expensive to collect and extremely scarce in practice. 37 introduces DeviceScope, an interactive tool designed to facilitate understanding smart meter data by detecting and localizing individual appliance patterns within a given time period. Our system is based on CamAL (Class Activation Map-based Appliance Localization), a novel weakly supervised approach for appliance localization that only requires the knowledge of the existence of an appliance in a household to be trained.
Improving smart grid system management is crucial in the fight against climate change, and enabling consumers to play an active role in this effort is a significant challenge for electricity suppliers. In this regard, millions of smart meters have been deployed worldwide in the last decade, recording the main electricity power consumed in individual households. This data produces valuable information that can help them reduce their electricity footprint; nevertheless, the collected signal aggregates the consumption of the different appliances running simultaneously in the house, making it difficult to apprehend. Non-Intrusive Load Monitoring (NILM) refers to the challenge of estimating the power consumption, pattern, or on/off state activation of individual appliances using the main smart meter signal. Recent methods proposed to tackle this task are based on a fully supervised deep-learning approach that requires both the aggregate signal and the ground truth of individual appliance power. However, such labels are expensive to collect and extremely scarce in practice, as they require conducting intrusive surveys in households to monitor each appliance. In 38, we introduce CamAL, a weakly supervised approach for appliance pattern localization that only requires information on the presence of an appliance in a household to be trained. CamAL merges an ensemble of deep-learning classifiers combined with an explainable classification method to be able to localize appliance patterns. Our experimental evaluation, conducted on 4 real-world datasets, demonstrates that CamAL significantly outperforms existing weakly supervised baselines and that current SotA fully supervised NILM approaches require significantly more labels to reach CamAL performances.
Information Extraction
35, which is situated within the TheoremKB 41 project, presents TheoremView, a novel framework for extracting proofs and theorems from raw PDF scientific papers without requiring LaTeX source files. Our approach combines three modalities (font, text, and vision) with sequential modeling to capture long-term dependencies and layout information. By eliminating OCR preprocessing, TheoremView reduces computational overhead for real-time applications while providing robust automated theorem extraction.
Graphs, and notably RDF graphs, are a prominent way of sharing data. As data usage democratizes, users need help figuring out the useful content of a graph dataset. In particular, journalists with whom we collaborate are interested in identifying, in a graph, the connections between entities, e.g., people, organizations, emails, etc. In 14, we present a novel method for exploring data graphs through their data paths connecting Named Entities (NEs, in short); each data path leads to a tabular-looking set of results. NEs are extracted from the data through dedicated Information Extraction modules. Our method builds upon the pre-existing ConnectionLens platform and follow-up work in the Abstra project, which builds simple, visual ER-style summaries of semi-structured data. The contribution of the present work, and its novelty, is twofold. First, we propose a novel analysis of entity-to-entity paths contained in datasets of any nature, and propose a new method for ranking paths, leveraging a novel Information Extraction module we built on top of ChatGPT. Second, we present an efficient approach to enumerate and compute NE paths, based on an algorithm which automatically recommends sub-paths to materialize, and rewrites the path queries using these subpaths. Our experiments demonstrate the interest of NE paths and the efficiency of our method for computing and ranking them.
8 Bilateral contracts and grants with industry
8.1 Bilateral contracts with industry
Participants: Paul Boniol, Magali Parrino, Pierre Senellart.
Magali Parrino started her PhD in 2025, under a CIFRE agreement between Valda (Paul Boniol and Pierre Senellart ) and EDF (Chatou lab).
9 Partnerships and cooperations
9.1 International initiatives
9.1.1 Participation in other International Programs
-
DesCartes
Participants: Pierre Senellart.
-
Title:
Intelligent Modelling for Decision-making in Critical Urban Systems
-
Partner Institution(s):
CNRS@CREATE, National University of Singapore
-
Duration:
2021–2026
-
Additional info:
DesCartes is a project managed by CNRS@CREATE, a CNRS subsidiary in Singapore and funded by Singapore’s National Research Foundation, with 50 million total budget. Pierre Senellart is involved in the project as one of the French PIs, and became in 2025 Lead PI for one of the workpackages.
-
Title:
-
International ANR project EQUUS
Participants: Luc Segoufin.
-
Title:
Efficient query answering under updates
-
Partner Institution(s):
TU Ilmenau, Uni. Bayreuth, HU Berlin, CNRS
-
Duration:
2020–2025
-
Title:
9.2 International research visitors
9.2.1 Visits of international scientists
Other international visits to the team
Anton Gnatenko
-
Status:
PhD students
-
Institution of origin:
Free University of Bozen–Bolzano
-
Country:
Italy
-
Dates:
December 2024 to May 2025
-
Mobility program:
PhD research visit
Amélie Marian
-
Status:
Professor
-
Institution of origin:
Rutgers University
-
Country:
USA
-
Dates:
March 2026 to April 2026
-
Mobility program:
ENS Visiting Professor
Victor Vianu
-
Status:
Professor
-
Institution of origin:
UC San Diego
-
Country:
USA
-
Dates:
June 2025 to January 2026
-
Mobility program:
Sabbatical
9.2.2 Visits to international teams
Research stays abroad
- Pierre Senellart was an invited participant to the Logic and Algorithms in DB Theory and AI Reunion seminar at UC Berkeley, CA, USA (January 2025)
- Camille Bourgaux was an invited participant to the Semirings in Databases, Automata, and Logic seminar in Dagstuhl, Germany (February 2025)
9.3 National initiatives
9.3.1 ANR
-
PRC EXPAND (coordinator)
Participants: Michael Thomazo, Camille Bourgaux.
-
Title:
Expanding the reach of ontology-based data access: EXpressivity, exPlanation, and Algorithms
-
Partner Institution(s):
LIRMM, LaBRI, LIMOS, Inria Lille (SPIRALS & D-DAL), IRISA
-
Duration:
2025–2030
-
Budget for Valda:
55 k€ (Inria budget)
-
Title:
-
PR[AI]RIE-PSAI AI Cluster
Participants: Pierre Senellart, Camille Bourgaux.
-
Title:
Paris Artificial Intelligence Research Institute – Paris School of AI
-
Duration:
2025–2029
-
Funding for Valda:
575 k€ (ENS budget)
-
Title:
-
Megyn's Bienvenu INTENDED Chair in Artificial Intelligence
Participants: Camille Bourgaux.
-
Title:
Intelligent handling of imperfect data
-
Partner Institution(s):
LaBRI
-
Duration:
2020–2026
-
Title:
9.3.2 Others
-
France 2030 i-Demo Cyberté project
Participants: Paul Boniol.
-
Partner institution(s):
Scality, Inria Rennes (CIDRE)
-
Duration:
2025–2030
-
Funding for Valda:
499 k€ (Inria budget)
-
Partner institution(s):
-
CNRS MITI nanoNet project
Participants: Paul Boniol.
-
Title:
Méthodologie avancée pour la détection des nanoparticules dans les séries temporelles spICP-ToF-MS
-
Partner Institution(s):
IPGP
-
Duration:
2025–2026
-
Title:
10 Dissemination
10.1 Promoting scientific activities
10.1.1 Scientific events: organisation
General chair, scientific chair
- Paul Boniol , IEEE BigData 2025, Chair of the Industrial & Government track
- Paul Boniol , International Workshop on Multivariate Time Series Analytics (MulTiSA) 2025, Panel Chair
Member of the organizing committees
- Camille Bourgaux , member of the DL steering committee
- Camille Bourgaux , co-responsible for the MaDICS/RADIA RECAST working group(organization of a thematic day in November, and two sessions of the GDR MaDICS symposium in May)
- Luc Segoufin , member of the STACS steering committee
- Pierre Senellart , editorial board of the LIPIcs series of conference proceedings
10.1.2 Scientific events: selection
Member of the conference program committees
- Paul Boniol , VLDB 2025, EDBT 2025, ICDE 2025 (Industry & applications track), Multisa Workshop of ICDE 2025, BDA 2025, BERT2S Workshop of NeurIPS 2025
- Camille Bourgaux , IJCAI 2025, KR 2025, DL 2025
- Antoine Gauquier , WASP 2025
- Pierre Senellart , SIGMOD 2026, Provenance Week 2025, SDProc 2025
- Michael Thomazo , KR 2025, RuleML+RR 2025, IJCAI 2025
10.1.3 Journal
Member of the editorial boards
- Luc Segoufin , associate editor, ACM Transactions on Computational Logic
- Victor Vianu , editor, Database Theory Column, SIGACT News
Reviewer - reviewing activities
- Pierre Senellart , review for Transactions on Graph Data and Knowledge
10.1.4 Invited talks
- Paul Boniol , Anomaly Detection in Time Series: Overview and New Trends, Invited speaker at Orange Innovation
- Paul Boniol , An introduction to Time series anomaly detection (a data-driven perspective), Invited speaker for SIDOS at EGC 2025, Strasbourg, France (January 2025)
- Pierre Senellart , Quels horizons de pratique pour la recherche en IA?, Invited speaker at Printemps Couperin, Paris, France (March 2025)
- Silviu Maniu (Univ. Grenoble–Alpes) & Pierre Senellart , Making Provenance and Probabilistic Database Theory Work in Practice, Invited talk at ICDT 2025 (Database Theory in Practice), Barcelona, Spain(March 2025) 21
- Pierre Senellart , Qualitative Evaluation of Academic Careers in Computer Science at CNRS. Global Forum on Development of Computer Science, Tsinghua University, Beijing, Chine, Invited keynote speaker at the Global Forum on Development of Computer Science of Tsinghua University, Beijing, China (April 2025) 44
- Pierre Senellart , Artificial Intelligence. A Personal View. Invited speaker at INSP Days, Paris, France (July 2025)
- Pierre Senellart , Les BD pourront-elles sauver l'IA?, Panel participant, BDA 2025, Toulouse, France (October 2025)
- Pierre Senellart , Intelligence artificielle: Concepts, modèles et enjeux, Invited speaker at Séminaire scientifique et technique de l'Inrap, Chartres, France (November 2025)
- Paul Boniol , Time Series Anomaly Detection: The Road to Automatic Solutions, Invited Speaker at the 3rd Macau Symposium on Data Science, Macau SAR, China (December 2025)
10.1.5 Leadership within the scientific community
- Serge Abiteboul is a member of the French Academy of Sciences, of the Academia Europaea, and an ACM Fellow.
- Pierre Senellart was until August 2025 is a junior member of the Institut Universitaire de France.
10.1.6 Research administration
- Serge Abiteboul is a member of the scientific committe of the Programme Inria Quadrant (PIQ).
- Antoine Gauquier is an elected member of the Conseil d'Administration of ENS-PSL
- Antoine Gauquier is an elected member of the DIENS lab council
- Luc Segoufin is a member of the Formation Spécialisée de Site (FSS) of the Inria Paris research centre.
- Pierre Senellart is Vice-President of PSL University in charge of Digital infrastructure and IT convergence. 48
- Pierre Senellart was until August 2025 the president of section 6 of the National Committee for Scientific Research. 43 As a representative of CoNRS, Pierre Senellart was in the Hcéres evaluation committee of the IRIT research unit, and president of the evaluation committee of the LIRMM research unit.
- Pierre Senellart was until August 2025 a member of the board of the conference of presidents of the national committee (CPCN) and as such a member of the coordination of managing parties of the national committee (C3N).
- Pierre Senellart is deputy director of the DI ENS laboratory, joint between ENS, CNRS, and Inria.
- Pierre Senellart is the scientific resource person for Scientific information & edition of the Inria Paris centre.
- Pierre Senellart is the vice-president of the Gilles Kahn PhD award of Société Informatique de France.
- Pierre Senellart is a member of the strategic orientation committee of ISIMA.
- Michael Thomazo is a deputy director of the École Doctorale Sciences Mathématiques de Paris-Centre (ED386)
- We participated in the following hiring committee within universities:
- Camille Bourgaux , Maître de conférences, ENSEIRB-MATMECA-Bordeaux INP
10.2 Teaching - Supervision - Juries - Educational and pedagogical outreach
- Licence: The Art of Computer Programming, L1, International Bachelor of Science in Artificial Intelligence, PSL – Pierre Senellart
- Licence: Algorithms, L1, CPES, PSL – Antoine Gauquier
- Licence: Differential calculus, L2, CPES, PSL – Antoine Gauquier
- Licence: Formal Languages, Computability, Complexity, L3, ENS – Michael Thomazo , Lucas Larroque
- Licence: Databases, L3, ENS – Pierre Senellart , Paul Boniol , Lucas Larroque
-
Licence: Practical Computing, L3, École normale supérieure – Pierre Senellart
- Master: Logiques de description, M1, DCI – Camille Bourgaux
- Master: Data acquisition, extraction, and storage, M2, IASD – Pierre Senellart
- Master: Knowledge graphs, description logics, and reasoning on data, M2, IASD – Michael Thomazo
-
Master: NoSQL databases, M2, IASD – Paul Boniol
- Professional training: Web Security, PESTO (Corps des Mines professional training) – Pierre Senellart
As a professor at ENS, Pierre Senellart held various teaching responsibilities (M2 administration, entrance competition) at ENS. Pierre Senellart is the academic director of the graduate program in Computer Science of PSL.
As an adjunct professor at PSL, Michaël Thomazo is in charge of PhD committees within DI ENS and deputy director of the École doctorale.
We also gave invited courses in summer schools:
- Camille Bourgaux , Inconsistency-Tolerant Semantics Based on (Preferred) Repairs, 21st Reasoning Web Summer School (RW 2025) – Istanbul, Turkey 20
- Paul Boniol , Time Series Anomaly Detection, Summer school on Artificial Intelligence for Aerospace – GSSI, L’Aquila, Italy
- Paul Boniol , Time Series Anomaly Detection: Foundations and Practice, TwinODIS 1st Summer School – FORTH-ICS, Heraklion, Greece
Most permanent members of the group are also involved in tutoring ENS students, advising them on their curriculum, their internships, etc. They are also occasionally involved with reviewing internship reports, supervising student projects, etc.
10.2.1 Supervision
-
PhD defended: Anatole Dahan, The Role of Permutation Groups in the Search for a Logic for Polynomial Time, 2020–2025, Arnaud Durand (Université Paris-Cité) & Luc Segoufin 42
- PhD in progress: Antoine Gauquier , Intelligent construction of a multimodal and heterogeneous data warehouse, with data traceability, started in September 2023, Pierre Senellart & Ioana Manolescu (Inria Cedar)
- PhD in progress: Lucas Larroque,Extension of rewriting procedures for reasoning using existential rules, started in September 2023, Michaël Thomazo
- PhD in progress: Robin Jean , Integration of preferences and domain knowledge in inconsistency-tolerant ontology-based data access, started in October 2023, Meghyn Bienvenu (CNRS LaBRI) & Camille Bourgaux
- PhD in progress: Aryak Sen , Scalability of a data provenance and probability management system, started in February 2024, Silviu Maniu (Université Grenoble Alpes) & Pierre Senellart
- PhD in progress: Emmanouil Sylligardos , Accuracy and execution time trade-off in ensembling and model selection for time series analytics, started in February 2024, Paul Boniol PierreSenellart
- PhD in progress: Felix Chavelli , Graph representations for multivariate time series analytics, started in October 2024, Paul Boniol & Michaël Thomazo
- PhD in progress: Pratik Karmakar , Quality, uncertainty, and lineage of data, Stéphane Bressan (NUS, deceased), Tan Kian-Lee (NUS), & Pierre Senellart (as he is based in Singapore, he is not considered a Valda member)
- PhD in progress: Marijan Soric , Exploitation et structuration des données et des connaissances géologiques hétérogènes, started in March 2025, Pierre Senellart , Ioana Manolescu (Inria Cedar), & Cécile Gracianne (BRGM)
- PhD in progress: Magali Parrino , Détection non-supervisée d’anomalies dans des flux continus de séries temporelles multivariées, started in July 2025, Paul Boniol , Emmanuel Remy (EDF), & Pierre Senellart
-
PhD in progress: Arthur Lombardo; started in October 2025, Pierre Senellart , Antoine Amarilli (Inria D-DAL) & Mikaël Monet (Inria D-DAL) (as he is based in Lille, he is considered a D-DAL member)
- Master's internship: Arushi Goyal; Pierre Senellart
- Master's internship: Marijan Soric; Pierre Senellart and Ioana Manolescu (Inria Cedar) 45
- M1 research project: Jeanne Coschieri; Michael Thomazo & David Carral (Inria Boreal)
- M1 research project: Paul Raphaël; Michael Thomazo & Lucas Larroque
10.2.2 Juries
- PhD: François Amat [reviewer], Institut polytechnique de Paris, Pierre Senellart
10.3 Popularization
10.3.1 Specific official responsibilities in science outreach structures
- Serge Abiteboul , President of the scientific steering committee of ANR
- Serge Abiteboul , President of the AFNIC Foundation
- Pierre Senellart is a scientific expert advising the Scientific and Ethical Committee of Parcoursup and MonMaster, the platforms for the selection of higher-education students at the first-year level and the Master's level. As such, he contributed to the 7th yearly report of the committee to the French parliament
10.3.2 Productions (articles, videos, podcasts, serious games, ...)
- Serge Abiteboul , editor of the binaire blog, which moved from the blog platform of Le Monde to that of La Recherche
- Serge Abiteboul , codirector of the Parlez-moi d'IA podcast on Cause commune
- Serge Abiteboul , co-author of articles on theatre and computer science 46, 47
10.3.3 Participation in Live events
- Serge Abiteboul , co-organizer with French Senator Ghislaine Senée of a Colloquium at the Senate : Les données au service des territoires intelligents
- Serge Abiteboul , co-organizer with Isabelle Hilali from Datacraft ot eh conference: Quantum & Intelligence artificielle : vers une convergence des ruptures technologiques ?
11 Scientific production
11.1 Major publications
- 1 articleMonadic Datalog, Tree Validity, and Limited Access Containment.ACM Transactions on Computational Logic2112020, 6:1-6:45HALDOI
- 2 inproceedingsAnswering Counting Queries over DL-Lite Ontologies.IJCAI 2020 - Twenty-Ninth International Joint Conference on Artificial IntelligenceProceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020.Reportée de juillet 2020 à janvier 2021 en raison de la COVIDYokohama, JapanJuly 2020HAL
- 3 inproceedingsADecimo: Model Selection for Time Series Anomaly Detection.ICDE 2024 - IEEE 40th International Conference on Data EngineeringUtrecht, NetherlandsMay 2024HAL
- 4 inproceedingsRevisiting Semiring Provenance for Datalog.KR 2022 - 19th International Conference on Principles of Knowledge Representation and ReasoningProceedings of the 19th International Conference on Principles of Knowledge Representation and ReasoningHaifa, IsraelJuly 2022, 91–101HALDOI
- 5 inproceedingsCapturing Homomorphism-Closed Decidable Queries with Existential Rules.KR 2021 - 18th International Conference on Principles of Knowledge Representation and ReasoningVirtual, VietnamNovember 2021, 141--150HAL
- 6 inproceedingsParallelisable Existential Rules: a Story of Pieces.KR 2021 - 18th International Conference on Principles of Knowledge Representation and ReasoningVirtual, VietnamNovember 2021HAL
- 7 inproceedingsConjunctive Queries With Self-Joins, Towards a Fine-Grained Complexity Analysis.PODS'23Seattle, United StatesJune 2023HAL
- 8 inproceedingsCoping with Incomplete Data: Recent Advances.SIGMOD/PODS 2020 - International Conference on Management of DataPortland / Virtual, United StatesACMJune 2020, 33-47HALDOI
- 9 articleTameness and the power of programs over monoids in DA.Logical Methods in Computer Science183August 2022, 14:1–14:34HALDOI
- 10 articleExpected Shapley-Like Scores of Boolean Functions: Complexity and Applications to Probabilistic Databases.Proceedings of the ACM on Management of Data22 (PODS)January 2024HALDOI
- 11 articleEnumeration for FO Queries over Nowhere Dense Graphs.Journal of the ACM (JACM)693June 2022, 1-37HALDOI
- 12 articleProvSQL: Provenance and Probability Management in PostgreSQL.Proceedings of the VLDB Endowment (PVLDB)1112August 2018, 2034-2037HALDOI
- 13 articleTroubles with nulls, views from the users.Proceedings of the VLDB Endowment (PVLDB)1511July 2022, 2613-2625HALDOI
11.2 Publications of the year
International journals
Invited conferences
International peer-reviewed conferences
National peer-reviewed Conferences
Doctoral dissertations and habilitation theses
Reports & preprints
Other scientific publications
Scientific popularization
11.3 Cited publications
- 49 bookData on the Web: From Relations to Semistructured Data and XML.Morgan Kaufmann1999back to text
- 50 bookFoundations of Databases.Addison-Wesley1995, URL: http://webdam.inria.fr/Alice/back to text
- 51 bookWeb Data Management.Cambridge University Press2011, URL: http://webdam.inria.fr/Jorgeback to text
- 52 incollectionDatabases.Computer Science, The Hardware, Software and Heart of ItSpringer2011, 169-229URL: https://doi.org/10.1007/978-1-4614-1168-0_10DOIback to text
- 53 articleAdaptive Query Processing.Foundations and Trends in Databases112007, 1-140URL: https://doi.org/10.1561/1900000001DOIback to text
- 54 articleAnswering queries using views: A survey.VLDB J.1042001, 270-294URL: https://doi.org/10.1007/s007780100054DOIback to text
- 55 articleThe State of the art in distributed query processing.ACM Comput. Surv.3242000, 422-469URL: http://doi.acm.org/10.1145/371578.371598DOIback to text
- 56 bookPrinciples of Distributed Database Systems, Third Edition.Springer2011, URL: https://doi.org/10.1007/978-1-4419-8834-8DOIback to text
- 57 bookActive Learning.Synthesis Lectures on Artificial Intelligence and Machine LearningMorgan & Claypool Publishers2012, URL: https://doi.org/10.2200/S00429ED1V01Y201207AIM018DOIback to text
- 58 bookReinforcement learning - an introduction.Adaptive computation and machine learningMIT Press1998, URL: http://www.worldcat.org/oclc/37293240back to text
- 59 inproceedingsOn the reliability and intuitiveness of aggregated search metrics.22nd ACM International Conference on Information and Knowledge Management, CIKM'13, San Francisco, CA, USA, October 27 - November 1, 20132013, 689-698URL: http://doi.acm.org/10.1145/2505515.2505691DOIback to text