2024Activity reportProject-TeamVALDA
RNSR: 201622223R- Research center Inria Paris Centre
- In partnership with:Ecole normale supérieure de Paris, CNRS
- Team name: Value from Data
- In collaboration with:Département d'Informatique de l'Ecole Normale Supérieure
- Domain:Perception, Cognition and Interaction
- Theme:Data and Knowledge Representation and Processing
Keywords
Computer Science and Digital Science
- A3.1. Data
- A3.1.1. Modeling, representation
- A3.1.2. Data management, quering and storage
- A3.1.3. Distributed data
- A3.1.4. Uncertain data
- A3.1.5. Control access, privacy
- A3.1.6. Query optimization
- A3.1.7. Open data
- A3.1.8. Big data (production, storage, transfer)
- A3.1.9. Database
- A3.1.10. Heterogeneous data
- A3.1.11. Structured data
- A3.2. Knowledge
- A3.2.1. Knowledge bases
- A3.2.2. Knowledge extraction, cleaning
- A3.2.3. Inference
- A3.2.4. Semantic Web
- A3.2.5. Ontologies
- A3.2.6. Linked data
- A3.3. Data and knowledge analysis
- A3.3.1. On-line analytical processing
- A3.3.2. Data mining
- A3.3.3. Big data analysis
- A3.4.3. Reinforcement learning
- A3.4.5. Bayesian methods
- A3.5.1. Analysis of large graphs
- A4.7. Access control
- A7.2. Logic in Computer Science
- A7.3. Calculability and computability
- A9.1. Knowledge
- A9.8. Reasoning
Other Research Topics and Application Domains
- B2. Health
- B3.3. Geosciences
- B4. Energy
- B4.2. Nuclear Energy Production
- B9.3. Medias
- B9.5.6. Data science
- B9.6.5. Sociology
- B9.6.10. Digital humanities
- B9.7.2. Open data
- B9.9. Ethics
- B9.10. Privacy
1 Team members, visitors, external collaborators
Research Scientists
- Serge Abiteboul [Inria, Emeritus]
- Paul Boniol [Inria, ISFP]
- Camille Bourgaux [CNRS, Researcher]
- Luc Segoufin [Inria, Senior Researcher, HDR]
- Michaël Thomazo [Inria, Researcher, HDR]
Faculty Member
- Pierre Senellart [Team leader, ENS Paris, Professor, HDR]
Post-Doctoral Fellow
- Sven Dziadek [Inria, Post-Doctoral Fellow, until Aug 2024]
PhD Students
- Felix Chavelli [Inria, from Oct 2024]
- Anatole Dahan [Université de Paris]
- Antoine Gauquier [ENS]
- Robin Jean [CNRS]
- Lucas Larroque [ENS]
- Shrey Mishra [ENS, until Jul 2024]
- Aryak Sen [CNRS, from Feb 2024]
- Emmanouil Sylligardos [ENS, from Feb 2024]
Technical Staff
- N. Smith [ENS, Engineer, until Feb 2024]
Interns and Apprentices
- Leo Boullot [PSL, Intern, until Jun 2024]
- Atefe Khodadaditaghanaki [Université Paris-Cité & CNRS, Intern, from May 2024 until Sep 2024]
- Haoming Lin [EPFL & Inria, Intern, from Mar 2024 until Sep 2024]
- Paul Sevestre [PSL, Intern, until Jun 2024]
- Marijan Soric [Centrale Lyon & Inria, Intern, from Sep 2024]
- Julie Zhan [PSL, Intern, until Jun 2024]
Administrative Assistant
- Meriem Guemair [Inria]
Visiting Scientists
- Thomas Schwentick [TU Dortmund, until Jan 2024]
- Victor Vianu [UC San Diego, from Jun 2024]
2 Overall objectives
Valda's focus is on both foundational and systems aspects of complex data management, especially human-centric data. The data we are interested in is typically heterogeneous, massively distributed, rapidly evolving, intensional, and often subjective, possibly erroneous, imprecise, incomplete. In this setting, Valda is in particular concerned with the optimization of complex resources such as computer time and space, communication, monetary, and privacy budgets. The goal is to extract value from data, beyond simple query answering.
Data management 30, 32 is now an old, well-established field, for which many scientific results and techniques have been accumulated since the sixties. Originally, most works dealt with static, homogeneous, and precise data. Later, works were devoted to heterogeneous data 2931, and possibly distributed 36 but at a small scale.
However, these classical techniques are poorly adapted to handle the new challenges of data management. Consider human-centric data, which is either produced by humans, e.g., emails, chats, recommendations, or produced by systems when dealing with humans, e.g., geolocation, business transactions, results of data analysis. When dealing with such data, and to accomplish any task to extract value from such data, we rapidly encounter the following facets:
- Heterogeneity: data may come in many different structures such as unstructured text, graphs, data streams, complex aggregates, etc., using many different schemas or ontologies.
- Massive distribution: data may come from a large number of autonomous sources distributed over the web, with complex access patterns.
- Rapid evolution: many sources may be producing data in real time, even if little of it is perhaps relevant to the specific application. Typically, recent data is of particular interest and changes have to be monitored.
- Intensionality1: in a classical database, all the data is available. In modern applications, the data is more and more available only intensionally, possibly at some cost, with the difficulty to discover which source can contribute towards a particular goal, and this with some uncertainty.
- Confidentiality and security: some personal data is critical and need to remain confidential. Applications manipulating personal data must take this into account and must be secure against linking.
- Uncertainty: modern data, and in particular human-centric data, typically includes errors, contradictions, imprecision, incompleteness, which complicates reasoning. Furthermore, the subjective nature of the data, with opinions, sentiments, or biases, also makes reasoning harder since one has, for instance, to consider different agents with distinct, possibly contradicting knowledge.
These problems have already been studied individually and have led to techniques such as query rewriting34 or distributed query optimization35.
Among all these aspects, intensionality is perhaps the one that has least been studied, so let us expand a bit on this. Consider a user's query, taken in a very broad sense: it may be a classical database query, some information retrieval search, a clustering or classification task, or some more advanced knowledge extraction request. Because of intensionality of data, solving such a query is a typically dynamic task: each time new data is obtained, the partial knowledge a system has of the world is revised, and query plans need to be updated, as in adaptive query processing 33 or aggregated search 39. The system then needs to decide, based on this partial knowledge, of the best next access to perform. This is reminiscent of the central problem of reinforcement learning 38 (train an agent to accomplish a task in a partially known world based on rewards obtained) and of active learning 37 (decide which action to perform next in order to optimize a learning strategy) and we intend to explore this connection further.
Uncertainty of the data interacts with its intensionality: efforts are required to obtain more precise, more complete, sounder results, which yields a trade-off between processing cost and data quality.
Other aspects, such as heterogeneity and massive distribution, are of major importance as well. A standard data management task, such as query answering, information retrieval, or clustering, may become much more challenging when taking into account the fact that data is not available in a central location, or in a common format. We aim to take these aspects into account, to be able to apply our research to real-world applications.
3 Research program
3.1 Research axis 1: Foundations of data management
This axis covers the theory of data management, broadly taken, and in particular the fields of database theory, knowledge representation, and some symbolic aspects of artificial intelligence (especially, reasoning on data).
The goal is to define solid and high-level foundations of data management tasks (query evaluation and optimization of various forms of queries, counting, reasoning, verification of data-centric processes, etc.) through formal tools, such as logics (esp., finite model theory), automata theory, complexity theory; we occasionally have contributions in these areas as well, though most of our work is motivated by data applications. We are especially interested in clean specifications of key aspects of database systems and data management tasks (e.g, confidentiality, access control, robustness), whether they are properties of the data or appropriate (query) languages for these tasks. We study expressive power of languages, computability and complexity of deciding or computing results, as well as the design of appropriate structures (e.g., indexes) to optimize these tasks.
3.2 Research axis 2: Uncertainty, provenance, and explainability in data management
This research axis deals with the modeling and efficient management of data that come with some uncertainty (probabilistic distributions, logical incompleteness, missing values, inconsistencies, open-world assumption, etc.) and with provenance information (indicating where the data originates from), as well as with the extraction of uncertainty and provenance annotations from real-world data. Provenance is also linked to explainability: determining where the result of a data management task comes from, how and why it was produced, helps explaining it. Interestingly, the foundations and tools for uncertainty management often rely on provenance annotations. For example, a typical way to compute the probability of query results in probabilistic databases is the so-called intensional approach: first generate the provenance of these query results (in some appropriate framework, e.g., that of Boolean functions or of provenance semirings), and then compute the probability of the resulting provenance annotation. For this reason, we deal with uncertainty and provenance in a unified manner, and with explainability as an application thereof.
3.3 Research axis 3: Knowledge discovery at scale
Our final axis deals with knowledge discovery at scale. The goal is to use techniques such as data mining, information extraction, data cleaning, information integration, machine learning, to derive knowledge from raw, dirty, inconsistent, heterogeneous, rapidly changing, data from real-world application scenarios.
We intend to leverage our expertise on data management to focus on the scalability of the approaches and tools developed. This is also in some sense an application axis for techniques developed in the other two axes; in particular, we have a focus on intensionality of data (i.e., cost to data access), on the trade-off between data uncertainty and its cost, on data provenance and explanations.
This axis is typically very changing in subtopics, depending on projects, collaborations, application partners.
4 Application domains
A large part of Valda's research is foundational in nature and not tailored to any specific application domain. Some applied works target certain application domains however:
-
Web data
in a broad-sense (semi-structured, structured or unstructured content extracted from Web databases; knowledge bases from the Semantic Web; social networks; Web archives and Web crawls; Web applications and deep Web databases; crowdsourcing platforms). This is a historical domain of interest of Valda researchers, and we have expertise in the acquisition, extraction, and management of this kind of data.
-
Open science
(publication databases, scientific publications, open-source software).
-
Clinical data
(notably inconsistent or incomplete hospital records).
-
Energy
(notably data from power stations, in collaboration with industrial partners).
-
Geoscience
(seismology or vulcanology time series, strucutred data about geological campaigns).
-
Data journalism
(statistical datasets, fact checking data).
Finally, transversal concerns which occur in different applications area and motivate some of our theory work are ethics of data management and privacy.
5 Highlights of the year
5.1 Scientific events
In March 2024, Valda organized an event in the honor of Serge Abiteboul, joint with the EDBT/ICDT 2024 conference in Paestum, Italy. This is at the occasion of Serge's 70th birthday and retirement from Arcep, and in celebration of his scientific career and achievements. This event included talks by his colleagues, former students, and Serge himself.
5.2 Awards
Paul Boniol 's and Emmanouil Sylligardos 's demonstration paper 18 was distinguished as a runner up for best demonstration at the ICDE 2024 conference.
5.3 Inria policy
Note : Readers are advised that the Institute does not endorse the text in the “Highlights of the year” section, which is the sole responsibility of the team leader.
At the end of 2024, Inria's top management enacted a new “contrat d'objectifs, de moyens et de performance” (COMP), which defines Inria's objectives for the period 2024–2028. We are very unhappy and concerned about the content of this document and the way it was imposed.
- Neither the staff nor their representative bodies were given the opportunity to participate in (or influence) the drafting of this document.
- The document defines Inria's main mission as “contributing to the digital sovereignty of the Nation through research and innovation” and proposes to amend Inria's founding decree to reflect this new definition. We strongly believe that our primary mission is (and should remain) the advancement of human knowledge through research. Research is not a means to achieve “digital sovereignty”, whatever that may mean. Research should not be associated with any particular nation, whatever that nation may be.
- The document announces the creation of a funding agency within Inria. France already has an independent funding agency, the ANR. The creation of a new funding agency within a research institute is unnecessary and a waste of resources. It is also likely to create confusion, opacity, and conflicts of interest.
- Many aspects of the document reflect a desire to drive research in a top-down manner, for example through the selection of “strategic partner institutions” and “strategic themes”. This threatens the fundamental freedom of researchers to choose their research topics and collaborations.
- The document indicates that all of Inria's research should have “dual nature”, that is, both civilian and military applications. While some of the institute's research may have military applications, the vast majority of it is independent of the military, and should remain so.
- The document announces a desire to place all of Inria in a “restricted regime area” (ZRR), which means that the hiring of researchers and interns will be reviewed and possibly vetoed by the Fonctionnaire Sécurité Défense. This creates administrative delays, subjects hiring to opaque criteria, and discourages the hiring of foreign nationals, thus harming research and collaboration.
- Staff opposition to these policies, which has been expressed in several votes and petitions, has been largely ignored.
6 New software, platforms, open data
6.1 New software
6.1.1 ProvSQL
-
Keywords:
Databases, Provenance, Probability
-
Functional Description:
The goal of the ProvSQL project is to add support for (m-)semiring provenance and uncertainty management to PostgreSQL databases, in the form of a PostgreSQL extension/module/plugin.
-
News of the Year:
Revamp of the implementation of the circuit storage, now stored in memory-mapped files accessed through a single worked process. Addition of a local circuit cache within each PostgreSQL backend. Implementation of expected value computation. Implementation of union of intervals semiring. Support for provenance tracking of deletions. Drop support for PostgreSQL 9.6. Support for PostgreSQL 17. Mac OS and WSL continuous integration. Various optimizations, bug fixes, and robustness enhancements.
- URL:
- Publications:
-
Contact:
Pierre Senellart
-
Participant:
Pierre Senellart
6.1.2 apxproof
-
Keyword:
LaTeX
-
Functional Description:
apxproof is a LaTeX package facilitating the typesetting of research articles with proofs in appendix, a common practice in database theory and theoretical computer science in general. The appendix material is written in the LaTeX code along with the main text which it naturally complements, and it is automatically deferred. The package can automatically send proofs to the appendix, can repeat in the appendix the theorem environments stated in the main text, can section the appendix automatically based on the sectioning of the main text, and supports a separate bibliography for the appendix material.
-
Release Contributions:
Support for lipcs's claimproof, support optional arguments in proofs
-
News of the Year:
Support for user-defined claimproof environment.
- URL:
-
Contact:
Pierre Senellart
-
Participant:
Pierre Senellart
6.1.3 dissem.in
-
Name:
Dissemin
-
Keywords:
Open Access, Publishing, HAL
-
Functional Description:
Dissemin is a web platform gathering metadata from many sources to analyze the open-access full text availability of publications of researchers. It has been designed to foster the use of repositories such as HAL (rather than preprints posted on personal homepages). It allows deposit on these repositories.
-
News of the Year:
Finalization of crossref ingestion.
- URL:
-
Contact:
Pierre Senellart
-
Participant:
Pierre Senellart
-
Partner:
CAPSH
7 New results
7.1 Research axis 1: Foundations of data management
Participants: Camille Bourgaux, Sven Dziadek, Lucas Larroque, Michaël Thomazo, Luc Segoufin.
Knowledge representation and knowledge bases
Ontology-based query answering is a problem that takes as input an ontology
Research on knowledge graph embeddings has recently evolved into knowledge base embeddings, where the goal is not only to map facts into vector spaces but also constrain the models so that they take into account the relevant conceptual knowledge available. 19 examines recent methods that have been proposed to embed knowledge bases in description logic into vector spaces through the lens of their geometric-based semantics. We identify several relevant theoretical properties, which we draw from the literature and sometimes generalize or unify. We then investigate how concrete embedding methods fit in this theoretical framework.
Consistent query answering
In 25, we consider the dichotomy conjecture for consistent query answering under primary key constraints. It states that, for every fixed Boolean conjunctive query
Other aspects of theoretical computer science
Our research occasionally touches other aspects of theoretical computer science not related to data management. In 11, we show how to efficiently solve problems involving a quantitative measure, here called energy, as well as a qualitative acceptance condition, expressed as a Büchi or Parity objective, in finite weighted automata and in one-clock weighted timed automata. Solving the former problem and extracting the corresponding witness is our main contribution and is handled by a modified version of the Bellman-Ford algorithm interleaved with Couvreur’s algorithm. The latter problem is handled via a reduction to the former relying on the corner-point abstraction.
7.2 Research axis 2: Uncertainty, provenance, and explainability in data management
Participants: Camille Bourgaux, Robin Jean, Pierre Senellart.
Inconsistent knowledge bases
In 15, we explore a quantitative approach to querying inconsistent description logic knowledge bases. We consider weighted knowledge bases in which both axioms and assertions have (possibly infinite) weights, which are used to assign a cost to each interpretation based upon the axioms and assertions it violates. Two notions of certain and possible answer are defined by either considering interpretations whose cost does not exceed a given bound or restricting attention to optimal-cost interpretations. Our main contribution is a comprehensive analysis of the combined and data complexity of bounded cost satisfiability and certain and possible answer recognition, for description logics between
In 16, we present a novel approach to querying classical inconsistent description logic (DL) knowledge bases by adopting a paraconsistent semantics with the four “Belnapian” values: exactly true, exactly false, both, and neither. In contrast to prior studies on paraconsistent DLs, we allow truth value operators in the query language, which can be used to differentiate between answers having contradictory evidence and those having only positive evidence. We present a reduction to classical DL query answering that allows us to pinpoint the precise combined and data complexity of answering queries with values in paraconsistent
Provenance management and probabilistic databases
In 26, we report on the impact that the theory of provenance semirings, developed by Val Tannen and his collaborators, has had on the design on a practical system for maintaining the provenance of query results over a relational database, namely ProvSQL.
Shapley values, originating in game theory and increasingly prominent in explainable AI, have been proposed to assess the contribution of facts in query answering over databases, along with other similar power indices such as Banzhaf values. In 12 we adapt these Shapley-like scores to probabilistic settings, the objective being to compute their expected value. We show that the computations of expected Shapley values and of the expected values of Boolean functions are interreducible in polynomial time, thus obtaining the same tractability landscape. We investigate the specific tractable case where Boolean functions are represented as deterministic decomposable circuits, designing a polynomial-time algorithm for this setting. We present applications to probabilistic databases through database provenance, and an effective implementation of this algorithm within the ProvSQL system, which experimentally validates its feasibility over a standard benchmark.
7.3 Research axis 3: Knowledge discovery at scale
Participants: Paul Boniol, Antoine Gauquier, Shrey Mishra, Pierre Senellart, Emmanouil Sylligardos.
Mining time series
Anomaly detection is an important problem in data analytics with applications in many domains. In recent years, there has been an increasing interest in anomaly detection tasks applied to time series. In two tutorials at two separate conferences 13, 17, we take a holistic view of anomaly detection in time series, starting from the core definitions and taxonomies related to time series and anomaly types, to an extensive description of the anomaly detection methods proposed by different communities in the literature. We explore the literature and the proposed methods by demonstrating systems that help users understand the core computational steps of some methods and navigate benchmark results. Finally, we describe the problem of model selection for anomaly detection and discuss recent experimental results.
Despite increasing academic interest in anomaly detection over time series and the large number of methods proposed in the literature, recent benchmark and evaluation studies demonstrated that there exists no single best anomaly detection method when applied to heterogeneous time series datasets. Therefore, the only scalable and viable solution to solve anomaly detection over very different time series collected from diverse domains is to propose a model selection method that will choose, based on time series characteristics, the best anomaly detection method to run. 18 describes ADecimo, a modular and extensible web application that helps users understand the performance of time series classification algorithms used as model selection methods for time series anomaly detection. Overall, our system enables users to compare 17 different classifiers over 1980 time series, and decide on the most suitable time series classification method for their own time series and use cases.
Beyond anomaly detection, exploring and comparing non-stationary multivariate time series is an important problem in many domains and real-world applications. In work conducted before Paul Boniol
's arrival in Valda, we introduced
Information extraction and structuring
22 describes Antoine Gauquier 's PhD project, which aims to study holistic methods for building, populating, and exploiting warehouses of heterogeneous content. Each warehouse is characterized by a specification of the types of content we search for; a set of websites in which to search for the content; a set of dedicated methods to analyze and understand the content, including to establish or find links that connect different pieces of content. AI and uncertainty are naturally involved in these steps. We present the overall thesis aims, as well as encouraging preliminary results for one use case: the acquisition of statistical data resources from French government websites, leveraging reinforcement learning.
23 explores first steps towards extracting information about theorems and proofs from scholarly documents to build a knowledge base of interlinked results. Specifically, we consider two main tasks: extractions of results and their proofs from the PDF of scientific articles; and establishing which results is used in the proof of which, across the scientific literature. We discuss the problem statement, methodologies, as well as preliminary findings employed in both phases of our approach, highlighting the challenges faced.
More specifically, in 24, 28 we address the extraction of mathematical statements and their proofs from scholarly PDF articles as a multimodal classification problem, utilizing text, font features, and bitmap image renderings of PDFs as distinct modalities. We propose a modular sequential multimodal machine learning approach specifically designed for extracting theorem-like environments and proofs. This is based on a cross-modal attention mechanism to generate multimodal paragraph embeddings, which are then fed into our novel multimodal sliding window transformer architecture to capture sequential information across paragraphs. Our approach demonstrates performance improvements obtained by transitioning from unimodality to multimodality, and finally by incorporating sequential modeling over paragraphs.
8 Partnerships and cooperations
8.1 International initiatives
8.1.1 Participation in other International Programs
DesCartes
Participants: Pierre Senellart.
-
Title:
Intelligent Modelling for Decision-making in Critical Urban Systems
-
Partner Institution(s):
CNRS@CREATE, National University of Singapore
-
Duration:
2021–2026
-
Additional info:
DesCartes is a project managed by CNRS@CREATE, a CNRS subsidiary in Singapore and funded by Singapore’s National Research Foundation, with 50 million total budget. Pierre Senellart is involved in the project as one of the French PIs.
PHC AURORA with Ana Ozaki
Participants: Camille Bourgaux.
-
Title:
Learning and Reasoning in Knowledge Graph Embeddings
-
Partner Institution(s):
University of Oslo, University of Bergen
-
Duration:
2023–2024
International ANR project EQUUS
Participants: Luc Segoufin.
-
Title:
Efficient query answering under updates
-
Partner Institution(s):
TU Ilmenau, Uni. Bayreuth, HU Berlin, CNRS
-
Duration:
2020–2025
8.2 International research visitors
8.2.1 Visits of international scientists
Other international visits to the team
Thomas Schwentick
-
Status:
Professor
-
Institution of origin:
TU Dortmund
-
Country:
Germany
-
Dates:
October 2023 to January 2024
-
Mobility program:
Sabbatical
Victor Vianu
-
Status:
Professor
-
Institution of origin:
UC San Diego
-
Country:
USA
-
Dates:
June 2024 to January 2025
-
Mobility program:
Sabbatical
8.2.2 Visits to international teams
Research stays abroad
- Pierre Senellart was an invited participant to the Representation, Provenance, and Explanations in Database Theory and Logic seminar at Dagstuhl (January 2024).
8.3 National initiatives
8.3.1 ANR
Valda has been part of two national ANR projects in 2024:
-
CQFD
(2018–2024; 19 k€ for Valda, budget managed by Inria), with Inria Sophia (GraphIK, coordinator), LaBRI, LIG, Inria Saclay (Cedar), IRISA, Inria Lille (Spirals), and Télécom ParisTech, on complex ontological queries over federated and heterogeneous data.
-
QUID
(2018–2024; 49 k€ for Valda, budget managed by Inria), LIGM (coordinator), IRIF, and LaBRI, on incomplete and inconsistent data.
Camille Bourgaux has been participating in the AI Chair of Meghyn Bienvenu on INTENDED (Intelligent handling of imperfect data) since 2020.
Pierre Senellart held a chair within the PR[AI]RIE institute for artificial intelligence in Paris since 2019. He now holds a chair in the new PR[AI]RIE – Paris School of AI (PSAI) AI cluster approved for funding from 2024. Camille Bourgaux is also a Research Fellow of PSAI.
8.3.2 Others
-
Dissemin
(2021–2024; 124 k€ for Valda, budget managed by ENS), sole partner, on the development of the dissem.in platform for open science promotion. Funded by the Fonds National Science Ouverte.
9 Dissemination
9.1 Promoting scientific activities
9.1.1 Scientific events: organisation
Member of the organizing committees
- Camille Bourgaux , member of the DL steering committee
- Luc Segoufin , member of the STACS steering committee
- Pierre Senellart , editorial board of the LIPIcs series of conference proceedings
Chair of conference program committees
- Camille Bourgaux , co-chair of the KR 2024 Doctoral Consortium
Member of the conference program committees
- Camille Bourgaux , BDA 2024 (demonstrations), DL 2024, IJCAI 2024, CSL 2025
- Paul Boniol , BDA 2024, DSAA 2024 (Applications track), MulTiSA 2024
- Pierre Senellart , JCDL 2024 (senior PC), SUM 2024, TaPP 2024, ICDT 2025
- Michaël Thomazo , KR 2024, RuleML+RR 2024, IJCAI 2024
- Victor Vianu , PODS 2024
Member of the editorial boards
- Luc Segoufin , associate editor, ACM ToCL
- Victor Vianu , editor, Database Theory column, SIGACT News
9.1.2 Invited talks
- Camille Bourgaux , Querying inconsistent prioritized data, invited talk at DL 2024 (June 2024, Bergen, Norway) 14
- Pierre Senellart , Provenance, Probabilities, and Power Indices in Databases, department seminar at ENS Paris-Saclay (March 2024, Gif-sur-Yvette, France)
- Paul Boniol ,Anomaly Detection in Time Series, keynote for AALTD workshop at ECML/PKDD 2024 (September 2024, Vilnius, Lithuania)
- Camille Bourgaux , Querying inconsistent databases, department seminar at ENS Rennes (October 2024, Bruz, France)
- Paul Boniol , Time Series Anomaly Detection: Overview and New Trends, Keynote for ML4Jets International Conference (November 2024, Paris France)
- Pierre Senellart , Qu'est-ce que l'IA ?, keynote at Assemblée des partenaires de Hal (November 2024, Lyon, France)
9.1.3 Leadership within the scientific community
- Serge Abiteboul is a member of the French Academy of Sciences, of the Academia Europaea, and an ACM Fellow.
- Pierre Senellart is a junior member of the Institut Universitaire de France.
9.1.4 Research administration
- Serge Abiteboul was a member of the jury for prizes of the Académie des Sciences: Computer science (president), Lovelace-Babbage, Mines-Télécom
- Serge Abiteboul was a member of the commission of new member of Académie des Sciences
- Serge Abiteboul is a member of the scientific committe of the Programme Inria Quadrant (PIQ)
- Luc Segoufin is a member of the Formation Spécialisée de Site (FSS) of the Inria Paris research centre.
- Pierre Senellart is the president of section 6 of the National Committee for Scientific Research. As a representative of CoNRS, Pierre Senellart was in the Hcéres evaluation committee of the LMF research unit.
- Pierre Senellart is a member of the board of the conference of presidents of the national committee (CPCN) and as such a member of the coordination of managing parties of the national committee (C3N).
- Pierre Senellart is deputy director of the DI ENS laboratory, joint between ENS, CNRS, and Inria.
- Pierre Senellart is the scientific resource person for Scientific information & edition of the Inria Paris centre
- We participated in the following hiring and promotion juries within universities:
- Serge Abiteboul , Professeur, CNAM
- Camille Bourgaux , Maître de conférences, IUT d'Orsay
- Pierre Senellart , Repyramidage Professeur des Universités, Université Paris-Cité
- Michaël Thomazo , Maître de conférences, Université de Bordeaux
- Michaël Thomazo , Maître de conférences, CentraleSupélec
9.2 Teaching - Supervision - Juries
9.2.1 Teaching
- Licence: Algorithms, L1, CPES, PSL – Antoine Gauquier
- Licence: Differential calculus, L2, CPES, PSL – Antoine Gauquier
- Licence: Practical Computing, L3, École normale supérieure – Pierre Senellart
- Licence: Formal Languages, Computability, Complexity, L3, École normale supérieure – Michaël Thomazo , Lucas Larroque
-
Licence: Databases, L3, École normale supérieure – Pierre Senellart , Lucas Larroque
- Master: Logiques de description, M1, DCI – Camille Bourgaux
- Master: Data Acquisition, Extraction, and Storage, M2, IASD – Pierre Senellart
- Master: NoSQL Databases, M2, IASD – Paul Boniol
- Master: Knowledge graphs, description logics, and reasoning on data, M2, IASD – Camille Bourgaux , Michaël Thomazo
-
Master: Description logics and reasoning on data, M2, LMFI – Camille Bourgaux , Michaël Thomazo
- Professional training: Web Security, PESTO (Corps des Mines professional training) – Pierre Senellart
As a professor at ENS, Pierre Senellart holds various teaching responsibilities (M1 projects, M2 administration, entrance competition) at ENS. Pierre Senellart is the academic director of the graduate program of PSL.
As an adjunct professor at PSL, Michaël Thomazo is in charge of PhD committees within DI ENS and co-responsible of the international entrance competition at ENS.
We also gave invited courses in summer schools:
- Paul Boniol , Time Series Anomaly Detection, International Summer School on Internet of Things - ISSIOT, Salerno, Italy
- Paul Boniol , Time Series Anomaly Detection, diiP Summer School 2024 – dSDS, Paris, France
- Michaël Thomazo , Summer School of Palacky University, Olomouc, Czechia
Most permanent members of the group are also involved in tutoring ENS students, advising them on their curriculum, their internships, etc. They are also occasionally involved with reviewing internship reports, supervising student projects, etc.
9.2.2 Supervision
- PhD defended: Shrey Mishra , Multimodal extraction of proofs and theorems from the scientific literature, 2021–2024, Pierre Senellart27
- PhD in progress: Anatole Dahan, Logical foundations of the polynomial hierarchy, started in October 2020, Arnaud Durand (Université Paris-Cité) & Luc Segoufin
- PhD in progress: Antoine Gauquier , Intelligent construction of a multimodal and heterogeneous data warehouse, with data traceability, started in September 2023, Pierre Senellart & Ioana Manolescu (Inria Cedar)
- PhD in progress: Lucas Larroque , Extension of rewriting procedures for reasoning using existential rules, started in September 2023, Michaël Thomazo
- PhD in progress: Robin Jean , Integration of preferences and domain knowledge in inconsistency-tolerant ontology-based data access, started in October 2023, Meghyn Bienvenu (CNRS LaBRI) & Camille Bourgaux
- PhD in progress: Aryak Sen , Scalability of a data provenance and probability management system, started in February 2024, Silviu Maniu (Université Grenoble Alpes) & Pierre Senellart
- PhD in progress: Emmanouil Sylligardos , Accuracy and execution time trade-off in ensembling and model selection for time series analytics., started in February 2024, Paul Boniol & Pierre Senellart
- PhD in progress: Felix Chavelli , Graph representations for multivariate time series analytics, started in October 2024, Paul Boniol & Michaël Thomazo
- PhD in progress: Pratik Karmakar, Quality, uncertainty, and lineage of data, Stéphane Bressan (NUS) & Pierre Senellart (as he is based in Singapore, he is not considered a Valda member)
- M2 internship: Atefe Khodadaditaghanaki , Camille Bourgaux
- M2 internship: Haoming Lin , Paul Boniol & Michaël Thomazo
- M2 internship: Marijan Soric , Cécile Gracianne (BRGM), Ioana Manolescu (Inria Cedar) & Pierre Senellart
- L2 internship: Leo Boullot , Antoine Gauquier & Pierre Senellart
- L2 internship: Paul Sevestre , Antoine Gauquier & Pierre Senellart
- L2 internship: Julie Zhan , Michaël Thomazo
9.2.3 Juries
- PhD: Edwige Cyffers [president], Université de Lille, Pierre Senellart
- PhD: Florent Martin-Lafay, Université Paris 1 Panthéon-Sorbonne, Serge Abiteboul
- PhD: Alexandra Rogova, Université Paris-Cité, Michaël Thomazo
- HdR: Pierre Bourhis [president & examiner], Université de Lille, Pierre Senellart & Serge Abiteboul
- HdR: Charles Paperman [reviewer], Université de Lille, Pierre Senellart
9.3 Popularization
9.3.1 Specific official responsibilities in science outreach structures
- Serge Abiteboul is the president of the scientific council of the direction of public finances (DGFIP)
- Serge Abiteboul is a member of the board of the Inria Foundation, of the Sopra Steria Foundation, of the Blaise Pascal Foundation
- Serge Abiteboul is a member of the scientific council of La Main à la Pâte, Cigref (on responsible digitalization); he was a member of the scientific council of the exhibition on IA “Double Je” in Toulouse in February 2024
- Pierre Senellart is a scientific expert advising the Scientific and Ethical Committee of Parcoursup and MonMaster, the platforms for the selection of higher-education students at the first-year level and the Master's level. As such, he contributed to the 6th yearly report of the committee to the French parliament
9.3.2 Productions (articles, videos, podcasts, serious games, ...)
- Serge Abiteboul is an editor of the binaire blog
- Serge Abiteboul is the co-author (with François Bancilhon) of a popularization book on digital commons: Vive les communs numériques, Odile Jacob, février 2024
9.3.3 Participation in Live events
- Serge Abiteboul is one of the author of the “Qui a hacké Garoutzia” play, which was staged at various occasions througout France in 2024, and which has had regular performance in Paris at La Scène parisienne since September 2024
- Serge Abiteboul participated in numerous (several dozens) scientific outreach activities: lectures, round tables, interviews (notably with the Académie des sciences; on the radio or on TV; and in many other settings)
- Camille Bourgaux participated to the RJMI (Rendez-Vous des Jeunes Mathématiciennes et Informaticiennes), an invent to promote mathematics and computer science to high-school female students
10 Scientific production
10.1 Major publications
- 1 articleMonadic Datalog, Tree Validity, and Limited Access Containment.ACM Transactions on Computational Logic2112020, 6:1-6:45HALDOI
- 2 inproceedingsADecimo: Model Selection for Time Series Anomaly Detection.ICDE 2024 - IEEE 40th International Conference on Data EngineeringUtrecht, NetherlandsMay 2024HAL
- 3 inproceedingsRevisiting Semiring Provenance for Datalog.KR 2022 - 19th International Conference on Principles of Knowledge Representation and ReasoningProceedings of the 19th International Conference on Principles of Knowledge Representation and ReasoningHaifa, IsraelJuly 2022, 91–101HALDOI
- 4 inproceedingsCapturing Homomorphism-Closed Decidable Queries with Existential Rules.KR 2021 - 18th International Conference on Principles of Knowledge Representation and ReasoningVirtual, VietnamNovember 2021, 141--150HAL
- 5 inproceedingsParallelisable Existential Rules: a Story of Pieces.KR 2021 - 18th International Conference on Principles of Knowledge Representation and ReasoningVirtual, VietnamNovember 2021HAL
- 6 inproceedingsConjunctive Queries With Self-Joins, Towards a Fine-Grained Complexity Analysis.PODS'23Seattle, United StatesJune 2023HAL
- 7 articleTameness and the power of programs over monoids in DA.Logical Methods in Computer Science183August 2022, 14:1–14:34HALDOI
- 8 articleExpected Shapley-Like Scores of Boolean Functions: Complexity and Applications to Probabilistic Databases.Proceedings of the ACM on Management of Data22 (PODS)January 2024HALDOI
- 9 articleEnumeration for FO Queries over Nowhere Dense Graphs.Journal of the ACM (JACM)693June 2022, 1-37HALDOI
- 10 articleProvSQL: Provenance and Probability Management in PostgreSQL.Proceedings of the VLDB Endowment (PVLDB)1112August 2018, 2034-2037HALDOI
10.2 Publications of the year
International journals
- 11 articleω-Regular Energy Problems.Formal Aspects of ComputingJuly 2024. In press. HALDOIback to text
- 12 articleExpected Shapley-Like Scores of Boolean Functions: Complexity and Applications to Probabilistic Databases.Proceedings of the ACM on Management of Data22 (PODS)January 2024HALDOIback to text
- 13 articleTime-Series Anomaly Detection: Overview and New Trends.Proceedings of the VLDB Endowment (PVLDB)17122024, 4229-4232HALDOIback to text
Invited conferences
- 14 inproceedingsQuerying Inconsistent Prioritized Data.DL 2024 - 37th International Workshop on Description LogicsBergen, NorwayJune 2024HALback to text
International peer-reviewed conferences
- 15 inproceedingsCost-Based Semantics for Querying Inconsistent Weighted Knowledge Bases.Proceedings of the 21st International Conference on Principles of Knowledge Representation and ReasoningKR 2024 - 21st International Conference on Principles of Knowledge Representation and ReasoningHanoi, VietnamNovember 2024HALback to text
- 16 inproceedingsQueries With Exact Truth Values in Paraconsistent Description Logics.Proceedings of the 21st International Conference on Principles of Knowledge Representation and ReasoningKR 2024 - 21st International Conference on Principles of Knowledge Representation and ReasoningHanoi, VietnamNovember 2024HALback to text
- 17 inproceedingsAn Interactive Dive into Time-Series Anomaly Detection.ICDE 2024 - 40th International Conference on Data EngineeringUtrecht, NetherlandsMay 2024HALback to text
- 18 inproceedingsADecimo: Model Selection for Time Series Anomaly Detection.ICDE 2024 - IEEE 40th International Conference on Data EngineeringUtrecht, NetherlandsMay 2024HALback to textback to text
- 19 inproceedingsKnowledge Base Embeddings: Semantics and Theoretical Properties.Proceedings of the 21st International Conference on Principles of Knowledge Representation and ReasoningKR 2024 - 21st International Conference on Principles of Knowledge Representation and ReasoningHanoi, VietnamNovember 2024HALback to text
- 20 inproceedingsOntology-Based Query Answering over Datalog-Expressible Rule Sets is Undecidable.KR 2024 - 21st International Conference on Principles of Knowledge Representation and ReasoningHanoi, VietnamNovember 2024HALback to text
- 21 inproceedingsd_symb playground: an interactive tool to explore large multivariate time series datasets.ICDE 2024 IEEE 40th International Conference on Data EngineeringUtrecht, NetherlandsMay 2024HALback to text
- 22 inproceedingsTowards Efficient Construction of a Traceable, Multimodal, and Heterogeneous Data Warehouse.CEUR Workshop ProceedingsVLDB 2024 PhD Workshop - The 50th International Conference on Very Large Data BasesGuangzhou, ChinaAugust 2024HALback to text
- 23 inproceedingsFirst Steps in Building a Knowledge Base of Mathematical Results.SDP Fourth Workshop on Scholarly Document Processing at ACL 2024Bangkok, ThailandAugust 2024HALback to text
- 24 inproceedingsModular Multimodal Machine Learning for Extraction of Theorems and Proofs in Long Scientific Documents.JCDL 2024 - ACM/IEEE-CS Joint Conference on Digital LibrariesHong Kong, ChinaDecember 2024HALDOIback to text
- 25 inproceedingsA Dichotomy in the Complexity of Consistent Query Answering for Two Atom Queries With Self-Join.PODS'24 - ACM Conference on Principle of Database Systems22Santiago, ChileMay 2024, 1-15HALDOIback to text
Scientific book chapters
- 26 inbookOn the Impact of Provenance Semiring Theory on the Design of a Provenance-Aware Database System.The Provenance of Elegance in Computation — Essays Dedicated to Val TannenOpenAccess Series in InformaticsSchloss Dagstuhl2024HALDOIback to text
Doctoral dissertations and habilitation theses
- 27 thesisMultimodal Extraction of Proofs and Theorems from the Scientific Literature.Université Paris Sciences & LettresJuly 2024HALback to text
Reports & preprints
- 28 miscModular Multimodal Machine Learning for Extraction of Theorems and Proofs in Long Scientific Documents (Extended Version).November 2024HALback to text
10.3 Cited publications
- 29 bookData on the Web: From Relations to Semistructured Data and XML.Morgan Kaufmann1999back to text
- 30 bookFoundations of Databases.Addison-Wesley1995, URL: http://webdam.inria.fr/Alice/back to text
- 31 bookWeb Data Management.Cambridge University Press2011, URL: http://webdam.inria.fr/Jorgeback to text
- 32 incollectionDatabases.Computer Science, The Hardware, Software and Heart of ItSpringer2011, 169-229URL: https://doi.org/10.1007/978-1-4614-1168-0_10DOIback to text
- 33 articleAdaptive Query Processing.Foundations and Trends in Databases112007, 1-140URL: https://doi.org/10.1561/1900000001DOIback to text
- 34 articleAnswering queries using views: A survey.VLDB J.1042001, 270-294URL: https://doi.org/10.1007/s007780100054DOIback to text
- 35 articleThe State of the art in distributed query processing.ACM Comput. Surv.3242000, 422-469URL: http://doi.acm.org/10.1145/371578.371598DOIback to text
- 36 bookPrinciples of Distributed Database Systems, Third Edition.Springer2011, URL: https://doi.org/10.1007/978-1-4419-8834-8DOIback to text
- 37 bookActive Learning.Synthesis Lectures on Artificial Intelligence and Machine LearningMorgan & Claypool Publishers2012, URL: https://doi.org/10.2200/S00429ED1V01Y201207AIM018DOIback to text
- 38 bookReinforcement learning - an introduction.Adaptive computation and machine learningMIT Press1998, URL: http://www.worldcat.org/oclc/37293240back to text
- 39 inproceedingsOn the reliability and intuitiveness of aggregated search metrics.22nd ACM International Conference on Information and Knowledge Management, CIKM'13, San Francisco, CA, USA, October 27 - November 1, 20132013, 689-698URL: http://doi.acm.org/10.1145/2505515.2505691DOIback to text