VALDA - 2022 - Annual activity report

VALDA

VALDA - 2022

2022

Activity report

Project-Team

VALDA

RNSR: 201622223R

Research center

Inria Paris Center

In partnership with:

Ecole normale supérieure de Paris, CNRS

Value from Data

In collaboration with:

Département d'Informatique de l'Ecole Normale Supérieure

Domain

Perception, Cognition and Interaction

Theme

Data and Knowledge Representation and Processing

Creation of the Project-Team: 2018 January 01

Keywords

Computer Science and Digital Science

A3.1. Data
A3.1.1. Modeling, representation
A3.1.2. Data management, quering and storage
A3.1.3. Distributed data
A3.1.4. Uncertain data
A3.1.5. Control access, privacy
A3.1.6. Query optimization
A3.1.7. Open data
A3.1.8. Big data (production, storage, transfer)
A3.1.9. Database
A3.1.10. Heterogeneous data
A3.1.11. Structured data
A3.2. Knowledge
A3.2.1. Knowledge bases
A3.2.2. Knowledge extraction, cleaning
A3.2.3. Inference
A3.2.4. Semantic Web
A3.2.5. Ontologies
A3.2.6. Linked data
A3.3.2. Data mining
A3.4.3. Reinforcement learning
A3.4.5. Bayesian methods
A3.5.1. Analysis of large graphs
A4.7. Access control
A7.2. Logic in Computer Science
A7.3. Calculability and computability
A9.1. Knowledge
A9.8. Reasoning

1 Team members, visitors, external collaborators

Research Scientists

Serge Abiteboul [Inria, Emeritus, HDR]
Camille Bourgaux [CNRS, Researcher]
Luc Segoufin [Inria, Senior Researcher, HDR]
Michael Thomazo [Inria, Researcher]

Faculty Members

Pierre Senellart [Team leader, ENS Paris, Professor, HDR]
Leonid Libkin [ENS Paris, Professor]
Cristina Sirangelo [Université Paris-Cité, Professor, from Feb 2022 until Jul 2022, Secondment to Inria, HDR]
Victor Vianu [ENS Paris, Professor, from Jul 2022 until Nov 2022, Visiting professor]

Post-Doctoral Fellows

Nofar Carmeli [ENS Paris, until Sep 2022]
Shufan Jiang [ENS Paris, from Dec 2022, ATER]
Anantha Padmanabha [ENS Paris]

PhD Students

Anatole Dahan [Université Paris-Cité]
Baptiste Lafosse [ENS Paris]
Shrey Mishra [ENS Paris]
Yann Ramusat [ENS Paris, ATER, until Aug 2022]
Alexandra Rogova [Université Paris-Cité]

Technical Staff

N. Smith [ENS Paris, Engineer, from Feb 2022]

Interns and Apprentices

Yacine Brihmouche [Université Paris-Dauphine, Intern, from May 2022 until Sep 2022]
Antoine Gauquier [Télécom Paris, Intern, from Oct 2022, Part-time]
Siméon Gheorgin [PSL, Intern, until Jun 2022, Part-time]

Administrative Assistant

Meriem Guemair [Inria]

2 Overall objectives

2.1 Objectives

Valda's focus is on both foundational and systems aspects of complex data management, especially human-centric data. The data we are interested in is typically heterogeneous, massively distributed, rapidly evolving, intensional, and often subjective, possibly erroneous, imprecise, incomplete. In this setting, Valda is in particular concerned with the optimization of complex resources such as computer time and space, communication, monetary, and privacy budgets. The goal is to extract value from data, beyond simple query answering.

Data management 40, 49 is now an old, well-established field, for which many scientific results and techniques have been accumulated since the sixties. Originally, most works dealt with static, homogeneous, and precise data. Later, works were devoted to heterogeneous data 3841, and possibly distributed 72 but at a small scale.

However, these classical techniques are poorly adapted to handle the new challenges of data management. Consider human-centric data, which is either produced by humans, e.g., emails, chats, recommendations, or produced by systems when dealing with humans, e.g., geolocation, business transactions, results of data analysis. When dealing with such data, and to accomplish any task to extract value from such data, we rapidly encounter the following facets:

Heterogeneity: data may come in many different structures such as unstructured text, graphs, data streams, complex aggregates, etc., using many different schemas or ontologies.
Massive distribution: data may come from a large number of autonomous sources distributed over the web, with complex access patterns.
Rapid evolution: many sources may be producing data in real time, even if little of it is perhaps relevant to the specific application. Typically, recent data is of particular interest and changes have to be monitored.
Intensionality1: in a classical database, all the data is available. In modern applications, the data is more and more available only intensionally, possibly at some cost, with the difficulty to discover which source can contribute towards a particular goal, and this with some uncertainty.
Confidentiality and security: some personal data is critical and need to remain confidential. Applications manipulating personal data must take this into account and must be secure against linking.
Uncertainty: modern data, and in particular human-centric data, typically includes errors, contradictions, imprecision, incompleteness, which complicates reasoning. Furthermore, the subjective nature of the data, with opinions, sentiments, or biases, also makes reasoning harder since one has, for instance, to consider different agents with distinct, possibly contradicting knowledge.

These problems have already been studied individually and have led to techniques such as query rewriting62 or distributed query optimization68.

Among all these aspects, intensionality is perhaps the one that has least been studied, so we pay particular attention to it. Consider a user's query, taken in a very broad sense: it may be a classical database query, some information retrieval search, a clustering or classification task, or some more advanced knowledge extraction request. Because of intensionality of data, solving such a query is a typically dynamic task: each time new data is obtained, the partial knowledge a system has of the world is revised, and query plans need to be updated, as in adaptive query processing 55 or aggregated search 80. The system then needs to decide, based on this partial knowledge, of the best next access to perform. This is reminiscent of the central problem of reinforcement learning 78 (train an agent to accomplish a task in a partially known world based on rewards obtained) and of active learning 74 (decide which action to perform next in order to optimize a learning strategy) and we intend to explore this connection further.

Uncertainty of the data interacts with its intensionality: efforts are required to obtain more precise, more complete, sounder results, which yields a trade-off between processing cost and data quality.

Other aspects, such as heterogeneity and massive distribution, are of major importance as well. A standard data management task, such as query answering, information retrieval, or clustering, may become much more challenging when taking into account the fact that data is not available in a central location, or in a common format. We aim to take these aspects into account, to be able to apply our research to real-world applications.

2.2 The Issues

We intend to tackle hard technical issues such as query answering, data integration, data monitoring, verification of data-centric systems, truth finding, knowledge extraction, data analytics, that take a different flavor in this modern context. In particular, we are interested in designing strategies to minimize data access cost towards a specific goal, possibly a massive data analysis task. That cost may be in terms of communication (accessing data in distributed systems, on the Web), of computational resources (when data is produced by complex tools such as information extraction, machine learning systems, or complex query processing), of monetary budget (paid-for application programming interfaces, crowdsourcing platforms), or of a privacy budget (as in the standard framework of differential privacy).

A number of data management tasks in Valda are inherently intractable. In addition to properly characterizing this intractability in terms of complexity theory, we intend to develop solutions for solving these tasks in practice, based on approximation strategies, randomized algorithms, enumeration algorithms with constant delay, or identification of restricted forms of data instances lowering the complexity of the task.

3 Research program

3.1 Scientific Foundations

We now detail some of the scientific foundations of our research on complex data management. This is the occasion to review connections between data management, especially on complex data as is the focus of Valda, with related research areas.

Complexity & Logic

Data management has been connected to logic since the advent of the relational model as main representation system for real-world data, and of first-order logic as the logical core of database querying languages 40. Since these early developments, logic has also been successfully used to capture a large variety of query modes, such as data aggregation 67, recursive queries (Datalog), or querying of XML databases 49. Logical formalisms facilitate reasoning about the expressiveness of a query language or about its complexity.

The main problem of interest in data management is that of query evaluation, i.e., computing the results of a query over a database. The complexity of this problem has far-reaching consequences. For example, it is because first-order logic is in the ${AC}_{0}$ complexity class that evaluation of SQL queries can be parallelized efficiently. It is usual 79 in data management to distinguish data complexity, where the query is considered to be fixed, from combined complexity, where both the query and the data are considered to be part of the input. Thus, though conjunctive queries, corresponding to a simple SELECT-FROM-WHERE fragment of SQL, have PTIME data complexity, they are NP-hard in combined complexity. Making this distinction is important, because data is often far larger (up to the order of terabytes) than queries (rarely more than a few hundred bytes). Beyond simple query evaluation, a central question in data management remains that of complexity; tools from algorithm analysis, and complexity theory can be used to pinpoint the tractability frontier of data management tasks.

Automata Theory

Automata theory and formal languages arise as important components of the study of many data management tasks: in temporal databases 39, queries, expressed in temporal logics, can often by compiled to automata; in graph databases 45, queries are naturally given as automata; typical query and schema languages for XML databases such as XPath and XML Schema can be compiled to tree automata 71, or for more complex languages to data tree automata 65. Another reason of the importance of automata theory, and tree automata in particular, comes from Courcelle's results 53 that show that very expressive queries (from the language of monadic second-order language) can be evaluated as tree automata over tree decompositions of the original databases, yielding linear-time algorithms (in data complexity) for a wide variety of applications.

Verification

Complex data management also has connections to verification and static analysis. Besides query evaluation, a central problem in data management is that of deciding whether two queries are equivalent40. This is critical for query optimization, in order to determine if the rewriting of a query, maybe cheaper to evaluate, will return the same result as the original query. Equivalence can easily be seen to be an instance of the problem of (non-)satisfiability: $q \equiv q^{'}$ if and only if $(q \land \neg q^{'}) \lor (\neg q \land q^{'})$ is not satisfiable. In other words, some aspects of query optimization are static analysis issues. Verification is also a critical part of any database application where it is important to ensure that some property will never (or always) arise 51.

Workflows

The orchestration of distributed activities (under the responsibility of a conductor) and their choreography (when they are fully autonomous) are complex issues that are essential for a wide range of data management applications including notably, e-commerce systems, business processes, health-care and scientific workflows. The difficulty is to guarantee consistency or more generally, quality of service, and to statically verify critical properties of the system. Different approaches to workflow specifications exist: automata-based, logic-based, or predicate-based control of function calls 37.

Probability & Provenance

To deal with the uncertainty attached to data, proper models need to be used (such as attaching provenance information to data items and viewing the whole database as being probabilistic) and practical methods and systems need to be developed to both reliably estimate the uncertainty in data items and properly manage provenance and uncertainty information throughout a long, complex system.

The simplest model of data uncertainty is the NULLs of SQL databases, also called Codd tables 40. This representation system is too basic for any complex task, and has the major inconvenient of not being closed under even simple queries or updates. A solution to this has been proposed in the form of conditional tables64 where every tuple is annotated with a Boolean formula over independent Boolean random events. This model has been recognized as foundational and extended in two different directions: to more expressive models of provenance than what Boolean functions capture, through a semiring formalism 60, and to a probabilistic formalism by assigning independent probabilities to the Boolean events 61. These two extensions form the basis of modern provenance and probability management, subsuming in a large way previous works 52, 46. Research in the past ten years has focused on a better understanding of the tractability of query answering with provenance and probabilistic annotations, in a variety of specializations of this framework 7766, 43.

Machine Learning

Statistical machine learning, and its applications to data mining and data analytics, is a major foundation of data management research. A large variety of research areas in complex data management, such as wrapper induction 73, crowdsourcing 44, focused crawling 59, or automatic database tuning 47 critically rely on machine learning techniques, such as classification 63, probabilistic models 58, or reinforcement learning 78.

Machine learning is also a rich source of complex data management problems: thus, the probabilities produced by a conditional random field 69 system result in probabilistic annotations that need to be properly modeled, stored, and queried.

Finally, complex data management also brings new twists to some classical machine learning problems. Consider for instance the area of active learning74, a subfield of machine learning concerned with how to optimally use a (costly) oracle, in an interactive manner, to label training data that will be used to build a learning model, e.g., a classifier. In most of the active learning literature, the cost model is very basic (uniform or fixed-value costs), though some works 75 consider more realistic costs. Also, oracles are usually assumed to be perfect with only a few exceptions 56. These assumptions usually break when applied to complex data management problems on real-world data, such as crowdsourcing.

3.2 Research Directions

At the beginning of the Valda team, the project was to focus on the following directions:

foundational aspects of data management, in particular related to query enumeration and reasoning on data, especially regarding security issues;
implementation of provenance and uncertainty management, real-world applications, other aspects of uncertainty and incompleteness, in particular dynamic;
development of personal information management systems, integration of machine learning techniques.

We believe the first two directions have been followed in a satisfactory manner. The focus on personal information management has not been kept for various organizational reasons, however, but the third axis of the project is reoriented to more general aspects of Web data management.

New permanent arrivals in the group since its creation have impacted its research directions in the following manner:

Camille Bourgaux and Michaël Thomazo are both specialists of knowledge representation and formal aspects of knowledge bases, which is an expertise that did not exist in the group. They are also both interested in, and have started working on aspects related to connecting their research with database theory, and investigating aspects of uncertainty and incompleteness in their research. This will lead to more work on knowledge representation and symbolic AI aspects, while keeping the focus of Valda on foundations of data management and uncertainty.
Leonid Libkin is a specialist of database theory, of incomplete data management, and has a line of current research on graph data management. His profile fits very well with the original orientation of the Valda project.

We intend to keep producing leading research on the foundations of data management. Generally speaking, the goal is to investigate the borders of feasibility of various tasks. For instance, what are the assumptions on data that allow for computable problems? When is it not possible at all? When can we hope for efficient query answering, when is it hopeless? This is a problem of theoretical nature which is necessary for understanding the limit of the methods and driving research towards the scenarios where positive results may be obtainable. Only when we have understood the limitation of different methods and have many examples where this is possible, we can hope to design a solid foundation that allowing for a good trade-off between what can be done (needs from the users) and what can be achieved (limitation from the system).

Similarly, we will continue our work, both foundational and practical, on various aspects of provenance and uncertainty management. One overall long-term goal is to reach a full understanding of the interactions between query evaluation or other broader data management tasks and uncertain and annotated data models. We would in particular want to go towards a full classification of tractable (typically polynomial-time) and intractable (typically NP-hard for decision problems, or #P-hard for probability evaluation) tasks, extending and connecting the query-based dichotomy 54 on probabilistic query evaluation with the instance-based one of 42, 43. Another long-term goal is to consider more dynamic scenarios than what has been considered so far in the uncertain data management literature: when following a workflow, or when interacting with intensional data sources, how to properly represent and update uncertainty annotations that are associated with data. This is critical for many complex data management scenarios where one has to maintain a probabilistic current knowledge of the world, while obtaining new knowledge by posing queries and accessing data sources. Such intensional tasks requires minimizing jointly data uncertainty and cost to data access.

As application area, in addition to the historical focus on personal information management which is now less stressed, we target Web data (Web pages, the semantic Web, social networks, the deep Web, crowdsourcing platforms, etc.).

We aim at keeping a delicate balance between theoretical, foundational research, and systems research, including development and implementation. This is a difficult balance to find, especially since most Valda researchers have a tendency to favor theoretical work, but we believe it is also one of the strengths of the team.

4 Application domains

4.1 Personal Information Management Systems

We recall that Valda's focus is on human-centric data, i.e., data produced by humans, explicitly or implicitly, or more generally containing information about humans. Quite naturally, we have used as a privileged application area to validate Valda’s results that of personal information management systems (Pims for short) 36.

A Pims is a system that allows a user to integrate her own data, e.g., emails and other kinds of messages, calendar, contacts, web search, social network, travel information, work projects, etc. Such information is commonly spread across different services. The goal is to give back to a user the control on her information, allowing her to formulate queries such as “What kind of interaction did I have recently with Alice B.?”, “Where were my last ten business trips, and who helped me plan them?”. The system has to orchestrate queries to the various services (which means knowing the existence of these services, and how to interact with them), integrate information from them (which means having data models for this information and its representation in the services), e.g., align a GPS location of the user to a business address or place mentioned in an email, or an event in a calendar to some event in a Web search. This information must be accessed intensionally: for instance, costly information extraction tools should only be run on emails which seem relevant, perhaps identified by a less costly cursory analysis (this means, in turn, obtaining a cost model for access to the different services). Impacted people can be found by examining events in the user's calendar and determining who is likely to attend them, perhaps based on email exchanges or former events' participant lists. Of course, uncertainty has to be maintained along the entire process, and provenance information is needed to explain query results to the user (e.g., indicate which meetings and trips are relevant to each person of the output). Knowledge about services, their data models, their costs, need either to be provided by the system designer, or to be automatically learned from interaction with these services, as in 73.

One motivation for that choice is that Pims concentrate many of the problems we intend to investigate: heterogeneity (various sources, each with a different structure), massive distribution (information spread out over the Web, in numerous sources), rapid evolution (new data regularly added), intensionality (knowledge from Wikidata, OpenStreetMap...), confidentiality and security (mostly private data), and uncertainty (very variable quality). Though the data is distributed, its size is relatively modest; other applications may be considered for works focusing on processing data at large scale, which is a potential research direction within Valda, though not our main focus. Another strong motivation for the choice of Pims as application domain is the importance of this application from a societal viewpoint.

A Pims is essentially a system built on top of a user's personal knowledge base; such knowledge bases are reminiscent of those found in the Semantic Web, e.g., linked open data. Some issues, such as ontology alignment 76 exist in both scenarios. However, there are some fundamental differences in building personal knowledge bases vs collecting information from the Semantic Web: first, the scope is quite smaller, as one is only interested in knowledge related to a given individual; second, a small proportion of the data is already present in the form of semantic information, most needs to be extracted and annotated through appropriate wrappers and enrichers; third, though the linked open data is meant to be read-only, the only update possible to a user being adding new triples, a personal knowledge base is very much something that a user needs to be able to edit, and propagating updates from the knowledge base to original data sources is a challenge in itself.

4.2 Web Data

The choice of Pims is not exclusive. We also consider other application areas as well. In particular, we have worked in the past and have a strong expertise on Web data 41 in a broad sense: semi-structured, structured, or unstructured content extracted from Web databases 73; knowledge bases from the Semantic Web 76; social networks 70; Web archives and Web crawls 57; Web applications and deep Web databases 50; crowdsourcing platforms 44. We intend to continue using Web data as a natural application domain for the research within Valda when relevant. For instance 48, deep Web databases are a natural application scenario for intensional data management issues: determining if a deep Web database contains some information requires optimizing the number of costly requests to that database.

A common aspect of both personal information and Web data is that their exploitation raises ethical considerations. Thus, a user needs to remain fully in control of the usage that is made of her personal information; a search engine or recommender system that ranks Web content for display to a specific user needs to do so in an unbiased, justifiable, manner. These ethical constraints sometimes forbid some technically solutions that may be technically useful, such as sharing a model learned from the personal data of a user to another user, or using blackboxes to rank query result. We fully intend to consider these ethical considerations within Valda. One of the main goals of a Pims is indeed to empower the user with a full control on the use of this data.

5 Social and environmental responsibility

Data-driven algorithmic systems raise ethical and legal concerns, that need to be taken into account within research. Serge Abiteboul, with collaborators from NYU, U. Washington, U. Michigan, U. Amsterdam, wrote a position article detailing the role that data management research needs to play in ensuring responsible design and use of algorithmic data-driven systems. 17

6 Highlights of the year

6.1 Awards

Michaël Thomazo, together with Maxime Buron and Marie-Laure Mugnier, received the BDA (French database community) award for their work on Parallelisable Existential Rules: a Story of Pieces31, also published at KR 2021 31

6.2 Broader Inria Context

The work of the Valda team in 2022 was affected by several issues within Inria; in particular major issues with the deployment of a new information system (Eksae) negatively impacted the work of our administrative assistant and made it impossible for the team leader to keep track of expenses.

The team also would like to thank the Inria evaluation committee for its admirable work in support of the research community, for its transparency, and for the integrity in which it conducts its activities.

7 New software and platforms

7.1 New software

7.1.1 ORBITS

Name:
Optimal Repair-Based Inconsistency-Tolerant Semantics
Keywords:
Knowledge Bases, Databases
Scientific Description:
ORBITS (Optimal Repair-Based Inconsistency-Tolerant Semantics) is a tool for filtering answers that hold under a given inconsistency-tolerant semantics among AR, IAR and brave with standard repairs or Pareto- or completion-optimal repairs in the case where a priority relation between the conflicting facts is given. ORBITS implements a variety of algorithms and propositional encoding variants for each semantics and type of repairs.
Functional Description:
ORBITS is a tool for filtering answers that hold under a given inconsistency-tolerant semantics based on some kind of optimal repairs in the case where a priority relation between the conflicting facts is given.
URL:
https://github.com/bourgaux/orbits
Publication:
hal-03770516
Contact:
Camille Bourgaux
Participant:
Camille Bourgaux

7.1.2 ProvSQL

Keywords:
Databases, Provenance, Probability
Functional Description:
The goal of the ProvSQL project is to add support for (m-)semiring provenance and uncertainty management to PostgreSQL databases, in the form of a PostgreSQL extension/module/plugin.
News of the Year:
Support for PostgreSQL 15. Miscellaneous enhancements and bug fixes.
URL:
https://github.com/PierreSenellart/provsql
Publications:
hal-01672566, hal-01851538
Contact:
Pierre Senellart
Participants:
Pierre Senellart, Baptiste Lafosse

7.1.3 TheoremKB

Keyword:
Information extraction
Functional Description:
TheoremKB is a collection of tools to extract semantic information from (mathematical) research articles.
News of the Year:
Improvements to theorem extraction, preliminary work on multimodal approach.
URL:
https://github.com/PierreSenellart/theoremkb
Publications:
hal-02956526, hal-02940819, hal-03293643, hal-03897168
Contact:
Pierre Senellart
Participants:
Pierre Senellart, Shrey Mishra, Yacine Brihmouche

7.1.4 apxproof

Keyword:
LaTeX
Functional Description:
apxproof is a LaTeX package facilitating the typesetting of research articles with proofs in appendix, a common practice in database theory and theoretical computer science in general. The appendix material is written in the LaTeX code along with the main text which it naturally complements, and it is automatically deferred. The package can automatically send proofs to the appendix, can repeat in the appendix the theorem environments stated in the main text, can section the appendix automatically based on the sectioning of the main text, and supports a separate bibliography for the appendix material.
Release Contributions:
Support for lipcs's claimproof, support optional arguments in proofs
News of the Year:
1.2.4 release: support for claimproof environment from lipics, support optional arguments in proofs.
URL:
https://github.com/PierreSenellart/apxproof
Contact:
Pierre Senellart
Participant:
Pierre Senellart

7.1.5 dissem.in

Name:
Dissemin
Keywords:
Open Access, Publishing, HAL
Functional Description:
Dissemin is a web platform gathering metadata from many sources to analyze the open-access full text availability of publications of researchers. It has been designed to foster the use of repositories such as HAL (rather than preprints posted on personal homepages). It allows deposit on these repositories.
News of the Year:
Support for a large variety of IdP from Shibboleth. Various Shibboleth fixes. Support for v2 of Sherpa/Romeo API. Other small improvements, bug fixes, maintainance.
URL:
https://gitlab.com/dissemin/dissemin
Contact:
Pierre Senellart
Participant:
Pierre Senellart
Partner:
CAPSH

7.2 New platforms

7.2.1 dissem.in

dissem.in, the openly accessible platform for promoting full-text deposit of scientific articles of researchers, which is based on the dissem.in (7.1.5) software, has been maintained by Valda since 2021. Works on the platform in 2022, in addition to works on the base software, include updating information about journals and publisher policies from the Sherpa/Romeo API.

Participants: Pierre Senellart, N. Smith.

8 New results

We present the results we obtained and published in 2022. Much research within Valda centers around the central problem of query answering in databases, but exploring various side questions: How to handle incomplete or inconsistent information? How to efficiently access query results when there are many of them? How to incorporate external ontologies within query answering? How to keep track of the provenance of queries? We describe our works in each of these areas in turn, and finish with other theoretical research conducted in the team, beyond data management.

8.1 Incomplete and inconsistent information

We first consider databases containing incomplete (missing) or inconsistent (contradictory) information.

One of the most common scenarios of handling incomplete information occurs in relational databases. They describe incomplete knowledge with three truth values, using Kleene’s logic for propositional formulae and a rather peculiar extension to predicate calculus. This design by a committee from several decades ago is now part of the standard adopted by vendors of database management systems. But is it really the right way to handle incompleteness in propositional and predicate logics? Our goal in 13 is to answer this question. Using an epistemic approach, we first characterize possible levels of partial knowledge about propositions, which leads to six truth values. We impose rationality conditions on the semantics of the connectives of the propositional logic, and prove that Kleene’s logic is the maximal sublogic to which the standard optimization rules apply, thereby justifying this design choice. For extensions to predicate logic, however, we show that the additional truth values are not necessary: every many-valued extension of first-order logic over databases with incomplete information represented by null values is no more powerful than the usual two-valued logic with the standard Boolean interpretation of the connectives. We use this observation to analyze the logic underlying SQL query evaluation, and conclude that the many-valued extension for handling incompleteness does not add any expressiveness to it.

We continue on the topic of incomplete information in 18, where our goal is to collect and analyze the shortcomings of nulls and their treatment by SQL, and to re-evaluate existing research in this light. To this end, we designed and conducted a survey on the everyday usage of null values among database users. From the analysis of the results we reached two main conclusions. First, null values are ubiquitous and relevant in real-life scenarios, but SQL's features designed to deal with them cause multiple problems. The severity of these problems varies depending on the SQL features used, and they cannot be reduced to a single issue. Second, foundational research on nulls is misdirected and has been addressing problems of limited practical relevance. We urge the community to view the results of this survey as a way to broaden the spectrum of their researches and further bridge the theory-practice gap on null values.

To answer database queries over incomplete data the gold standard is finding certain answers: those that are true regardless of how incomplete data is interpreted. Such answers can be found efficiently for conjunctive queries and their unions, even in the presence of constraints such as keys or functional dependencies. With negation added, the complexity of finding certain answers becomes intractable however. In 28 we exhibit a well-behaved class of queries that extends unions of conjunctive queries with a limited form of negation and that permits efficient computation of certain answers even in the presence of constraints by means of rewriting into Datalog with negation. The class consists of queries that are the closure of conjunctive queries under Boolean operations of union, intersection and difference. We show that for these queries, certain answers can be expressed in Datalog with negation, even in the presence of functional dependencies, thus making them tractable in data complexity. We show that in general Datalog cannot be replaced by first-order logic, but without constraints such a rewriting can be done in first-order.

While all relational database systems are based on the bag data model, much of theoretical research still views relations as sets. Recent attempts to provide theoretical foundations for modern data management problems under the bag semantics concentrated on applications that need to deal with incomplete relations, i.e., relations populated by constants and nulls. Our goal in 12 is to provide a complete characterization of the complexity of query answering over such relations in fragments of bag relational algebra. The main challenges that we face are twofold. First, bag relational algebra has more operations than its set analog (e.g., additive union, max-union, min-intersection, duplicate elimination) and the relationship between various fragments is not fully known. Thus we first fill this gap. Second, we look at query answering over incomplete data, which again is more complex than in the set case: rather than certainty and possibility of answers, we now have numerical information about occurrences of tuples. We then fully classify the complexity of finding this information in all the fragments of bag relational algebra.

Finally, we turn to inconsistent data. In 19, 20, we investigate practical algorithms for inconsistency-tolerant query answering over prioritized knowledge bases, which consist of a logical theory, a set of facts, and a priority relation between conflicting facts. We consider three well-known semantics (AR, IAR and brave) based upon two notions of optimal repairs (Pareto and completion). Deciding whether a query answer holds under these semantics is (co)NP-complete in data complexity for a large class of logical theories, and SAT-based procedures have been devised for repair-based semantics when there is no priority relation, or the relation has a special structure. We introduce the first SAT encodings for Pareto- and completion-optimal repairs w.r.t. general priority relations and proposes several ways of employing existing and new encodings to compute answers under (optimal) repair-based semantics, by exploiting different reasoning modes of SAT solvers. The comprehensive experimental evaluation of our implementation compares both (i) the impact of adopting semantics based on different kinds of repairs, and (ii) the relative performances of alternative procedures for the same semantics.

8.2 Enumeration and direct access to query results

Many queries have as output sets of results which are too big to be generated at once. Two strategies can then be used: either to design algorithms for efficient enumeration of the query results, one after the other, or for efficient direct access to one specific result among the set of results.

In 16, we consider the evaluation of first-order queries over classes of databases that are nowhere dense. The notion of nowhere-dense classes was introduced by Nešetřil and Ossona de Mendez as a formalization of classes of “sparse” graphs and generalizes many well-known classes of graphs, such as classes of bounded degree, bounded tree-width, or bounded expansion. It has recently been shown by Grohe, Kreutzer, and Siebertz that over nowhere-dense classes of databases, first-order sentences can be evaluated in pseudo-linear time (pseudo-linear time means that for all $ϵ$ there exists an algorithm working in time $O (n^{1 + ϵ})$ , where $n$ is the size of the database). For first-order queries of higher arities, we show that over any nowhere dense class of databases, the set of their solutions can be enumerated with constant delay after a pseudo-linear time preprocessing. In the same context, we also show that after a pseudo-linear time preprocessing we can, on input of a tuple, test in constant time whether it is a solution to the query.

A class of relational databases has low degree if for all $δ > 0$ , all but finitely many databases in the class have degree at most $n^{δ}$ , where $n$ is the size of the database. Typical examples are databases of bounded degree or of degree bounded by $log n$ . It is known that over a class of databases having low degree, first-order boolean queries can be checked in pseudo-linear time, i.e. for all $ϵ > 0$ in time bounded by $n^{1 + ϵ}$ . We generalize this result in 14 by considering query evaluation. We show that counting the number of answers to a query can be done in pseudo-linear time and that after a pseudo-linear time preprocessing we can test in constant time whether a given tuple is a solution to a query or enumerate the answers to a query with constant delay.

Finally, we consider in 25 the task of lexicographic direct access to query answers. That is, we want to simulate an array containing the answers of a join query sorted in a lexicographic order chosen by the user. A recent dichotomy showed for which queries and orders this task can be done in polylogarithmic access time after quasilinear preprocessing, but this dichotomy does not tell us how much time is required in the cases classiﬁed as hard. We determine the pre-processing time needed to achieve polylogarithmic access time for all self-join free queries and all lexicographical orders. To this end, we propose a decomposition-based general algorithm for direct access on join queries. We then explore its optimality by proving lower bounds for the preprocessing time based on the hardness of a certain online Set-Disjointness problem, which shows that our algorithm’s bounds are tight for all lexicographic orders on self-join free queries. Then, we prove the hardness of Set-Disjointness based on the Zero-Clique Conjecture which is an established conjecture from ﬁne-grained complexity theory. We also show that similar techniques can be used to prove that, for enumerating answers to Loomis-Whitney joins, it is not possible to signiﬁcantly improve upon trivially computing all answers at preprocessing. This, in turn, gives further evidence (based on the Zero-Clique Conjecture) to the enumeration hardness of self-join free cyclic joins with re-spect to linear preprocessing and constant delay.

8.3 Ontology-mediated query answering

We know consider cases where to answer a query, we need to take into account external knowledge given in the form of a logical ontology (e.g., described in description logics, or through existential rules).

While ontology-mediated query answering most often adopts (unions of) conjunctive queries as the query language, some recent works have explored the use of counting queries coupled with DL-Lite ontologies. The aim of 22, 21 is to extend the study of counting queries to Horn description logics outside the DL-Lite family. Through a combination of novel techniques, adaptations of existing constructions, and new connections to closed predicates, we achieve a complete picture of the data and combined complexity of answering counting conjunctive queries (CCQs) and cardinality queries (a restricted class of CCQs) in ${ℰℒℋℐ}_{⊥}$ and its various sublogics. Notably, we show that CCQ answering is 2EXP-complete in combined complexity for ${ℰℒℋℐ}_{⊥}$ and every sublogic that extends EL or DL-Lite $_{pos}^{ℋ}$ . Our study not only provides the first results for counting queries beyond DL-Lite, but it also closes some open questions about the combined complexity of CCQ answering in DL-Lite.

Existential rules are a very popular ontology-mediated query language for which the chase represents a generic computational approach for query answering. It is straightforward that existential rule queries exhibiting chase termination are decidable and can only recognize properties that are preserved under homomorphisms. 24 is an extended abstract of our eponymous publication at KR 2021 where we show the converse: every decidable query that is closed under homomorphism can be expressed by an existential rule set for which the standard chase universally terminates. Membership in this fragment is not decidable, but we show via a diagonalisation argument that this is unavoidable.

In the literature, existential rules are often supposed to be in some normal form that simplifies technical developments. For instance, a common assumption is that rule heads are atomic, i.e., restricted to a single atom. Such assumptions are considered to be made without loss of generality as long as all sets of rules can be normalised while preserving entailment. However, an important question is whether the properties that ensure the decidability of reasoning are preserved as well. We provide in 26 a systematic study of the impact of these procedures on the different chase variants with respect to chase (non-)termination and FO-rewritability. This also leads us to study open problems related to chase termination of independent interest.

8.4 Provenance for recursive queries

Data provenance consists in bookkeeping meta information during query evaluation, in order to enrich query results with their trust level, likelihood, evaluation cost, and more. The framework of semiring provenance abstracts from the specific kind of meta information that annotates the data.

While the definition of semiring provenance is uncontroversial for unions of conjunctive queries, the picture is less clear for Datalog. Indeed, the original definition might include infinite computations, and is not consistent with other proposals for Datalog semantics over annotated data. In 23, we propose and investigate several provenance semantics, based on different approaches for defining classical Datalog semantics. We study the relationship between these semantics, and introduce properties that allow us to analyze and compare them.

In 30, 33, we establish a translation between a formalism for dynamic programming over hypergraphs and the computation of semiring-based provenance for Datalog programs. The benefit of this translation is a new method for computing the provenance of Datalog programs for specific classes of semirings, which we apply to provenance-aware querying of graph databases. Theoretical results and practical optimizations lead to an efficient implementation using Soufflé, a state-of-the-art Datalog interpreter. Experimental results on real-world data suggest this approach to be efficient in practical contexts, competing with dedicated solutions for graphs.

8.5 Theoretical computer science beyond databases

Valda's research has always encompassed other foundational topics. We conclude with the description of other theoretical computer science works (namely, in algebraic automata theory and logic), which does not fit within the previous areas of research.

The program-over-monoid model of computation originates with Barrington's proof that the model captures the complexity class NC $^{1}$ . In 15 we make progress in understanding the subtleties of the model. First, we identify a new tameness condition on a class of monoids that entails a natural characterization of the regular languages recognizable by programs over monoids from the class. Second, we prove that the class known as DA satisfies tameness and hence that the regular languages recognized by programs over monoids in DA are precisely those recognizable in the classical sense by morphisms from QDA. Third, we show by contrast that the well studied class of monoids called J is not tame. Finally, we exhibit a program-length-based hierarchy within the class of languages recognized by programs over monoids from DA.

When we bundle quantifiers and modalities together (as in $\exists x □$ , $◊ \forall x$ etc.) in first-order modal logic (FOML), we get new logical operators whose combinations produce interesting bundled fragments of FOML. It is well-known that finding decidable fragments of FOML is hard, but existing work shows that certain bundled fragments are decidable, without any restriction on the arity of predicates, the number of variables, or the modal scope. In 29, we explore generalized bundles such as $\forall x \forall y □$ , $\forall x \exists y ◊$ etc., and map the terrain with regard to decidability, presenting both decidability and undecidability results. In particular, we propose the loosely bundled fragment, which is decidable over increasing domains and encompasses all known decidable bundled fragments.

9 Bilateral contracts and grants with industry

9.1 Standardization activities

Leonid Libkin is involved in the standardization process of the GQL and SQL query languages. In particular, he is a chair of the LDBC working group on semantics of GQL, and a member of ISO/IEC JTC1 SC32 WG3 (SQL committee). He is also a member of INCITS, the US InterNational Committee for Information Technology Standards.

As part of this standardization effort, 27 presents the key elements of the graph pattern matching language at the core of both SQL/PGQ and GQL, in advance of the publication of the corresponding new standards.

Participants: Leonid Libkin.

10 Partnerships and cooperations

10.1 International initiatives

10.1.1 Associate Teams in the framework of an Inria International Lab or in the framework of an Inria International Program

GQA

Title:
Languages for Graph Querying and Analytics
Duration:
2022 ->
Coordinator:
Pablo Barceló (pbarcelo@dcc.uchile.cl)
Partners:
- Pontificia Universidad Católica de Chile Santiago (Chili)
Inria contact:
Leonid Libkin
Summary:
The project brings together experts in graph databases, in particular in the new generation of query languages currently standardized by the ISO. The history of collaboration between the two groups goes back many years and pre-dates our current collaboration on graph data; having started in the areas of tree-structured data and data interoperability. Our main objective is to combine the graph query languages expertise of the Inria group with the machine learning and graphs analytics expertise of the Chilean group to come up with a new generation of query languages that seamlessly integrate graph querying with analytics.

10.1.2 Participation in other International Programs

DesCartes
(2021–2026) is a project managed by CNRS@CREATE, a CNRS subsidiary in Singapore and funded by Singapore's National Research Foundation, with 50 million total budget. Pierre Senellart is involved in the project as one of the French PIs.

10.1.3 Informal international partners

Valda has strong collaborations with the following international groups:

Univ. Edinburgh, United Kingdom:
Paolo Guagliardo, Andreas Pieris
Univ. Oxford, United Kingdom:
Michael Benedikt and Georg Gottlob
TU Dresden, Germany:
Markus Krötzsch and Sebastian Rudolph
Dortmund University, Germany:
Thomas Schwentick
Bayreuth University, Germany:
Wim Martens
Univ. Bergen, Norway:
Ana Ozaki
Univ. Roma La Sapienza, Italy:
Marco Console
Warsaw University, Poland:
Mikołaj Bojańczyk and Szymon Toruńczyk
Tel Aviv University, Israel:
Daniel Deutch and Tova Milo
NYU, USA:
Julia Stoyanovich
Univ. California San Diego, USA:
Victor Vianu
Pontifical Catholic University of Chile:
Marcelo Arenas, Pablo Barceló
National University of Singapore:
Stéphane Bressan

10.2 International research visitors

10.2.1 Visits of international scientists

Visits of international scientists

Victor Vianu, Professor at UCSD, visited the group during several months in 2022. He was also hired on a fixed-term contract by ENS.
Yael Amsterdamer, Senior Lecturer at Bar-Ilan University & Daniel Deutch, Professor at Tel Aviv University jointly visited Valda in July 2022.
Dan Suciu, Professor at University of Washington, visited Valda in November 2022.

10.3 European initiatives

10.3.1 Other european programs/initiatives

A bilateral French–German ANR project, entitled EQUUS – Efficient Query answering Under UpdateS started in 2020. It involves CNRS (CRIL, CRIStAL, IMJ), Télécom Paris, HU Berlin, and Bayreuth University, in addition to Inria Valda.

10.4 National initiatives

10.4.1 ANR

Valda has been part of three national ANR projects in 2022:

CQFD
(2018–2024; 19 k€ for Valda, budget managed by Inria), with Inria Sophia (GraphIK, coordinator), LaBRI, LIG, Inria Saclay (Cedar), IRISA, Inria Lille (Spirals), and Télécom ParisTech, on complex ontological queries over federated and heterogeneous data.
QUID
(2018–2024; 49 k€ for Valda, budget managed by Inria), LIGM (coordinator), IRIF, and LaBRI, on incomplete and inconsistent data.
VERIGRAPH
(2022–2026; 150 k€ for Valda (coordinator), budget managed by ENS), LIG, and LIRIS, on verifiable graph queries and transformations

Camille Bourgaux has been participating in the AI Chair of Meghyn Bienvenu on INTENDED (Intelligent handling of imperfect data) since 2020.

Pierre Senellart has held a chair within the PR[AI]RIE institute for artificial intelligence in Paris since 2019.

10.4.2 Others

Dissemin
(2021–2024; 124 k€ for Valda, budget managed by ENS), sole partner, on the development of the dissem.in platform for open science promotion. Funded by the Fonds National Science Ouverte.

11 Dissemination

11.1 Promoting scientific activities

11.1.1 Scientific events: organisation

General chair, scientific chair

Leonid Libkin, general chair of PODS 2022; chair (until July 2022) and now member of the PODS Executive Committee

Member of the organizing committees

Leonid Libkin, member of the LICS Steering Committee
Luc Segoufin, member of the steering committee of the conference series Highlights of Logic, Games and Automata

11.1.2 Scientific events: selection

Chair of conference program committees

Camille Bourgaux, program co-chair of the Artificial Intelligence in Bergen research school, AIB 2022

Member of the conference program committees

Camille Bourgaux, AAAI 2023, IJCAI-ECAI 2022, KR 2022, DL 2022
Nofar Carmeli, PODS 2023
Leonid Libkin, KR 2022 (area chair), KR 2023, The Web Conference 2023 (industry track)
Pierre Senellart, BDA 2022, SIGMOD 2023
Michaël Thomazo, IJCAI 2022, KR 2022

11.1.3 Journal

Member of the editorial boards

Leonid Libkin, Acta Informatica
Leonid Libkin, Bulletin of Symbolic Logic
Luc Segoufin, ACM Transactions on Computational Logics

11.1.4 Invited talks

Nofar Carmeli, Invited Tutorial at ICDT 2022 on Answering Unions of Conjunctive Queries with Ideal Time Guarantees
Leonid Libkin, Invited Talk at KR 2022 on Graph queries: do we study what matters?
Leonid Libkin, Invited Talk at the Workshop on Finite Model Theory and Many-Valued Logics

11.1.5 Leadership within the scientific community

Serge Abiteboul is a member of the French Academy of Sciences, of the Academia Europaea, of the scientific council of the Société Informatique de France, and an ACM Fellow.
Leonid Libkin is a Fellow of the Royal Society of Edinburgh, a member of the Academia Europaea, of the UK Computing research committee, and an ACM Fellow.
Pierre Senellart is a junior member of the Institut Universitaire de France.

11.1.6 Research administration

Luc Segoufin is a member of the CNHSCT of Inria.
Pierre Senellart is the president of section 6 of the National Committee for Scientific Research.
Pierre Senellart is a member of the board of the conference of presidents of the national committee (CPCN) and as such a member of the coordination of managing parties of the national committee (C3N).
Pierre Senellart is deputy director of the DI ENS laboratory, joint between ENS, CNRS, and Inria.

11.2 Teaching - Supervision - Juries

11.2.1 Teaching

Licence: Algorithms, L2, CPES, PSL – Pierre Senellart
Licence: Practical Computing, L3, École normale supérieure – Pierre Senellart
Licence: Formal Languages, Computability, Complexity, L3, École normale supérieure – Michaël Thomazo, Yann Ramusat
Licence: Databases, L3, École normale supérieure – Leonid Libkin, Yann Ramusat
Master: Advanced Databases, M2, IASD – Pierre Senellart, Michaël Thomazo
Master: Data wrangling, Data privacy, M2, IASD – Leonid Libkin, Pierre Senellart
Master: Anonymization, privacy, IASD – Pierre Senellart
Master: Knowledge graphs, description logics, reasoning on data, M2, IASD – Camille Bourgaux, Michaël Thomazo

Pierre Senellart holds various teaching responsibilities (L3 internships, M1 projects, M2 administration, entrance competition) at ENS. Pierre Senellart is in the managing board of the graduate program. Leonid Libkin is co-responsible of the international entrance competition at ENS. Yann Ramusat was the secretary of the entrance competition at ENS for computer science. Michaël Thomazo is an adjunct professor at PSL.

Most members of the group are also involved in tutoring ENS students, advising them on their curriculum, their internships, etc. They are also occasionally involved with reviewing internship reports, supervising student projects, etc.

11.2.2 Supervision

PhD completed: Quentin Manière, Counting queries in ontology-based data access, 2019–2022, Meghyn Bienvenu & Michaël Thomazo (as he was based in Bordeaux, he was not considered a Valda member)
PhD completed: Yann Ramusat, Provenance-based routing in probabilistic graphs 33, 2018–2022, Silviu Maniu & Pierre Senellart
PhD in progress: Anatole Dahan, Logical foundations of the polynomial hierarchy, started in October 2020, Arnaud Durand & Luc Segoufin
PhD in progress: Baptiste Lafosse, Compiler dedicated to the evaluation of SQL queries, started in October 2021, Pierre Senellart & Jean-Marie Lagniez
PhD in progress: Shrey Mishra, Towards a knowledge base of mathematic results, started in January 2021, Pierre Senellart
PhD in progress: Alexandra Rogova, Query analytics in Cypher, started October 2021, Amelie Gheerbrant & Leonid Libkin
PhD in progress: Étienne Toussaint, Paolo Guagliardo & Leonid Libkin (as he is based in Edinburgh, he is not considered a Valda member)
Internship: Yacine Brihmouche, M1 internship, Pierre Senellart 34
Internship: Siméon Gheorghin, L3 internship, Pierre Senellart 35

11.2.3 Juries

PhD: Sajad Nazari [reviewer], INSA Centre Val de Loire, Pierre Senellart

11.3 Popularization

11.3.1 Responsibilities

Serge Abiteboul is a member of the strategic committee of the Blaise Pascal foundation for scientific mediation.
Pierre Senellart is a scientific expert advising the Scientific and Ethical Committee of Parcoursup, the platform for the selection of first-year higher education students.

11.3.2 Articles and contents

Serge Abiteboul is a founding editor of the binaire blog for popularizing computer science.
Serge Abiteboul contributed an interview about artificial intelligence to the May 2022 special edition of Pour la Science; this article was among the ten 2022 articles recommanded by the editorial team of the magazine.
Serge Abiteboul wrote a book on the regulation of social networks 32
Serge Abiteboul wrote an article on the carbon impact of 5G 11

12 Scientific production

12.1 Major publications

1 articleM.Michael Benedikt, P.Pierre Bourhis, G.Georg Gottlob and P.Pierre Senellart. Monadic Datalog, Tree Validity, and Limited Access Containment.ACM Transactions on Computational Logic2112020, 6:1-6:45
HAL DOI
2 inproceedingsM.Meghyn Bienvenu, Q.Quentin Manière and M.Michaël Thomazo. Answering Counting Queries over DL-Lite Ontologies.IJCAI 2020 - Twenty-Ninth International Joint Conference on Artificial IntelligenceProceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020.Reportée de juillet 2020 à janvier 2021 en raison de la COVIDYokohama, JapanJuly 2020
HAL
3 inproceedingsC.Camille Bourgaux, P.Pierre Bourhis, L.Liat Peterfreund and M.Michaël Thomazo. Revisiting Semiring Provenance for Datalog.KR 2022 - 19th International Conference on Principles of Knowledge Representation and ReasoningProceedings of the 19th International Conference on Principles of Knowledge Representation and ReasoningHaifa, IsraelJuly 2022, 91–101
HAL DOI
4 inproceedingsC.Camille Bourgaux, D.David Carral, M.Markus Krötzsch, S.Sebastian Rudolph and M.Michaël Thomazo. Capturing Homomorphism-Closed Decidable Queries with Existential Rules.KR 2021 - 18th International Conference on Principles of Knowledge Representation and ReasoningVirtual, VietnamNovember 2021, 141--150
HAL
5 inproceedingsM.Maxime Buron, M.-L.Marie-Laure Mugnier and M.Michaël Thomazo. Parallelisable Existential Rules: a Story of Pieces.KR 2021 - 18th International Conference on Principles of Knowledge Representation and ReasoningVirtual, VietnamNovember 2021
HAL
6 inproceedingsM.Marco Console, P.Paolo Guagliardo, L.Leonid Libkin and E.Etienne Toussaint. Coping with Incomplete Data: Recent Advances.SIGMOD/PODS 2020 - International Conference on Management of DataPortland / Virtual, United StatesACMJune 2020, 33-47
HAL DOI
7 articleN.Nathan Grosshans, P.Pierre Mckenzie and L.Luc Segoufin. Tameness and the power of programs over monoids in DA.Logical Methods in Computer Science183August 2022, 14:1–14:34
HAL DOI
8 articleN.Nicole Schweikardt, L.Luc Segoufin and A.Alexandre Vigny. Enumeration for FO Queries over Nowhere Dense Graphs.Journal of the ACM (JACM)693June 2022, 1-37
HAL DOI
9 articleP.Pierre Senellart, L.Louis Jachiet, S.Silviu Maniu and Y.Yann Ramusat. ProvSQL: Provenance and Probability Management in PostgreSQL.Proceedings of the VLDB Endowment (PVLDB)1112August 2018, 2034-2037
HAL DOI
10 articleE.Etienne Toussaint, P.Paolo Guagliardo, L.Leonid Libkin and J.Juan Sequeda. Troubles with nulls, views from the users.Proceedings of the VLDB Endowment (PVLDB)1511July 2022, 2613-2625
HAL DOI

12.2 Publications of the year

International journals

11 article S.Serge Abiteboul and P.Patrick Lagrange. 5G : amélioration ou aggravation du bilan carbone ? Polytechnique Insights March 2022
HAL back to text
12 articleM.Marco Console, P.Paolo Guagliardo and L.Leonid Libkin. Fragments of bag relational algebra: Expressiveness and certain answers.Information Systems105March 2022, 101604
HAL DOI back to text
13 articleM.Marco Console, P.Paolo Guagliardo and L.Leonid Libkin. Propositional and predicate logics of incomplete information.Artificial Intelligence302January 2022, 103603
HAL DOI back to text
14 articleA.Arnaud Durand, N.Nicole Schweikardt and L.Luc Segoufin. Enumerating Answers to First-Order Queries over Databases of Low Degree.Logical Methods in Computer Science182May 2022, 23
HAL DOI back to text
15 articleN.Nathan Grosshans, P.Pierre Mckenzie and L.Luc Segoufin. Tameness and the power of programs over monoids in DA.Logical Methods in Computer Science183August 2022, 14:1–14:34
HAL DOI back to text
16 articleN.Nicole Schweikardt, L.Luc Segoufin and A.Alexandre Vigny. Enumeration for FO Queries over Nowhere Dense Graphs.Journal of the ACM (JACM)693June 2022, 1-37
HAL DOI back to text
17 articleJ.Julia Stoyanovich, B.Bill Howe, H. V.Hosagrahar Visvesvaraya Jagadish, S.Sebastian Schelter and S.Serge Abiteboul. Responsible data management.Communications of the ACM656June 2022, 64-74
HAL DOI back to text
18 articleE.Etienne Toussaint, P.Paolo Guagliardo, L.Leonid Libkin and J.Juan Sequeda. Troubles with nulls, views from the users.Proceedings of the VLDB Endowment (PVLDB)1511July 2022, 2613-2625
HAL DOI back to text

International peer-reviewed conferences

19 inproceedingsM.Meghyn Bienvenu and C.Camille Bourgaux. Querying Inconsistent Prioritized Data with ORBITS: Algorithms, Implementation, and Experiments (Extended Abstract).DL 2022 - 35th International Workshop on Description LogicsProceedings of the 35th International Workshop on Description LogicsHaifa, IsraelAugust 2022
HAL back to text
20 inproceedingsM.Meghyn Bienvenu and C.Camille Bourgaux. Querying Inconsistent Prioritized Data with ORBITS: Algorithms, Implementation, and Experiments.KR 2022 - 19th International Conference on Principles of Knowledge Representation and ReasoningHaifa, IsraelJuly 2022
HAL back to text
21 inproceedingsM.Meghyn Bienvenu, Q.Quentin Manière and M.Michaël Thomazo. Complexity Landscape for Counting Queries.Proceedings of the 35th International Workshop on Description Logics35th International Workshop on Description LogicsHaifa, IsraelAugust 2022
HAL back to text
22 inproceedingsM.Meghyn Bienvenu, Q.Quentin Manière and M.Michaël Thomazo. Counting Queries over ELHI⊥ Ontologies.KR 2022 - 19th International Conference on Principles of Knowledge Representation and ReasoningHaifa, IsraelJuly 2022, 53-62
HAL DOI back to text
23 inproceedingsC.Camille Bourgaux, P.Pierre Bourhis, L.Liat Peterfreund and M.Michaël Thomazo. Revisiting Semiring Provenance for Datalog.KR 2022 - 19th International Conference on Principles of Knowledge Representation and ReasoningProceedings of the 19th International Conference on Principles of Knowledge Representation and ReasoningHaifa, IsraelJuly 2022, 91–101
HAL DOI back to text
24 inproceedingsC.Camille Bourgaux, D.David Carral, M.Markus Krötzsch, S.Sebastian Rudolph and M.Michaël Thomazo. Capturing Homomorphism-Closed Decidable Queries with Existential Rules (Extended Abstract).Proceedings of the Thirty-First International Joint Conference on Artificial IntelligenceIJCAI-ECAI 2022 - 31st International Joint Conference on Artificial Intelligence - 25th European Conference on Artificial IntelligenceVienna, AustriaJuly 2022, 5269-5273
HAL DOI back to text
25 inproceedingsK.Karl Bringmann, N.Nofar Carmeli and S.Stefan Mengel. Tight Fine-Grained Bounds for Direct Access on Join Queries.SIGMOD/PODS '22: International Conference on Management of DataPODS '22: International Conference on Management of DataPhiladelphia PA, United StatesACMJune 2022, 427-436
HAL DOI back to text
26 inproceedingsD.David Carral, L.Lucas Larroque, M.-L.Marie-Laure Mugnier and M.Michaël Thomazo. Normalisations of Existential Rules: Not so Innocuous!KR 2022 - 19th International Conference on Principles of Knowledge Representation and ReasoningHaÏfa, Israel2022, 102-111
HAL back to text
27 inproceedingsA.Alin Deutsch, N.Nadime Francis, A.Alastair Green, K.Keith Hare, B.Bei Li, L.Leonid Libkin, T.Tobias Lindaaker, V.Victor Marsault, W.Wim Martens, J.Jan Michels, F.Filip Murlak, S.Stefan Plantikow, P.Petra Selmer, H.Hannes Voigt, O.Oskar van Rest, D.Domagoj Vrgoč, M.Mingxi Wu and F.Fred Zemke. Graph Pattern Matching in GQL and SQL/PGQ.SIGMOD '22: International Conference on Management of DataPhiladelphia, United StatesJune 2022
HAL DOI back to text
28 inproceedingsA.Amélie Gheerbrant, L.Leonid Libkin, A.Alexandra Rogova and C.Cristina Sirangelo. Certain Answers of Extensions of Conjunctive Queries by Datalog and First-Order Rewriting.4th International Workshop on the Resurgence of Datalog in Academia and IndustryGenoa, ItalySeptember 2022
HAL back to text
29 inproceedingsM.Mo Liu, A.Anantha Padmanabha, R.R Ramanujam and Y.Yanjing Wang. Generalized Bundled Fragments for First-Order Modal Logic.Leibniz International Proceedings in Informatics Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany47th International Symposium on Mathematical Foundations of Computer Science (MFCS 2022)24147th International Symposium on Mathematical Foundations of Computer Science (MFCS 2022)Vienna, AustriaAugust 2022, 70:1--70:14
HAL DOI back to text
30 inproceedingsY.Yann Ramusat, S.Silviu Maniu and P.Pierre Senellart. Efficient Provenance-Aware Querying of Graph Databases with Datalog.GRADES-NDA 2022 - Joint Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA)Philadelphia, United StatesJune 2022
HAL back to text

Conferences without proceedings

31 inproceedingsM.Maxime Buron, M.-L.Marie-Laure Mugnier and M.Michaël Thomazo. Parallelisable Existential Rules: a Story of Pieces.BDA 2022 - 38ème journée "Gestion de Données – Principes, Technologies et Applications"Clermont-Ferrand, FranceOctober 2022
HAL back to text back to text

Scientific books

32 bookS.Serge Abiteboul and J.Jean Cattan. Nous sommes les réseaux sociaux.Odile Jacob2022
HAL back to text

Doctoral dissertations and habilitation theses

33 thesisY.Yann Ramusat. The Semiring-Based Provenance Framework for Graph Databases.Ecole normale supérieure - ENS PARIS; PSL UniversityApril 2022
HAL back to text back to text

Other scientific publications

34 thesisY.Yacine Brihmouche. TheoremKB : une base de connaissance des résultats mathématiques.Paris IX DauphineSeptember 2022
HAL back to text
35 thesisS.Siméon Gheorghin. Etude de données Twitter en lien avec l'élection présidentielle française d'avril 2022.Paris Sciences et LettresParisJune 2022, 14
HAL back to text

12.3 Cited publications

36 articleS.Serge Abiteboul, B.Benjamin André and D.Daniel Kaplan. Managing your digital life.Commun. ACM5852015, 32-35URL: http://doi.acm.org/10.1145/2670528
DOI back to text
37 articleS.Serge Abiteboul, P.Pierre Bourhis and V.Victor Vianu. Comparing workflow specification languages: A matter of views.ACM Trans. Database Syst.3722012, 10:1-10:59URL: http://doi.acm.org/10.1145/2188349.2188352
DOI back to text
38 bookS.Serge Abiteboul, P.Peter Buneman and D.Dan Suciu. Data on the Web: From Relations to Semistructured Data and XML.Morgan Kaufmann1999
back to text
39 inproceedingsS.Serge Abiteboul, L.Laurent Herr and J. V.Jan Van den Bussche. Temporal Versus First-Order Logic to Query Temporal Databases.Proceedings of the Fifteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 3-5, 1996, Montreal, Canada1996, 49-57URL: http://doi.acm.org/10.1145/237661.237674
DOI back to text
40 bookS.Serge Abiteboul, R.Richard Hull and V.Victor Vianu. Foundations of Databases.Addison-Wesley1995, URL: http://webdam.inria.fr/Alice/
back to text back to text back to text back to text
41 bookS.Serge Abiteboul, I.Ioana Manolescu, P.Philippe Rigaux, M.-C.Marie-Christine Rousset and P.Pierre Senellart. Web Data Management.Cambridge University Press2011, URL: http://webdam.inria.fr/Jorge
back to text back to text
42 inproceedingsA.Antoine Amarilli, P.Pierre Bourhis and P.Pierre Senellart. Provenance Circuits for Trees and Treelike Instances.Automata, Languages, and Programming - 42nd International Colloquium, ICALP 2015, Kyoto, Japan, July 6-10, 2015, Proceedings, Part II2015, 56-68URL: https://doi.org/10.1007/978-3-662-47666-6_5
DOI back to text
43 inproceedingsA.Antoine Amarilli, P.Pierre Bourhis and P.Pierre Senellart. Tractable Lineages on Treelike Instances: Limits and Extensions.Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2016, San Francisco, CA, USA, June 26 - July 01, 20162016, 355-370URL: http://doi.acm.org/10.1145/2902251.2902301
DOI back to text back to text
44 articleY.Yael Amsterdamer, Y.Yael Grossman, T.Tova Milo and P.Pierre Senellart. CrowdMiner: Mining association rules from the crowd.PVLDB6122013, 1250-1253URL: http://www.vldb.org/pvldb/vol6/p1250-amsterdamer.pdf
back to text back to text
45 inproceedingsP. B.Pablo Barceló Baeza. Querying graph databases.Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2013, New York, NY, USA - June 22 - 27, 20132013, 175-188URL: http://doi.acm.org/10.1145/2463664.2465216
DOI back to text
46 articleD.Daniel Barbará, H.Hector Garcia-Molina and D.Daryl Porter. The Management of Probabilistic Data.IEEE Trans. Knowl. Data Eng.451992, 487-502URL: https://doi.org/10.1109/69.166990
DOI back to text
47 articleD.Debabrota Basu, Q.Qian Lin, W.Weidong Chen, H. T.Hoang Tam Vo, Z.Zihong Yuan, P.Pierre Senellart and S.Stéphane Bressan. Regularized Cost-Model Oblivious Database Tuning with Reinforcement Learning.T. Large-Scale Data- and Knowledge-Centered Systems282016, 96-132URL: https://doi.org/10.1007/978-3-662-53455-7_5
DOI back to text
48 inproceedingsM.Michael Benedikt, G.Georg Gottlob and P.Pierre Senellart. Determining relevance of accesses at runtime.Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2011, June 12-16, 2011, Athens, Greece2011, 211-222URL: http://doi.acm.org/10.1145/1989284.1989309
DOI back to text
49 incollectionM.Michael Benedikt and P.Pierre Senellart. Databases.Computer Science, The Hardware, Software and Heart of ItSpringer2011, 169-229URL: https://doi.org/10.1007/978-1-4614-1168-0_10
DOI back to text back to text
50 inproceedingsM.Meghyn Bienvenu, D.Daniel Deutch, D.Davide Martinenghi, P.Pierre Senellart and F. M.Fabian M. Suchanek. Dealing with the Deep Web and all its Quirks.Proceedings of the Second International Workshop on Searching and Integrating New Web Data Sources, Istanbul, Turkey, August 31, 20122012, 21-24URL: http://ceur-ws.org/Vol-884/VLDS2012_p21_Bienvenu.pdf
back to text
51 inproceedingsM.Miko\laj Bojańczyk, L.Luc Segoufin and S.Szymon Toruńczyk. Verification of database-driven systems via amalgamation.Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2013, New York, NY, USA - June 22 - 27, 20132013, 63-74URL: http://doi.acm.org/10.1145/2463664.2465228
DOI back to text
52 inproceedingsP.Peter Buneman, S.Sanjeev Khanna and W.-C.Wang-Chiew Tan. Why and Where: A Characterization of Data Provenance.Database Theory - ICDT 2001, 8th International Conference, London, UK, January 4-6, 2001, Proceedings.2001, 316-330URL: https://doi.org/10.1007/3-540-44503-X_20
DOI back to text
53 articleB.Bruno Courcelle. The Monadic Second-Order Logic of Graphs. I. Recognizable Sets of Finite Graphs.Inf. Comput.8511990, 12-75URL: https://doi.org/10.1016/0890-5401(90)90043-H
DOI back to text
54 articleN. N.Nilesh N. Dalvi and D.Dan Suciu. The dichotomy of probabilistic inference for unions of conjunctive queries.J. ACM5962012, 30:1-30:87URL: http://doi.acm.org/10.1145/2395116.2395119
DOI back to text
55 articleA.Amol Deshpande, Z. G.Zachary G. Ives and V.Vijayshankar Raman. Adaptive Query Processing.Foundations and Trends in Databases112007, 1-140URL: https://doi.org/10.1561/1900000001
DOI back to text
56 inproceedingsP.Pinar Donmez and J. G.Jaime G. Carbonell. Proactive learning: cost-sensitive active learning with multiple imperfect oracles.Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM 2008, Napa Valley, California, USA, October 26-30, 20082008, 619-628URL: http://doi.acm.org/10.1145/1458082.1458165
DOI back to text
57 inproceedingsM.Muhammad Faheem and P.Pierre Senellart. Adaptive Web Crawling Through Structure-Based Link Classification.Digital Libraries: Providing Quality Information - 17th International Conference on Asia-Pacific Digital Libraries, ICADL 2015, Seoul, Korea, December 9-12, 2015, Proceedings2015, 39-51URL: https://doi.org/10.1007/978-3-319-27974-9_5
DOI back to text
58 bookL.Lise Getoor. Introduction to statistical relational learning.MIT Press2007
back to text
59 inproceedingsG.Georges Gouriten, S.Silviu Maniu and P.Pierre Senellart. Scalable, generic, and adaptive systems for focused crawling.25th ACM Conference on Hypertext and Social Media, HT '14, Santiago, Chile, September 1-4, 20142014, 35-45URL: http://doi.acm.org/10.1145/2631775.2631795
DOI back to text
60 inproceedingsT. J.Todd J. Green, G.Gregory Karvounarakis and V.Val Tannen. Provenance semirings.Proceedings of the Twenty-Sixth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 11-13, 2007, Beijing, China2007, 31-40URL: http://doi.acm.org/10.1145/1265530.1265535
DOI back to text
61 articleT. J.Todd J. Green and V.Val Tannen. Models for Incomplete and Probabilistic Information.IEEE Data Eng. Bull.2912006, 17-24URL: http://sites.computer.org/debull/A06mar/green.ps
back to text
62 articleA. Y.Alon Y. Halevy. Answering queries using views: A survey.VLDB J.1042001, 270-294URL: https://doi.org/10.1007/s007780100054
DOI back to text
63 articleM. A.Marti A. Hearst, S. T.Susan T Dumais, E.Edgar Osuna, J.John Platt and B.Bernhard Scholkopf. Support vector machines.IEEE Intelligent Systems1341998, 18-28URL: https://doi.org/10.1109/5254.708428
DOI back to text
64 articleT.Tomasz Imielinski and W.Witold Lipski Jr.. Incomplete Information in Relational Databases.J. ACM3141984, 761-791URL: http://doi.acm.org/10.1145/1634.1886
DOI back to text
65 articleF.Florent Jacquemard, L.Luc Segoufin and J.Jerémie Dimino. FO2(<, +1, ) on data trees, data tree automata and branching vector addition systems.Logical Methods in Computer Science1222016, URL: https://doi.org/10.2168/LMCS-12(2:3)2016
DOI back to text
66 incollectionB.Benny Kimelfeld and P.Pierre Senellart. Probabilistic XML: Models and Complexity.Advances in Probabilistic Databases for Uncertain Information ManagementSpringer2013, 39-66URL: https://doi.org/10.1007/978-3-642-37509-5_3
DOI back to text
67 articleA. C.Anthony C. Klug. Equivalence of Relational Algebra and Relational Calculus Query Languages Having Aggregate Functions.J. ACM2931982, 699-717URL: http://doi.acm.org/10.1145/322326.322332
DOI back to text
68 articleD.Donald Kossmann. The State of the art in distributed query processing.ACM Comput. Surv.3242000, 422-469URL: http://doi.acm.org/10.1145/371578.371598
DOI back to text
69 inproceedingsJ. D.John D. Lafferty, A.Andrew McCallum and F. C.Fernando C. N. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data.Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), Williams College, Williamstown, MA, USA, June 28 - July 1, 20012001, 282-289
back to text
70 inproceedingsS.Siyu Lei, S.Silviu Maniu, L.Luyi Mo, R.Reynold Cheng and P.Pierre Senellart. Online Influence Maximization.Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 10-13, 20152015, 645-654URL: http://doi.acm.org/10.1145/2783258.2783271
DOI back to text
71 articleF.Frank Neven. Automata Theory for XML Researchers.SIGMOD Record3132002, 39-46URL: http://doi.acm.org/10.1145/601858.601869
DOI back to text
72 bookM. T.M. Tamer Özsu and P.Patrick Valduriez. Principles of Distributed Database Systems, Third Edition.Springer2011, URL: https://doi.org/10.1007/978-1-4419-8834-8
DOI back to text
73 inproceedingsP.Pierre Senellart, A.Avin Mittal, D.Daniel Muschick, R.Rémi Gilleron and M.Marc Tommasi. Automatic wrapper induction from hidden-web sources with domain knowledge.10th ACM International Workshop on Web Information and Data Management (WIDM 2008), Napa Valley, California, USA, October 30, 20082008, 9-16URL: http://doi.acm.org/10.1145/1458502.1458505
DOI back to text back to text back to text
74 bookB.Burr Settles. Active Learning.Synthesis Lectures on Artificial Intelligence and Machine LearningMorgan & Claypool Publishers2012, URL: https://doi.org/10.2200/S00429ED1V01Y201207AIM018
DOI back to text back to text
75 inproceedingsB.Burr Settles, M.Mark Craven and L.Lewis Friedland. Active learning with real annotation costs.NIPS 2008 Workshop on Cost-Sensitive Learning2008, URL: http://burrsettles.com/pub/settles.nips08ws.pdf
back to text
76 articleF. M.Fabian M. Suchanek, S.Serge Abiteboul and P.Pierre Senellart. PARIS: Probabilistic Alignment of Relations, Instances, and Schema.PVLDB532011, 157-168URL: http://www.vldb.org/pvldb/vol5/p157_fabianmsuchanek_vldb2012.pdf
back to text back to text
77 bookD.Dan Suciu, D.Dan Olteanu, C.Christopher Ré and C.Christoph Koch. Probabilistic Databases.Synthesis Lectures on Data ManagementMorgan & Claypool Publishers2011, URL: https://doi.org/10.2200/S00362ED1V01Y201105DTM016
DOI back to text
78 bookR. S.Richard S. Sutton and A. G.Andrew G. Barto. Reinforcement learning - an introduction.Adaptive computation and machine learningMIT Press1998, URL: http://www.worldcat.org/oclc/37293240
back to text back to text
79 inproceedingsM. Y.Moshe Y. Vardi. The Complexity of Relational Query Languages (Extended Abstract).Proceedings of the 14th Annual ACM Symposium on Theory of Computing, May 5-7, 1982, San Francisco, California, USA1982, 137-146URL: http://doi.acm.org/10.1145/800070.802186
DOI back to text
80 inproceedingsK.Ke Zhou, M.Mounia Lalmas, T.Tetsuya Sakai, R.Ronan Cummins and J. M.Joemon M. Jose. On the reliability and intuitiveness of aggregated search metrics.22nd ACM International Conference on Information and Knowledge Management, CIKM'13, San Francisco, CA, USA, October 27 - November 1, 20132013, 689-698URL: http://doi.acm.org/10.1145/2505515.2505691
DOI back to text

VALDA - 2022

VALDA - 2022

Keywords

Computer Science and Digital Science

Other Research Topics and Application Domains

1 Team members, visitors, external collaborators

Research Scientists

Faculty Members

Post-Doctoral Fellows

PhD Students

Technical Staff

Interns and Apprentices

Administrative Assistant

2 Overall objectives

2.1 Objectives

2.2 The Issues

3 Research program

3.1 Scientific Foundations

Complexity & Logic

Automata Theory

Verification

Workflows

Probability & Provenance

Machine Learning

3.2 Research Directions

4 Application domains

4.1 Personal Information Management Systems

4.2 Web Data

5 Social and environmental responsibility

6 Highlights of the year

6.1 Awards

6.2 Broader Inria Context

7 New software and platforms

7.1 New software

7.1.1 ORBITS

7.1.2 ProvSQL

7.1.3 TheoremKB

7.1.4 apxproof

7.1.5 dissem.in

7.2 New platforms

7.2.1 dissem.in

8 New results

8.1 Incomplete and inconsistent information

8.2 Enumeration and direct access to query results

8.3 Ontology-mediated query answering

8.4 Provenance for recursive queries

8.5 Theoretical computer science beyond databases

9 Bilateral contracts and grants with industry

9.1 Standardization activities

10 Partnerships and cooperations

10.1 International initiatives

10.1.1 Associate Teams in the framework of an Inria International Lab or in the framework of an Inria International Program

GQA

10.1.2 Participation in other International Programs

10.1.3 Informal international partners

10.2 International research visitors

10.2.1 Visits of international scientists

Visits of international scientists

10.3 European initiatives

10.3.1 Other european programs/initiatives

10.4 National initiatives

10.4.1 ANR

10.4.2 Others

11 Dissemination

11.1 Promoting scientific activities

11.1.1 Scientific events: organisation

General chair, scientific chair

Member of the organizing committees

11.1.2 Scientific events: selection

Chair of conference program committees

Member of the conference program committees

11.1.3 Journal

Member of the editorial boards

11.1.4 Invited talks

11.1.5 Leadership within the scientific community

11.1.6 Research administration

11.2 Teaching - Supervision - Juries

11.2.1 Teaching

11.2.2 Supervision

11.2.3 Juries