Monadic Datalog, Tree Validity, and Limited Access Containment

VALDA Value from Data

Data and Knowledge Representation and Processing

Perception, Cognition and Interaction

http://team.inria.fr/valda Département d'Informatique de l'Ecole Normale Supérieure Ecole normale supérieure de Paris, CNRS Creation of the Project-Team: 2018 January 01 Project-Team A3.1. - Data A3.1.1. - Modeling, representation A3.1.2. - Data management, quering and storage A3.1.3. - Distributed data A3.1.4. - Uncertain data A3.1.5. - Control access, privacy A3.1.6. - Query optimization A3.1.7. - Open data A3.1.8. - Big data (production, storage, transfer) A3.1.9. - Database A3.1.10. - Heterogeneous data A3.1.11. - Structured data A3.2. - Knowledge A3.2.1. - Knowledge bases A3.2.2. - Knowledge extraction, cleaning A3.2.3. - Inference A3.2.4. - Semantic Web A3.2.5. - Ontologies A3.2.6. - Linked data A3.3.2. - Data mining A3.4.3. - Reinforcement learning A3.4.5. - Bayesian methods A3.5.1. - Analysis of large graphs A4.7. - Access control A7.2. - Logic in Computer Science A7.3. - Calculability and computability A9.1. - Knowledge A9.8. - Reasoning B6.3.1. - Web B6.3.4. - Social Networks B6.5. - Information systems B9.5.6. - Data science B9.6.5. - Sociology B9.6.10. - Digital humanities B9.7.2. - Open data B9.9. - Ethics B9.10. - Privacy Inria Paris Center Serge Abiteboul Chercheur Inria, Emeritus oui Camille Bourgaux Chercheur CNRS, Researcher Luc Segoufin Chercheur Inria, Senior Researcher oui Michael Thomazo Chercheur Inria, Researcher Pierre Senellart Enseignant Team leader, ENS Paris, Professor oui Leonid Libkin Enseignant ENS Paris, Professor Cristina Sirangelo Enseignant Université Paris-Cité, Professor, from Feb 2022 until Jul 2022, oui Victor Vianu Enseignant ENS Paris, Professor, from Jul 2022 until Nov 2022, Nofar Carmeli PostDoc ENS Paris, until Sep 2022 Shufan Jiang PostDoc ENS Paris, from Dec 2022, Anantha Padmanabha PostDoc ENS Paris Anatole Dahan PhD Université Paris-Cité Baptiste Lafosse PhD ENS Paris Shrey Mishra PhD ENS Paris Yann Ramusat PhD ENS Paris, ATER, until Aug 2022 Alexandra Rogova PhD Université Paris-Cité N. Smith Technique ENS Paris, Engineer, from Feb 2022 Yacine Brihmouche Stagiaire Université Paris-Dauphine, Intern, from May 2022 until Sep 2022 Antoine Gauquier Stagiaire Télécom Paris, Intern, from Oct 2022, Siméon Gheorgin Stagiaire PSL, Intern, until Jun 2022, Meriem Guemair Assistant Inria Overall objectives Objectives

Valda's focus is on both foundational and systems aspects of complex data management, especially human-centric data. The data we are interested in is typically heterogeneous, massively distributed, rapidly evolving, intensional, and often subjective, possibly erroneous, imprecise, incomplete. In this setting, Valda is in particular concerned with the optimization of complex resources such as computer time and space, communication, monetary, and privacy budgets. The goal is to extract value from data, beyond simple query answering.

Data management 40, 49 is now an old, well-established field, for which many scientific results and techniques have been accumulated since the sixties. Originally, most works dealt with static, homogeneous, and precise data. Later, works were devoted to heterogeneous data 3841, and possibly distributed 72 but at a small scale.

However, these classical techniques are poorly adapted to handle the new challenges of data management. Consider human-centric data, which is either produced by humans, e.g., emails, chats, recommendations, or produced by systems when dealing with humans, e.g., geolocation, business transactions, results of data analysis. When dealing with such data, and to accomplish any task to extract value from such data, we rapidly encounter the following facets:

Heterogeneity: data may come in many different structures such as unstructured text, graphs, data streams, complex aggregates, etc., using many different schemas or ontologies.

Massive distribution: data may come from a large number of autonomous sources distributed over the web, with complex access patterns.

Rapid evolution: many sources may be producing data in real time, even if little of it is perhaps relevant to the specific application. Typically, recent data is of particular interest and changes have to be monitored.

Intensionality1: in a classical database, all the data is available. In modern applications, the data is more and more available only intensionally, possibly at some cost, with the difficulty to discover which source can contribute towards a particular goal, and this with some uncertainty.

Confidentiality and security: some personal data is critical and need to remain confidential. Applications manipulating personal data must take this into account and must be secure against linking.

Uncertainty: modern data, and in particular human-centric data, typically includes errors, contradictions, imprecision, incompleteness, which complicates reasoning. Furthermore, the subjective nature of the data, with opinions, sentiments, or biases, also makes reasoning harder since one has, for instance, to consider different agents with distinct, possibly contradicting knowledge.

These problems have already been studied individually and have led to techniques such as query rewriting62 or distributed query optimization68.

Among all these aspects, intensionality is perhaps the one that has least been studied, so we pay particular attention to it. Consider a user's query, taken in a very broad sense: it may be a classical database query, some information retrieval search, a clustering or classification task, or some more advanced knowledge extraction request. Because of intensionality of data, solving such a query is a typically dynamic task: each time new data is obtained, the partial knowledge a system has of the world is revised, and query plans need to be updated, as in adaptive query processing 55 or aggregated search 80. The system then needs to decide, based on this partial knowledge, of the best next access to perform. This is reminiscent of the central problem of reinforcement learning 78 (train an agent to accomplish a task in a partially known world based on rewards obtained) and of active learning 74 (decide which action to perform next in order to optimize a learning strategy) and we intend to explore this connection further.

Uncertainty of the data interacts with its intensionality: efforts are required to obtain more precise, more complete, sounder results, which yields a trade-off between processing cost and data quality.

Other aspects, such as heterogeneity and massive distribution, are of major importance as well. A standard data management task, such as query answering, information retrieval, or clustering, may become much more challenging when taking into account the fact that data is not available in a central location, or in a common format. We aim to take these aspects into account, to be able to apply our research to real-world applications.

The Issues

We intend to tackle hard technical issues such as query answering, data integration, data monitoring, verification of data-centric systems, truth finding, knowledge extraction, data analytics, that take a different flavor in this modern context. In particular, we are interested in designing strategies to minimize data access cost towards a specific goal, possibly a massive data analysis task. That cost may be in terms of communication (accessing data in distributed systems, on the Web), of computational resources (when data is produced by complex tools such as information extraction, machine learning systems, or complex query processing), of monetary budget (paid-for application programming interfaces, crowdsourcing platforms), or of a privacy budget (as in the standard framework of differential privacy).

A number of data management tasks in Valda are inherently intractable. In addition to properly characterizing this intractability in terms of complexity theory, we intend to develop solutions for solving these tasks in practice, based on approximation strategies, randomized algorithms, enumeration algorithms with constant delay, or identification of restricted forms of data instances lowering the complexity of the task.

Research program Scientific Foundations

We now detail some of the scientific foundations of our research on complex data management. This is the occasion to review connections between data management, especially on complex data as is the focus of Valda, with related research areas.

Complexity & Logic

Data management has been connected to logic since the advent of the relational model as main representation system for real-world data, and of first-order logic as the logical core of database querying languages 40. Since these early developments, logic has also been successfully used to capture a large variety of query modes, such as data aggregation 67, recursive queries (Datalog), or querying of XML databases 49. Logical formalisms facilitate reasoning about the expressiveness of a query language or about its complexity.

The main problem of interest in data management is that of query evaluation, i.e., computing the results of a query over a database. The complexity of this problem has far-reaching consequences. For example, it is because first-order logic is in the ${AC}_{0}$ complexity class that evaluation of SQL queries can be parallelized efficiently. It is usual 79 in data management to distinguish data complexity, where the query is considered to be fixed, from combined complexity, where both the query and the data are considered to be part of the input. Thus, though conjunctive queries, corresponding to a simple SELECT-FROM-WHERE fragment of SQL, have PTIME data complexity, they are NP-hard in combined complexity. Making this distinction is important, because data is often far larger (up to the order of terabytes) than queries (rarely more than a few hundred bytes). Beyond simple query evaluation, a central question in data management remains that of complexity; tools from algorithm analysis, and complexity theory can be used to pinpoint the tractability frontier of data management tasks.

Automata Theory

Automata theory and formal languages arise as important components of the study of many data management tasks: in temporal databases 39, queries, expressed in temporal logics, can often by compiled to automata; in graph databases 45, queries are naturally given as automata; typical query and schema languages for XML databases such as XPath and XML Schema can be compiled to tree automata 71, or for more complex languages to data tree automata 65. Another reason of the importance of automata theory, and tree automata in particular, comes from Courcelle's results 53 that show that very expressive queries (from the language of monadic second-order language) can be evaluated as tree automata over tree decompositions of the original databases, yielding linear-time algorithms (in data complexity) for a wide variety of applications.

Verification

Complex data management also has connections to verification and static analysis. Besides query evaluation, a central problem in data management is that of deciding whether two queries are equivalent40. This is critical for query optimization, in order to determine if the rewriting of a query, maybe cheaper to evaluate, will return the same result as the original query. Equivalence can easily be seen to be an instance of the problem of (non-)satisfiability: $q \equiv q^{'}$ if and only if $(q \land \neg q^{'}) \lor (\neg q \land q^{'})$ is not satisfiable. In other words, some aspects of query optimization are static analysis issues. Verification is also a critical part of any database application where it is important to ensure that some property will never (or always) arise 51.

Workflows

The orchestration of distributed activities (under the responsibility of a conductor) and their choreography (when they are fully autonomous) are complex issues that are essential for a wide range of data management applications including notably, e-commerce systems, business processes, health-care and scientific workflows. The difficulty is to guarantee consistency or more generally, quality of service, and to statically verify critical properties of the system. Different approaches to workflow specifications exist: automata-based, logic-based, or predicate-based control of function calls 37.

Probability & Provenance

To deal with the uncertainty attached to data, proper models need to be used (such as attaching provenance information to data items and viewing the whole database as being probabilistic) and practical methods and systems need to be developed to both reliably estimate the uncertainty in data items and properly manage provenance and uncertainty information throughout a long, complex system.

The simplest model of data uncertainty is the NULLs of SQL databases, also called Codd tables 40. This representation system is too basic for any complex task, and has the major inconvenient of not being closed under even simple queries or updates. A solution to this has been proposed in the form of conditional tables64 where every tuple is annotated with a Boolean formula over independent Boolean random events. This model has been recognized as foundational and extended in two different directions: to more expressive models of provenance than what Boolean functions capture, through a semiring formalism 60, and to a probabilistic formalism by assigning independent probabilities to the Boolean events 61. These two extensions form the basis of modern provenance and probability management, subsuming in a large way previous works 52, 46. Research in the past ten years has focused on a better understanding of the tractability of query answering with provenance and probabilistic annotations, in a variety of specializations of this framework 7766, 43.

Machine Learning

Statistical machine learning, and its applications to data mining and data analytics, is a major foundation of data management research. A large variety of research areas in complex data management, such as wrapper induction 73, crowdsourcing 44, focused crawling 59, or automatic database tuning 47 critically rely on machine learning techniques, such as classification 63, probabilistic models 58, or reinforcement learning 78.

Machine learning is also a rich source of complex data management problems: thus, the probabilities produced by a conditional random field 69 system result in probabilistic annotations that need to be properly modeled, stored, and queried.

Finally, complex data management also brings new twists to some classical machine learning problems. Consider for instance the area of active learning74, a subfield of machine learning concerned with how to optimally use a (costly) oracle, in an interactive manner, to label training data that will be used to build a learning model, e.g., a classifier. In most of the active learning literature, the cost model is very basic (uniform or fixed-value costs), though some works 75 consider more realistic costs. Also, oracles are usually assumed to be perfect with only a few exceptions 56. These assumptions usually break when applied to complex data management problems on real-world data, such as crowdsourcing.

Research Directions

At the beginning of the Valda team, the project was to focus on the following directions:

foundational aspects of data management, in particular related to query enumeration and reasoning on data, especially regarding security issues;

implementation of provenance and uncertainty management, real-world applications, other aspects of uncertainty and incompleteness, in particular dynamic;

development of personal information management systems, integration of machine learning techniques.

We believe the first two directions have been followed in a satisfactory manner. The focus on personal information management has not been kept for various organizational reasons, however, but the third axis of the project is reoriented to more general aspects of Web data management.

New permanent arrivals in the group since its creation have impacted its research directions in the following manner:

Camille Bourgaux and Michaël Thomazo are both specialists of knowledge representation and formal aspects of knowledge bases, which is an expertise that did not exist in the group. They are also both interested in, and have started working on aspects related to connecting their research with database theory, and investigating aspects of uncertainty and incompleteness in their research. This will lead to more work on knowledge representation and symbolic AI aspects, while keeping the focus of Valda on foundations of data management and uncertainty.

Leonid Libkin is a specialist of database theory, of incomplete data management, and has a line of current research on graph data management. His profile fits very well with the original orientation of the Valda project.

We intend to keep producing leading research on the foundations of data management. Generally speaking, the goal is to investigate the borders of feasibility of various tasks. For instance, what are the assumptions on data that allow for computable problems? When is it not possible at all? When can we hope for efficient query answering, when is it hopeless? This is a problem of theoretical nature which is necessary for understanding the limit of the methods and driving research towards the scenarios where positive results may be obtainable. Only when we have understood the limitation of different methods and have many examples where this is possible, we can hope to design a solid foundation that allowing for a good trade-off between what can be done (needs from the users) and what can be achieved (limitation from the system).

Similarly, we will continue our work, both foundational and practical, on various aspects of provenance and uncertainty management. One overall long-term goal is to reach a full understanding of the interactions between query evaluation or other broader data management tasks and uncertain and annotated data models. We would in particular want to go towards a full classification of tractable (typically polynomial-time) and intractable (typically NP-hard for decision problems, or #P-hard for probability evaluation) tasks, extending and connecting the query-based dichotomy 54 on probabilistic query evaluation with the instance-based one of 42, 43. Another long-term goal is to consider more dynamic scenarios than what has been considered so far in the uncertain data management literature: when following a workflow, or when interacting with intensional data sources, how to properly represent and update uncertainty annotations that are associated with data. This is critical for many complex data management scenarios where one has to maintain a probabilistic current knowledge of the world, while obtaining new knowledge by posing queries and accessing data sources. Such intensional tasks requires minimizing jointly data uncertainty and cost to data access.

As application area, in addition to the historical focus on personal information management which is now less stressed, we target Web data (Web pages, the semantic Web, social networks, the deep Web, crowdsourcing platforms, etc.).

We aim at keeping a delicate balance between theoretical, foundational research, and systems research, including development and implementation. This is a difficult balance to find, especially since most Valda researchers have a tendency to favor theoretical work, but we believe it is also one of the strengths of the team.

Application domains Personal Information Management Systems

We recall that Valda's focus is on human-centric data, i.e., data produced by humans, explicitly or implicitly, or more generally containing information about humans. Quite naturally, we have used as a privileged application area to validate Valda’s results that of personal information management systems (Pims for short) 36.

A Pims is a system that allows a user to integrate her own data, e.g., emails and other kinds of messages, calendar, contacts, web search, social network, travel information, work projects, etc. Such information is commonly spread across different services. The goal is to give back to a user the control on her information, allowing her to formulate queries such as “What kind of interaction did I have recently with Alice B.?”, “Where were my last ten business trips, and who helped me plan them?”. The system has to orchestrate queries to the various services (which means knowing the existence of these services, and how to interact with them), integrate information from them (which means having data models for this information and its representation in the services), e.g., align a GPS location of the user to a business address or place mentioned in an email, or an event in a calendar to some event in a Web search. This information must be accessed intensionally: for instance, costly information extraction tools should only be run on emails which seem relevant, perhaps identified by a less costly cursory analysis (this means, in turn, obtaining a cost model for access to the different services). Impacted people can be found by examining events in the user's calendar and determining who is likely to attend them, perhaps based on email exchanges or former events' participant lists. Of course, uncertainty has to be maintained along the entire process, and provenance information is needed to explain query results to the user (e.g., indicate which meetings and trips are relevant to each person of the output). Knowledge about services, their data models, their costs, need either to be provided by the system designer, or to be automatically learned from interaction with these services, as in 73.

One motivation for that choice is that Pims concentrate many of the problems we intend to investigate: heterogeneity (various sources, each with a different structure), massive distribution (information spread out over the Web, in numerous sources), rapid evolution (new data regularly added), intensionality (knowledge from Wikidata, OpenStreetMap...), confidentiality and security (mostly private data), and uncertainty (very variable quality). Though the data is distributed, its size is relatively modest; other applications may be considered for works focusing on processing data at large scale, which is a potential research direction within Valda, though not our main focus. Another strong motivation for the choice of Pims as application domain is the importance of this application from a societal viewpoint.

A Pims is essentially a system built on top of a user's personal knowledge base; such knowledge bases are reminiscent of those found in the Semantic Web, e.g., linked open data. Some issues, such as ontology alignment 76 exist in both scenarios. However, there are some fundamental differences in building personal knowledge bases vs collecting information from the Semantic Web: first, the scope is quite smaller, as one is only interested in knowledge related to a given individual; second, a small proportion of the data is already present in the form of semantic information, most needs to be extracted and annotated through appropriate wrappers and enrichers; third, though the linked open data is meant to be read-only, the only update possible to a user being adding new triples, a personal knowledge base is very much something that a user needs to be able to edit, and propagating updates from the knowledge base to original data sources is a challenge in itself.

Web Data

The choice of Pims is not exclusive. We also consider other application areas as well. In particular, we have worked in the past and have a strong expertise on Web data 41 in a broad sense: semi-structured, structured, or unstructured content extracted from Web databases 73; knowledge bases from the Semantic Web 76; social networks 70; Web archives and Web crawls 57; Web applications and deep Web databases 50; crowdsourcing platforms 44. We intend to continue using Web data as a natural application domain for the research within Valda when relevant. For instance 48, deep Web databases are a natural application scenario for intensional data management issues: determining if a deep Web database contains some information requires optimizing the number of costly requests to that database.

A common aspect of both personal information and Web data is that their exploitation raises ethical considerations. Thus, a user needs to remain fully in control of the usage that is made of her personal information; a search engine or recommender system that ranks Web content for display to a specific user needs to do so in an unbiased, justifiable, manner. These ethical constraints sometimes forbid some technically solutions that may be technically useful, such as sharing a model learned from the personal data of a user to another user, or using blackboxes to rank query result. We fully intend to consider these ethical considerations within Valda. One of the main goals of a Pims is indeed to empower the user with a full control on the use of this data.

Social and environmental responsibility

Data-driven algorithmic systems raise ethical and legal concerns, that need to be taken into account within research. Serge Abiteboul, with collaborators from NYU, U. Washington, U. Michigan, U. Amsterdam, wrote a position article detailing the role that data management research needs to play in ensuring responsible design and use of algorithmic data-driven systems. 17

Highlights of the year Awards

Michaël Thomazo, together with Maxime Buron and Marie-Laure Mugnier, received the BDA (French database community) award for their work on Parallelisable Existential Rules: a Story of Pieces31, also published at KR 2021 31

Broader Inria Context

The work of the Valda team in 2022 was affected by several issues within Inria; in particular major issues with the deployment of a new information system (Eksae) negatively impacted the work of our administrative assistant and made it impossible for the team leader to keep track of expenses.

The team also would like to thank the Inria evaluation committee for its admirable work in support of the research community, for its transparency, and for the integrity in which it conducts its activities.

New software and platforms New software ORBITS Name:

Optimal Repair-Based Inconsistency-Tolerant Semantics

Keywords:

Knowledge Bases, Databases

Scientific Description:

ORBITS (Optimal Repair-Based Inconsistency-Tolerant Semantics) is a tool for filtering answers that hold under a given inconsistency-tolerant semantics among AR, IAR and brave with standard repairs or Pareto- or completion-optimal repairs in the case where a priority relation between the conflicting facts is given. ORBITS implements a variety of algorithms and propositional encoding variants for each semantics and type of repairs.

Functional Description:

ORBITS is a tool for filtering answers that hold under a given inconsistency-tolerant semantics based on some kind of optimal repairs in the case where a priority relation between the conflicting facts is given.

URL:

https://github.com/bourgaux/orbits

Publication:

hal-03770516

Contact:

Camille Bourgaux

Participant:

Camille Bourgaux

ProvSQL Keywords:

Databases, Provenance, Probability

Functional Description:

The goal of the ProvSQL project is to add support for (m-)semiring provenance and uncertainty management to PostgreSQL databases, in the form of a PostgreSQL extension/module/plugin.

News of the Year:

Support for PostgreSQL 15. Miscellaneous enhancements and bug fixes.

URL:

https://github.com/PierreSenellart/provsql

Publications:

hal-01672566, hal-01851538

Contact:

Pierre Senellart

Participants:

Pierre Senellart, Baptiste Lafosse

TheoremKB Keyword:

Information extraction

Functional Description:

TheoremKB is a collection of tools to extract semantic information from (mathematical) research articles.

News of the Year:

Improvements to theorem extraction, preliminary work on multimodal approach.

URL:

https://github.com/PierreSenellart/theoremkb

Publications:

hal-02956526, hal-02940819, hal-03293643, hal-03897168

Contact:

Pierre Senellart

Participants:

Pierre Senellart, Shrey Mishra, Yacine Brihmouche

apxproof Keyword:

LaTeX

Functional Description:

apxproof is a LaTeX package facilitating the typesetting of research articles with proofs in appendix, a common practice in database theory and theoretical computer science in general. The appendix material is written in the LaTeX code along with the main text which it naturally complements, and it is automatically deferred. The package can automatically send proofs to the appendix, can repeat in the appendix the theorem environments stated in the main text, can section the appendix automatically based on the sectioning of the main text, and supports a separate bibliography for the appendix material.

Release Contributions:

Support for lipcs's claimproof, support optional arguments in proofs

News of the Year:

1.2.4 release: support for claimproof environment from lipics, support optional arguments in proofs.

URL:

https://github.com/PierreSenellart/apxproof

Contact:

Pierre Senellart

Participant:

Pierre Senellart

dissem.in Name:

Dissemin

Keywords:

Open Access, Publishing, HAL

Functional Description:

Dissemin is a web platform gathering metadata from many sources to analyze the open-access full text availability of publications of researchers. It has been designed to foster the use of repositories such as HAL (rather than preprints posted on personal homepages). It allows deposit on these repositories.

News of the Year:

Support for a large variety of IdP from Shibboleth. Various Shibboleth fixes. Support for v2 of Sherpa/Romeo API. Other small improvements, bug fixes, maintainance.

URL:

https://gitlab.com/dissemin/dissemin

Contact:

Pierre Senellart

Participant:

Pierre Senellart

Partner:

CAPSH

New platforms dissem.in

dissem.in, the openly accessible platform for promoting full-text deposit of scientific articles of researchers, which is based on the dissem.in (7.1.5) software, has been maintained by Valda since 2021. Works on the platform in 2022, in addition to works on the base software, include updating information about journals and publisher policies from the Sherpa/Romeo API.

PierreSenellartN.Smith New results

We present the results we obtained and published in 2022. Much research within Valda centers around the central problem of query answering in databases, but exploring various side questions: How to handle incomplete or inconsistent information? How to efficiently access query results when there are many of them? How to incorporate external ontologies within query answering? How to keep track of the provenance of queries? We describe our works in each of these areas in turn, and finish with other theoretical research conducted in the team, beyond data management.

Incomplete and inconsistent information

We first consider databases containing incomplete (missing) or inconsistent (contradictory) information.

One of the most common scenarios of handling incomplete information occurs in relational databases. They describe incomplete knowledge with three truth values, using Kleene’s logic for propositional formulae and a rather peculiar extension to predicate calculus. This design by a committee from several decades ago is now part of the standard adopted by vendors of database management systems. But is it really the right way to handle incompleteness in propositional and predicate logics? Our goal in 13 is to answer this question. Using an epistemic approach, we first characterize possible levels of partial knowledge about propositions, which leads to six truth values. We impose rationality conditions on the semantics of the connectives of the propositional logic, and prove that Kleene’s logic is the maximal sublogic to which the standard optimization rules apply, thereby justifying this design choice. For extensions to predicate logic, however, we show that the additional truth values are not necessary: every many-valued extension of first-order logic over databases with incomplete information represented by null values is no more powerful than the usual two-valued logic with the standard Boolean interpretation of the connectives. We use this observation to analyze the logic underlying SQL query evaluation, and conclude that the many-valued extension for handling incompleteness does not add any expressiveness to it.

We continue on the topic of incomplete information in 18, where our goal is to collect and analyze the shortcomings of nulls and their treatment by SQL, and to re-evaluate existing research in this light. To this end, we designed and conducted a survey on the everyday usage of null values among database users. From the analysis of the results we reached two main conclusions. First, null values are ubiquitous and relevant in real-life scenarios, but SQL's features designed to deal with them cause multiple problems. The severity of these problems varies depending on the SQL features used, and they cannot be reduced to a single issue. Second, foundational research on nulls is misdirected and has been addressing problems of limited practical relevance. We urge the community to view the results of this survey as a way to broaden the spectrum of their researches and further bridge the theory-practice gap on null values.

To answer database queries over incomplete data the gold standard is finding certain answers: those that are true regardless of how incomplete data is interpreted. Such answers can be found efficiently for conjunctive queries and their unions, even in the presence of constraints such as keys or functional dependencies. With negation added, the complexity of finding certain answers becomes intractable however. In 28 we exhibit a well-behaved class of queries that extends unions of conjunctive queries with a limited form of negation and that permits efficient computation of certain answers even in the presence of constraints by means of rewriting into Datalog with negation. The class consists of queries that are the closure of conjunctive queries under Boolean operations of union, intersection and difference. We show that for these queries, certain answers can be expressed in Datalog with negation, even in the presence of functional dependencies, thus making them tractable in data complexity. We show that in general Datalog cannot be replaced by first-order logic, but without constraints such a rewriting can be done in first-order.

While all relational database systems are based on the bag data model, much of theoretical research still views relations as sets. Recent attempts to provide theoretical foundations for modern data management problems under the bag semantics concentrated on applications that need to deal with incomplete relations, i.e., relations populated by constants and nulls. Our goal in 12 is to provide a complete characterization of the complexity of query answering over such relations in fragments of bag relational algebra. The main challenges that we face are twofold. First, bag relational algebra has more operations than its set analog (e.g., additive union, max-union, min-intersection, duplicate elimination) and the relationship between various fragments is not fully known. Thus we first fill this gap. Second, we look at query answering over incomplete data, which again is more complex than in the set case: rather than certainty and possibility of answers, we now have numerical information about occurrences of tuples. We then fully classify the complexity of finding this information in all the fragments of bag relational algebra.

Finally, we turn to inconsistent data. In 19, 20, we investigate practical algorithms for inconsistency-tolerant query answering over prioritized knowledge bases, which consist of a logical theory, a set of facts, and a priority relation between conflicting facts. We consider three well-known semantics (AR, IAR and brave) based upon two notions of optimal repairs (Pareto and completion). Deciding whether a query answer holds under these semantics is (co)NP-complete in data complexity for a large class of logical theories, and SAT-based procedures have been devised for repair-based semantics when there is no priority relation, or the relation has a special structure. We introduce the first SAT encodings for Pareto- and completion-optimal repairs w.r.t. general priority relations and proposes several ways of employing existing and new encodings to compute answers under (optimal) repair-based semantics, by exploiting different reasoning modes of SAT solvers. The comprehensive experimental evaluation of our implementation compares both (i) the impact of adopting semantics based on different kinds of repairs, and (ii) the relative performances of alternative procedures for the same semantics.

Enumeration and direct access to query results

Many queries have as output sets of results which are too big to be generated at once. Two strategies can then be used: either to design algorithms for efficient enumeration of the query results, one after the other, or for efficient direct access to one specific result among the set of results.

In 16, we consider the evaluation of first-order queries over classes of databases that are nowhere dense. The notion of nowhere-dense classes was introduced by Nešetřil and Ossona de Mendez as a formalization of classes of “sparse” graphs and generalizes many well-known classes of graphs, such as classes of bounded degree, bounded tree-width, or bounded expansion. It has recently been shown by Grohe, Kreutzer, and Siebertz that over nowhere-dense classes of databases, first-order sentences can be evaluated in pseudo-linear time (pseudo-linear time means that for all $ϵ$ there exists an algorithm working in time $O (n^{1 + ϵ})$ , where $n$ is the size of the database). For first-order queries of higher arities, we show that over any nowhere dense class of databases, the set of their solutions can be enumerated with constant delay after a pseudo-linear time preprocessing. In the same context, we also show that after a pseudo-linear time preprocessing we can, on input of a tuple, test in constant time whether it is a solution to the query.

A class of relational databases has low degree if for all $δ > 0$ , all but finitely many databases in the class have degree at most $n^{δ}$ , where $n$ is the size of the database. Typical examples are databases of bounded degree or of degree bounded by $log n$ . It is known that over a class of databases having low degree, first-order boolean queries can be checked in pseudo-linear time, i.e. for all $ϵ > 0$ in time bounded by $n^{1 + ϵ}$ . We generalize this result in 14 by considering query evaluation. We show that counting the number of answers to a query can be done in pseudo-linear time and that after a pseudo-linear time preprocessing we can test in constant time whether a given tuple is a solution to a query or enumerate the answers to a query with constant delay.

Finally, we consider in 25 the task of lexicographic direct access to query answers. That is, we want to simulate an array containing the answers of a join query sorted in a lexicographic order chosen by the user. A recent dichotomy showed for which queries and orders this task can be done in polylogarithmic access time after quasilinear preprocessing, but this dichotomy does not tell us how much time is required in the cases classiﬁed as hard. We determine the pre-processing time needed to achieve polylogarithmic access time for all self-join free queries and all lexicographical orders. To this end, we propose a decomposition-based general algorithm for direct access on join queries. We then explore its optimality by proving lower bounds for the preprocessing time based on the hardness of a certain online Set-Disjointness problem, which shows that our algorithm’s bounds are tight for all lexicographic orders on self-join free queries. Then, we prove the hardness of Set-Disjointness based on the Zero-Clique Conjecture which is an established conjecture from ﬁne-grained complexity theory. We also show that similar techniques can be used to prove that, for enumerating answers to Loomis-Whitney joins, it is not possible to signiﬁcantly improve upon trivially computing all answers at preprocessing. This, in turn, gives further evidence (based on the Zero-Clique Conjecture) to the enumeration hardness of self-join free cyclic joins with re-spect to linear preprocessing and constant delay.

Ontology-mediated query answering

We know consider cases where to answer a query, we need to take into account external knowledge given in the form of a logical ontology (e.g., described in description logics, or through existential rules).

While ontology-mediated query answering most often adopts (unions of) conjunctive queries as the query language, some recent works have explored the use of counting queries coupled with DL-Lite ontologies. The aim of 22, 21 is to extend the study of counting queries to Horn description logics outside the DL-Lite family. Through a combination of novel techniques, adaptations of existing constructions, and new connections to closed predicates, we achieve a complete picture of the data and combined complexity of answering counting conjunctive queries (CCQs) and cardinality queries (a restricted class of CCQs) in ${ℰℒℋℐ}_{⊥}$ and its various sublogics. Notably, we show that CCQ answering is 2EXP-complete in combined complexity for ${ℰℒℋℐ}_{⊥}$ and every sublogic that extends EL or DL-Lite $_{pos}^{ℋ}$ . Our study not only provides the first results for counting queries beyond DL-Lite, but it also closes some open questions about the combined complexity of CCQ answering in DL-Lite.

Existential rules are a very popular ontology-mediated query language for which the chase represents a generic computational approach for query answering. It is straightforward that existential rule queries exhibiting chase termination are decidable and can only recognize properties that are preserved under homomorphisms. 24 is an extended abstract of our eponymous publication at KR 2021 where we show the converse: every decidable query that is closed under homomorphism can be expressed by an existential rule set for which the standard chase universally terminates. Membership in this fragment is not decidable, but we show via a diagonalisation argument that this is unavoidable.

In the literature, existential rules are often supposed to be in some normal form that simplifies technical developments. For instance, a common assumption is that rule heads are atomic, i.e., restricted to a single atom. Such assumptions are considered to be made without loss of generality as long as all sets of rules can be normalised while preserving entailment. However, an important question is whether the properties that ensure the decidability of reasoning are preserved as well. We provide in 26 a systematic study of the impact of these procedures on the different chase variants with respect to chase (non-)termination and FO-rewritability. This also leads us to study open problems related to chase termination of independent interest.

Provenance for recursive queries

Data provenance consists in bookkeeping meta information during query evaluation, in order to enrich query results with their trust level, likelihood, evaluation cost, and more. The framework of semiring provenance abstracts from the specific kind of meta information that annotates the data.

While the definition of semiring provenance is uncontroversial for unions of conjunctive queries, the picture is less clear for Datalog. Indeed, the original definition might include infinite computations, and is not consistent with other proposals for Datalog semantics over annotated data. In 23, we propose and investigate several provenance semantics, based on different approaches for defining classical Datalog semantics. We study the relationship between these semantics, and introduce properties that allow us to analyze and compare them.

In 30, 33, we establish a translation between a formalism for dynamic programming over hypergraphs and the computation of semiring-based provenance for Datalog programs. The benefit of this translation is a new method for computing the provenance of Datalog programs for specific classes of semirings, which we apply to provenance-aware querying of graph databases. Theoretical results and practical optimizations lead to an efficient implementation using Soufflé, a state-of-the-art Datalog interpreter. Experimental results on real-world data suggest this approach to be efficient in practical contexts, competing with dedicated solutions for graphs.

Theoretical computer science beyond databases

Valda's research has always encompassed other foundational topics. We conclude with the description of other theoretical computer science works (namely, in algebraic automata theory and logic), which does not fit within the previous areas of research.

The program-over-monoid model of computation originates with Barrington's proof that the model captures the complexity class NC $^{1}$ . In 15 we make progress in understanding the subtleties of the model. First, we identify a new tameness condition on a class of monoids that entails a natural characterization of the regular languages recognizable by programs over monoids from the class. Second, we prove that the class known as DA satisfies tameness and hence that the regular languages recognized by programs over monoids in DA are precisely those recognizable in the classical sense by morphisms from QDA. Third, we show by contrast that the well studied class of monoids called J is not tame. Finally, we exhibit a program-length-based hierarchy within the class of languages recognized by programs over monoids from DA.

When we bundle quantifiers and modalities together (as in $\exists x □$ , $◊ \forall x$ etc.) in first-order modal logic (FOML), we get new logical operators whose combinations produce interesting bundled fragments of FOML. It is well-known that finding decidable fragments of FOML is hard, but existing work shows that certain bundled fragments are decidable, without any restriction on the arity of predicates, the number of variables, or the modal scope. In 29, we explore generalized bundles such as $\forall x \forall y □$ , $\forall x \exists y ◊$ etc., and map the terrain with regard to decidability, presenting both decidability and undecidability results. In particular, we propose the loosely bundled fragment, which is decidable over increasing domains and encompasses all known decidable bundled fragments.

Bilateral contracts and grants with industry Standardization activities

Leonid Libkin is involved in the standardization process of the GQL and SQL query languages. In particular, he is a chair of the LDBC working group on semantics of GQL, and a member of ISO/IEC JTC1 SC32 WG3 (SQL committee). He is also a member of INCITS, the US InterNational Committee for Information Technology Standards.

As part of this standardization effort, 27 presents the key elements of the graph pattern matching language at the core of both SQL/PGQ and GQL, in advance of the publication of the corresponding new standards.

LeonidLibkin Partnerships and cooperations International initiatives Associate Teams in the framework of an Inria International Lab or in the framework of an Inria International Program GQA Title:

Languages for Graph Querying and Analytics

Duration:

2022 ->

Coordinator:

Pablo Barceló (pbarcelo@dcc.uchile.cl)

Partners:

Pontificia Universidad Católica de Chile Santiago (Chili)

Inria contact:

LeonidLibkin

Summary:

The project brings together experts in graph databases, in particular in the new generation of query languages currently standardized by the ISO. The history of collaboration between the two groups goes back many years and pre-dates our current collaboration on graph data; having started in the areas of tree-structured data and data interoperability. Our main objective is to combine the graph query languages expertise of the Inria group with the machine learning and graphs analytics expertise of the Chilean group to come up with a new generation of query languages that seamlessly integrate graph querying with analytics.

Participation in other International Programs DesCartes

(2021–2026) is a project managed by CNRS@CREATE, a CNRS subsidiary in Singapore and funded by Singapore's National Research Foundation, with 50 million total budget. Pierre Senellart is involved in the project as one of the French PIs.

Informal international partners

Valda has strong collaborations with the following international groups:

Univ. Edinburgh, United Kingdom:

Paolo Guagliardo, Andreas Pieris

Univ. Oxford, United Kingdom:

Michael Benedikt and Georg Gottlob

TU Dresden, Germany:

Markus Krötzsch and Sebastian Rudolph

Dortmund University, Germany:

Thomas Schwentick

Bayreuth University, Germany:

Wim Martens

Univ. Bergen, Norway:

Ana Ozaki

Univ. Roma La Sapienza, Italy:

Marco Console

Warsaw University, Poland:

Mikołaj Bojańczyk and Szymon Toruńczyk

Tel Aviv University, Israel:

Daniel Deutch and Tova Milo

NYU, USA:

Julia Stoyanovich

Univ. California San Diego, USA:

Victor Vianu

Pontifical Catholic University of Chile:

Marcelo Arenas, Pablo Barceló

National University of Singapore:

Stéphane Bressan

International research visitors Visits of international scientists Visits of international scientists

Victor Vianu, Professor at UCSD, visited the group during several months in 2022. He was also hired on a fixed-term contract by ENS.

Yael Amsterdamer, Senior Lecturer at Bar-Ilan University & Daniel Deutch, Professor at Tel Aviv University jointly visited Valda in July 2022.

Dan Suciu, Professor at University of Washington, visited Valda in November 2022.

European initiatives Other european programs/initiatives

A bilateral French–German ANR project, entitled EQUUS – Efficient Query answering Under UpdateS started in 2020. It involves CNRS (CRIL, CRIStAL, IMJ), Télécom Paris, HU Berlin, and Bayreuth University, in addition to Inria Valda.

National initiatives ANR

Valda has been part of three national ANR projects in 2022:

CQFD

(2018–2024; 19 k€ for Valda, budget managed by Inria), with Inria Sophia (GraphIK, coordinator), LaBRI, LIG, Inria Saclay (Cedar), IRISA, Inria Lille (Spirals), and Télécom ParisTech, on complex ontological queries over federated and heterogeneous data.

QUID

(2018–2024; 49 k€ for Valda, budget managed by Inria), LIGM (coordinator), IRIF, and LaBRI, on incomplete and inconsistent data.

VERIGRAPH

(2022–2026; 150 k€ for Valda (coordinator), budget managed by ENS), LIG, and LIRIS, on verifiable graph queries and transformations

Camille Bourgaux has been participating in the AI Chair of Meghyn Bienvenu on INTENDED (Intelligent handling of imperfect data) since 2020.

Pierre Senellart has held a chair within the PR[AI]RIE institute for artificial intelligence in Paris since 2019.

Others Dissemin

(2021–2024; 124 k€ for Valda, budget managed by ENS), sole partner, on the development of the dissem.in platform for open science promotion. Funded by the Fonds National Science Ouverte.

Dissemination Promoting scientific activities Scientific events: organisation General chair, scientific chair

Leonid Libkin, general chair of PODS 2022; chair (until July 2022) and now member of the PODS Executive Committee

Member of the organizing committees

Leonid Libkin, member of the LICS Steering Committee

Luc Segoufin, member of the steering committee of the conference series Highlights of Logic, Games and Automata

Scientific events: selection Chair of conference program committees

Camille Bourgaux, program co-chair of the Artificial Intelligence in Bergen research school, AIB 2022

Member of the conference program committees

Camille Bourgaux, AAAI 2023, IJCAI-ECAI 2022, KR 2022, DL 2022

Nofar Carmeli, PODS 2023

Leonid Libkin, KR 2022 (area chair), KR 2023, The Web Conference 2023 (industry track)

Pierre Senellart, BDA 2022, SIGMOD 2023

Michaël Thomazo, IJCAI 2022, KR 2022

Journal Member of the editorial boards

Leonid Libkin, Acta Informatica

Leonid Libkin, Bulletin of Symbolic Logic

Luc Segoufin, ACM Transactions on Computational Logics

Invited talks

Nofar Carmeli, Invited Tutorial at ICDT 2022 on Answering Unions of Conjunctive Queries with Ideal Time Guarantees

Leonid Libkin, Invited Talk at KR 2022 on Graph queries: do we study what matters?

Leonid Libkin, Invited Talk at the Workshop on Finite Model Theory and Many-Valued Logics

Leadership within the scientific community

Serge Abiteboul is a member of the French Academy of Sciences, of the Academia Europaea, of the scientific council of the Société Informatique de France, and an ACM Fellow.

Leonid Libkin is a Fellow of the Royal Society of Edinburgh, a member of the Academia Europaea, of the UK Computing research committee, and an ACM Fellow.

Pierre Senellart is a junior member of the Institut Universitaire de France.

Research administration

Luc Segoufin is a member of the CNHSCT of Inria.

Pierre Senellart is the president of section 6 of the National Committee for Scientific Research.

Pierre Senellart is a member of the board of the conference of presidents of the national committee (CPCN) and as such a member of the coordination of managing parties of the national committee (C3N).

Pierre Senellart is deputy director of the DI ENS laboratory, joint between ENS, CNRS, and Inria.

Teaching - Supervision - Juries Teaching

Licence: Algorithms, L2, CPES, PSL – Pierre Senellart

Licence: Practical Computing, L3, École normale supérieure – Pierre Senellart

Licence: Formal Languages, Computability, Complexity, L3, École normale supérieure – Michaël Thomazo, Yann Ramusat

Licence: Databases, L3, École normale supérieure – Leonid Libkin, Yann Ramusat

Master: Advanced Databases, M2, IASD – Pierre Senellart, Michaël Thomazo

Master: Data wrangling, Data privacy, M2, IASD – Leonid Libkin, Pierre Senellart

Master: Anonymization, privacy, IASD – Pierre Senellart

Master: Knowledge graphs, description logics, reasoning on data, M2, IASD – Camille Bourgaux, Michaël Thomazo

Pierre Senellart holds various teaching responsibilities (L3 internships, M1 projects, M2 administration, entrance competition) at ENS. Pierre Senellart is in the managing board of the graduate program. Leonid Libkin is co-responsible of the international entrance competition at ENS. Yann Ramusat was the secretary of the entrance competition at ENS for computer science. Michaël Thomazo is an adjunct professor at PSL.

Most members of the group are also involved in tutoring ENS students, advising them on their curriculum, their internships, etc. They are also occasionally involved with reviewing internship reports, supervising student projects, etc.

Supervision

PhD completed: Quentin Manière, Counting queries in ontology-based data access, 2019–2022, Meghyn Bienvenu & Michaël Thomazo (as he was based in Bordeaux, he was not considered a Valda member)

PhD completed: Yann Ramusat, Provenance-based routing in probabilistic graphs 33, 2018–2022, Silviu Maniu & Pierre Senellart

PhD in progress: Anatole Dahan, Logical foundations of the polynomial hierarchy, started in October 2020, Arnaud Durand & Luc Segoufin

PhD in progress: Baptiste Lafosse, Compiler dedicated to the evaluation of SQL queries, started in October 2021, Pierre Senellart & Jean-Marie Lagniez

PhD in progress: Shrey Mishra, Towards a knowledge base of mathematic results, started in January 2021, Pierre Senellart

PhD in progress: Alexandra Rogova, Query analytics in Cypher, started October 2021, Amelie Gheerbrant & Leonid Libkin

PhD in progress: Étienne Toussaint, Paolo Guagliardo & Leonid Libkin (as he is based in Edinburgh, he is not considered a Valda member)

Internship: Yacine Brihmouche, M1 internship, Pierre Senellart 34

Internship: Siméon Gheorghin, L3 internship, Pierre Senellart 35

Juries

PhD: Sajad Nazari [reviewer], INSA Centre Val de Loire, Pierre Senellart

Popularization Responsibilities

Serge Abiteboul is a member of the strategic committee of the Blaise Pascal foundation for scientific mediation.

Pierre Senellart is a scientific expert advising the Scientific and Ethical Committee of Parcoursup, the platform for the selection of first-year higher education students.

Articles and contents

Serge Abiteboul is a founding editor of the binaire blog for popularizing computer science.

Serge Abiteboul contributed an interview about artificial intelligence to the May 2022 special edition of Pour la Science; this article was among the ten 2022 articles recommanded by the editorial team of the magazine.

Serge Abiteboul wrote a book on the regulation of social networks 32

Serge Abiteboul wrote an article on the carbon impact of 5G 11

Monadic Datalog, Tree Validity, and Limited Access Containment M. Michael Benedikt P. Pierre Bourhis G. Georg Gottlob P. Pierre Senellart ACM Transactions on Computational Logic 2020 21 1 6:1-6:45 Answering Counting Queries over DL-Lite Ontologies M. Meghyn Bienvenu Q. Quentin Manière M. Michaël Thomazo Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020. July 2020 Revisiting Semiring Provenance for Datalog C. Camille Bourgaux P. Pierre Bourhis L. Liat Peterfreund M. Michaël Thomazo Proceedings of the 19th International Conference on Principles of Knowledge Representation and Reasoning KR 2022 - 19th International Conference on Principles of Knowledge Representation and Reasoning Haifa, Israel July 2022 91–101 Capturing Homomorphism-Closed Decidable Queries with Existential Rules C. Camille Bourgaux D. David Carral M. Markus Krötzsch S. Sebastian Rudolph M. Michaël Thomazo KR 2021 - 18th International Conference on Principles of Knowledge Representation and Reasoning Virtual, Vietnam November 2021 141--150 Parallelisable Existential Rules: a Story of Pieces M. Maxime Buron M.-L. Marie-Laure Mugnier M. Michaël Thomazo KR 2021 - 18th International Conference on Principles of Knowledge Representation and Reasoning Virtual, Vietnam November 2021 Coping with Incomplete Data: Recent Advances M. Marco Console P. Paolo Guagliardo L. Leonid Libkin E. Etienne Toussaint June 2020 ACM 33-47 Tameness and the power of programs over monoids in DA N. Nathan Grosshans P. Pierre Mckenzie L. Luc Segoufin Logical Methods in Computer Science August 2022 18 3 14:1–14:34 Enumeration for FO Queries over Nowhere Dense Graphs N. Nicole Schweikardt L. Luc Segoufin A. Alexandre Vigny Journal of the ACM (JACM) June 2022 69 3 1-37 ProvSQL: Provenance and Probability Management in PostgreSQL P. Pierre Senellart L. Louis Jachiet S. Silviu Maniu Y. Yann Ramusat Proceedings of the VLDB Endowment (PVLDB) August 2018 11 12 2034-2037 Troubles with nulls, views from the users E. Etienne Toussaint P. Paolo Guagliardo L. Leonid Libkin J. Juan Sequeda Proceedings of the VLDB Endowment (PVLDB) July 2022 15 11 2613-2625 5G : amélioration ou aggravation du bilan carbone ? S. Serge Abiteboul P. Patrick Lagrange Polytechnique Insights March 2022 Fragments of bag relational algebra: Expressiveness and certain answers M. Marco Console P. Paolo Guagliardo L. Leonid Libkin Information Systems March 2022 105 101604 Propositional and predicate logics of incomplete information M. Marco Console P. Paolo Guagliardo L. Leonid Libkin Artificial Intelligence January 2022 302 103603 Enumerating Answers to First-Order Queries over Databases of Low Degree A. Arnaud Durand N. Nicole Schweikardt L. Luc Segoufin Logical Methods in Computer Science May 2022 18 2 23 Tameness and the power of programs over monoids in DA N. Nathan Grosshans P. Pierre Mckenzie L. Luc Segoufin Logical Methods in Computer Science August 2022 18 3 14:1–14:34 Enumeration for FO Queries over Nowhere Dense Graphs N. Nicole Schweikardt L. Luc Segoufin A. Alexandre Vigny Journal of the ACM (JACM) June 2022 69 3 1-37 Responsible data management J. Julia Stoyanovich B. Bill Howe H. V. Hosagrahar Visvesvaraya Jagadish S. Sebastian Schelter S. Serge Abiteboul Communications of the ACM June 2022 65 6 64-74 Troubles with nulls, views from the users E. Etienne Toussaint P. Paolo Guagliardo L. Leonid Libkin J. Juan Sequeda Proceedings of the VLDB Endowment (PVLDB) July 2022 15 11 2613-2625 Querying Inconsistent Prioritized Data with ORBITS: Algorithms, Implementation, and Experiments (Extended Abstract) M. Meghyn Bienvenu C. Camille Bourgaux Proceedings of the 35th International Workshop on Description Logics DL 2022 - 35th International Workshop on Description Logics Haifa, Israel August 2022 Querying Inconsistent Prioritized Data with ORBITS: Algorithms, Implementation, and Experiments M. Meghyn Bienvenu C. Camille Bourgaux KR 2022 - 19th International Conference on Principles of Knowledge Representation and Reasoning Haifa, Israel July 2022 Complexity Landscape for Counting Queries M. Meghyn Bienvenu Q. Quentin Manière M. Michaël Thomazo 35th International Workshop on Description Logics Haifa, Israel August 2022 Counting Queries over ELHI⊥ Ontologies M. Meghyn Bienvenu Q. Quentin Manière M. Michaël Thomazo KR 2022 - 19th International Conference on Principles of Knowledge Representation and Reasoning Haifa, Israel July 2022 53-62 Revisiting Semiring Provenance for Datalog C. Camille Bourgaux P. Pierre Bourhis L. Liat Peterfreund M. Michaël Thomazo Proceedings of the 19th International Conference on Principles of Knowledge Representation and Reasoning KR 2022 - 19th International Conference on Principles of Knowledge Representation and Reasoning Haifa, Israel July 2022 91–101 Capturing Homomorphism-Closed Decidable Queries with Existential Rules (Extended Abstract) C. Camille Bourgaux D. David Carral M. Markus Krötzsch S. Sebastian Rudolph M. Michaël Thomazo IJCAI-ECAI 2022 - 31st International Joint Conference on Artificial Intelligence - 25th European Conference on Artificial Intelligence Vienna, Austria July 2022 5269-5273 Tight Fine-Grained Bounds for Direct Access on Join Queries K. Karl Bringmann N. Nofar Carmeli S. Stefan Mengel PODS '22: International Conference on Management of Data SIGMOD/PODS '22: International Conference on Management of Data Philadelphia PA, United States June 2022 ACM 427-436 Normalisations of Existential Rules: Not so Innocuous! D. David Carral L. Lucas Larroque M.-L. Marie-Laure Mugnier M. Michaël Thomazo KR 2022 - 19th International Conference on Principles of Knowledge Representation and Reasoning HaÏfa, Israel 2022 102-111 Graph Pattern Matching in GQL and SQL/PGQ A. Alin Deutsch N. Nadime Francis A. Alastair Green K. Keith Hare B. Bei Li L. Leonid Libkin T. Tobias Lindaaker V. Victor Marsault W. Wim Martens J. Jan Michels F. Filip Murlak S. Stefan Plantikow P. Petra Selmer H. Hannes Voigt O. Oskar van Rest D. Domagoj Vrgoč M. Mingxi Wu F. Fred Zemke SIGMOD '22: International Conference on Management of Data Philadelphia, United States June 2022 Certain Answers of Extensions of Conjunctive Queries by Datalog and First-Order Rewriting A. Amélie Gheerbrant L. Leonid Libkin A. Alexandra Rogova C. Cristina Sirangelo 4th International Workshop on the Resurgence of Datalog in Academia and Industry Genoa, Italy September 2022 Generalized Bundled Fragments for First-Order Modal Logic M. Mo Liu A. Anantha Padmanabha R. R Ramanujam Y. Yanjing Wang 47th International Symposium on Mathematical Foundations of Computer Science (MFCS 2022) 47th International Symposium on Mathematical Foundations of Computer Science (MFCS 2022) Vienna, Austria August 2022 241 70:1--70:14 Efficient Provenance-Aware Querying of Graph Databases with Datalog Y. Yann Ramusat S. Silviu Maniu P. Pierre Senellart GRADES-NDA 2022 - Joint Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA) Philadelphia, United States June 2022 Parallelisable Existential Rules: a Story of Pieces M. Maxime Buron M.-L. Marie-Laure Mugnier M. Michaël Thomazo BDA 2022 - 38ème journée "Gestion de Données – Principes, Technologies et Applications" Clermont-Ferrand, France October 2022 Nous sommes les réseaux sociaux S. Serge Abiteboul J. Jean Cattan 2022 Odile Jacob The Semiring-Based Provenance Framework for Graph Databases Y. Yann Ramusat April 2022 TheoremKB : une base de connaissance des résultats mathématiques Y. Yacine Brihmouche September 2022 Etude de données Twitter en lien avec l'élection présidentielle française d'avril 2022 S. Siméon Gheorghin Paris June 2022 14 Managing your digital life S. Serge Abiteboul B. Benjamin André D. Daniel Kaplan Commun. ACM 2015 58 5 32-35 Comparing workflow specification languages: A matter of views S. Serge Abiteboul P. Pierre Bourhis V. Victor Vianu ACM Trans. Database Syst. 2012 37 2 10:1-10:59 Data on the Web: From Relations to Semistructured Data and XML S. Serge Abiteboul P. Peter Buneman D. Dan Suciu 1999 Morgan Kaufmann Temporal Versus First-Order Logic to Query Temporal Databases S. Serge Abiteboul L. Laurent Herr J. V. Jan Van den Bussche 1996 49-57 Foundations of Databases S. Serge Abiteboul R. Richard Hull V. Victor Vianu 1995 Addison-Wesley Web Data Management S. Serge Abiteboul I. Ioana Manolescu P. Philippe Rigaux M.-C. Marie-Christine Rousset P. Pierre Senellart 2011 Cambridge University Press Provenance Circuits for Trees and Treelike Instances A. Antoine Amarilli P. Pierre Bourhis P. Pierre Senellart 2015 56-68 Tractable Lineages on Treelike Instances: Limits and Extensions A. Antoine Amarilli P. Pierre Bourhis P. Pierre Senellart 2016 355-370 CrowdMiner: Mining association rules from the crowd Y. Yael Amsterdamer Y. Yael Grossman T. Tova Milo P. Pierre Senellart PVLDB 2013 6 12 1250-1253 Querying graph databases P. B. Pablo Barceló Baeza 2013 175-188 The Management of Probabilistic Data D. Daniel Barbará H. Hector Garcia-Molina D. Daryl Porter IEEE Trans. Knowl. Data Eng. 1992 4 5 487-502 Regularized Cost-Model Oblivious Database Tuning with Reinforcement Learning D. Debabrota Basu Q. Qian Lin W. Weidong Chen H. T. Hoang Tam Vo Z. Zihong Yuan P. Pierre Senellart S. Stéphane Bressan T. Large-Scale Data- and Knowledge-Centered Systems 2016 28 96-132 Determining relevance of accesses at runtime M. Michael Benedikt G. Georg Gottlob P. Pierre Senellart 2011 211-222 Databases M. Michael Benedikt P. Pierre Senellart 2011 Springer 169-229 Dealing with the Deep Web and all its Quirks M. Meghyn Bienvenu D. Daniel Deutch D. Davide Martinenghi P. Pierre Senellart F. M. Fabian M. Suchanek 2012 21-24 Verification of database-driven systems via amalgamation M. Miko\laj Bojańczyk L. Luc Segoufin S. Szymon Toruńczyk 2013 63-74 Why and Where: A Characterization of Data Provenance P. Peter Buneman S. Sanjeev Khanna W.-C. Wang-Chiew Tan 2001 316-330 The Monadic Second-Order Logic of Graphs. I. Recognizable Sets of Finite Graphs B. Bruno Courcelle Inf. Comput. 1990 85 1 12-75 The dichotomy of probabilistic inference for unions of conjunctive queries N. N. Nilesh N. Dalvi D. Dan Suciu J. ACM 2012 59 6 30:1-30:87 Adaptive Query Processing A. Amol Deshpande Z. G. Zachary G. Ives V. Vijayshankar Raman Foundations and Trends in Databases 2007 1 1 1-140 Proactive learning: cost-sensitive active learning with multiple imperfect oracles P. Pinar Donmez J. G. Jaime G. Carbonell 2008 619-628 Adaptive Web Crawling Through Structure-Based Link Classification M. Muhammad Faheem P. Pierre Senellart 2015 39-51 Introduction to statistical relational learning L. Lise Getoor 2007 MIT Press Scalable, generic, and adaptive systems for focused crawling G. Georges Gouriten S. Silviu Maniu P. Pierre Senellart 2014 35-45 Provenance semirings T. J. Todd J. Green G. Gregory Karvounarakis V. Val Tannen 2007 31-40 Models for Incomplete and Probabilistic Information T. J. Todd J. Green V. Val Tannen IEEE Data Eng. Bull. 2006 29 1 17-24 Answering queries using views: A survey A. Y. Alon Y. Halevy VLDB J. 2001 10 4 270-294 Support vector machines M. A. Marti A. Hearst S. T. Susan T Dumais E. Edgar Osuna J. John Platt B. Bernhard Scholkopf IEEE Intelligent Systems 1998 13 4 18-28 Incomplete Information in Relational Databases T. Tomasz Imielinski W. Witold Lipski Jr. J. ACM 1984 31 4 761-791 FO2(<, +1, <formula type="inline"><math xmlns="http://www.w3.org/1998/Math/MathML"><mo/></math></formula>) on data trees, data tree automata and branching vector addition systems F. Florent Jacquemard L. Luc Segoufin J. Jerémie Dimino Logical Methods in Computer Science 2016 12 2 Probabilistic XML: Models and Complexity B. Benny Kimelfeld P. Pierre Senellart 2013 Springer 39-66 Equivalence of Relational Algebra and Relational Calculus Query Languages Having Aggregate Functions A. C. Anthony C. Klug J. ACM 1982 29 3 699-717 The State of the art in distributed query processing D. Donald Kossmann ACM Comput. Surv. 2000 32 4 422-469 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. D. John D. Lafferty A. Andrew McCallum F. C. Fernando C. N. Pereira 2001 282-289 Online Influence Maximization S. Siyu Lei S. Silviu Maniu L. Luyi Mo R. Reynold Cheng P. Pierre Senellart 2015 645-654 Automata Theory for XML Researchers F. Frank Neven SIGMOD Record 2002 31 3 39-46 Principles of Distributed Database Systems, Third Edition M. T. M. Tamer Özsu P. Patrick Valduriez 2011 Springer Automatic wrapper induction from hidden-web sources with domain knowledge P. Pierre Senellart A. Avin Mittal D. Daniel Muschick R. Rémi Gilleron M. Marc Tommasi 2008 9-16 Active Learning B. Burr Settles Synthesis Lectures on Artificial Intelligence and Machine Learning 2012 Morgan & Claypool Publishers Active learning with real annotation costs B. Burr Settles M. Mark Craven L. Lewis Friedland 2008 PARIS: Probabilistic Alignment of Relations, Instances, and Schema F. M. Fabian M. Suchanek S. Serge Abiteboul P. Pierre Senellart PVLDB 2011 5 3 157-168 Probabilistic Databases D. Dan Suciu D. Dan Olteanu C. Christopher Ré C. Christoph Koch Synthesis Lectures on Data Management 2011 Morgan & Claypool Publishers Reinforcement learning - an introduction R. S. Richard S. Sutton A. G. Andrew G. Barto Adaptive computation and machine learning 1998 MIT Press The Complexity of Relational Query Languages (Extended Abstract) M. Y. Moshe Y. Vardi 1982 137-146 On the reliability and intuitiveness of aggregated search metrics K. Ke Zhou M. Mounia Lalmas T. Tetsuya Sakai R. Ronan Cummins J. M. Joemon M. Jose 2013 689-698