VALDA - 2020 - Annual activity report

VALDA

VALDA - 2020

2020

Activity report

Project-Team

VALDA

RNSR: 201622223R

Research center

Paris

In partnership with:

CNRS, Ecole normale supérieure de Paris

Value from Data

In collaboration with:

Département d'Informatique de l'Ecole Normale Supérieure

Domain

Perception, Cognition and Interaction

Theme

Data and Knowledge Representation and Processing

Creation of the Team: 2016 December 01, updated into Project-Team: 2018 January 01

Keywords

Computer Science and Digital Science

A3.1. Data
A3.1.1. Modeling, representation
A3.1.2. Data management, quering and storage
A3.1.3. Distributed data
A3.1.4. Uncertain data
A3.1.5. Control access, privacy
A3.1.6. Query optimization
A3.1.7. Open data
A3.1.8. Big data (production, storage, transfer)
A3.1.9. Database
A3.1.10. Heterogeneous data
A3.1.11. Structured data
A3.2. Knowledge
A3.2.1. Knowledge bases
A3.2.2. Knowledge extraction, cleaning
A3.2.3. Inference
A3.2.4. Semantic Web
A3.2.5. Ontologies
A3.2.6. Linked data
A3.3.2. Data mining
A3.4.3. Reinforcement learning
A3.4.5. Bayesian methods
A3.5.1. Analysis of large graphs
A4.7. Access control
A7.2. Logic in Computer Science
A7.3. Calculability and computability
A9.1. Knowledge
A9.8. Reasoning

1 Team members, visitors, external collaborators

Research Scientists

Serge Abiteboul [Inria, Emeritus, HDR]
Camille Bourgaux [CNRS, Researcher]
Olivier Cappé [CNRS, Senior Researcher]
Luc Segoufin [Inria, Senior Researcher]
Michaël Thomazo [Inria, Researcher]
Victor Vianu [Inria, Advanced Research Position, from Sep 2020 until Oct 2020]

Faculty Members

Pierre Senellart [Team leader, École Normale Supérieure de Paris, Professor]
Leonid Libkin [École Normale Supérieure de Paris, Professor]
Silviu Maniu [Université Paris-Saclay, Associate Professor, until Aug 2020]

Post-Doctoral Fellows

Ashish Deepak Dandekar [École Normale Supérieure de Paris, until Sep 2020]
Nathan Grosshans [École Normale Supérieure de Paris, until Aug 2020]
Liat Peterfreund [CNRS]

PhD Students

Juliette Achddou [1000 Mercis, CIFRE]
Anatole Dahan [Université de Paris, from October 2020]
Julien Grange [Université Denis Diderot, until Aug 2020]
Yann Ramusat [École Normale Supérieure de Paris]
Yoan Russac [École Normale Supérieure de Paris]

Interns and Apprentices

Raphaël Chekroun [ENS, Intern, from Oct 2020]
Théo Delmazure [École Normale Supérieure de Paris, Intern, from Apr 2020 until Aug 2020]
Lucas Pluvinage [École Normale Supérieure de Paris, Intern, from Apr 2020 until Sep 2020]

Administrative Assistant

Meriem Guemair [Inria]

External Collaborator

Victor Vianu [Université de Californie, from Oct 2020]

2 Overall objectives

2.1 Objectives

Valda's focus is on both foundational and systems aspects of complex data management, especially human-centric data. The data we are interested in is typically heterogeneous, massively distributed, rapidly evolving, intensional, and often subjective, possibly erroneous, imprecise, incomplete. In this setting, Valda is in particular concerned with the optimization of complex resources such as computer time and space, communication, monetary, and privacy budgets. The goal is to extract value from data, beyond simple query answering.

Data management 43, 52 is now an old, well-established field, for which many scientific results and techniques have been accumulated since the sixties. Originally, most works dealt with static, homogeneous, and precise data. Later, works were devoted to heterogeneous data 4144, and possibly distributed 75 but at a small scale.

However, these classical techniques are poorly adapted to handle the new challenges of data management. Consider human-centric data, which is either produced by humans, e.g., emails, chats, recommendations, or produced by systems when dealing with humans, e.g., geolocation, business transactions, results of data analysis. When dealing with such data, and to accomplish any task to extract value from such data, we rapidly encounter the following facets:

Heterogeneity: data may come in many different structures such as unstructured text, graphs, data streams, complex aggregates, etc., using many different schemas or ontologies.
Massive distribution: data may come from a large number of autonomous sources distributed over the web, with complex access patterns.
Rapid evolution: many sources may be producing data in real time, even if little of it is perhaps relevant to the specific application. Typically, recent data is of particular interest and changes have to be monitored.
Intensionality1: in a classical database, all the data is available. In modern applications, the data is more and more available only intensionally, possibly at some cost, with the difficulty to discover which source can contribute towards a particular goal, and this with some uncertainty.
Confidentiality and security: some personal data is critical and need to remain confidential. Applications manipulating personal data must take this into account and must be secure against linking.
Uncertainty: modern data, and in particular human-centric data, typically includes errors, contradictions, imprecision, incompleteness, which complicates reasoning. Furthermore, the subjective nature of the data, with opinions, sentiments, or biases, also makes reasoning harder since one has, for instance, to consider different agents with distinct, possibly contradicting knowledge.

These problems have already been studied individually and have led to techniques such as query rewriting65 or distributed query optimization71.

Among all these aspects, intensionality is perhaps the one that has least been studied, so we pay particular attention to it. Consider a user's query, taken in a very broad sense: it may be a classical database query, some information retrieval search, a clustering or classification task, or some more advanced knowledge extraction request. Because of intensionality of data, solving such a query is a typically dynamic task: each time new data is obtained, the partial knowledge a system has of the world is revised, and query plans need to be updated, as in adaptive query processing 58 or aggregated search 83. The system then needs to decide, based on this partial knowledge, of the best next access to perform. This is reminiscent of the central problem of reinforcement learning 81 (train an agent to accomplish a task in a partially known world based on rewards obtained) and of active learning 77 (decide which action to perform next in order to optimize a learning strategy) and we intend to explore this connection further.

Uncertainty of the data interacts with its intensionality: efforts are required to obtain more precise, more complete, sounder results, which yields a trade-off between processing cost and data quality.

Other aspects, such as heterogeneity and massive distribution, are of major importance as well. A standard data management task, such as query answering, information retrieval, or clustering, may become much more challenging when taking into account the fact that data is not available in a central location, or in a common format. We aim to take these aspects into account, to be able to apply our research to real-world applications.

2.2 The Issues

We intend to tackle hard technical issues such as query answering, data integration, data monitoring, verification of data-centric systems, truth finding, knowledge extraction, data analytics, that take a different flavor in this modern context. In particular, we are interested in designing strategies to minimize data access cost towards a specific goal, possibly a massive data analysis task. That cost may be in terms of communication (accessing data in distributed systems, on the Web), of computational resources (when data is produced by complex tools such as information extraction, machine learning systems, or complex query processing), of monetary budget (paid-for application programming interfaces, crowdsourcing platforms), or of a privacy budget (as in the standard framework of differential privacy).

A number of data management tasks in Valda are inherently intractable. In addition to properly characterizing this intractability in terms of complexity theory, we intend to develop solutions for solving these tasks in practice, based on approximation strategies, randomized algorithms, enumeration algorithms with constant delay, or identification of restricted forms of data instances lowering the complexity of the task.

3 Research program

3.1 Scientific Foundations

We now detail some of the scientific foundations of our research on complex data management. This is the occasion to review connections between data management, especially on complex data as is the focus of Valda, with related research areas.

Complexity & Logic

Data management has been connected to logic since the advent of the relational model as main representation system for real-world data, and of first-order logic as the logical core of database querying languages 43. Since these early developments, logic has also been successfully used to capture a large variety of query modes, such as data aggregation 70, recursive queries (Datalog), or querying of XML databases 52. Logical formalisms facilitate reasoning about the expressiveness of a query language or about its complexity.

The main problem of interest in data management is that of query evaluation, i.e., computing the results of a query over a database. The complexity of this problem has far-reaching consequences. For example, it is because first-order logic is in the ${AC}_{0}$ complexity class that evaluation of SQL queries can be parallelized efficiently. It is usual 82 in data management to distinguish data complexity, where the query is considered to be fixed, from combined complexity, where both the query and the data are considered to be part of the input. Thus, though conjunctive queries, corresponding to a simple SELECT-FROM-WHERE fragment of SQL, have PTIME data complexity, they are NP-hard in combined complexity. Making this distinction is important, because data is often far larger (up to the order of terabytes) than queries (rarely more than a few hundred bytes). Beyond simple query evaluation, a central question in data management remains that of complexity; tools from algorithm analysis, and complexity theory can be used to pinpoint the tractability frontier of data management tasks.

Automata Theory

Automata theory and formal languages arise as important components of the study of many data management tasks: in temporal databases 42, queries, expressed in temporal logics, can often by compiled to automata; in graph databases 48, queries are naturally given as automata; typical query and schema languages for XML databases such as XPath and XML Schema can be compiled to tree automata 74, or for more complex languages to data tree automata 68. Another reason of the importance of automata theory, and tree automata in particular, comes from Courcelle's results 56 that show that very expressive queries (from the language of monadic second-order language) can be evaluated as tree automata over tree decompositions of the original databases, yielding linear-time algorithms (in data complexity) for a wide variety of applications.

Verification

Complex data management also has connections to verification and static analysis. Besides query evaluation, a central problem in data management is that of deciding whether two queries are equivalent43. This is critical for query optimization, in order to determine if the rewriting of a query, maybe cheaper to evaluate, will return the same result as the original query. Equivalence can easily be seen to be an instance of the problem of (non-)satisfiability: $q \equiv q^{'}$ if and only if $(q \land \neg q^{'}) \lor (\neg q \land q^{'})$ is not satisfiable. In other words, some aspects of query optimization are static analysis issues. Verification is also a critical part of any database application where it is important to ensure that some property will never (or always) arise 54.

Workflows

The orchestration of distributed activities (under the responsibility of a conductor) and their choreography (when they are fully autonomous) are complex issues that are essential for a wide range of data management applications including notably, e-commerce systems, business processes, health-care and scientific workflows. The difficulty is to guarantee consistency or more generally, quality of service, and to statically verify critical properties of the system. Different approaches to workflow specifications exist: automata-based, logic-based, or predicate-based control of function calls 40.

Probability & Provenance

To deal with the uncertainty attached to data, proper models need to be used (such as attaching provenance information to data items and viewing the whole database as being probabilistic) and practical methods and systems need to be developed to both reliably estimate the uncertainty in data items and properly manage provenance and uncertainty information throughout a long, complex system.

The simplest model of data uncertainty is the NULLs of SQL databases, also called Codd tables 43. This representation system is too basic for any complex task, and has the major inconvenient of not being closed under even simple queries or updates. A solution to this has been proposed in the form of conditional tables67 where every tuple is annotated with a Boolean formula over independent Boolean random events. This model has been recognized as foundational and extended in two different directions: to more expressive models of provenance than what Boolean functions capture, through a semiring formalism 63, and to a probabilistic formalism by assigning independent probabilities to the Boolean events 64. These two extensions form the basis of modern provenance and probability management, subsuming in a large way previous works 55, 49. Research in the past ten years has focused on a better understanding of the tractability of query answering with provenance and probabilistic annotations, in a variety of specializations of this framework 8069, 46.

Machine Learning

Statistical machine learning, and its applications to data mining and data analytics, is a major foundation of data management research. A large variety of research areas in complex data management, such as wrapper induction 76, crowdsourcing 47, focused crawling 62, or automatic database tuning 50 critically rely on machine learning techniques, such as classification 66, probabilistic models 61, or reinforcement learning 81.

Machine learning is also a rich source of complex data management problems: thus, the probabilities produced by a conditional random field 72 system result in probabilistic annotations that need to be properly modeled, stored, and queried.

Finally, complex data management also brings new twists to some classical machine learning problems. Consider for instance the area of active learning77, a subfield of machine learning concerned with how to optimally use a (costly) oracle, in an interactive manner, to label training data that will be used to build a learning model, e.g., a classifier. In most of the active learning literature, the cost model is very basic (uniform or fixed-value costs), though some works 78 consider more realistic costs. Also, oracles are usually assumed to be perfect with only a few exceptions 59. These assumptions usually break when applied to complex data management problems on real-world data, such as crowdsourcing.

3.2 Research Directions

At the beginning of the Valda team, the project was to focus on the following directions:

foundational aspects of data management, in particular related to query enumeration and reasoning on data, especially regarding security issues;
implementation of provenance and uncertainty management, real-world applications, other aspects of uncertainty and incompleteness, in particular dynamic;
development of personal information management systems, integration of machine learning techniques.

We believe the first two directions have been followed in a satisfactory manner. The focus on personal information management has not been kept for various organizational reasons, however, but the third axis of the project is reoriented to more general aspects of Web data management.

New permanent arrivals in the group since its creation have impacted its research directions in the following manner:

Camille Bourgaux and Michaël Thomazo are both specialists of knowledge representation and formal aspects of knowledge bases, which is an expertise that did not exist in the group. They are also both interested in, and have started working on aspects related to connecting their research with database theory, and investigating aspects of uncertainty and incompleteness in their research. This will lead to more work on knowledge representation and symbolic AI aspects, while keeping the focus of Valda on foundations of data management and uncertainty.
Olivier Cappé is a specialist in statistics and machine learning, in particular multi-armed bandits and reinforcement learning. He is also interested in applications of these learning techniques to data management problems. His arrival in the group therefore complements the expertise of other researchers, and will lead to more work on machine learning issues.
Leonid Libkin is a specialist of database theory, of incomplete data management, and has a line of current research on graph data management. His profile fits very well with the original orientation of the Valda project.

We intend to keep producing leading research on the foundations of data management. Generally speaking, the goal is to investigate the borders of feasibility of various tasks. For instance, what are the assumptions on data that allow for computable problems? When is it not possible at all? When can we hope for efficient query answering, when is it hopeless? This is a problem of theoretical nature which is necessary for understanding the limit of the methods and driving research towards the scenarios where positive results may be obtainable. Only when we have understood the limitation of different methods and have many examples where this is possible, we can hope to design a solid foundation that allowing for a good trade-off between what can be done (needs from the users) and what can be achieved (limitation from the system).

Similarly, we will continue our work, both foundational and practical, on various aspects of provenance and uncertainty management. One overall long-term goal is to reach a full understanding of the interactions between query evaluation or other broader data management tasks and uncertain and annotated data models. We would in particular want to go towards a full classification of tractable (typically polynomial-time) and intractable (typically NP-hard for decision problems, or #P-hard for probability evaluation) tasks, extending and connecting the query-based dichotomy 57 on probabilistic query evaluation with the instance-based one of 45, 46. Another long-term goal is to consider more dynamic scenarios than what has been considered so far in the uncertain data management literature: when following a workflow, or when interacting with intensional data sources, how to properly represent and update uncertainty annotations that are associated with data. This is critical for many complex data management scenarios where one has to maintain a probabilistic current knowledge of the world, while obtaining new knowledge by posing queries and accessing data sources. Such intensional tasks requires minimizing jointly data uncertainty and cost to data access.

As application area, in addition to the historical focus on personal information management which is now less stressed, we target Web data (Web pages, the semantic Web, social networks, the deep Web, crowdsourcing platforms, etc.).

We aim at keeping a delicate balance between theoretical, foundational research, and systems research, including development and implementation. This is a difficult balance to find, especially since most Valda researchers have a tendency to favor theoretical work, but we believe it is also one of the strengths of the team.

4 Application domains

4.1 Personal Information Management Systems

We recall that Valda's focus is on human-centric data, i.e., data produced by humans, explicitly or implicitly, or more generally containing information about humans. Quite naturally, we have used as a privileged application area to validate Valda’s results that of personal information management systems (Pims for short) 39.

A Pims is a system that allows a user to integrate her own data, e.g., emails and other kinds of messages, calendar, contacts, web search, social network, travel information, work projects, etc. Such information is commonly spread across different services. The goal is to give back to a user the control on her information, allowing her to formulate queries such as “What kind of interaction did I have recently with Alice B.?”, “Where were my last ten business trips, and who helped me plan them?”. The system has to orchestrate queries to the various services (which means knowing the existence of these services, and how to interact with them), integrate information from them (which means having data models for this information and its representation in the services), e.g., align a GPS location of the user to a business address or place mentioned in an email, or an event in a calendar to some event in a Web search. This information must be accessed intensionally: for instance, costly information extraction tools should only be run on emails which seem relevant, perhaps identified by a less costly cursory analysis (this means, in turn, obtaining a cost model for access to the different services). Impacted people can be found by examining events in the user's calendar and determining who is likely to attend them, perhaps based on email exchanges or former events' participant lists. Of course, uncertainty has to be maintained along the entire process, and provenance information is needed to explain query results to the user (e.g., indicate which meetings and trips are relevant to each person of the output). Knowledge about services, their data models, their costs, need either to be provided by the system designer, or to be automatically learned from interaction with these services, as in 76.

One motivation for that choice is that Pims concentrate many of the problems we intend to investigate: heterogeneity (various sources, each with a different structure), massive distribution (information spread out over the Web, in numerous sources), rapid evolution (new data regularly added), intensionality (knowledge from Wikidata, OpenStreetMap...), confidentiality and security (mostly private data), and uncertainty (very variable quality). Though the data is distributed, its size is relatively modest; other applications may be considered for works focusing on processing data at large scale, which is a potential research direction within Valda, though not our main focus. Another strong motivation for the choice of Pims as application domain is the importance of this application from a societal viewpoint.

A Pims is essentially a system built on top of a user's personal knowledge base; such knowledge bases are reminiscent of those found in the Semantic Web, e.g., linked open data. Some issues, such as ontology alignment 79 exist in both scenarios. However, there are some fundamental differences in building personal knowledge bases vs collecting information from the Semantic Web: first, the scope is quite smaller, as one is only interested in knowledge related to a given individual; second, a small proportion of the data is already present in the form of semantic information, most needs to be extracted and annotated through appropriate wrappers and enrichers; third, though the linked open data is meant to be read-only, the only update possible to a user being adding new triples, a personal knowledge base is very much something that a user needs to be able to edit, and propagating updates from the knowledge base to original data sources is a challenge in itself.

4.2 Web Data

The choice of Pims is not exclusive. We also consider other application areas as well. In particular, we have worked in the past and have a strong expertise on Web data 44 in a broad sense: semi-structured, structured, or unstructured content extracted from Web databases 76; knowledge bases from the Semantic Web 79; social networks 73; Web archives and Web crawls 60; Web applications and deep Web databases 53; crowdsourcing platforms 47. We intend to continue using Web data as a natural application domain for the research within Valda when relevant. For instance 51, deep Web databases are a natural application scenario for intensional data management issues: determining if a deep Web database contains some information requires optimizing the number of costly requests to that database.

A common aspect of both personal information and Web data is that their exploitation raises ethical considerations. Thus, a user needs to remain fully in control of the usage that is made of her personal information; a search engine or recommender system that ranks Web content for display to a specific user needs to do so in an unbiased, justifiable, manner. These ethical constraints sometimes forbid some technically solutions that may be technically useful, such as sharing a model learned from the personal data of a user to another user, or using blackboxes to rank query result. We fully intend to consider these ethical considerations within Valda. One of the main goals of a Pims is indeed to empower the user with a full control on the use of this data.

5 Highlights of the year

5.1 Awards

Leonid Libkin was awarded the Gems of PODS award at PODS 2020 for his work on incomplete data, and invited to write a survey article for the occasion. 21
Pierre Senellart was named a junior member of Institut Universitaire de France.

6 New software and platforms

6.1 New software

6.1.1 ProvSQL

Keywords: Databases, Provenance, Probability
Functional Description: The goal of the ProvSQL project is to add support for (m-)semiring provenance and uncertainty management to PostgreSQL databases, in the form of a PostgreSQL extension/module/plugin.
News of the Year: Implementation of an in-memory storage of the provenance circuit. Implementation of aggregate provenance. Major performance enhancements. Support for PostgreSQL 13. Miscellaneous enhancements and bug fixes.
URL: https://github.com/PierreSenellart/provsql
Publications: hal-01672566, hal-01851538
Contact: Pierre Senellart
Participants: Pierre Senellart, Silviu Maniu, Yann Ramusat

6.1.2 apxproof

Keyword: LaTeX
Functional Description: apxproof is a LaTeX package facilitating the typesetting of research articles with proofs in appendix, a common practice in database theory and theoretical computer science in general. The appendix material is written in the LaTeX code along with the main text which it naturally complements, and it is automatically deferred. The package can automatically send proofs to the appendix, can repeat in the appendix the theorem environments stated in the main text, can section the appendix automatically based on the sectioning of the main text, and supports a separate bibliography for the appendix material.
Release Contributions: Compatibility fixes with xypic, fancyvrb, memoir, natbib
News of the Year: Minor 1.2.1 release: compatibility fixes with xypic, fancyvrb, memoir, natbib
URL: https://github.com/PierreSenellart/apxproof
Contact: Pierre Senellart
Participant: Pierre Senellart

6.1.3 TheoremKB

Keyword: Information extraction
Functional Description: TheoremKB is a collection of tools to extract semantic information from (mathematical) research articles.
News of the Year: Initial version. Initial version of extractors of theorems and proofs from PDF and LaTeX. Construction and analysis of relations between theorems.
URL: https://github.com/PierreSenellart/theoremkb
Publications: hal-02956526, hal-02940819
Contact: Pierre Senellart
Participants: Pierre Senellart, Theo Delemazure, Lucas Pluvinage

7 New results

We present the results we obtained and published in 2020 in four directions: the management of incomplete and uncertain data; the complexity of query languages over restricted structures; information extraction; and some other works in the area of database theory.

7.1 Incompleteness, Uncertainty, and Provenance of Data

A major research area within Valda is the management of incomplete, missing, imprecise, uncertain data, along with the tracking of data provenance, a tool often necessary for uncertain data management.

Incomplete data.

The standard notion of query answering over incomplete database is that of certain answers, guaranteeing correctness regardless of how incomplete data is interpreted. In 22, we consider databases with numerical data and queries with arithmetic and comparisons. Even though the notion of certain answers still applies,we explain that it becomes much more problematic in situations when missing data occurs in numerical columns. We propose a new general framework that allows us to assign a measure of certainty to query answers. We test it in the agnostic scenario where we do not have prior information about values of numerical attributes, similarly to the predominant approach in handling incomplete data which assumes that each null can be interpreted as an arbitrary value of the domain.

In 29, we consider incomplete databases whose information content may be enriched by additional knowledge. The knowledge order among them is derived from their semantics, rather than being fixed a priori. The resulting framework allows us to capture and justify existing notions of certainty, and extend these concepts to other data models and query languages. As natural applications, we provide for the first time a well-founded definition of certain answers for the relational bag data model and for value-inventing queries on incomplete databases, addressing the key shortcomings of previous approaches.

Missing data.

When dealing with missing data, one regularly estimates likelihoods of certain events by computing volumes of sets that serve as a mathematical representation of such events. Such sets need to be measurable, which is usually achieved by putting bounds, sometimes ad hoc, on them. In 23, we address the question how unbounded or unmeasurable sets can be measured nonetheless. Intuitively, we want to know how likely a randomly chosen point is to be in a given set, even in the absence of a uniform distribution over the entire space.

Inconsistent data.

Another form of uncertainty in databases arises in the presence of inconsistencies; a way to address such inconsistencies is by way of optimal repairs. In 17, we explore the issue of inconsistency handling over prioritized knowledge bases (KBs), which consist of an ontology, a set of facts, and a priority relation between con- flicting facts. After transfer- ring the notions of globally-, Pareto- and completion-optimal repairs from the database literature to our setting, we study the data complexity of the core reasoning tasks: query entailment under inconsistency-tolerant semantics based upon optimal repairs, existence of a unique optimal repair, and enumeration of all optimal repairs. Our results provide a nearly complete picture of the data complexity of these tasks for ontologies formulated in common DL-Lite dialects.

Data provenance.

In 19, 20, we address the problem of handling provenance information in ELHr ontologies. We consider a setting recently introduced for ontology-based data access, based on semirings and extending classical data provenance, in which ontology axioms are annotated with provenance tokens. A consequence inherits the provenance of the axioms involved in deriving it, yielding a provenance polynomial as annotation. We analyse the semantics for the ELHr case and show that the presence of conjunctions poses various difficulties for handling provenance, some of which are mitigated by assuming multiplicative idempotency of the semiring. Under this assumption, we study three problems: ontology completion with provenance, computing the set of relevant axioms for a consequence, and query answering.

In 12, we investigate compact representations of Boolean provenance represented as Boolean circuits, by providing a systematic picture of many circuit classes considered in knowledge compilation and how they can be systematically connected to width measures, through upper and lower bounds. Our upper bounds show that bounded-treewidth circuits can be constructively converted to d-SDNNFs, in time linear in the circuit size and singly exponential in the treewidth; and that bounded-pathwidth circuits can similarly be converted to uOBDDs. We show matching lower bounds on the compilation of monotone DNF or CNF formulas to structured targets, assuming a constant bound on the arity (size of clauses) and degree (number of occurrences of each variable): any d-SDNNF (resp., SDNNF) for such a DNF (resp., CNF) must be of exponential size in its treewidth, and the same holds for uOBDDs (resp., n-OBDDs) when considering pathwidth.

7.2 Query Languages over Restricted Structures

Another major line of research within Valda is to investigate the complexity of classical database problems (query evaluation, query enumeration, query containement), and the expressive power of quer classes, when data is assumed to have a restricted structure: trees, relations with bounded treewidth, bounded expansion, bounded degree...

Complexity.

In 15, we consider the evaluation of first-order queries over classes of databases with bounded expansion. The notion of bounded expansion is fairly broad and generalizes bounded degree, bounded treewidth and exclusion of at least one minor. It was known that over a class of databases with bounded expansion, first-order sentences could be evaluated in time linear in the size of the database. We give a different proof of this result. Moreover, we show that answers to first-order queries can be enumerated with constant delay after a linear time preprocessing. We also show that counting the number of answers to a query can be done in time linear in the size of the database.

In 14, we consider the problem of containment of monadic datalog (MDL) queries in unions of conjunctive queries (UCQs). We start by revisiting the connection between MDL/UCQ containment and containment problems involving regular tree languages. We then present a general approach for getting tighter bounds on the complexity of query containment, based on analysis of the number of mappings of queries into tree-like instances. We give two applications of the machinery. We first give an important special case of the MDL/UCQ containment problem that is in EXPTIME, and use this bound to show an EXPTIME bound on containment under access patterns. Secondly we show that the same technique can be used to get a new tight upper bound for containment of tree automata in UCQs. We finally show that the new MDL/UCQ upper bounds are tight. We establish a 2EXPTIME lower bound on the MDL/UCQ containment problem, resolving an open problem from the early 1990s.

Expresive power.

Julien Grange's PhD thesis 32 focused on the expressive power of invariant logics over sparse classes of structures. In 25, we show that the expressive power of order-invariant first-order logic collapses to first-order logic over hollow trees. A hollow tree is an unranked ordered tree where every non leaf node has at most four adjacent nodes: two siblings (left and right) and its first and last children. In particular there is no predicate for the linear order among siblings nor for the descendant relation. Moreover only the first and last nodes of a siblinghood are linked to their parent node, and the parent-child relation cannot be completely reconstructed in first-order. In 26, we study the expressive power of successor-invariant first-order logic, which is an extension of first-order logic where the usage of an additional successor relation on the structure is allowed, as long as the validity of formulas is independent on the choice of a particular successor. We show that when the degree is bounded, successor-invariant first-order logic is no more expressive than first-order logic.

7.3 Information Extraction

Information extraction consists in extracting structured data and knowledge from unstructured text or semi-structured documents.

The framework of document spanners abstracts the task of information extraction from text as a function that maps every document (a string) into a relation over the document's spans (intervals identified by their start and end indices). In 24, we embark on the investigation of document spanners that can annotate extractions with auxiliary information such as confidence, support, and confidentiality measures. To this end, we adopt the abstraction of provenance semirings. Hence, the proposed spanner extension, referred to as an annotator, maps every string into an annotated relation over the spans. We investigate key aspects of expressiveness, such as the closure under the positive RA, and key aspects of computational complexity, such as the enumeration of annotated answers and their ranked enumeration in the case of numeric semirings.

Beyond these formal approaches, we also consider a practical application of information extraction: building a knowledge base of (mathematical) results in the scientific literature. This is the goal of the TheoremKB project. In 38, we start on the task of extracting theorems and proofs from the PDF version or source of mathematical articles. In 37, we aim at building a graph of interconnected results from these extracted theorems and proofs.

One standard approach to information extraction has been to rely on human intelligence by using the power of crowd data sourcing platforms. In 13, we discuss challenges of such platforms involving humans in the loop, and more generally how such platforms can evolve to support the future of work.

7.4 Other Topics in Database Theory

Finally, we also dealt with other research problems, mostly in the field of database theory.

Register automata have been used as a convenient model for specifying and verifying database driven systems. An important problem in such systems is to provide views that hide or restructure certain information about the data or process, extending classical notions of database views. In 28 we carry out a formal investigation of views of register automata by considering simple views that project away some of the registers. We show that classical register automata are not able to describe such projections and introduce more powerful register automata that are able to do so. We also show useful properties of these automata such as closure under projection and decidability of verifying temporal properties of their runs.

Ontology-mediated query answering (OMQA) is a promising approach to data access and integration that has been actively studied in the knowledge representation and database communities for more than a decade. The vast majority of work on OMQA focuses on conjunctive queries, whereas more expressive queries that feature counting or other forms of aggregation remain largely unex-plored. In 18, we introduce a general form of counting query, relate it to previous proposals, and study the complexity of answering such queries in the presence of DL-Lite ontologies. As it follows from existing work that query answering is intractable and often of high complexity, we consider some practically relevant restrictions, for which we establish improved complexity bounds.

The program-over-monoid model of computation originates with Barrington's proof that it captures the complexity class ${NC}^{1}$ . In 27 we make progress in understanding the subtleties of the model. First, we identify a new tameness condition on a class of monoids that entails a natural characterization of the regular languages recognizable by programs over monoids from the class. Second, we prove that the class known as DA satisfies tameness and hence that the regular languages recognized by programs over monoids in DA are precisely those recognizable in the classical sense by morphisms from QDA. Third, we show by contrast that the well studied class of monoids called J is not tame. Finally, we exhibit a program-length-based hierarchy within the class of languages recognized by programs over monoids from DA.

8 Bilateral contracts and grants with industry

8.1 Bilateral contracts with industry

Numberly:

Duration: 2019–2022
Local coordinator: Olivier Cappé
Juliette Achddou's PhD research is set up as a CIFRE contract and supervision agreement between her employer, the Numberly company, and École normale supérieure.

Neo4j:

Duration: 2020–2021
Local coordinator: Leonid Libkin
A contract has been established with Neo4j, the leading company in the field of graph databases, to work towards the creation of a new standard for graph languages called GQL, building on Neo4j's Cypher query language. Leonid Libkin is chairing a working group on the formal semantics of GQL. In addition to Valda, it involves researchers from Edinburgh, Santiago, Warsaw, and other universities in Paris (UPEM, Université de Paris). This project is supported by a grant from Neo4j. Leonid Libkin is also a scientific advisor of Neo4j.

8.2 Standardization activities

Leonid Libkin is involved in the standardization process of the GQL and SQL query languages. In particular, he is a chair of the LDBC working group on semantics of GQL, and a member of ISO/IEC JTC1 SC32 WG3 (SQL committee).

9 Partnerships and cooperations

9.1 International initiatives

Informal international partners

Valda has strong collaborations with the following international groups:

Univ. Edinburgh, United Kingdom: Paolo Guagliardo, Andreas Pieris
Univ. Oxford, United Kingdom: Michael Benedikt, Dan Olteanu, and Georg Gottlob
TU Dresden, Germany: Markus Krötzsch and Sebastian Rudolph
Dortmund University, Germany: Thomas Schwentick
Bayreuth University, Germany: Wim Martens
Univ. Bergen, Norway: Ana Ozaki
Univ. Roma La Sapienza, Italy: Marco Console
Warsaw University, Poland: Mikołaj Bojańczyk and Szymon Toruńczyk
Tel Aviv University, Israel: Daniel Deutch and Tova Milo
NYU, USA: Julia Stoyanovich
Univ. California San Diego, USA: Victor Vianu
Pontifical Catholic University of Chile: Marcelo Arenas, Pablo Barceló
National University of Singapore: Stéphane Bressan

9.2 International research visitors

9.2.1 Visits of international scientists

Victor Vianu, Professor at UC San Diego and former holder of an Inria international chair, spent a few months within Valda, employed on a short-term Advanced Research Position on the HeadWork project.

9.3 European initiatives

9.3.1 Collaborations in European programs, except FP7 and H2020

A bilateral French–German ANR project, entitled EQUUS – Efficient Query answering Under UpdateS has started in 2020. It involves CNRS (CRIL, CRIStAL, IMJ), Télécom Paris, HU Berlin, and Bayreuth University, in addition to Inria Valda.

9.4 National initiatives

9.4.1 ANR

Valda has been part of four national ANR projects in 2020:

HEADWORK (2016–2021; 38 k€ for Valda, budget managed by Inria), together with IRISA (Druid, coordinator), Inria Lille (Links & Spirals), and Inria Rennes (Sumo), and two application partners: MNHN (Cesco) and FouleFactory. The topic is workflows for crowdsourcing. See http://headwork.gforge.inria.fr/.
BioQOP (2017–2021; 66 k€ for Valda, budget managed by ENS), with Idemia (coordinator) and GREYC, on the optimization of queries for privacy-aware biometric data management. See http://bioqop.di.ens.fr/.
CQFD (2018–2022; 19 k€ for Valda, budget managed by Inria), with Inria Sophia (GraphIK, coordinator), LaBRI, LIG, Inria Saclay (Cedar), IRISA, Inria Lille (Spirals), and Télécom ParisTech, on complex ontological queries over federated and heterogeneous data. See http://www.lirmm.fr/cqfd/.
QUID (2018–2022; 49 k€ for Valda, budget managed by Inria), LIGM (coordinator), IRIF, and LaBRI, on incomplete and inconsistent data. See https://quid.labri.fr/home.html.

Camille Bourgaux has been participating in the AI Chair of Meghyn Bienvenu on INTENDED (Intelligent handling of imperfect data) since 2020.

9.5 Regional initiatives

Liat Peterfreund obtained a post-doc scholarship from FSMP from 2020 to 2022.

Pierre Senellart has held a Chair of the PaRis Artificial Intelligence Research InstitutE, PRAIRIE since the fall of 2019.

10 Dissemination

10.1 Promoting scientific activities

10.1.1 Scientific events: organisation

General chair, scientific chair

Leonid Libkin, general chair of PODS 2021 and chair of the PODS Executive Committee
Luc Segoufin, chair of the steering committee of the conference series Highlights of Logic, Games and Automata
Pierre Senellart, co-organizer and chief judge of the ICPC (International Collegiate Programming Contest) Southwestern Europe 2019-2020 competition

Member of the organizing committees

Leonid Libkin, member of the SIGMOD Executive Committee.
Pierre Senellart, member of the steering committee of BDA, the French scientific community on data management.
Pierre Senellart, co-organizer and secretary of the ICPC (International Collegiate Programming Contest) Southwestern Europe 2020-2021 competition.

10.1.2 Scientific events: selection

Chair of conference program committees

Leonid Libkin, LICS 2021

Member of the conference program committees

Camille Bourgaux, AAAI 2021, DL 2020, IJCAI 2020, KR 2020, TIME 2020
Leonid Libkin, FOSSACS 2020, IJCAI 2020, KR 2020 (track chair)
Olivier Cappé, NeurIPS 2020 (area chair)
Liat Peterfreund, PODS 2021
Luc Segoufin, ICALP 2020
Pierre Senellart, BDA 2020, ICDT 2021 Test-of-Time Committee, PODS 2021
Michaël Thomazo, AAAI 2021, IJCAI 2020, RJCIA 2020

10.1.3 Journal

Member of the editorial boards

Olivier Cappé, Annals of the Institute of Statistical Mathematics
Leonid Libkin, Bulletin of Symbolic Logic
Leonid Libkin, Acta Informatica
Leonid Libkin, RAIRO Theoretical Informatics and Applications
Leonid Libkin, Journal of Applied Logic
Leonid Libkin, SN Computer Science

10.1.4 Leadership within the scientific community

Serge Abiteboul is a member of the French Academy of Sciences, of the Academia Europaea, of the scientific council of the Société Informatique de France, and an ACM Fellow.
Leonid Libkin is a Fellow of the Royal Society of Edinburgh, a member of the Academia Europaea, of the UK Computing research committee, and an ACM Fellow.
Pierre Senellart is a junior member of the Institut Universitaire de France.

10.1.5 Scientific expertise

Pierre Senellart, reviews for ANR (Flash Covid-19), FONDECYT (Chili)

10.1.6 Research administration

Olivier Cappé is a scientific deputy director of CNRS division of Information Sciences and Technologies (INS2I).
Luc Segoufin is a member of the CNHSCT of Inria.
Pierre Senellart is a member of the board of section 6 of the National Committee for Scientific Research.
Pierre Senellart is deputy director of the DI ENS laboratory, joint between ENS, CNRS, and Inria.
Pierre Senellart is a member of the board of the DIM RFSI (Réseau Francilien en Sciences Informatiques).

10.2 Teaching - Supervision - Juries

10.2.1 Teaching

Licence: Databases, 74 heqTD, L3, École normale supérieure – Pierre Senellart, Nathan Grosshans, Leonid Libkin, Michaël Thomazo
Licence: Data Structures, NYU Paris – Ashish Dandekar
Master: Data wrangling, Data privacy, 36 heqTD, M2, IASD – Leonid Libkin, Pierre Senellart
Master: Anonymization, privacy, 36 heqTD, M2, IASD – Ashish Dandekar, Pierre Senellart
Master: Knowledge graphs, description logics, reasoning on data, 72 heqTD, M2, IASD – Camille Bourgaux, Michaël Thomazo
Other: invited one-day mini course on graph data at Peking University (Beijing, online) – Leonid Libkin

Pierre Senellart has had various teaching responsibilities (L3 internships, M1 projects, M2 administration, entrance competition) at ENS. Leonid Libkin is responsible of the graduate program in computer science of PSL University, and co-responsible of the international entrance competition at ENS. Nathan Grosshans was the secretary of the entrance competition at ENS for computer science. Most members of the group are also involved in tutoring ENS students, advising them on their curriculum, their internships, etc. They are also occasionally involved with reviewing internship reports, supervising student projects, etc.

10.2.2 Supervision

PhD: Julien Grange, Successor-Invariant First-Order Logic on Classes of Bounded Degree, PSL University, 29 June 2020, Luc Segoufin
PhD in progess: Juliette Achddou, Application of reinforcement learning strategies to the context of Real-Time Bidding, started in September 2018, Olivier Cappé & Aurélien Garivier
PhD in progress: Anatole Dahan, Logical foundations of the polynomial hierarchy, started in October 2020, Arnaud Durand & Luc Segoufin
PhD in progress: Yann Ramusat, Provenance-based routing in probabilistic graphs, started in September 2018, Silviu Maniu & Pierre Senellart
PhD in progess: Yoan Russac, Sequential methods for robust decision making, started in December 2018, Olivier Cappé

10.2.3 Juries

PhD: Julien Romero [president], Institut Polytechnique de Paris, Pierre Senellart

10.3 Popularization

10.3.1 Internal or external Inria responsibilities

Serge Abiteboul is the president of the strategic committee of the Blaise Pascal foundation for scientific mediation.
Pierre Senellart is a research fellow within the CERRE (Centre on Regulation in Europe), a European think tank that produces policy papers and organize events about the regulation of network industries. He contributes in particular to reflections on the use of artificial intelligence techniques and on the interoperability of software platforms.

10.3.2 Articles and contents

Serge Abiteboul is a founding editor of the binaire blog for popularizing computer science. See https://www.lemonde.fr/blog/binaire/.
Serge Abiteboul co-edited a special issue of a magazine on the industrial heritage of information technology 16 in which he also co-wrote an article on Pictures of the digital transformation11.
Olivier Cappé co-wrote two research reports on population mobility in France during the Covid-19 pandemic 34, 33.
Pierre Senellart co-wrote a CERRE report on Making data portability more effective for the digital economy 35.

11 Scientific production

11.1 Major publications

1 inproceedings SergeS. Abiteboul, PierreP. Bourhis and VictorV. Vianu. Explanations and Transparency in Collaborative Workflows PODS 2018 - 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles Of Database Systems Houston, Texas, United States June 2018
HAL
2 articleMichaelM. Benedikt, PierreP. Bourhis, GeorgG. Gottlob and PierreP. Senellart. Monadic Datalog, Tree Validity, and Limited Access ContainmentACM Transactions on Computational Logic2112020, 6:1-6:45
HAL DOI
3 inproceedings MeghynM. Bienvenu, QuentinQ. Manière and MichaëlM. Thomazo. Answering Counting Queries over DL-Lite Ontologies IJCAI 2020 - Twenty-Ninth International Joint Conference on Artificial Intelligence Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020. Reportée de juillet 2020 à janvier 2021 en raison de la COVID Yokohama, Japan July 2020
HAL
4 inproceedingsCamilleC. Bourgaux, AnaA. Ozaki, RafaelR. Peñaloza and LiviaL. Predoiu. Provenance for the Description Logic ELHrIJCAI-PRICAI-20 - Twenty-Ninth International Joint Conference on Artificial Intelligence and Seventeenth Pacific Rim International Conference on Artificial IntelligenceProceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020Reportée de juillet 2020 à janvier 2021 en raison de la COVIDYokohama, JapanJuly 2020, 1862-1869
HAL DOI
5 inproceedingsMarcoM. Console, PaoloP. Guagliardo, LeonidL. Libkin and EtienneE. Toussaint. Coping with Incomplete Data: Recent AdvancesSIGMOD/PODS 2020 - International Conference on Management of DataPortland / Virtual, United StatesACMJune 2020, 33-47
HAL DOI
6 article WojciechW. Kazana and LucL. Segoufin. First-order queries on classes of structures with bounded expansion Logical Methods in Computer Science 16 1 2020
HAL
7 articlePaulP. Lagrée, OlivierO. Cappé, BogdanB. Cautis and SilviuS. Maniu. Algorithms for Online Influencer MarketingACM Transactions on Knowledge Discovery from Data (TKDD)131January 2019, 1-30
HAL DOI
8 inproceedings YoanY. Russac, ClaireC. Vernade and OlivierO. Cappé. Weighted Linear Bandits for Non-Stationary Environments NeurIPS 2019 - 33rd Conference on Neural Information Processing Systems https://arxiv.org/abs/1909.09146 Vancouver, Canada December 2019
HAL
9 inproceedings NicoleN. Schweikardt, LucL. Segoufin and AlexandreA. Vigny. Enumeration for FO Queries over Nowhere Dense Graphs PODS 2018 - Principles Of Database Systems Houston, United States June 2018
HAL
10 articlePierreP. Senellart, LouisL. Jachiet, SilviuS. Maniu and YannY. Ramusat. ProvSQL: Provenance and Probability Management in PostgreSQLProceedings of the VLDB Endowment (PVLDB)1112August 2018, 2034-2037
HAL DOI

11.2 Publications of the year

International journals

11 article SergeS. Abiteboul and ClaireC. Mathieu. Pictures of the digital transformation Patrimoine industriel 2020
HAL back to text
12 article AntoineA. Amarilli, FlorentF. Capelli, MikaëlM. Monet and PierreP. Senellart. Connecting Knowledge Compilation Classes and Width Parameters Theory of Computing Systems August 2020
HAL DOI back to text
13 articleSenjutiS. Basu Roy, LeiL. Chen, AtsuyukiA. Morishima, James AbelloJ. Monedero, PierreP. Bourhis, FrançoisF. Charoy, MarinaM. Danilevsky, GautamG. Das, GianlucaG. Demartini, AbishekA. Dubey, ShadyS. Elbassuoni, DavidD. Gross-Amblard, EmilieE. Hoareau, MunenariM. Inoguchi, JaredJ. Kenworthy, ItaruI. Kitahara, DongwonD. Lee, YunyaoY. Li, Ria MaeR. Borromeo, PaoloP. Papotti, RaghavR. Rao, SudeepaS. Roy, PierreP. Senellart, KeishiK. Tajima, SaravananS. Thirumuruganathan, MarionM. Tommasi, KazutoshiK. Umemoto, AndreaA. Wiggins, KoichiroK. Yoshida and SihemS. Amer-Yahia. Making AI Machines Work for Humans in FoWSIGMOD record492December 2020, 30-35
HAL DOI back to text
14 articleMichaelM. Benedikt, PierreP. Bourhis, GeorgG. Gottlob and PierreP. Senellart. Monadic Datalog, Tree Validity, and Limited Access ContainmentACM Transactions on Computational Logic2112020, 6:1-6:45
HAL DOI back to text
15 article WojciechW. Kazana and LucL. Segoufin. First-order queries on classes of structures with bounded expansion Logical Methods in Computer Science 16 1 2020
HAL back to text

National journals

16 article SergeS. Abiteboul and FlorenceF. Hachez-Leroy. What Heritage for Information Technology ? Introduction to the journal Patrimoine industriel 73 2020
HAL back to text

International peer-reviewed conferences

17 inproceedingsMeghynM. Bienvenu and CamilleC. Bourgaux. Querying and Repairing Inconsistent Prioritized Knowledge Bases: Complexity Analysis and Links with Abstract ArgumentationKR 2020 - 17th International Conference on Principles of Knowledge Representation and ReasoningRhodes / Virtual, Greece2020, 141-151
HAL DOI back to text
18 inproceedings MeghynM. Bienvenu, QuentinQ. Manière and MichaëlM. Thomazo. Answering Counting Queries over DL-Lite Ontologies IJCAI 2020 - Twenty-Ninth International Joint Conference on Artificial Intelligence Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020. Yokohama, Japan July 2020
HAL back to text
19 inproceedings CamilleC. Bourgaux, AnaA. Ozaki, RafaelR. Peñaloza and LiviaL. Predoiu. Provenance for the Description Logic ELHr (Extended Abstract) DL 2020 - 33rd International Workshop on Description Logics Rhodes / Virtual, Greece September 2020
HAL back to text
20 inproceedingsCamilleC. Bourgaux, AnaA. Ozaki, RafaelR. Peñaloza and LiviaL. Predoiu. Provenance for the Description Logic ELHrIJCAI-PRICAI-20 - Twenty-Ninth International Joint Conference on Artificial Intelligence and Seventeenth Pacific Rim International Conference on Artificial IntelligenceProceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020Yokohama, JapanJuly 2020, 1862-1869
HAL DOI back to text
21 inproceedingsMarcoM. Console, PaoloP. Guagliardo, LeonidL. Libkin and EtienneE. Toussaint. Coping with Incomplete Data: Recent AdvancesSIGMOD/PODS 2020 - International Conference on Management of DataPortland / Virtual, United StatesMay 2020, 33-47
HAL DOI back to text
22 inproceedingsMarcoM. Console, MatthiasM. Hofer and LeonidL. Libkin. Queries with Arithmetic on Incomplete DatabasesSIGMOD/PODS 2020 : International Conference on Management of DataPortland / Virtual, United StatesJune 2020, 179-189
HAL DOI back to text
23 inproceedingsMarcoM. Console, MatthiasM. Hofer and LeonidL. Libkin. Reasoning about Measures of Unmeasurable SetsKR 2020 - 17th International Conference on Principles of Knowledge Representation and ReasoningRhodes / Virtual, GreeceSeptember 2020, 264-273
HAL DOI back to text
24 inproceedings JohannesJ. Doleschal, BennyB. Kimelfeld, WimW. Martens and LiatL. Peterfreund. Weight Annotation in Information Extraction ICDT 2020 - 23rd International Conference on Database Theory Copenhague / Virtual, Denmark https://diku-dk.github.io/edbticdt2020/ March 2020
HAL DOI back to text
25 inproceedingsJulienJ. Grange and LucL. Segoufin. Order-Invariant First-Order Logic over Hollow TreesCSL 2020 - 28th annual conference of the European Association for Computer Science Logic23Barcelona, SpainJanuary 2020, 1-23
HAL DOI back to text
26 inproceedings JulienJ. Grange. Successor-Invariant First-Order Logic on Classes of Bounded Degree LICS 2020 - Thirty-Fifth Annual ACM/IEEE Symposium on Logic in Computer Science 13 Saarbrücken / Virtual, Germany July 2020
HAL DOI back to text
27 inproceedings NathanN. Grosshans. The Power of Programs over Monoids in J LATA 2020 - 14th International Conference on Language and Automata Theory and Applications Milan, Italy March 2020
HAL DOI back to text
28 inproceedingsLucL. Segoufin and VictorV. Vianu. Projection Views of Register AutomataPODS'20: Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database SystemsPortland / Virtual, United StatesJune 2020, 299-313
HAL DOI back to text
29 inproceedingsEtienneE. Toussaint, PaoloP. Guagliardo and LeonidL. Libkin. Knowledge-Preserving Certain Answers for SQL-like QueriesKR 2020 - 17th International Conference on Principles of Knowledge Representation and ReasoningRhodes / Virtual, GreeceSeptember 2020, 758-767
HAL DOI back to text

National peer-reviewed Conferences

30 inproceedings AshishA. Dandekar, DebabrotaD. Basu, PierreP. Senellart and StéphaneS. Bressan. Confidentialité différentielle à risque : Relier les sources d’aléa et un budget de confidentialité BDA 2020 - 36ème Conférence sur la Gestion de Données – Principes, Technologies et Applications Paris / Virtuel, France October 2020
HAL
31 inproceedings YannY. Ramusat, SilviuS. Maniu and PierreP. Senellart. Algorithmes à base de provenance pour des requêtes enrichies sur les bases de données graphes BDA 2020 - 36ème Conférence sur la Gestion de Données – Principes, Technologies et Applications Paris / Virtuel, France October 2020
HAL

Conferences without proceedings

Doctoral dissertations and habilitation theses

32 thesis JulienJ. Grange. On the Expressive Power of Invariant Logics over Sparse Classes of Structures ENS Paris June 2020
HAL back to text

Reports & preprints

33 report JamalJ. Atif, BertrandB. Cabot, OlivierO. Cappé, OlgaO. Mula and RafaelR. Pinot. Initiative face au virus. Regards croisés sur l'épidemie de Covid-19 apportés par les données sanitaires et de géolocalisation (mars à octobre 2020) Université PSL; Inria; CNRS December 2020
HAL back to text
34 report JamalJ. Atif, OlivierO. Cappé, AkinA. Kazakçi, YannickY. Léo, LaurentL. Massoulié and OlgaO. Mula. Initiative face au virus Observations sur la mobilité pendant l'épidémie de Covid-19 Université PSL May 2020
HAL back to text
35 report JanJ. Krämer, PierreP. Senellart and AlexandreA. de Streel. Making data portability more effective for the digital economy CERRE June 2020
HAL back to text
36 misc YoanY. Russac, OlivierO. Cappé and AurélienA. Garivier. Algorithms for Non-Stationary Generalized Linear Bandits March 2020
HAL

Other scientific publications

37 thesis TheoT. Delemazure. A Knowledge Base of Mathematical Results Ecole Normale Supérieure (ENS) September 2020
HAL back to text
38 thesis LucasL. Pluvinage. Extracting scientific results from research articles Ecole Normale Supérieure (ENS) September 2020
HAL back to text

11.3 Cited publications

39 articleSergeS. Abiteboul, BenjaminB. André and DanielD. Kaplan. Managing your digital lifeCommun. ACM5852015, 32-35URL: http://doi.acm.org/10.1145/2670528
DOI back to text
40 articleSergeS. Abiteboul, PierreP. Bourhis and VictorV. Vianu. Comparing workflow specification languages: A matter of viewsACM Trans. Database Syst.3722012, 10:1-10:59URL: http://doi.acm.org/10.1145/2188349.2188352
DOI back to text
41 book SergeS. Abiteboul, PeterP. Buneman and DanD. Suciu. Data on the Web: From Relations to Semistructured Data and XML Morgan Kaufmann 1999
back to text
42 inproceedingsSergeS. Abiteboul, LaurentL. Herr and Jan VanJ. den Bussche. Temporal Versus First-Order Logic to Query Temporal DatabasesProceedings of the Fifteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 3-5, 1996, Montreal, Canada1996, 49-57URL: http://doi.acm.org/10.1145/237661.237674
DOI back to text
43 bookSergeS. Abiteboul, RichardR. Hull and VictorV. Vianu. Foundations of DatabasesAddison-Wesley1995, URL: http://webdam.inria.fr/Alice/
back to text back to text back to text back to text
44 bookSergeS. Abiteboul, IoanaI. Manolescu, PhilippeP. Rigaux, Marie-ChristineM.-C. Rousset and PierreP. Senellart. Web Data ManagementCambridge University Press2011, URL: http://webdam.inria.fr/Jorge
back to text back to text
45 inproceedingsAntoineA. Amarilli, PierreP. Bourhis and PierreP. Senellart. Provenance Circuits for Trees and Treelike InstancesAutomata, Languages, and Programming - 42nd International Colloquium, ICALP 2015, Kyoto, Japan, July 6-10, 2015, Proceedings, Part II2015, 56-68URL: https://doi.org/10.1007/978-3-662-47666-6_5
DOI back to text
46 inproceedingsAntoineA. Amarilli, PierreP. Bourhis and PierreP. Senellart. Tractable Lineages on Treelike Instances: Limits and ExtensionsProceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2016, San Francisco, CA, USA, June 26 - July 01, 20162016, 355-370URL: http://doi.acm.org/10.1145/2902251.2902301
DOI back to text back to text
47 articleYaelY. Amsterdamer, YaelY. Grossman, TovaT. Milo and PierreP. Senellart. CrowdMiner: Mining association rules from the crowdPVLDB6122013, 1250-1253URL: http://www.vldb.org/pvldb/vol6/p1250-amsterdamer.pdf
back to text back to text
48 inproceedingsPablo BarcelóP. Baeza. Querying graph databasesProceedings of the 32nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2013, New York, NY, USA - June 22 - 27, 20132013, 175-188URL: http://doi.acm.org/10.1145/2463664.2465216
DOI back to text
49 articleDanielD. Barbará, HectorH. Garcia-Molina and DarylD. Porter. The Management of Probabilistic DataIEEE Trans. Knowl. Data Eng.451992, 487-502URL: https://doi.org/10.1109/69.166990
DOI back to text
50 articleDebabrotaD. Basu, QianQ. Lin, WeidongW. Chen, Hoang TamH. Vo, ZihongZ. Yuan, PierreP. Senellart and StéphaneS. Bressan. Regularized Cost-Model Oblivious Database Tuning with Reinforcement LearningT. Large-Scale Data- and Knowledge-Centered Systems282016, 96-132URL: https://doi.org/10.1007/978-3-662-53455-7_5
DOI back to text
51 inproceedingsMichaelM. Benedikt, GeorgG. Gottlob and PierreP. Senellart. Determining relevance of accesses at runtimeProceedings of the 30th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2011, June 12-16, 2011, Athens, Greece2011, 211-222URL: http://doi.acm.org/10.1145/1989284.1989309
DOI back to text
52 incollectionMichaelM. Benedikt and PierreP. Senellart. DatabasesComputer Science, The Hardware, Software and Heart of ItSpringer2011, 169-229URL: https://doi.org/10.1007/978-1-4614-1168-0_10
DOI back to text back to text
53 inproceedingsMeghynM. Bienvenu, DanielD. Deutch, DavideD. Martinenghi, PierreP. Senellart and Fabian M.F. Suchanek. Dealing with the Deep Web and all its QuirksProceedings of the Second International Workshop on Searching and Integrating New Web Data Sources, Istanbul, Turkey, August 31, 20122012, 21-24URL: http://ceur-ws.org/Vol-884/VLDS2012_p21_Bienvenu.pdf
back to text
54 inproceedingsMiko\lajM. Bojańczyk, LucL. Segoufin and SzymonS. Toruńczyk. Verification of database-driven systems via amalgamationProceedings of the 32nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2013, New York, NY, USA - June 22 - 27, 20132013, 63-74URL: http://doi.acm.org/10.1145/2463664.2465228
DOI back to text
55 inproceedingsPeterP. Buneman, SanjeevS. Khanna and Wang-ChiewW.-C. Tan. Why and Where: A Characterization of Data ProvenanceDatabase Theory - ICDT 2001, 8th International Conference, London, UK, January 4-6, 2001, Proceedings.2001, 316-330URL: https://doi.org/10.1007/3-540-44503-X_20
DOI back to text
56 articleBrunoB. Courcelle. The Monadic Second-Order Logic of Graphs. I. Recognizable Sets of Finite GraphsInf. Comput.8511990, 12-75URL: https://doi.org/10.1016/0890-5401(90)90043-H
DOI back to text
57 articleNilesh N.N. Dalvi and DanD. Suciu. The dichotomy of probabilistic inference for unions of conjunctive queriesJ. ACM5962012, 30:1-30:87URL: http://doi.acm.org/10.1145/2395116.2395119
DOI back to text
58 articleAmolA. Deshpande, Zachary G.Z. Ives and VijayshankarV. Raman. Adaptive Query ProcessingFoundations and Trends in Databases112007, 1-140URL: https://doi.org/10.1561/1900000001
DOI back to text
59 inproceedingsPinarP. Donmez and Jaime G.J. Carbonell. Proactive learning: cost-sensitive active learning with multiple imperfect oraclesProceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM 2008, Napa Valley, California, USA, October 26-30, 20082008, 619-628URL: http://doi.acm.org/10.1145/1458082.1458165
DOI back to text
60 inproceedingsMuhammadM. Faheem and PierreP. Senellart. Adaptive Web Crawling Through Structure-Based Link ClassificationDigital Libraries: Providing Quality Information - 17th International Conference on Asia-Pacific Digital Libraries, ICADL 2015, Seoul, Korea, December 9-12, 2015, Proceedings2015, 39-51URL: https://doi.org/10.1007/978-3-319-27974-9_5
DOI back to text
61 book LiseL. Getoor. Introduction to statistical relational learning MIT Press 2007
back to text
62 inproceedingsGeorgesG. Gouriten, SilviuS. Maniu and PierreP. Senellart. Scalable, generic, and adaptive systems for focused crawling25th ACM Conference on Hypertext and Social Media, HT '14, Santiago, Chile, September 1-4, 20142014, 35-45URL: http://doi.acm.org/10.1145/2631775.2631795
DOI back to text
63 inproceedingsTodd J.T. Green, GregoryG. Karvounarakis and ValV. Tannen. Provenance semiringsProceedings of the Twenty-Sixth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 11-13, 2007, Beijing, China2007, 31-40URL: http://doi.acm.org/10.1145/1265530.1265535
DOI back to text
64 articleTodd J.T. Green and ValV. Tannen. Models for Incomplete and Probabilistic InformationIEEE Data Eng. Bull.2912006, 17-24URL: http://sites.computer.org/debull/A06mar/green.ps
back to text
65 articleAlon Y.A. Halevy. Answering queries using views: A surveyVLDB J.1042001, 270-294URL: https://doi.org/10.1007/s007780100054
DOI back to text
66 articleMarti A.M. Hearst, Susan TS. Dumais, EdgarE. Osuna, JohnJ. Platt and BernhardB. Scholkopf. Support vector machinesIEEE Intelligent Systems1341998, 18-28URL: https://doi.org/10.1109/5254.708428
DOI back to text
67 articleTomaszT. Imielinski and WitoldW. Lipski Jr.. Incomplete Information in Relational DatabasesJ. ACM3141984, 761-791URL: http://doi.acm.org/10.1145/1634.1886
DOI back to text
68 articleFlorentF. Jacquemard, LucL. Segoufin and JerémieJ. Dimino. FO2(<, +1, ) on data trees, data tree automata and branching vector addition systems'Logical Methods in Computer Science1222016, URL: https://doi.org/10.2168/LMCS-12(2:3)2016
DOI back to text
69 incollectionBennyB. Kimelfeld and PierreP. Senellart. Probabilistic XML: Models and ComplexityAdvances in Probabilistic Databases for Uncertain Information ManagementSpringer2013, 39-66URL: https://doi.org/10.1007/978-3-642-37509-5_3
DOI back to text
70 articleAnthony C.A. Klug. Equivalence of Relational Algebra and Relational Calculus Query Languages Having Aggregate FunctionsJ. ACM2931982, 699-717URL: http://doi.acm.org/10.1145/322326.322332
DOI back to text
71 articleDonaldD. Kossmann. The State of the art in distributed query processingACM Comput. Surv.3242000, 422-469URL: http://doi.acm.org/10.1145/371578.371598
DOI back to text
72 inproceedingsJohn D.J. Lafferty, AndrewA. McCallum and Fernando C. N.F. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence DataProceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), Williams College, Williamstown, MA, USA, June 28 - July 1, 20012001, 282-289
back to text
73 inproceedingsSiyuS. Lei, SilviuS. Maniu, LuyiL. Mo, ReynoldR. Cheng and PierreP. Senellart. Online Influence MaximizationProceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 10-13, 20152015, 645-654URL: http://doi.acm.org/10.1145/2783258.2783271
DOI back to text
74 articleFrankF. Neven. Automata Theory for XML ResearchersSIGMOD Record3132002, 39-46URL: http://doi.acm.org/10.1145/601858.601869
DOI back to text
75 bookM. TamerM. Özsu and PatrickP. Valduriez. Principles of Distributed Database Systems, Third EditionSpringer2011, URL: https://doi.org/10.1007/978-1-4419-8834-8
DOI back to text
76 inproceedingsPierreP. Senellart, AvinA. Mittal, DanielD. Muschick, RémiR. Gilleron and MarcM. Tommasi. Automatic wrapper induction from hidden-web sources with domain knowledge10th ACM International Workshop on Web Information and Data Management (WIDM 2008), Napa Valley, California, USA, October 30, 20082008, 9-16URL: http://doi.acm.org/10.1145/1458502.1458505
DOI back to text back to text back to text
77 bookBurrB. Settles. Active LearningSynthesis Lectures on Artificial Intelligence and Machine LearningMorgan & Claypool Publishers2012, URL: https://doi.org/10.2200/S00429ED1V01Y201207AIM018
DOI back to text back to text
78 inproceedingsBurrB. Settles, MarkM. Craven and LewisL. Friedland. Active learning with real annotation costsNIPS 2008 Workshop on Cost-Sensitive Learning2008, URL: http://burrsettles.com/pub/settles.nips08ws.pdf
back to text
79 articleFabian M.F. Suchanek, SergeS. Abiteboul and PierreP. Senellart. PARIS: Probabilistic Alignment of Relations, Instances, and SchemaPVLDB532011, 157-168URL: http://www.vldb.org/pvldb/vol5/p157_fabianmsuchanek_vldb2012.pdf
back to text back to text
80 bookDanD. Suciu, DanD. Olteanu, ChristopherC. Ré and ChristophC. Koch. Probabilistic DatabasesSynthesis Lectures on Data ManagementMorgan & Claypool Publishers2011, URL: https://doi.org/10.2200/S00362ED1V01Y201105DTM016
DOI back to text
81 bookRichard S.R. Sutton and Andrew G.A. Barto. Reinforcement learning - an introductionAdaptive computation and machine learningMIT Press1998, URL: http://www.worldcat.org/oclc/37293240
back to text back to text
82 inproceedingsMoshe Y.M. Vardi. The Complexity of Relational Query Languages (Extended Abstract)Proceedings of the 14th Annual ACM Symposium on Theory of Computing, May 5-7, 1982, San Francisco, California, USA1982, 137-146URL: http://doi.acm.org/10.1145/800070.802186
DOI back to text
83 inproceedingsKeK. Zhou, MouniaM. Lalmas, TetsuyaT. Sakai, RonanR. Cummins and Joemon M.J. Jose. On the reliability and intuitiveness of aggregated search metrics22nd ACM International Conference on Information and Knowledge Management, CIKM'13, San Francisco, CA, USA, October 27 - November 1, 20132013, 689-698URL: http://doi.acm.org/10.1145/2505515.2505691
DOI back to text

VALDA - 2020

VALDA - 2020

Keywords

Computer Science and Digital Science

Other Research Topics and Application Domains

1 Team members, visitors, external collaborators

Research Scientists

Faculty Members

Post-Doctoral Fellows

PhD Students

Interns and Apprentices

Administrative Assistant

External Collaborator

2 Overall objectives

2.1 Objectives

2.2 The Issues

3 Research program

3.1 Scientific Foundations

Complexity & Logic

Automata Theory

Verification

Workflows

Probability & Provenance

Machine Learning

3.2 Research Directions

4 Application domains

4.1 Personal Information Management Systems

4.2 Web Data

5 Highlights of the year

5.1 Awards

6 New software and platforms

6.1 New software

6.1.1 ProvSQL

6.1.2 apxproof

6.1.3 TheoremKB

7 New results

7.1 Incompleteness, Uncertainty, and Provenance of Data

Incomplete data.

Missing data.

Inconsistent data.

Data provenance.

7.2 Query Languages over Restricted Structures

Complexity.

Expresive power.

7.3 Information Extraction

7.4 Other Topics in Database Theory

8 Bilateral contracts and grants with industry

8.1 Bilateral contracts with industry

8.2 Standardization activities

9 Partnerships and cooperations

9.1 International initiatives

Informal international partners

9.2 International research visitors

9.2.1 Visits of international scientists

9.3 European initiatives

9.3.1 Collaborations in European programs, except FP7 and H2020

9.4 National initiatives

9.4.1 ANR

9.5 Regional initiatives

10 Dissemination

10.1 Promoting scientific activities

10.1.1 Scientific events: organisation

General chair, scientific chair

Member of the organizing committees

10.1.2 Scientific events: selection

Chair of conference program committees

Member of the conference program committees

10.1.3 Journal

Member of the editorial boards

10.1.4 Leadership within the scientific community

10.1.5 Scientific expertise

10.1.6 Research administration

10.2 Teaching - Supervision - Juries

10.2.1 Teaching

10.2.2 Supervision

10.2.3 Juries

10.3 Popularization

10.3.1 Internal or external Inria responsibilities

10.3.2 Articles and contents

11 Scientific production