Managing your digital life

Valda Value from Data

Data and Knowledge Representation and Processing

Perception, Cognition and Interaction

http://www.di.ens.fr/ValdaTeam.html.en Creation of the Team: 2016 December 01, updated into Project-Team: 2018 January 01 Team A3.1.1. - Modeling, representation A3.1.2. - Data management, quering and storage A3.1.3. - Distributed data A3.1.4. - Uncertain data A3.1.5. - Control access, privacy A3.1.9. - Database A3.2.2. - Knowledge extraction, cleaning A3.2.3. - Inference A3.3.2. - Data mining A3.4.3. - Reinforcement learning A3.4.5. - Bayesian methods A3.5.1. - Analysis of large graphs A4.7. - Access control A7.2. - Logic in Computer Science A9.1. - Knowledge B6.3.1. - Web B6.3.4. - Social Networks B6.5. - Information systems B9.5.5. - Sociology B9.5.10. - Digital humanities B9.7.2. - Open data B9.8. - Privacy B9.10. - Ethics Inria teams are typically groups of researchers working on the definition of a common project, and objectives, with the goal to arrive at the creation of a project-team. Such project-teams may include other partners (universities or research institutions).

VALDA has integrated members of the Inria DAHU project-team in 2017. Their relevant activity has been integrated in this activity report.

Pierre Senellart Enseignant

Paris

Team leader, École normale supérieure, Professor oui Serge Abiteboul Chercheur

Paris

Inria, Senior Researcher oui Luc Segoufin Chercheur

Paris

Inria, Senior Researcher, from Sep 2017 oui Julien Grange PhD

Paris

École normale supérieure, from Sep 2017 Miyoung Han PhD

Paris

Institut Telecom ex GET Groupe des Ecoles des Télécommunications Quentin Lobbe PhD

Paris

Institut Telecom ex GET Groupe des Ecoles des Télécommunications Mikael Monet PhD

Paris

Institut Telecom ex GET Groupe des Ecoles des Télécommunications David Montoya PhD

Paris

ENGIE, until March 2017 Karima Rafes PhD

Paris

BorderCloud Alexandre Vigny PhD

Paris

Univ Denis Diderot, from Sep 2017 Su Yang PhD

Paris

Ecole Nationale Supérieure des Mines de Paris, until Apr 2017 Yann Ramusat Stagiaire

Paris

Inria, from Mar 2017 until Jul 2017 Victor Vianu Visiteur

Paris

UCSD & École normale supérieure, from Jul 2017 Yann Ramusat CollaborateurExterieur

Paris

Student at École normale supérieure on a long-term project, from Sep 2017 Linday Polienor Assistant

Paris

until June 2017 Sandrine Vergès Assistant

Paris

from September 2017 Overall Objectives Objectives

Valda's focus is on both foundational and systems aspects of complex data management, especially human-centric data. The data we are interested in is typically heterogeneous, massively distributed, rapidly evolving, intensional, and often subjective, possibly erroneous, imprecise, incomplete. In this setting, Valda is in particular concerned with the optimization of complex resources such as computer time and space, communication, monetary, and privacy budgets. The goal is to extract value from data, beyond simple query answering.

Data management , is now an old, well-established field, for which many scientific results and techniques have been accumulated since the sixties. Originally, most works dealt with static, homogeneous, and precise data. Later, works were devoted to heterogeneous data , and possibly distributed but at a small scale.

However, these classical techniques are poorly adapted to handle the new challenges of data management. Consider human-centric data, which is either produced by humans, e.g., emails, chats, recommendations, or produced by systems when dealing with humans, e.g., geolocation, business transactions, results of data analysis. When dealing with such data, and to accomplish any task to extract value from such data, we rapidly encounter the following facets:

Heterogeneity: data may come in many different structures such as unstructured text, graphs, data streams, complex aggregates, etc., using many different schemas or ontologies.

Massive distribution: data may come from a large number of autonomous sources distributed over the web, with complex access patterns.

Rapid evolution: many sources may be producing data in real time, even if little of it is perhaps relevant to the specific application. Typically, recent data is of particular interest and changes have to be monitored.

IntensionalityWe use the spelling intensional, as in mathematical logic and philosophy, to describe something that is neither available nor defined in extension; intensional is derived from intension, while intentional is derived from intent.: in a classical database, all the data is available. In modern applications, the data is more and more available only intensionally, possibly at some cost, with the difficulty to discover which source can contribute towards a particular goal, and this with some uncertainty.

Confidentiality and security: some personal data is critical and need to remain confidential. Applications manipulating personal data must take this into account and must be secure against linking.

Uncertainty: modern data, and in particular human-centric data, typically includes errors, contradictions, imprecision, incompleteness, which complicates reasoning. Furthermore, the subjective nature of the data, with opinions, sentiments, or biases, also makes reasoning harder since one has, for instance, to consider different agents with distinct, possibly contradicting knowledge.

These problems have already been studied individually and have led to techniques such as query rewriting or distributed query optimization .

Among all these aspects, intensionality is perhaps the one that has least been studied, so we will pay particular attention to it. Consider a user's query, taken in a very broad sense: it may be a classical database query, some information retrieval search, a clustering or classification task, or some more advanced knowledge extraction request. Because of intensionality of data, solving such a query is a typically dynamic task: each time new data is obtained, the partial knowledge a system has of the world is revised, and query plans need to be updated, as in adaptive query processing or aggregated search . The system then needs to decide, based on this partial knowledge, of the best next access to perform. This is reminiscent of the central problem of reinforcement learning (train an agent to accomplish a task in a partially known world based on rewards obtained) and of active learning (decide which action to perform next in order to optimize a learning strategy) and we intend to explore this connection further.

Uncertainty of the data interacts with its intensionality: efforts are required to obtain more precise, more complete, sounder results, which yields a trade-off between processing cost and data quality.

Other aspects, such as heterogeneity and massive distribution, are of major importance as well. A standard data management task, such as query answering, information retrieval, or clustering, may become much more challenging when taking into account the fact that data is not available in a central location, or in a common format. We aim to take these aspects into account, to be able to apply our research to real-world applications.

The Issues

We intend to tackle hard technical issues such as query answering, data integration, data monitoring, verification of data-centric systems, truth finding, knowledge extraction, data analytics, that take a different flavor in this modern context. In particular, we are interested in designing strategies to minimize data access cost towards a specific goal, possibly a massive data analysis task. That cost may be in terms of communication (accessing data in distributed systems, on the Web), of computational resources (when data is produced by complex tools such as information extraction, machine learning systems, or complex query processing), of monetary budget (paid-for application programming interfaces, crowdsourcing platforms), or of a privacy budget (as in the standard framework of differential privacy).

A number of data management tasks in Valda are inherently intractable. In addition to properly characterizing this intractability in terms of complexity theory, we intend to develop solutions for solving these tasks in practice, based on approximation strategies, randomized algorithms, enumeration algorithms with constant delay, or identification of restricted forms of data instances lowering the complexity of the task.

Research Program Scientific Foundations

We now detail some of the scientific foundations of our research on complex data management. This is the occasion to review connections between data management, especially on complex data as is the focus of Valda, with related research areas.

Complexity & Logic.

Data management has been connected to logic since the advent of the relational model as main representation system for real-world data, and of first-order logic as the logical core of database querying languages . Since these early developments, logic has also been successfully used to capture a large variety of query modes, such as data aggregation , recursive queries (Datalog), or querying of XML databases . Logical formalisms facilitate reasoning about the expressiveness of a query language or about its complexity.

The main problem of interest in data management is that of query evaluation, i.e., computing the results of a query over a database. The complexity of this problem has far-reaching consequences. For example, it is because first-order logic is in the ${AC}_{0}$ complexity class that evaluation of SQL queries can be parallelized efficiently. It is usual in data management to distinguish data complexity, where the query is considered to be fixed, from combined complexity, where both the query and the data are considered to be part of the input. Thus, though conjunctive queries, corresponding to a simple SELECT-FROM-WHERE fragment of SQL, have PTIME data complexity, they are NP-hard in combined complexity. Making this distinction is important, because data is often far larger (up to the order of terabytes) than queries (rarely more than a few hundred bytes). Beyond simple query evaluation, a central question in data management remains that of complexity; tools from algorithm analysis, and complexity theory can be used to pinpoint the tractability frontier of data management tasks.

Automata Theory.

Automata theory and formal languages arise as important components of the study of many data management tasks: in temporal databases , queries, expressed in temporal logics, can often by compiled to automata; in graph databases , queries are naturally given as automata; typical query and schema languages for XML databases such as XPath and XML Schema can be compiled to tree automata , or for more complex languages to data tree automata. Another reason of the importance of automata theory, and tree automata in particular, comes from Courcelle's results that show that very expressive queries (from the language of monadic second-order language) can be evaluated as tree automata over tree decompositions of the original databases, yielding linear-time algorithms (in data complexity) for a wide variety of applications.

Verification.

Complex data management also has connections to verification and static analysis. Besides query evaluation, a central problem in data management is that of deciding whether two queries are equivalent . This is critical for query optimization, in order to determine if the rewriting of a query, maybe cheaper to evaluate, will return the same result as the original query. Equivalence can easily be seen to be an instance of the problem of (non-)satisfiability: $q \equiv q^{'}$ if and only if $(q \land \neg q^{'}) \lor (\neg q \land q^{'})$ is not satisfiable. In other words, some aspects of query optimization are static analysis issues. Verification is also a critical part of any database application where it is important to ensure that some property will never (or always) arise .

Workflows.

The orchestration of distributed activities (under the responsibility of a conductor) and their choreography (when they are fully autonomous) are complex issues that are essential for a wide range of data management applications including notably, e-commerce systems, business processes, health-care and scientific workflows. The difficulty is to guarantee consistency or more generally, quality of service, and to statically verify critical properties of the system. Different approaches to workflow specifications exist: automata-based, logic-based, or predicate-based control of function calls .

Probability & Provenance.

To deal with the uncertainty attached to data, proper models need to be used (such as attaching provenance information to data items and viewing the whole database as being probabilistic) and practical methods and systems need to be developed to both reliably estimate the uncertainty in data items and properly manage provenance and uncertainty information throughout a long, complex system.

The simplest model of data uncertainty is the NULLs of SQL databases, also called Codd tables . This representation system is too basic for any complex task, and has the major inconvenient of not being closed under even simple queries or updates. A solution to this has been proposed in the form of conditional tables where every tuple is annotated with a Boolean formula over independent Boolean random events. This model has been recognized as foundational and extended in two different directions: to more expressive models of provenance than what Boolean functions capture, through a semiring formalism , and to a probabilistic formalism by assigning independent probabilities to the Boolean events . These two extensions form the basis of modern provenance and probability management, subsuming in a large way previous works , . Research in the past ten years has focused on a better understanding of the tractability of query answering with provenance and probabilistic annotations, in a variety of specializations of this framework , .

Machine Learning.

Statistical machine learning, and its applications to data mining and data analytics, is a major foundation of data management research. A large variety of research areas in complex data management, such as wrapper induction , crowdsourcing , focused crawling , or automatic database tuning critically rely on machine learning techniques, such as classification , probabilistic models , or reinforcement learning .

Machine learning is also a rich source of complex data management problems: thus, the probabilities produced by a conditional random field system result in probabilistic annotations that need to be properly modeled, stored, and queried.

Finally, complex data management also brings new twists to some classical machine learning problems. Consider for instance the area of active learning , a subfield of machine learning concerned with how to optimally use a (costly) oracle, in an interactive manner, to label training data that will be used to build a learning model, e.g., a classifier. In most of the active learning literature, the cost model is very basic (uniform or fixed-value costs), though some works consider more realistic costs. Also, oracles are usually assumed to be perfect with only a few exceptions . These assumptions usually break when applied to complex data management problems on real-world data, such as crowdsourcing.

Having situated Valda's research area within its broader scientific scope, we now move to the discussion of Valda's application domains.

Research Directions

We now detail three main research axes within the research agenda of Valda. For each axis, we first mention the leading researcher, and other permanent members involved.

Foundations of data management (Luc Segoufin; Serge Abiteboul, Pierre Senellart).Foundations of data management

The systems we are interested in, i.e., for manipulating heterogeneous and confidential data, rapidly changing and massively distributed, are inherently error-prone. The need for formal methods to verify data management systems is best illustrated by the long list of famous leakages of sensitive or personal data that made the front pages of newspapers recently. Moreover, because of the cost in accessing intensional data, it is important to optimize the resources needed for manipulating them.

This creates a need for solid and high-level foundations of DBMS in a manner that is easier to understand, while also facilitating optimization and verification of its critical properties.

In particular these foundations are necessary for various design and reasoning tasks. It allows for clean specifications of key properties of the system such as confidentiality, access control, robustness etc. Once clean specifications are available, it opens the door for formal and runtime verification of the specification. It also permits the design of appropriate query languages – with good expressive power, with limited usage of resources –, the design of good indexes – for optimized evaluation –, and so on. Note that access control policies currently used in database management systems are relatively crude – for example, PostgreSQL offers access control rules on tables, views, or tuples (row security policies), but provides no guarantee that these access methods do not contradict each other, or that a user may have access through a query to information that she is not supposed to have access to.

Valda involves leading researchers in the formal verification of data flow in a system manipulating data. Other notable teams involve the WAVE project http://db.ucsd.edu/WAVE/default.html at U. C. San Diego, and the Business Artifact http://researcher.watson.ibm.com/researcher/view_group.php?id=2501 research program of IBM. One of Valda's objectives is to continue this line of research.

In the short run, we plan to contribute to the state of the art of foundations of systems manipulating data by identifying new scenarios, i.e., specification formalisms, query languages, index structures, query evaluation plans, etc., that allow for any of the tasks mentioned above: formal or runtime verification, optimization etc. Several such scenarios are already known and Valda researchers contributed significantly to their discovery , ,, but this research is still in infancy and there is a clear need for more functionalities and more efficiency. This research direction has many facets.

One of the facet is to develop new logical frameworks and new automaton models, with good algorithmic properties (for instance efficient emptiness test, efficient inclusion test and so on), in order to develop a toolbox for reasoning task around systems manipulating data. This toolbox can then be used for higher level tasks such as optimization, verification , or query rewriting using views .

Another facet is to develop new index structures and new algorithms for efficient query evaluation. For example the enumeration of the output of a query requires the construction of index structures allowing for efficient compressed representation of the output with efficient streaming decompression algorithms as we aim for a constant delay between any two consecutive outputs . We have contributed a lot to this fields by providing several such indexes but there remains a lot to be investigated.

Our medium-term goal is to investigate the borders of feasibility of all the reasoning tasks above. For instance what are the assumptions on data that allow for computable verification problems? When is it not possible at all? When can we hope for efficient query answering, when is it hopeless? This is a problem of theoretical nature which is necessary for understanding the limit of the methods and driving research towards the scenarios where positive results may be obtainable.

A typical result would be to show that constant delay enumeration of queries is not possible unless the database verify property A and the query property B. Another typical result would be to show that having a robust access control policy verifying at the same time this and that property is not achievable.

Very few such results exist nowadays. If many problems are shown undecidable or decidable, charting the frontier of tractability (say linear time) remains a challenge.

Only when we will have understood the limitation of the method (medium-term goal) and have many examples where this is possible, we can hope to design a solid foundation that allowing for a good trade-off between what can be done (needs from the users) and what can be achieved (limitation from the system). This will be our long-term goal.

Uncertainty and provenance of data (Pierre Senellart; Luc Segoufin).Uncertainty and provenance of data

This research axis deals with the modeling and efficient management of data that come with some uncertainty (probabilistic distributions, logical incompleteness, missing values, open-world assumption, etc.) and with provenance information (indicating where the data originates from), as well as with the extraction of uncertainty and provenance annotations from real-world data. Interestingly, the foundations and tools for uncertainty management often rely on provenance annotations. For example, a typical way to compute the probability of query results in probabilistic databases is first to generate the provenance of these query results (in some Boolean framework, e.g., that of Boolean functions or of provenance semirings), and then to compute the probability of the resulting provenance annotation. For this reason, we will deal with uncertainty and provenance in a unified manner.

Valda researchers have carried out seminal work on probabilistic databases , , provenance management , incomplete information , and uncertainty analysis and propagation in conflicting datasets , . These research areas have reached a point where the foundations are well-understood, and where it becomes critical, while continuing developing the theory of uncertain and provenance data management, to move to concrete implementations and applications to real-world use cases.

In the short term, we will focus on implementing techniques from the database theory literature on provenance and uncertainty data management, in the direction of building a full-featured database management add-on that transparently manages provenance and probability annotations for a large class of querying tasks. This work has started recently with the creation of the ProvSQL extension to PostgreSQL, discussed in more details in the following section. To support this development work, we need to resolve the following research question: what representation systems and algorithms to use to support both semiring provenance frameworks , extensions to queries with negation , aggregation , or recursion ?

Next, we will study how to add support for incompleteness, probabilities, and provenance annotations in the scenarios identified in the first axis, and how to extract and derive such annotations from real-world datasets and tasks. We will also work on the efficiency of our uncertain data management system, and compare it to other uncertainty management solutions, in the perspective of making it a fully usable system, with little overhead compared to a classical database management system. This requires a careful choice of the provenance representation system used, which should be both compact and amenable to probability computations. We will study practical applications of uncertainty management. As an example, we intend to consider routing in public transport networks, given a probabilistic model on the reliability and schedule uncertainty of different transit routes. The system should be able to provide a user with itinerary to get to have a (probabilistic) guarantee to be at its destination within a given time frame, which may not be the shortest route in the classical sense.

One overall long-term goal is to reach a full understanding of the interactions between query evaluation or other broader data management tasks and uncertain and annotated data models. We would in particular want to go towards a full classification of tractable (typically polynomial-time) and intractable (typically NP-hard for decision problems, or #P-hard for probability evaluation) tasks, extending and connecting the query-based dichotomy on probabilistic query evaluation with the instance-based one of .

Another long-term goal is to consider more dynamic scenarios than what has been considered so far in the uncertain data management literature: when following a workflow, or when interacting with intensional data sources, how to properly represent and update uncertainty annotations that are associated with data. This is critical for many complex data management scenarios where one has to maintain a probabilistic current knowledge of the world, while obtaining new knowledge by posing queries and accessing data sources. Such intensional tasks requires minimizing jointly data uncertainty and cost to data access.

Personal information management (Serge Abiteboul; Pierre Senellart).Personal information management

This is a more applied direction of research that will be the context to study issues of interest (see discussion in application domains further).

A typical person today usually has data on several devices and in a number of commercial systems that function as data traps where it is easy to check in information and difficult to remove it or sometimes to simply access it. It is also difficult, sometimes impossible, to control data access by other parties. This situation is unsatisfactory because it requires users to trade privacy against convenience but also, because it limits the value we, as individuals and as a society, can derive from the data. This leads to the concept of Personal Information Management System, in short, a Pims.

A Pims runs, on a user's server, the services selected by the user, storing and processing the user's data. The Pims centralizes the user's personal information. It is a digital home. The Pims is also able to exert control over information that resides in external services (for example, Facebook), and that only gets replicated inside the Pims. See, for instance, for a discussion on the advantages of Pims, as well as issues they raise, e.g. security issues. It is argued there that the main reason for a user to move to Pims is these systems enable great new functionalities.

Valda will study in particular the integration of the user's data. Researchers in the team have already provided important contributions in the context of data integration, notably in the context of the Webdam ERC (2009–2013).

Based on such an integration, Pims can provide a functions, that goes beyond simple query answering:

Global search over the person's data with a semantic layer using a personal ontology (for example, the data organization the person likes and the person's terminology for data) that helps give meaning to the data;

Automatic synchronization of data on different devices/systems, and global task sequencing to facilitate interoperating different devices/services;

Exchange of information and knowledge between "friends" in a truly social way, even if these use different social network platforms, or no platform at all;

Centralized control point for connected objects, a hub for the Internet of Things; and

Data analysis/mining over the person's information.

The focus on personal data and these various aspects raise interesting technical challenges that we intend to address.

In the short term, we intend to continue work on the ThymeFlow system to turn it into an easily extendable and deployable platform for the management of personal information – we will in particular encourage students from the M2 Web Data Management class taught by Serge and Pierre in the MPRI programme to use this platform in their course projects. The goal is to make it easy to add new functionalities (such as new source synchronizers to retrieve data and propagate updates to original data sources, and enrichers to add value to existing data) to considerably broaden the scope of the platform and consequently expand its value.

In the medium term, we will continue the work already started that focuses in turning information into knowledge and in knowledge integration. Issues related to intensionality or uncertainty will in particular be considered, relying on the works produced in the other two research axes. We stress, in particular, the importance of minimizing the cost to data access (or, in specific scenarios, the privacy cost associated with obtaining data items) in the context of personal information management: legacy data is often only available through costly APIs, interaction between several Pims may require sharing information within a strict privacy budget, etc. For these reasons, intensionality of data will be a strong focus of the research.

In the long term, we intend to use the knowledge acquired and machine learning techniques to predict the user's behavior and desires, and support new digital assistant functions, providing real value from data. We will also look into possibilities for deploying the ThymeFlow platform at a large scale, perhaps in collaboration with industry partners.

Application Domains Personal Information Management Systems

We recall that Valda's focus is on human-centric data, i.e., data produced by humans, explicitly or implicitly, or more generally containing information about humans. Quite naturally, we will use as a privileged application area to validate Valda’s results that of personal information management systems (Pims for short) .

A Pims is a system that allows a user to integrate her own data, e.g., emails and other kinds of messages, calendar, contacts, web search, social network, travel information, work projects, etc. Such information is commonly spread across different services. The goal is to give back to a user the control on her information, allowing her to formulate queries such as “What kind of interaction did I have recently with Alice B.?”, “Where were my last ten business trips, and who helped me plan them?”. The system has to orchestrate queries to the various services (which means knowing the existence of these services, and how to interact with them), integrate information from them (which means having data models for this information and its representation in the services), e.g., align a GPS location of the user to a business address or place mentioned in an email, or an event in a calendar to some event in a Web search. This information must be accessed intensionally: for instance, costly information extraction tools should only be run on emails which seem relevant, perhaps identified by a less costly cursory analysis (this means, in turn, obtaining a cost model for access to the different services). Impacted people can be found by examining events in the user's calendar and determining who is likely to attend them, perhaps based on email exchanges or former events' participant lists. Of course, uncertainty has to be maintained along the entire process, and provenance information is needed to explain query results to the user (e.g., indicate which meetings and trips are relevant to each person of the output). Knowledge about services, their data models, their costs, need either to be provided by the system designer, or to be automatically learned from interaction with these services, as in .

One motivation for that choice is that Pims concentrate many of the problems we intend to investigate: heterogeneity (various sources, each with a different structure), massive distribution (information spread out over the Web, in numerous sources), rapid evolution (new data regularly added), intensionality (knowledge from Wikidata, OpenStreetMap...), confidentiality and security (mostly private data), and uncertainty (very variable quality). Though the data is distributed, its size is relatively modest; other applications may be considered for works focusing on processing data at large scale, which is a potential research direction within Valda, though not our main focus. Another strong motivation for the choice of Pims as application domain is the importance of this application from a societal viewpoint.

A Pims is essentially a system built on top of a user's personal knowledge base; such knowledge bases are reminiscent of those found in the Semantic Web, e.g., linked open data. Some issues, such as ontology alignment exist in both scenarios. However, there are some fundamental differences in building personal knowledge bases vs collecting information from the Semantic Web: first, the scope is quite smaller, as one is only interested in knowledge related to a given individual; second, a small proportion of the data is already present in the form of semantic information, most needs to be extracted and annotated through appropriate wrappers and enrichers; third, though the linked open data is meant to be read-only, the only update possible to a user being adding new triples, a personal knowledge base is very much something that a user needs to be able to edit, and propagating updates from the knowledge base to original data sources is a challenge in itself.

Web Data

The choice of Pims is not exclusive. We intend to consider other application areas as well. In particular, we have worked in the past and have a strong expertise on Web data in a broad sense: semi-structured, structured, or unstructured content extracted from Web databases ; knowledge bases from the Semantic Web ; social networks ; Web archives and Web crawls ; Web applications and deep Web databases ; crowdsourcing platforms . We intend to continue using Web data as a natural application domain for the research within Valda when relevant. For instance , deep Web databases are a natural application scenario for intensional data management issues: determining if a deep Web database contains some information requires optimizing the number of costly requests to that database.

A common aspect of both personal information and Web data is that their exploitation raises ethical considerations. Thus, a user needs to remain fully in control of the usage that is made of her personal information; a search engine or recommender system that ranks Web content for display to a specific user needs to do so in an unbiased, justifiable, manner. These ethical constraints sometimes forbid some technically solutions that may be technically useful, such as sharing a model learned from the personal data of a user to another user, or using blackboxes to rank query result. We fully intend to consider these ethical considerations within Valda. One of the main goals of a Pims is indeed to empower the user with a full control on the use of this data.

New Software and Platforms ProvSQL

Keywords: Databases - Provenance - Probability

Functional Description: The goal of the ProvSQL project is to add support for (m-)semiring provenance and uncertainty management to PostgreSQL databases, in the form of a PostgreSQL extension/module/plugin.

News Of The Year: ProvSQL becomes usable for a large range of queries. Support for semirings and m-semirings is present, support for probability computation has been added through a variety of techniques, including knowledge compilation, support for where-provenance is currently being implemented.

Participants: Pierre Senellart and Yann Ramusat

Contact: Pierre Senellart

Publication: Provenance and Probabilities in Relational Databases: From Theory to Practice

URL: https://github.com/PierreSenellart/provsql

Thymeflow

Keyword: Personal information

Functional Description: ThymeFlow allows in particular the development of plugins for both interacting with existing Web sources and presenting users with rich interfaces and query facilities over their personal information. A preliminary version of ThymeFlow tools has also been deployed on the Cozy Cloud personal cloud system. The model allows the open-source community to contribute individual plugins while we focus on providing users with useful ways to exploit their personal information.

News Of The Year: Minor maintenance.

Participants: David Montoya, Pierre Senellart, Serge Abiteboul and Su Yang

Partner: ENGIE

Contact: Pierre Senellart

Publication: Personal Knowledge Base Systems

URL: https://github.com/thymeflow/thymeflow/

apxproof

Keyword: LaTeX

Functional Description: apxproof is a LaTeX package facilitating the typesetting of research articles with proofs in appendix, a common practice in database theory and theoretical computer science in general. The appendix material is written in the LaTeX code along with the main text which it naturally complements, and it is automatically deferred. The package can automatically send proofs to the appendix, can repeat in the appendix the theorem environments stated in the main text, can section the appendix automatically based on the sectioning of the main text, and supports a separate bibliography for the appendix material.

Release Functional Description: Ability to specify a sectioning counter, Compilation fix of proofsketch in inline mode

News Of The Year: Overall software maintenance. Support for more document classes. Some new features.

Participant: Pierre Senellart

Contact: Pierre Senellart

URL: https://github.com/PierreSenellart/apxproof

New Results Enumeration of Query Results

In many applications the output of a query may have a huge size and computing all the answers may already consume too many of the allowed resources. In this case it may be appropriate to first output a small subset of the answers and then, on demand, output a subsequent small numbers of answers and so on until all possible answers have been exhausted. To make this even more attractive it is preferable to be able to minimize the time necessary to output the first answers and, from a given set of answers, also minimize the time necessary to output the next set of answers - this second time interval is known as the delay. We have shown that this was doable with a almost linear preprocessing time and constant enumeration delay for first-order queries over structures having local bounded expansion .

Ethical Data Management

Issues of responsible data analysis and use are coming to the forefront of the discourse in data science research and practice . The research has been focused on analyzing the fairness, accountability and transparency (FAT) properties of specific algorithms and their outputs. Although these issues are most apparent in the social sciences where fairness is interpreted in terms of the distribution of resources across protected groups, management of bias in source data affects a variety of fields. Consider climate change studies that require representative data from geographically diverse regions, or supply chain analyses that require data that represents the diversity of products and customers. In a paper , we argue that FAT properties must be considered as database system issues, further upstream in the data science lifecycle: bias in source data goes unnoticed, and bias may be introduced during pre-processing (fairness), spurious correlations lead to reproducibility problems (accountability), and assumptions made during pre-processing have invisible but significant effects on decisions (transparency). As machine learning methods continue to be applied broadly by non-experts, the potential for misuse increases. There is a need for a data sharing and collaborative analytics platform with features to encourage (and in some cases, enforce) best practices at all stages of the data science lifecycle. We describe features of such a platform, which we term Fides, in the context of of urban analytics, outlining a systems research agenda in responsible data science.

Structure and Tractability of Uncertain Data

A major part of the work conducted in Valda has been to study the connections between tractability and structure in databases, in particular uncertain databases.

In a first line of work, we have investigated incompleteness related to order. In , we have introduced a query language for order-incomplete data, based on the positive relational algebra with order-aware accumulation. We have used partial orders to represent order-incomplete data, and studied possible and certain answers for queries in this context, showing these problems are respectively NP-complete and coNP-complete, but identifying tractable cases depending on query operators and the structure of input partial orders. In , we consider a different setting where some partial order is known, but actual values are unknown. Our work is the first to propose a principled scheme to derive the value distributions and expected values of unknown items in this setting, with the goal of computing estimated top- $k$ results by interpolating the unknown values from the known ones. We have studied the complexity of this general task, and show tight complexity bounds, proving that the problem is intractable, but can be tractably approximated. We have also isolated structure-based restrictions that allow for a PTIME solution.

In , we have investigated parameterizations of both database instances and queries that make query evaluation fixed-parameter tractable in combined complexity, first in a setting without uncertainty. For this, we have introduced a new Datalog fragment with stratified negation, intensional-clique-guarded Datalog (ICG-Datalog), with linear-time evaluation on structures of bounded treewidth for programs of bounded rule size. Our result is shown by compiling to alternating two-way automata, whose semantics is defined via cyclic provenance circuits (cycluits) that can be tractably evaluated. Finally, we move to the probabilistic setting and have shown that probabilistic query evaluation remains intractable in combined complexity under this parameterization.

Finally, a last line of work concerns efficient queries over probabilistic graphs. In a first theoretical work , we have studied the combined complexity of conjunctive query evaluation on probabilistic graphs, which can be alternatively phrased as a probabilistic version of the graph homomorphism problem. We have shown that the complexity landscape is surprisingly rich, using a variety of technical tools. In a more practical work , we have proposed indexing techniques and algorithms to evaluate source-to-target queries in probabilistic graphs, by exploiting their structure. We have shown that these significantly enhance the accuracy and efficiency of existing query evaluation approaches on probabilistic graphs.

Partnerships and Cooperations Regional Initiatives

Valda has obtained a 10k€ budget from ENS in 2017, as a start-up grant from the team (Action Concertée Incitative).

Inria established a bilateral contract with the Centre – Val de Loire region, for the expertise and audit of a research project by Pierre Senellart. Because of delays due to the company being audited, the expertise is still in progress.

National Initiatives ANR

Valda has been part of one ANR project in 2017 (Headwork, budget managed by Inria), together with IRISA (DRUID team, coordinator), Inria Lille (LINKS & SPIRAL), and Inria Rennes (SUMO), and two application partners: MNHN (Cesco) and FouleFactory. The topic is workflows for crowdsourcing. See http://headwork.gforge.inria.fr/.

In addition, another project (BioQOP, budget managed by ENS) will start in January 2018, with Morpho and GREYC, on the optimization of queries for privacy-aware biometric data management

International Initiatives Informal International Partners

Valda has strong collaborations with the following international groups:

Univ. Edinburgh, United Kingdom:

Peter Buneman and Leonid Libkin

Univ. Oxford, United Kingdom:

Michael Benedikt, Evgeny Kharlamov, and Georg Gottlob

Dortmund University, Germany:

Thomas Schwentick

Warsaw University, Poland:

Mikołaj Bojańczyk and Szymon Toruńczyk

Tel Aviv University, Israel:

Daniel Deutch and Tova Milo

Drexel University, USA:

Julia Stoyanovich

Univ. California San Diego, USA:

Victor Vianu

National University of Singapore:

Stéphane Bressan

International Research Visitors Visits of International Scientists

Victor Vianu, Professor at UC San Diego and holder of an Inria international chair, spent 6 months within Valda: three months employed by Inria and three months as an ENS invited professor.

Internships

Deabrota Basu, PhD student at National University of Singapore, stayed 2.5 months within Valda, to work with Pierre Senellart.

Visits to International Teams Research Stays Abroad

Pierre Senellart has spent around two months at the University of Edinburgh, collaborating with Peter Buneman and Leonid Libkin.

Pierre Senellart has spent a cumulated time of more than one month at National University of Singapore, co-advising Debabrota Basu, PhD student working under the co-supervision of Stéphane Bressan.

Dissemination Promoting Scientific Activities Scientific Events Organisation Member of the Organizing Committees

Serge Abiteboul, organization of Personal Analytics & Privacy workshop, joint with ECML-PKDD 2017, Skopje, Macedonia

Serge Abiteboul, scientific organization of colloquium on La communauté scientifique face au renseignement, École militaire, Parsi, France

Serge Abiteboul, organization of colloquium on Les enjeux scientifiques de l'éthique du numérique, Académie des sciences

Pierre Senellart, organization of ParisBD 2017, Télécom ParisTech, Paris, France

Pierre Senellart, co-organizer of ACM-ICPC Southwestern Europe 2017 competition

Scientific Events Selection Chair of Conference Program Committees

Pierre Senellart, BDA 2017 (French conference on data management)

Pierre Senellart, WebDB workshop, joint with SIGMOD 2017

Member of the Conference Program Committees

Pierre Senellart, Gems of PODS 2017 committee

Pierre Senellart, SIGMOD 2017 (distinguished PC member), ICDT 2017, EDBT 2018

Journal Reviewer - Reviewing Activities

Pierre Senellart, Journal of the ACM, VLDB Journal, Artificial Intelligence

Invited Talks

Serge Abiteboul, keynote at PPDP-LOPSTR, Namur, Belgium

Serge Abiteboul, keynote at ETAPS, Uppsala, Sweden

Serge Abiteboul, keynote at Law & Big Data Conference, Paris, France

Leadership within the Scientific Community

Serge Abiteboul is a member of the French Academy of Sciences, of the Academia Europa, and of the scientific council of the Société Informatique de France.

Scientific Expertise

Pierre Senellart, ANR, NSF

Research Administration

Serge Abiteboul was the president of the Dune jury (Développement d'universités numériques expérimentales)

Serge Abiteboul participated in the NCU jury (nouveaux cursus à l'université)

Serge Abiteboul contributed to the report on Éthique de la recherche en apprentissage machine of Cerna-Allistene

Serge Abiteboul is co-chair of the “Committee on Gender Equality and Equal Opportunities” of Inria.

Luc Segoufin is a member of the CNHSCT of Inria.

Pierre Senellart is a member of the board of section 6 of the National Committee for Scientific Research.

Pierre Senellart is vice-director of the DI ENS laboratory, joint between ENS, CNRS, and Inria.

Teaching - Supervision - Juries Teaching

Licence: Pierre Senellart, Databases, 54 heqTD, L3, École normale supérieure

Licence: Pierre Senellart, Algorithms, 18 heqTD, L3, École normale supérieure

Master: Serge Abiteboul & Pierre Senellart, Web data management, 36 heqTD, M2, MPRI

Master: Luc Segoufin, Logic, descriptive complexity and database theory, 36 heqTD, M2, MPRI

Pierre Senellart has various teaching responsibilities (L3 internships, M2 internships, M2 administration) at ENS.

Serge Abiteboul proposed with Benjamin Nguyen and Philippe Rigaux a second session of the Mooc “Bases de données relationnelles: comprendre pour maîtriser” (FUN). He proposed with Julia Stoyanovich a course on “Ethical data management” at the EDBT Summer School, Genova, 2017.

Supervision

PhD : David Montoya, Une base de connaissance personnelle intégrant les données d'un utilisateur et une chronologie de ses activités, Université Paris-Saclay, 6 March 2017, Serge Abiteboul & Pierre Senellart

PhD in progress: Debabrota Basu, Reinforcement learning applications to data management problems, started in 2015, Stéphane Bressan & Pierre Senellart

PhD in progress: Julien Grange, Graph properties: order and arithmetic in predicate logics, started in 2017, Luc Segoufin

PhD in progress: Miyoung Han, Learning approaches to dynamic data management, started in 2015, Pierre Senellart

PhD in progress: Quentin Lobbé, Diachronic analysis of diaspora communities through web archives enrichment, started in 2015, Pierre Senellart & Dana Diminescu

PhD in progress: Mikaël Monet, Efficient querying of large uncertain graphs by exploiting their structure, started in 2015, Pierre Senellart & Antoine Amarilli

PhD in progress: Karima Rafes, Security and management of personal data in the Web of things, started in 2015, Serge Abiteboul & Sarah Cohen-Boulakia

PhD in progress: Alexandre Vigny, Query enumeration on nowhere-dense graphs, started in 2015, Luc Segoufin & Arnaud Durand

Juries

PhD Paul Lagrée, October 2017, Université Paris-Saclay, Pierre Senellart

Popularization

Serge Abiteboul is involved in several popular science activities. He founded and animates the blog http://binaire.blog.lemonde.fr/ on computer science. He was the scientific curator (commissaire scientifique) of the exhibition “Terra Data” at the Cité des Sciences. He published two scientific popularization books in 2017, “Le temps des algorithmes” , with Gilles Dowek, and “Terra data” , with Valéeie Peugeot.

Serge Abiteboul is the president of the strategic committee of the Blaise Pascal foundation for scientific mediation.

Managing your digital life Serge Abiteboul S. Benjamin André B. Daniel Kaplan D. Commun. ACM 58 5 2015 32–35 http://doi.acm.org/10.1145/2670528 Foundations of Databases Serge Abiteboul S. Richard Hull R. Victor Vianu V. Addison-Wesley 1995 http://webdam.inria.fr/Alice/ Web Data Management Serge Abiteboul S. Ioana Manolescu I. Philippe Rigaux P. Marie-Christine Rousset M. Pierre Senellart P. Cambridge University Press 2011 http://webdam.inria.fr/Jorge Provenance Circuits for Trees and Treelike Instances Antoine Amarilli A. Pierre Bourhis P. Pierre Senellart P. Automata, Languages, and Programming - 42nd International Colloquium, ICALP 2015, Kyoto, Japan, July 6-10, 2015, Proceedings, Part II 2015 56–68 https://doi.org/10.1007/978-3-662-47666-6_5 Databases Michael Benedikt M. Pierre Senellart P. Computer Science, The Hardware, Software and Heart of It Springer 2011 169–229 https://doi.org/10.1007/978-1-4614-1168-0_10 Datalog Rewritings of Regular Path Queries using Views Nadime Francis N. Luc Segoufin L. Cristina Sirangelo C. Logical Methods in Computer Science 11 4 2015 https://doi.org/10.2168/LMCS-11(4:14)2015 FO2(<, +1, ~) on data trees, data tree automata and branching vector addition systems Florent Jacquemard F. Luc Segoufin L. Jerémie Dimino J. Logical Methods in Computer Science 12 2 2016 https://doi.org/10.2168/LMCS-12(2:3)2016 Enumeration of monadic second-order queries on trees Wojciech Kazana W. Luc Segoufin L. ACM Trans. Comput. Log. 14 4 2013 25:1–25:12 http://doi.acm.org/10.1145/2528928 Online Influence Maximization Siyu Lei S. Silviu Maniu S. Luyi Mo L. Reynold Cheng R. Pierre Senellart P. Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 10-13, 2015 2015 645–654 http://doi.acm.org/10.1145/2783258.2783271 Hup-me: inferring and reconciling a timeline of user activity from rich smartphone data David Montoya D. Serge Abiteboul S. Pierre Senellart P. Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems, Bellevue, WA, USA, November 3-6, 2015 2015 62:1–62:4 http://doi.acm.org/10.1145/2820783.2820852 Le temps des algorithmes Serge Abiteboul S. Gilles Dowek G. Editions Le Pommier 2017 192 https://hal.inria.fr/hal-01502505 Terra Data : Qu'allons-nous faire des données numériques ? Serge Abiteboul S. Valérie Peugeot V. Editions Le Pommier 2017 320 https://hal.inria.fr/hal-01502512 Proceedings of the 20th International Workshop on the Web and Databases, WebDB 2017 Alexandra Meliou A. Pierre Senellart P. May 2017 https://hal.inria.fr/hal-01523772 Bottom-up automata on data trees and vertical XPath Diego Figueira D. Luc Segoufin L. 1860-5974 Logical Methods in Computer Science 2017 1-40 https://hal.inria.fr/hal-01631219 https://arxiv.org/abs/1710.08748 An Indexing Framework for Queries on Probabilistic Graphs Silviu Maniu S. Reynold Cheng R. Pierre Senellart P. 0362-5915 ACM Trans. Datab. Syst 2017 https://hal.inria.fr/hal-01437580 Archivage du Web Pierre Senellart P. Les Big Data à découvert CNRS Éditions March 2017 https://hal.inria.fr/hal-01497800 Provenance and Probabilities in Relational Databases: From Theory to Practice Pierre Senellart P. 0163-5808 SIGMOD record December 2017 1-11 https://hal.inria.fr/hal-01672566 Issues in Ethical Data Management - Extended Abstract Serge Abiteboul S. PPDP 2017 - 19th International Symposium on Principles and Practice of Declarative Programming Namur, Belgium October 2017 https://hal.inria.fr/hal-01621687 ACM SIGPLAN International Conference on Principles and Practice of Declarative Programming 19 PPDP Personal Knowledge Base Systems Serge Abiteboul S. David Montoya D. PAP 2017, Personal analytics and privacy Skopje, Macedonia September 2017 https://hal.inria.fr/hal-01592601 International Workshop on Personal Analytics and Privacy 2017 Top-k Querying of Unknown Values under Order Constraints Antoine Amarilli A. Yael Amsterdamer Y. Tova Milo T. Pierre Senellart P. ICDT 2017 - International Conference on Database Theory Venice, Italy March 2017 https://hal.inria.fr/hal-01439295 International Conference on Database Theory 2017 ICDT Combined Tractability of Query Evaluation via Tree Automata and Cycluits Antoine Amarilli A. Pierre Bourhis P. Mikaël Monet M. Pierre Senellart P. ICDT 2017 - International Conference on Database Theory Venice, Italy March 2017 https://hal.inria.fr/hal-01439294 International Conference on Database Theory 2017 ICDT Possible and Certain Answers for Queries over Order-Incomplete Data Antoine Amarilli A. Mouhamadou Lamine Ba M. Daniel Deutch D. Pierre Senellart P. Sven Schewe S. Thomas Schneider T. Jef Wijsen J. 24th International Symposium on Temporal Representation and Reasoning (TIME 2017) Mons, Belgium 90 Schloss Dagstuhl October 2017 4:1-4:19 https://hal.inria.fr/hal-01570603 IEEE International Symposium on Temporal Representation and Reasoning 24 TIME https://arxiv.org/abs/1707.07222 Conjunctive Queries on Probabilistic Graphs: Combined Complexity Antoine Amarilli A. Mikaël Monet M. Pierre Senellart P. Principles of Database Systems (PODS) Chicago, United States May 2017 https://hal.inria.fr/hal-01486634 ACM Conference on Principle of Database Systems 36 PODS https://arxiv.org/abs/1703.03201 Minimal absent words in a sliding window & applications to on-line pattern matching Maxime Crochemore M. Alice Heliou A. Gregory Kucherov G. Laurent Mouchard L. Solon P Pissis S. P. Yann Ramusat Y. FCT 2017 Bordeaux, France Lecture Notes in Computer Science Springer September 2017 https://hal.archives-ouvertes.fr/hal-01569264 International Symposium on Fundamentals of Computation Theory 2017 FCT Une autocomplétion générique de SPARQL dans un contexte multi-services Karima Rafes K. Sarah Cohen-Boulakia S. Serge Abiteboul S. BDA 2017 - 33ème conférence sur la «Gestion de Données — Principes, Technologies et Applications» Nancy, France November 2017 https://hal.inria.fr/hal-01627760 Journées Bases de Données Avancées 33 BDA Towards Approximating Incomplete Queries over Partially Complete Databases (Extended Abstract) Ognjen Savković O. Evgeny Kharlamov E. Werner Nutt W. Pierre Senellart P. AMW Montevideo, Uruguay AMW 2017 - 11th Alberto Mendelzon International Workshop on Foundations of Data Management Montevideo, Uruguay June 5 – 9, 2017 June 2017 https://hal.inria.fr/hal-01586884 Alberto Mendelzon International Workshop on Foundations of Data Management 11 AMW Constant Delay Enumeration for FO Queries over Databases with Local Bounded Expansion Luc Segoufin L. Alexandre Vigny A. ICDT Venise, Italy March 2017 https://hal.inria.fr/hal-01589303 International Conference on Database Theory 2017 ICDT Fides: Towards a Platform for Responsible Data Science Julia Stoyanovich J. Bill Howe B. Serge Abiteboul S. Gerome Miklau G. Arnaud Sahuguet A. Gerhard Weikum G. SSDBM'17 - 29th International Conference on Scientific and Statistical Database Management Chicago, United States June 2017 https://hal.inria.fr/hal-01522418 International Conference in Scientific and Statistical Database Management 29 SSDBM Top-k Querying of Unknown Values under Order Constraints (Extended Version) Antoine Amarilli A. Yael Amsterdamer Y. Tova Milo T. Pierre Senellart P. January 2017 https://hal.inria.fr/hal-01439310 https://arxiv.org/abs/1701.02634 - 32 pages, 1 figure, 1 algorithm, 51 references. Extended version of paper at ICDT'17 Possible and Certain Answers for Queries over Order-Incomplete Data Antoine Amarilli A. Mouhamadou Lamine Ba M. L. Daniel Deutch D. Pierre Senellart P. October 2017 https://hal.inria.fr/hal-01614571 https://arxiv.org/abs/1707.07222 - 55 pages, 5 figures, 1 table, 44 references. Accepted at TIME'17. This paper is the full version with appendices of the article in the TIME proceedings. The main text of this full version is the same as the TIME proceedings version, except some superficial changes (to fit the proceedings version to 15 pages, and to obey LIPIcs-specific formatting requirements) Connecting Width and Structure in Knowledge Compilation Pierre Senellart P. Antoine Amarilli A. Mikaël Monet M. October 2017 https://hal.inria.fr/hal-01614551 https://arxiv.org/abs/1709.06188 - 32 pages, no figures, 39 references. Submitted Comparing workflow specification languages: A matter of views Serge Abiteboul S. Pierre Bourhis P. Victor Vianu V. ACM Trans. Database Syst. 37 2 2012 10:1–10:59 http://doi.acm.org/10.1145/2188349.2188352 Data on the Web: From Relations to Semistructured Data and XML Serge Abiteboul S. Peter Buneman P. Dan Suciu D. Morgan Kaufmann 1999 Deduction with Contradictions in Datalog Serge Abiteboul S. Daniel Deutch D. Victor Vianu V. Nicole Schweikardt N. Vassilis Christophides V. Vincent Leroy V. Proc. 17th International Conference on Database Theory (ICDT), Athens, Greece, March 24-28, 2014. OpenProceedings.org 2014 143–154 https://doi.org/10.5441/002/icdt.2014.17 Temporal Versus First-Order Logic to Query Temporal Databases Serge Abiteboul S. Laurent Herr L. Jan Van den Bussche J. V. Richard Hull R. Proceedings of the Fifteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 3-5, 1996, Montreal, Canada ACM Press 1996 49–57 http://doi.acm.org/10.1145/237661.237674 On the expressiveness of probabilistic XML models Serge Abiteboul S. Benny Kimelfeld B. Yehoshua Sagiv Y. Pierre Senellart P. VLDB J. 18 5 2009 1041–1064 https://doi.org/10.1007/s00778-009-0146-1 Representing and querying XML with incomplete information Serge Abiteboul S. Luc Segoufin L. Victor Vianu V. ACM Trans. Database Syst. 31 1 2006 208–254 http://doi.acm.org/10.1145/1132863.1132869 Tractable Lineages on Treelike Instances: Limits and Extensions Antoine Amarilli A. Pierre Bourhis P. Pierre Senellart P. Tova Milo T. Wang-Chiew Tan W. Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2016, San Francisco, CA, USA, June 26 - July 01, 2016 ACM 2016 355–370 http://doi.acm.org/10.1145/2902251.2902301 Provenance for aggregate queries Yael Amsterdamer Y. Daniel Deutch D. Val Tannen V. Maurizio Lenzerini M. Thomas Schwentick T. Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2011, June 12-16, 2011, Athens, Greece ACM 2011 153–164 http://doi.acm.org/10.1145/1989284.1989302 CrowdMiner: Mining association rules from the crowd Yael Amsterdamer Y. Yael Grossman Y. Tova Milo T. Pierre Senellart P. PVLDB 6 12 2013 1250–1253 http://www.vldb.org/pvldb/vol6/p1250-amsterdamer.pdf Querying graph databases Pablo Barceló Baeza P. B. Richard Hull R. Wenfei Fan W. Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2013, New York, NY, USA - June 22 - 27, 2013 ACM 2013 175–188 http://doi.acm.org/10.1145/2463664.2465216 The Management of Probabilistic Data Daniel Barbará D. Hector Garcia-Molina H. Daryl Porter D. IEEE Trans. Knowl. Data Eng. 4 5 1992 487–502 https://doi.org/10.1109/69.166990 Regularized Cost-Model Oblivious Database Tuning with Reinforcement Learning Debabrota Basu D. Qian Lin Q. Weidong Chen W. Hoang Tam Vo H. T. Zihong Yuan Z. Pierre Senellart P. Stéphane Bressan S. T. Large-Scale Data- and Knowledge-Centered Systems 28 2016 96–132 https://doi.org/10.1007/978-3-662-53455-7_5 Determining relevance of accesses at runtime Michael Benedikt M. Georg Gottlob G. Pierre Senellart P. Maurizio Lenzerini M. Thomas Schwentick T. Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2011, June 12-16, 2011, Athens, Greece ACM 2011 211–222 http://doi.acm.org/10.1145/1989284.1989309 Dealing with the Deep Web and all its Quirks Meghyn Bienvenu M. Daniel Deutch D. Davide Martinenghi D. Pierre Senellart P. Fabian M. Suchanek F. M. Marco Brambilla M. Stefano Ceri S. Tim Furche T. Georg Gottlob G. Proceedings of the Second International Workshop on Searching and Integrating New Web Data Sources, Istanbul, Turkey, August 31, 2012 CEUR Workshop Proceedings 884 CEUR-WS.org 2012 21–24 http://ceur-ws.org/Vol-884/VLDS2012_p21_Bienvenu.pdf Verification of database-driven systems via amalgamation Mikołaj Bojańczyk M. Luc Segoufin L. Szymon Toruńczyk S. Richard Hull R. Wenfei Fan W. Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2013, New York, NY, USA - June 22 - 27, 2013 ACM 2013 63–74 http://doi.acm.org/10.1145/2463664.2465228 Why and Where: A Characterization of Data Provenance Peter Buneman P. Sanjeev Khanna S. Wang-Chiew Tan W.-C. Jan Van den Bussche J. V. Victor Vianu V. Database Theory - ICDT 2001, 8th International Conference, London, UK, January 4-6, 2001, Proceedings. Lecture Notes in Computer Science 1973 Springer 2001 316–330 https://doi.org/10.1007/3-540-44503-X_20 The Monadic Second-Order Logic of Graphs. I. Recognizable Sets of Finite Graphs Bruno Courcelle B. Inf. Comput. 85 1 1990 12–75 https://doi.org/10.1016/0890-5401(90)90043-H The dichotomy of probabilistic inference for unions of conjunctive queries Nilesh N. Dalvi N. N. Dan Suciu D. J. ACM 59 6 2012 30:1–30:87 http://doi.acm.org/10.1145/2395116.2395119 Adaptive Query Processing Amol Deshpande A. Zachary G. Ives Z. G. Vijayshankar Raman V. Foundations and Trends in Databases 1 1 2007 1–140 https://doi.org/10.1561/1900000001 Proactive learning: cost-sensitive active learning with multiple imperfect oracles Pinar Donmez P. Jaime G. Carbonell J. G. James G. Shanahan J. G. Sihem Amer-Yahia S. Ioana Manolescu I. Yi Zhang Y. David A. Evans D. A. Aleksander Kolcz A. Key-Sun Choi K. Abdur Chowdhury A. Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM 2008, Napa Valley, California, USA, October 26-30, 2008 ACM 2008 619–628 http://doi.acm.org/10.1145/1458082.1458165 Adaptive Web Crawling Through Structure-Based Link Classification Muhammad Faheem M. Pierre Senellart P. Robert B. Allen R. B. Jane Hunter J. Marcia Lei Zeng M. L. Digital Libraries: Providing Quality Information - 17th International Conference on Asia-Pacific Digital Libraries, ICADL 2015, Seoul, Korea, December 9-12, 2015, Proceedings Lecture Notes in Computer Science 9469 Springer 2015 39–51 https://doi.org/10.1007/978-3-319-27974-9_5 Corroborating information from disagreeing views Alban Galland A. Serge Abiteboul S. Amélie Marian A. Pierre Senellart P. Brian D. Davison B. D. Torsten Suel T. Nick Craswell N. Bing Liu B. Proceedings of the Third International Conference on Web Search and Web Data Mining, WSDM 2010, New York, NY, USA, February 4-6, 2010 ACM 2010 131–140 http://doi.acm.org/10.1145/1718487.1718504 On database query languages for K-relations Floris Geerts F. Antonella Poggi A. J. Applied Logic 8 2 2010 173–185 https://doi.org/10.1016/j.jal.2009.09.001 Introduction to statistical relational learning Lise Getoor L. MIT Press 2007 Scalable, generic, and adaptive systems for focused crawling Georges Gouriten G. Silviu Maniu S. Pierre Senellart P. Leo Ferres L. Gustavo Rossi G. Virgílio A. F. Almeida V. A. F. Eelco Herder E. 25th ACM Conference on Hypertext and Social Media, HT '14, Santiago, Chile, September 1-4, 2014 ACM 2014 35–45 http://doi.acm.org/10.1145/2631775.2631795 Provenance semirings Todd J. Green T. J. Gregory Karvounarakis G. Val Tannen V. Leonid Libkin L. Proceedings of the Twenty-Sixth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 11-13, 2007, Beijing, China ACM 2007 31–40 http://doi.acm.org/10.1145/1265530.1265535 Models for Incomplete and Probabilistic Information Todd J. Green T. J. Val Tannen V. IEEE Data Eng. Bull. 29 1 2006 17–24 http://sites.computer.org/debull/A06mar/green.ps Answering queries using views: A survey Alon Y. Halevy A. Y. VLDB J. 10 4 2001 270–294 https://doi.org/10.1007/s007780100054 Support vector machines Marti A. Hearst M. A. Susan T Dumais S. T. Edgar Osuna E. John Platt J. Bernhard Scholkopf B. IEEE Intelligent Systems 13 4 1998 18–28 https://doi.org/10.1109/5254.708428 Incomplete Information in Relational Databases Tomasz Imielinski T. Witold Lipski Jr. W. L. J. ACM 31 4 1984 761–791 http://doi.acm.org/10.1145/1634.1886 Enumeration of first-order queries on classes of structures with bounded expansion Wojciech Kazana W. Luc Segoufin L. Richard Hull R. Wenfei Fan W. Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2013, New York, NY, USA - June 22 - 27, 2013 ACM 2013 297–308 http://doi.acm.org/10.1145/2463664.2463667 Probabilistic XML: Models and Complexity Benny Kimelfeld B. Pierre Senellart P. Zongmin Ma Z. Li Yan L. Advances in Probabilistic Databases for Uncertain Information Management Studies in Fuzziness and Soft Computing 304 Springer 2013 39–66 https://doi.org/10.1007/978-3-642-37509-5_3 Equivalence of Relational Algebra and Relational Calculus Query Languages Having Aggregate Functions Anthony C. Klug A. C. J. ACM 29 3 1982 699–717 http://doi.acm.org/10.1145/322326.322332 The State of the art in distributed query processing Donald Kossmann D. ACM Comput. Surv. 32 4 2000 422–469 http://doi.acm.org/10.1145/371578.371598 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John D. Lafferty J. D. Andrew McCallum A. Fernando C. N. Pereira F. C. N. Carla E. Brodley C. E. Andrea Pohoreckyj Danyluk A. P. Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), Williams College, Williamstown, MA, USA, June 28 - July 1, 2001 Morgan Kaufmann 2001 282–289 Semiring Frameworks and Algorithms for Shortest-Distance Problems Mehryar Mohri M. Journal of Automata, Languages and Combinatorics 7 3 2002 321–350 Automata Theory for XML Researchers Frank Neven F. SIGMOD Record 31 3 2002 39–46 http://doi.acm.org/10.1145/601858.601869 A glimpse on constant delay enumeration (Invited Talk) Luc Segoufin L. Ernst W. Mayr E. W. Natacha Portier N. 31st International Symposium on Theoretical Aspects of Computer Science (STACS 2014), STACS 2014, March 5-8, 2014, Lyon, France LIPIcs 25 Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik 2014 13–27 https://doi.org/10.4230/LIPIcs.STACS.2014.13 Automatic wrapper induction from hidden-web sources with domain knowledge Pierre Senellart P. Avin Mittal A. Daniel Muschick D. Rémi Gilleron R. Marc Tommasi M. Chee Yong Chan C. Y. Neoklis Polyzotis N. 10th ACM International Workshop on Web Information and Data Management (WIDM 2008), Napa Valley, California, USA, October 30, 2008 ACM 2008 9–16 http://doi.acm.org/10.1145/1458502.1458505 Active learning with real annotation costs B. Settles B. M. Craven M. L. Friedland L. NIPS 2008 Workshop on Cost-Sensitive Learning 2008 http://burrsettles.com/pub/settles.nips08ws.pdf Active Learning Synthesis Lectures on Artificial Intelligence and Machine Learning Burr Settles B. Morgan & Claypool Publishers 2012 https://doi.org/10.2200/S00429ED1V01Y201207AIM018 PARIS: Probabilistic Alignment of Relations, Instances, and Schema Fabian M. Suchanek F. M. Serge Abiteboul S. Pierre Senellart P. PVLDB 5 3 2011 157–168 http://www.vldb.org/pvldb/vol5/p157_fabianmsuchanek_vldb2012.pdf Probabilistic Databases Synthesis Lectures on Data Management Dan Suciu D. Dan Olteanu D. Christopher Ré C. Christoph Koch C. Morgan & Claypool Publishers 2011 https://doi.org/10.2200/S00362ED1V01Y201105DTM016 Reinforcement learning - an introduction Adaptive computation and machine learning Richard S. Sutton R. S. Andrew G. Barto A. G. MIT Press 1998 http://www.worldcat.org/oclc/37293240 The Complexity of Relational Query Languages (Extended Abstract) Moshe Y. Vardi M. Y. Harry R. Lewis H. R. Barbara B. Simons B. B. Walter A. Burkhard W. A. Lawrence H. Landweber L. H. Proceedings of the 14th Annual ACM Symposium on Theory of Computing, May 5-7, 1982, San Francisco, California, USA ACM 1982 137–146 http://doi.acm.org/10.1145/800070.802186 On the reliability and intuitiveness of aggregated search metrics Ke Zhou K. Mounia Lalmas M. Tetsuya Sakai T. Ronan Cummins R. Joemon M. Jose J. M. Qi He Q. Arun Iyengar A. Wolfgang Nejdl W. Jian Pei J. Rajeev Rastogi R. 22nd ACM International Conference on Information and Knowledge Management, CIKM'13, San Francisco, CA, USA, October 27 - November 1, 2013 ACM 2013 689–698 http://doi.acm.org/10.1145/2505515.2505691 Principles of Distributed Database Systems, Third Edition M. Tamer Özsu M. T. Patrick Valduriez P. Springer 2011 https://doi.org/10.1007/978-1-4419-8834-8