VALDA has integrated members of the Inria DAHU project-team in 2017. Their relevant activity has been integrated in this activity report.
Valda's focus is on both foundational and systems aspects of complex data management, especially human-centric data. The data we are interested in is typically heterogeneous, massively distributed, rapidly evolving, intensional, and often subjective, possibly erroneous, imprecise, incomplete. In this setting, Valda is in particular concerned with the optimization of complex resources such as computer time and space, communication, monetary, and privacy budgets. The goal is to extract value from data, beyond simple query answering.
Data management , is now an old, well-established field, for which many scientific results and techniques have been accumulated since the sixties. Originally, most works dealt with static, homogeneous, and precise data. Later, works were devoted to heterogeneous data , and possibly distributed but at a small scale.
However, these classical techniques are poorly adapted to handle the new challenges of data management. Consider human-centric data, which is either produced by humans, e.g., emails, chats, recommendations, or produced by systems when dealing with humans, e.g., geolocation, business transactions, results of data analysis. When dealing with such data, and to accomplish any task to extract value from such data, we rapidly encounter the following facets:
Heterogeneity: data may come in many different structures such as unstructured text, graphs, data streams, complex aggregates, etc., using many different schemas or ontologies.
Massive distribution: data may come from a large number of autonomous sources distributed over the web, with complex access patterns.
Rapid evolution: many sources may be producing data in real time, even if little of it is perhaps relevant to the specific application. Typically, recent data is of particular interest and changes have to be monitored.
Intensionality
Confidentiality and security: some personal data is critical and need to remain confidential. Applications manipulating personal data must take this into account and must be secure against linking.
Uncertainty: modern data, and in particular human-centric data, typically includes errors, contradictions, imprecision, incompleteness, which complicates reasoning. Furthermore, the subjective nature of the data, with opinions, sentiments, or biases, also makes reasoning harder since one has, for instance, to consider different agents with distinct, possibly contradicting knowledge.
These problems have already been studied individually and have led to techniques such as query rewriting or distributed query optimization .
Among all these aspects, intensionality is perhaps the one that has least been studied, so we will pay particular attention to it. Consider a user's query, taken in a very broad sense: it may be a classical database query, some information retrieval search, a clustering or classification task, or some more advanced knowledge extraction request. Because of intensionality of data, solving such a query is a typically dynamic task: each time new data is obtained, the partial knowledge a system has of the world is revised, and query plans need to be updated, as in adaptive query processing or aggregated search . The system then needs to decide, based on this partial knowledge, of the best next access to perform. This is reminiscent of the central problem of reinforcement learning (train an agent to accomplish a task in a partially known world based on rewards obtained) and of active learning (decide which action to perform next in order to optimize a learning strategy) and we intend to explore this connection further.
Uncertainty of the data interacts with its intensionality: efforts are required to obtain more precise, more complete, sounder results, which yields a trade-off between processing cost and data quality.
Other aspects, such as heterogeneity and massive distribution, are of major importance as well. A standard data management task, such as query answering, information retrieval, or clustering, may become much more challenging when taking into account the fact that data is not available in a central location, or in a common format. We aim to take these aspects into account, to be able to apply our research to real-world applications.
We intend to tackle hard technical issues such as query answering, data integration, data monitoring, verification of data-centric systems, truth finding, knowledge extraction, data analytics, that take a different flavor in this modern context. In particular, we are interested in designing strategies to minimize data access cost towards a specific goal, possibly a massive data analysis task. That cost may be in terms of communication (accessing data in distributed systems, on the Web), of computational resources (when data is produced by complex tools such as information extraction, machine learning systems, or complex query processing), of monetary budget (paid-for application programming interfaces, crowdsourcing platforms), or of a privacy budget (as in the standard framework of differential privacy).
A number of data management tasks in Valda are inherently intractable. In addition to properly characterizing this intractability in terms of complexity theory, we intend to develop solutions for solving these tasks in practice, based on approximation strategies, randomized algorithms, enumeration algorithms with constant delay, or identification of restricted forms of data instances lowering the complexity of the task.
We now detail some of the scientific foundations of our research on complex data management. This is the occasion to review connections between data management, especially on complex data as is the focus of Valda, with related research areas.
Data management has been connected to logic since the advent of the relational model as main representation system for real-world data, and of first-order logic as the logical core of database querying languages . Since these early developments, logic has also been successfully used to capture a large variety of query modes, such as data aggregation , recursive queries (Datalog), or querying of XML databases . Logical formalisms facilitate reasoning about the expressiveness of a query language or about its complexity.
The main problem of interest in data management is that of query
evaluation, i.e., computing the results of a query over a database.
The complexity of this problem has far-reaching consequences.
For example, it is because first-order logic is in the
Automata theory and formal languages arise as important components of the study of many data management tasks: in temporal databases , queries, expressed in temporal logics, can often by compiled to automata; in graph databases , queries are naturally given as automata; typical query and schema languages for XML databases such as XPath and XML Schema can be compiled to tree automata , or for more complex languages to data tree automata. Another reason of the importance of automata theory, and tree automata in particular, comes from Courcelle's results that show that very expressive queries (from the language of monadic second-order language) can be evaluated as tree automata over tree decompositions of the original databases, yielding linear-time algorithms (in data complexity) for a wide variety of applications.
Complex data management also has connections
to verification and static analysis. Besides query evaluation, a central
problem in data management is that of deciding whether two queries are
equivalent . This is critical
for query optimization, in order to determine
if the rewriting of a query, maybe cheaper to evaluate, will return
the same result as the original query. Equivalence can easily be seen to
be an instance of the problem of (non-)satisfiability:
The orchestration of distributed activities (under the responsibility of a conductor) and their choreography (when they are fully autonomous) are complex issues that are essential for a wide range of data management applications including notably, e-commerce systems, business processes, health-care and scientific workflows. The difficulty is to guarantee consistency or more generally, quality of service, and to statically verify critical properties of the system. Different approaches to workflow specifications exist: automata-based, logic-based, or predicate-based control of function calls .
To deal with the uncertainty attached to data, proper models need to be used (such as attaching provenance information to data items and viewing the whole database as being probabilistic) and practical methods and systems need to be developed to both reliably estimate the uncertainty in data items and properly manage provenance and uncertainty information throughout a long, complex system.
The simplest model of data uncertainty is the NULLs of SQL databases, also called Codd tables . This representation system is too basic for any complex task, and has the major inconvenient of not being closed under even simple queries or updates. A solution to this has been proposed in the form of conditional tables where every tuple is annotated with a Boolean formula over independent Boolean random events. This model has been recognized as foundational and extended in two different directions: to more expressive models of provenance than what Boolean functions capture, through a semiring formalism , and to a probabilistic formalism by assigning independent probabilities to the Boolean events . These two extensions form the basis of modern provenance and probability management, subsuming in a large way previous works , . Research in the past ten years has focused on a better understanding of the tractability of query answering with provenance and probabilistic annotations, in a variety of specializations of this framework , .
Statistical machine learning, and its applications to data mining and data analytics, is a major foundation of data management research. A large variety of research areas in complex data management, such as wrapper induction , crowdsourcing , focused crawling , or automatic database tuning critically rely on machine learning techniques, such as classification , probabilistic models , or reinforcement learning .
Machine learning is also a rich source of complex data management problems: thus, the probabilities produced by a conditional random field system result in probabilistic annotations that need to be properly modeled, stored, and queried.
Finally, complex data management also brings new twists to some classical machine learning problems. Consider for instance the area of active learning , a subfield of machine learning concerned with how to optimally use a (costly) oracle, in an interactive manner, to label training data that will be used to build a learning model, e.g., a classifier. In most of the active learning literature, the cost model is very basic (uniform or fixed-value costs), though some works consider more realistic costs. Also, oracles are usually assumed to be perfect with only a few exceptions . These assumptions usually break when applied to complex data management problems on real-world data, such as crowdsourcing.
Having situated Valda's research area within its broader scientific scope, we now move to the discussion of Valda's application domains.
We now detail three main research axes within the research agenda of Valda. For each axis, we first mention the leading researcher, and other permanent members involved.
The systems we are interested in, i.e., for manipulating heterogeneous and confidential data, rapidly changing and massively distributed, are inherently error-prone. The need for formal methods to verify data management systems is best illustrated by the long list of famous leakages of sensitive or personal data that made the front pages of newspapers recently. Moreover, because of the cost in accessing intensional data, it is important to optimize the resources needed for manipulating them.
This creates a need for solid and high-level foundations of DBMS in a manner that is easier to understand, while also facilitating optimization and verification of its critical properties.
In particular these foundations are necessary for various design and reasoning tasks. It allows for clean specifications of key properties of the system such as confidentiality, access control, robustness etc. Once clean specifications are available, it opens the door for formal and runtime verification of the specification. It also permits the design of appropriate query languages – with good expressive power, with limited usage of resources –, the design of good indexes – for optimized evaluation –, and so on. Note that access control policies currently used in database management systems are relatively crude – for example, PostgreSQL offers access control rules on tables, views, or tuples (row security policies), but provides no guarantee that these access methods do not contradict each other, or that a user may have access through a query to information that she is not supposed to have access to.
Valda involves leading researchers in the formal verification of data
flow in a system manipulating data. Other notable teams involve the
WAVE project
In the short run, we plan to contribute to the state of the art of foundations of systems manipulating data by identifying new scenarios, i.e., specification formalisms, query languages, index structures, query evaluation plans, etc., that allow for any of the tasks mentioned above: formal or runtime verification, optimization etc. Several such scenarios are already known and Valda researchers contributed significantly to their discovery , ,, but this research is still in infancy and there is a clear need for more functionalities and more efficiency. This research direction has many facets.
One of the facet is to develop new logical frameworks and new automaton models, with good algorithmic properties (for instance efficient emptiness test, efficient inclusion test and so on), in order to develop a toolbox for reasoning task around systems manipulating data. This toolbox can then be used for higher level tasks such as optimization, verification , or query rewriting using views .
Another facet is to develop new index structures and new algorithms for efficient query evaluation. For example the enumeration of the output of a query requires the construction of index structures allowing for efficient compressed representation of the output with efficient streaming decompression algorithms as we aim for a constant delay between any two consecutive outputs . We have contributed a lot to this fields by providing several such indexes but there remains a lot to be investigated.
Our medium-term goal is to investigate the borders of feasibility of all the reasoning tasks above. For instance what are the assumptions on data that allow for computable verification problems? When is it not possible at all? When can we hope for efficient query answering, when is it hopeless? This is a problem of theoretical nature which is necessary for understanding the limit of the methods and driving research towards the scenarios where positive results may be obtainable.
A typical result would be to show that constant delay enumeration of queries is not possible unless the database verify property A and the query property B. Another typical result would be to show that having a robust access control policy verifying at the same time this and that property is not achievable.
Very few such results exist nowadays. If many problems are shown undecidable or decidable, charting the frontier of tractability (say linear time) remains a challenge.
Only when we will have understood the limitation of the method (medium-term goal) and have many examples where this is possible, we can hope to design a solid foundation that allowing for a good trade-off between what can be done (needs from the users) and what can be achieved (limitation from the system). This will be our long-term goal.
This research axis deals with the modeling and efficient management of data that come with some uncertainty (probabilistic distributions, logical incompleteness, missing values, open-world assumption, etc.) and with provenance information (indicating where the data originates from), as well as with the extraction of uncertainty and provenance annotations from real-world data. Interestingly, the foundations and tools for uncertainty management often rely on provenance annotations. For example, a typical way to compute the probability of query results in probabilistic databases is first to generate the provenance of these query results (in some Boolean framework, e.g., that of Boolean functions or of provenance semirings), and then to compute the probability of the resulting provenance annotation. For this reason, we will deal with uncertainty and provenance in a unified manner.
Valda researchers have carried out seminal work on probabilistic databases , , provenance management , incomplete information , and uncertainty analysis and propagation in conflicting datasets , . These research areas have reached a point where the foundations are well-understood, and where it becomes critical, while continuing developing the theory of uncertain and provenance data management, to move to concrete implementations and applications to real-world use cases.
In the short term, we will focus on implementing techniques from the database theory literature on provenance and uncertainty data management, in the direction of building a full-featured database management add-on that transparently manages provenance and probability annotations for a large class of querying tasks. This work has started recently with the creation of the ProvSQL extension to PostgreSQL, discussed in more details in the following section. To support this development work, we need to resolve the following research question: what representation systems and algorithms to use to support both semiring provenance frameworks , extensions to queries with negation , aggregation , or recursion ?
Next, we will study how to add support for incompleteness, probabilities, and provenance annotations in the scenarios identified in the first axis, and how to extract and derive such annotations from real-world datasets and tasks. We will also work on the efficiency of our uncertain data management system, and compare it to other uncertainty management solutions, in the perspective of making it a fully usable system, with little overhead compared to a classical database management system. This requires a careful choice of the provenance representation system used, which should be both compact and amenable to probability computations. We will study practical applications of uncertainty management. As an example, we intend to consider routing in public transport networks, given a probabilistic model on the reliability and schedule uncertainty of different transit routes. The system should be able to provide a user with itinerary to get to have a (probabilistic) guarantee to be at its destination within a given time frame, which may not be the shortest route in the classical sense.
One overall long-term goal is to reach a full understanding of the interactions between query evaluation or other broader data management tasks and uncertain and annotated data models. We would in particular want to go towards a full classification of tractable (typically polynomial-time) and intractable (typically NP-hard for decision problems, or #P-hard for probability evaluation) tasks, extending and connecting the query-based dichotomy on probabilistic query evaluation with the instance-based one of .
Another long-term goal is to consider more dynamic scenarios than what has been considered so far in the uncertain data management literature: when following a workflow, or when interacting with intensional data sources, how to properly represent and update uncertainty annotations that are associated with data. This is critical for many complex data management scenarios where one has to maintain a probabilistic current knowledge of the world, while obtaining new knowledge by posing queries and accessing data sources. Such intensional tasks requires minimizing jointly data uncertainty and cost to data access.
This is a more applied direction of research that will be the context to study issues of interest (see discussion in application domains further).
A typical person today usually has data on several devices and in a number of commercial systems that function as data traps where it is easy to check in information and difficult to remove it or sometimes to simply access it. It is also difficult, sometimes impossible, to control data access by other parties. This situation is unsatisfactory because it requires users to trade privacy against convenience but also, because it limits the value we, as individuals and as a society, can derive from the data. This leads to the concept of Personal Information Management System, in short, a Pims.
A Pims runs, on a user's server, the services selected by the user, storing and processing the user's data. The Pims centralizes the user's personal information. It is a digital home. The Pims is also able to exert control over information that resides in external services (for example, Facebook), and that only gets replicated inside the Pims. See, for instance, for a discussion on the advantages of Pims, as well as issues they raise, e.g. security issues. It is argued there that the main reason for a user to move to Pims is these systems enable great new functionalities.
Valda will study in particular the integration of the user's data. Researchers in the team have already provided important contributions in the context of data integration, notably in the context of the Webdam ERC (2009–2013).
Based on such an integration, Pims can provide a functions, that goes beyond simple query answering:
Global search over the person's data with a semantic layer using a personal ontology (for example, the data organization the person likes and the person's terminology for data) that helps give meaning to the data;
Automatic synchronization of data on different devices/systems, and global task sequencing to facilitate interoperating different devices/services;
Exchange of information and knowledge between "friends" in a truly social way, even if these use different social network platforms, or no platform at all;
Centralized control point for connected objects, a hub for the Internet of Things; and
Data analysis/mining over the person's information.
The focus on personal data and these various aspects raise interesting technical challenges that we intend to address.
In the short term, we intend to continue work on the ThymeFlow system to turn it into an easily extendable and deployable platform for the management of personal information – we will in particular encourage students from the M2 Web Data Management class taught by Serge and Pierre in the MPRI programme to use this platform in their course projects. The goal is to make it easy to add new functionalities (such as new source synchronizers to retrieve data and propagate updates to original data sources, and enrichers to add value to existing data) to considerably broaden the scope of the platform and consequently expand its value.
In the medium term, we will continue the work already started that focuses in turning information into knowledge and in knowledge integration. Issues related to intensionality or uncertainty will in particular be considered, relying on the works produced in the other two research axes. We stress, in particular, the importance of minimizing the cost to data access (or, in specific scenarios, the privacy cost associated with obtaining data items) in the context of personal information management: legacy data is often only available through costly APIs, interaction between several Pims may require sharing information within a strict privacy budget, etc. For these reasons, intensionality of data will be a strong focus of the research.
In the long term, we intend to use the knowledge acquired and machine learning techniques to predict the user's behavior and desires, and support new digital assistant functions, providing real value from data. We will also look into possibilities for deploying the ThymeFlow platform at a large scale, perhaps in collaboration with industry partners.
We recall that Valda's focus is on human-centric data, i.e., data produced by humans, explicitly or implicitly, or more generally containing information about humans. Quite naturally, we will use as a privileged application area to validate Valda’s results that of personal information management systems (Pims for short) .
A Pims is a system that allows a user to integrate her own data, e.g., emails and other kinds of messages, calendar, contacts, web search, social network, travel information, work projects, etc. Such information is commonly spread across different services. The goal is to give back to a user the control on her information, allowing her to formulate queries such as “What kind of interaction did I have recently with Alice B.?”, “Where were my last ten business trips, and who helped me plan them?”. The system has to orchestrate queries to the various services (which means knowing the existence of these services, and how to interact with them), integrate information from them (which means having data models for this information and its representation in the services), e.g., align a GPS location of the user to a business address or place mentioned in an email, or an event in a calendar to some event in a Web search. This information must be accessed intensionally: for instance, costly information extraction tools should only be run on emails which seem relevant, perhaps identified by a less costly cursory analysis (this means, in turn, obtaining a cost model for access to the different services). Impacted people can be found by examining events in the user's calendar and determining who is likely to attend them, perhaps based on email exchanges or former events' participant lists. Of course, uncertainty has to be maintained along the entire process, and provenance information is needed to explain query results to the user (e.g., indicate which meetings and trips are relevant to each person of the output). Knowledge about services, their data models, their costs, need either to be provided by the system designer, or to be automatically learned from interaction with these services, as in .
One motivation for that choice is that Pims concentrate many of the problems we intend to investigate: heterogeneity (various sources, each with a different structure), massive distribution (information spread out over the Web, in numerous sources), rapid evolution (new data regularly added), intensionality (knowledge from Wikidata, OpenStreetMap...), confidentiality and security (mostly private data), and uncertainty (very variable quality). Though the data is distributed, its size is relatively modest; other applications may be considered for works focusing on processing data at large scale, which is a potential research direction within Valda, though not our main focus. Another strong motivation for the choice of Pims as application domain is the importance of this application from a societal viewpoint.
A Pims is essentially a system built on top of a user's personal knowledge base; such knowledge bases are reminiscent of those found in the Semantic Web, e.g., linked open data. Some issues, such as ontology alignment exist in both scenarios. However, there are some fundamental differences in building personal knowledge bases vs collecting information from the Semantic Web: first, the scope is quite smaller, as one is only interested in knowledge related to a given individual; second, a small proportion of the data is already present in the form of semantic information, most needs to be extracted and annotated through appropriate wrappers and enrichers; third, though the linked open data is meant to be read-only, the only update possible to a user being adding new triples, a personal knowledge base is very much something that a user needs to be able to edit, and propagating updates from the knowledge base to original data sources is a challenge in itself.
The choice of Pims is not exclusive. We intend to consider other application areas as well. In particular, we have worked in the past and have a strong expertise on Web data in a broad sense: semi-structured, structured, or unstructured content extracted from Web databases ; knowledge bases from the Semantic Web ; social networks ; Web archives and Web crawls ; Web applications and deep Web databases ; crowdsourcing platforms . We intend to continue using Web data as a natural application domain for the research within Valda when relevant. For instance , deep Web databases are a natural application scenario for intensional data management issues: determining if a deep Web database contains some information requires optimizing the number of costly requests to that database.
A common aspect of both personal information and Web data is that their exploitation raises ethical considerations. Thus, a user needs to remain fully in control of the usage that is made of her personal information; a search engine or recommender system that ranks Web content for display to a specific user needs to do so in an unbiased, justifiable, manner. These ethical constraints sometimes forbid some technically solutions that may be technically useful, such as sharing a model learned from the personal data of a user to another user, or using blackboxes to rank query result. We fully intend to consider these ethical considerations within Valda. One of the main goals of a Pims is indeed to empower the user with a full control on the use of this data.
Keywords: Databases - Provenance - Probability
Functional Description: The goal of the ProvSQL project is to add support for (m-)semiring provenance and uncertainty management to PostgreSQL databases, in the form of a PostgreSQL extension/module/plugin.
News Of The Year: ProvSQL becomes usable for a large range of queries. Support for semirings and m-semirings is present, support for probability computation has been added through a variety of techniques, including knowledge compilation, support for where-provenance is currently being implemented.
Participants: Pierre Senellart and Yann Ramusat
Contact: Pierre Senellart
Publication: Provenance and Probabilities in Relational Databases: From Theory to Practice
Keyword: Personal information
Functional Description: ThymeFlow allows in particular the development of plugins for both interacting with existing Web sources and presenting users with rich interfaces and query facilities over their personal information. A preliminary version of ThymeFlow tools has also been deployed on the Cozy Cloud personal cloud system. The model allows the open-source community to contribute individual plugins while we focus on providing users with useful ways to exploit their personal information.
News Of The Year: Minor maintenance.
Participants: David Montoya, Pierre Senellart, Serge Abiteboul and Su Yang
Partner: ENGIE
Contact: Pierre Senellart
Publication: Personal Knowledge Base Systems
Keyword: LaTeX
Functional Description: apxproof is a LaTeX package facilitating the typesetting of research articles with proofs in appendix, a common practice in database theory and theoretical computer science in general. The appendix material is written in the LaTeX code along with the main text which it naturally complements, and it is automatically deferred. The package can automatically send proofs to the appendix, can repeat in the appendix the theorem environments stated in the main text, can section the appendix automatically based on the sectioning of the main text, and supports a separate bibliography for the appendix material.
Release Functional Description: Ability to specify a sectioning counter, Compilation fix of proofsketch in inline mode
News Of The Year: Overall software maintenance. Support for more document classes. Some new features.
Participant: Pierre Senellart
Contact: Pierre Senellart
In many applications the output of a query may have a huge size and computing all the answers may already consume too many of the allowed resources. In this case it may be appropriate to first output a small subset of the answers and then, on demand, output a subsequent small numbers of answers and so on until all possible answers have been exhausted. To make this even more attractive it is preferable to be able to minimize the time necessary to output the first answers and, from a given set of answers, also minimize the time necessary to output the next set of answers - this second time interval is known as the delay. We have shown that this was doable with a almost linear preprocessing time and constant enumeration delay for first-order queries over structures having local bounded expansion .
Issues of responsible data analysis and use are coming to the forefront of the discourse in data science research and practice . The research has been focused on analyzing the fairness, accountability and transparency (FAT) properties of specific algorithms and their outputs. Although these issues are most apparent in the social sciences where fairness is interpreted in terms of the distribution of resources across protected groups, management of bias in source data affects a variety of fields. Consider climate change studies that require representative data from geographically diverse regions, or supply chain analyses that require data that represents the diversity of products and customers. In a paper , we argue that FAT properties must be considered as database system issues, further upstream in the data science lifecycle: bias in source data goes unnoticed, and bias may be introduced during pre-processing (fairness), spurious correlations lead to reproducibility problems (accountability), and assumptions made during pre-processing have invisible but significant effects on decisions (transparency). As machine learning methods continue to be applied broadly by non-experts, the potential for misuse increases. There is a need for a data sharing and collaborative analytics platform with features to encourage (and in some cases, enforce) best practices at all stages of the data science lifecycle. We describe features of such a platform, which we term Fides, in the context of of urban analytics, outlining a systems research agenda in responsible data science.
A major part of the work conducted in Valda has been to study the connections between tractability and structure in databases, in particular uncertain databases.
In a first line of work, we have investigated incompleteness related to
order. In , we have introduced a query
language for order-incomplete data, based on the positive relational
algebra with order-aware accumulation. We have used partial orders to
represent order-incomplete data, and studied possible and certain
answers for queries in this context, showing these problems are
respectively NP-complete and coNP-complete, but identifying tractable
cases depending on query operators and the structure of input partial
orders. In , we consider a different
setting where some partial order is known, but actual values are
unknown. Our work is the first to propose a principled scheme to derive
the value distributions and expected values of unknown items in this
setting, with the goal of computing estimated top-
In , we have investigated parameterizations of both database instances and queries that make query evaluation fixed-parameter tractable in combined complexity, first in a setting without uncertainty. For this, we have introduced a new Datalog fragment with stratified negation, intensional-clique-guarded Datalog (ICG-Datalog), with linear-time evaluation on structures of bounded treewidth for programs of bounded rule size. Our result is shown by compiling to alternating two-way automata, whose semantics is defined via cyclic provenance circuits (cycluits) that can be tractably evaluated. Finally, we move to the probabilistic setting and have shown that probabilistic query evaluation remains intractable in combined complexity under this parameterization.
Finally, a last line of work concerns efficient queries over probabilistic graphs. In a first theoretical work , we have studied the combined complexity of conjunctive query evaluation on probabilistic graphs, which can be alternatively phrased as a probabilistic version of the graph homomorphism problem. We have shown that the complexity landscape is surprisingly rich, using a variety of technical tools. In a more practical work , we have proposed indexing techniques and algorithms to evaluate source-to-target queries in probabilistic graphs, by exploiting their structure. We have shown that these significantly enhance the accuracy and efficiency of existing query evaluation approaches on probabilistic graphs.
Valda has obtained a 10k€ budget from ENS in 2017, as a start-up grant from the team (Action Concertée Incitative).
Inria established a bilateral contract with the Centre – Val de Loire region, for the expertise and audit of a research project by Pierre Senellart. Because of delays due to the company being audited, the expertise is still in progress.
Valda has been part of one ANR project in 2017 (Headwork, budget
managed by Inria), together with IRISA (DRUID team, coordinator),
Inria Lille (LINKS & SPIRAL), and Inria Rennes (SUMO), and two
application partners: MNHN (Cesco) and FouleFactory. The topic is
workflows for crowdsourcing. See
http://
In addition, another project (BioQOP, budget managed by ENS) will start in January 2018, with Morpho and GREYC, on the optimization of queries for privacy-aware biometric data management
Valda has strong collaborations with the following international groups:
Peter Buneman and Leonid Libkin
Michael Benedikt, Evgeny Kharlamov, and Georg Gottlob
Thomas Schwentick
Mikołaj Bojańczyk and Szymon Toruńczyk
Daniel Deutch and Tova Milo
Julia Stoyanovich
Victor Vianu
Stéphane Bressan
Victor Vianu, Professor at UC San Diego and holder of an Inria international chair, spent 6 months within Valda: three months employed by Inria and three months as an ENS invited professor.
Deabrota Basu, PhD student at National University of Singapore, stayed 2.5 months within Valda, to work with Pierre Senellart.
Pierre Senellart has spent around two months at the University of Edinburgh, collaborating with Peter Buneman and Leonid Libkin.
Pierre Senellart has spent a cumulated time of more than one month at National University of Singapore, co-advising Debabrota Basu, PhD student working under the co-supervision of Stéphane Bressan.
Serge Abiteboul, organization of Personal Analytics & Privacy workshop, joint with ECML-PKDD 2017, Skopje, Macedonia
Serge Abiteboul, scientific organization of colloquium on La communauté scientifique face au renseignement, École militaire, Parsi, France
Serge Abiteboul, organization of colloquium on Les enjeux scientifiques de l'éthique du numérique, Académie des sciences
Pierre Senellart, organization of ParisBD 2017, Télécom ParisTech, Paris, France
Pierre Senellart, co-organizer of ACM-ICPC Southwestern Europe 2017 competition
Pierre Senellart, BDA 2017 (French conference on data management)
Pierre Senellart, WebDB workshop, joint with SIGMOD 2017
Pierre Senellart, Gems of PODS 2017 committee
Pierre Senellart, SIGMOD 2017 (distinguished PC member), ICDT 2017, EDBT 2018
Pierre Senellart, Journal of the ACM, VLDB Journal, Artificial Intelligence
Serge Abiteboul, keynote at PPDP-LOPSTR, Namur, Belgium
Serge Abiteboul, keynote at ETAPS, Uppsala, Sweden
Serge Abiteboul, keynote at Law & Big Data Conference, Paris, France
Serge Abiteboul is a member of the French Academy of Sciences, of the Academia Europa, and of the scientific council of the Société Informatique de France.
Pierre Senellart, ANR, NSF
Serge Abiteboul was the president of the Dune jury (Développement d'universités numériques expérimentales)
Serge Abiteboul participated in the NCU jury (nouveaux cursus à l'université)
Serge Abiteboul contributed to the report on Éthique de la recherche en apprentissage machine of Cerna-Allistene
Serge Abiteboul is co-chair of the “Committee on Gender Equality and Equal Opportunities” of Inria.
Luc Segoufin is a member of the CNHSCT of Inria.
Pierre Senellart is a member of the board of section 6 of the National Committee for Scientific Research.
Pierre Senellart is vice-director of the DI ENS laboratory, joint between ENS, CNRS, and Inria.
Licence: Pierre Senellart, Databases, 54 heqTD, L3, École normale supérieure
Licence: Pierre Senellart, Algorithms, 18 heqTD, L3, École normale supérieure
Master: Serge Abiteboul & Pierre Senellart, Web data management, 36 heqTD, M2, MPRI
Master: Luc Segoufin, Logic, descriptive complexity and database theory, 36 heqTD, M2, MPRI
Pierre Senellart has various teaching responsibilities (L3 internships, M2 internships, M2 administration) at ENS.
Serge Abiteboul proposed with Benjamin Nguyen and Philippe Rigaux a second session of the Mooc “Bases de données relationnelles: comprendre pour maîtriser” (FUN). He proposed with Julia Stoyanovich a course on “Ethical data management” at the EDBT Summer School, Genova, 2017.
PhD : David Montoya, Une base de connaissance personnelle intégrant les données d'un utilisateur et une chronologie de ses activités, Université Paris-Saclay, 6 March 2017, Serge Abiteboul & Pierre Senellart
PhD in progress: Debabrota Basu, Reinforcement learning applications to data management problems, started in 2015, Stéphane Bressan & Pierre Senellart
PhD in progress: Julien Grange, Graph properties: order and arithmetic in predicate logics, started in 2017, Luc Segoufin
PhD in progress: Miyoung Han, Learning approaches to dynamic data management, started in 2015, Pierre Senellart
PhD in progress: Quentin Lobbé, Diachronic analysis of diaspora communities through web archives enrichment, started in 2015, Pierre Senellart & Dana Diminescu
PhD in progress: Mikaël Monet, Efficient querying of large uncertain graphs by exploiting their structure, started in 2015, Pierre Senellart & Antoine Amarilli
PhD in progress: Karima Rafes, Security and management of personal data in the Web of things, started in 2015, Serge Abiteboul & Sarah Cohen-Boulakia
PhD in progress: Alexandre Vigny, Query enumeration on nowhere-dense graphs, started in 2015, Luc Segoufin & Arnaud Durand
PhD Paul Lagrée, October 2017, Université Paris-Saclay, Pierre Senellart
Serge Abiteboul is involved in several popular science activities. He
founded and animates the blog http://
Serge Abiteboul is the president of the strategic committee of the Blaise Pascal foundation for scientific mediation.