Our best results of the year appeared in extremely visible and selective venues: automated recommendation of materialized XML views in ACM SIGMOD conference , XML query-update independence and RDF materialized view selection in the VLDB 2012 conference, and scalable duplicate detection in IEEE TKDE .
On the national scientific stage, our team has invested significant effort in the recently accepted LabEx DigiCosme proposal, where I. Manolescu is coordinating the “Scalable and secure data management” task, and in the national database conference where I. Manolescu has been the Program Committee chair, while Nicole Bidoit and François Goasdoué were part of the Program Committee.
Significant prototype development effort was invested in particular leading to the ACM CIKM Amada and Nautilus software demonstrations.
The development of Web technologies has led to a strong increase in the number and complexity of the applications which represent their data in Web formats, among which XML (for structured documents) and RDF (for Semantic Web data) are the most prominent. Oak has carried on research on algorithms and systems for efficiently processing expressive queries on such Web data formats. We have considered the efficient management of XML and RDF data, both for query evaluation and for efficiently applying updates, possibly in concurrence with queries. We have also started investigating multidimensional data analysis within RDF data warehouses.
For applications that integrate such Web data from various sources, we developed efficient and effective techniques to automatically recognize multiple representations of the same real-world object. That is, we devised main-memory resident algorithms that apply on hierarchical data as well as algorithms that manipulate graph data leveraging off-the-shelf database management systems and parallelization to address both efficiency and scalability beyond main-memory .
We have recently started to work on the efficient management of complex Web data, in particular structured XML documents and Semantic Web data under the form of RDF, in a cloud-based data management platform. We have investigated architectures and algorithms for storing Web data in elastic cloud-based stores and building an index for such data within the efficient key-value stores provided by off-the-shelf cloud data management platform. We have devised and prototyped such platforms for both XML and RDF data, and started experimenting with them in the Amazon Web Services (AWS) platform , , .
With the increasing complexity of data processing queries, for instance in applications such as relational data analysis or integration of Web data (e.g., XML or RDF) comes the need to better manage complex data transformations. This includes systematically verifying, maintaining, and testing the transformations an application relies on. In this context, Oak has focused on verifying the semantic correctness of a declarative program that specifies a data transformation query, e.g., an SQL query. To this end, we have investigated how to leverage data provenance (the information of the origin of data and the query operators) for query debugging. More specifically, we developed and implemented algorithms to explain unexpected results produced by a query (why-provenance) as well as expected results that are however missing from the query result (why-not provenance). Results have been presented in form of a software prototype .
This concerns archiving filtered content from online information sources (journals, blogs,
Open data intelligence: the goal is to build and efficiently exploit warehouses of Open Data, integrated from several data sources on the Web, in order to produce consolidated rich information to be given to decision makers and to the citizens. Such projects have been started in France notably by the city areas of Rennes (pioneer in Open Data usage), Paris, and more recently by Grenoble, in a project to which we participate; the ICT Labs DataBridges also focused on such topics. Oak competencies required for such projects are related to large-scale RDF data management as well as to the design of innovative data models for semantic-rich content.
In a cloud environment, data catalogs and indices need to be efficiently built and maintained, typically in parallel; queries need to be routed to only those data subsets which are likely to lead to results, and efficiently executed, using parallelism and the available indexes. We work on such topics within the Europa ICT Labs activity. Algorithms for efficiently handling indexes and views in the cloud for heterogeneous, complex-structure data are also at the core of our work proposal in the Datalyse project (see below). Our 2012 work in this context has lead to , , .
Companies seek to gather as much information as possible about their customers, for instance for more targeted publicity campaigns, market analysis, offer personalization, etc. That is, they want to consider data beyond the data they collected about their customers in their proprietary databases. Complementary data include for instance social data extracted from customers' activities in social networks, public data related to their place of residence (e.g., crime rate or housing price evolution). To achieve this goal, data integration on highly heterogeneous and massive data is necessary. Furthermore, as one means to assess both the correctness of the integration result and the quality (in this application most notably trustworthiness), we can resort to data provenance. These topics are explored in the project Datalyse which we submitted in 2012 to the French national “AAP Cloud 3: Big Data” call, a project headed by the “Business & Decision” company and whose ongoing evaluation will continue in early 2013.
Amada (https://
Jesús Camacho-Rodríguez (jesus.camacho-rodriguez@inria.fr))
Zoi Kaoudi (zoi.kaoudi@inria.fr), Ioana Manolescu (ioana.manolescu@inria.fr), Dario Colazzo (dario.colazzo@lri.fr), François Goasdoué (fg@lri.fr)
A platform for Web data management in the Amazon cloud
Nautilus Analyzer (http://
Melanie Herschel (melanie.herschel@lri.fr)
n.a.
A tool for analyzing and debugging SQL queries using why-provenance and why-not provenance.
RDFViewS (http://
Konstantinos Karanasos (konstantinos.karanasos@inria.fr)
François Goasdoué (fg@lri.fr), Julien Leblay (julien.leblay@inria.fr), and Ioana Manolescu (ioana.manolescu@inria.fr)
a storage tuning wizard for RDF applications
ViP2P (views in peer-to-peer, http://
Ioana Manolescu (ioana.manolescu@inria.fr)
Jesús Camacho_Rodriguez (jesus.camacho-rodriguez@inria.fr), Asterios Katsifodimos (asterios.katsifodimos@inria.fr), Konstantinos Karanasos (konstantinos.karanasos@inria.fr)
a P2P platform for disseminating and querying XML and RDF data in large-scale distributed networks.
XUpOp (XML Update Optimization)
Dario Colazzo (colazzo@lri.fr)
Nicole Bidoit (bidoit@lri.fr), Marina Sahakian (Marina.Sahakyan@lri.fr), and Mohamed Amine Baazizi (baazizi@lri.fr)
a general purpose type based optimizer for XML updates
XUpIn (XML Update Independence)
Federico Ulliana (Federico.Ulliana@lri.fr)
Dario Colazzo (colazzo@lri.fr), Nicole Bidoit (bidoit@lri.fr)
an XML query-update independence tester
XUpTe (XML Update for Tempora documents)
Dario Colazzo (colazzo@lri.fr)
Nicole Bidoit (bidoit@lri.fr), Mohamed-Amine Baazizi (amine.baazizi@gmail.com)
a type-based optimizer for representing and updated XML temporal sata
XPUQ (XML Partitioning for Updates and Queries )
Dario Colazzo (colazzo@lri.fr)
Nicole Bidoit (bidoit@lri.fr), Noor Malla (noorwm@hotmail.com)
a static analyzer and partitioner for XML queries and updates
We addressed the problem of detecting independence between XML queries and updates. Since the problem is undecidable for XQuery queries and updates, and is intractable even for restricted fragments, we adopted an approximating technique based on a schema-based static analysis. Our analysis turned to be precise and, at the same time, fast to run. Main result about this research line have been published in , while the complete study is reported in Federico Ulliana's PhD Thesis (defended in December 12) .
To address the problem of manipulating large XML documents via main-memory XQuery engines, largely used for their efficiency and easiness of integration in a programming environment, we developed partitioning techniques for both XQuery queries and updates. Our technique is based on a static analysis over queries and updates (no schema is used) able to infer information that is used to partition the input document, in a streaming fashion. Besides allowing existing main-memory system to scale up in terms of query/update input size, our technique also admits a MapReduce implementation. Main results have been published in , while the complete study is reported in Noor Malla's PhD Thesis (defended on September 21) .
We also tackled the problem of safe manipulation of JSON data. Some typed and MapReduce-based programming languages for manipulating JSON data have been recently proposed. However, the problem of inferring a schema for untyped JSON data was still open, and having a schema for manipulated data is fundamental for the afore mentioned programming languages. We started investigating technique able to deal with massive JSON data sets. To ensure efficiency, our technique is based on Map-Reduce, while to ensure precision and conciseness it adopts type rewriting rules able to: i) compact as much as possible intermediate inferred types, and ii) to avoid gross approximation when compacting types. Some preliminary results are quite encouraging, and appeared in .
Considerable energy is spent towards enriching XML data on the web with semantics through annotations. These annotations can range from simple metadata to complex semantic relationships between data items. Although the vision of supporting such annotations is spreading, it still lacks the infrastructure that will enable it. To this end we have proposed a framework enabling the storage and querying of annotated documents. We have introduced (i) the XR data model, in which annotated documents are XML documents described by RDF triples and (ii) the query language XRQ to interrogate annotated documents through their structure and their semantics. A prototype platform XRP for the management of annotated documents has also been developed, to show the relevance of our approach through experiments .
A promising method for efficiently querying RDF data consists of translating SPARQL queries into efficient RDBMS-style operations. However, answering SPARQL queries requires handling RDF reasoning, which must be implemented outside the relational engines that do not support it. We have introduced the database (DB) fragment of RDF, going beyond the expressive power of previously studied RDF fragments. Within this fragment, we have devised novel sound and complete techniques for answering Basic Graph Pattern (BGP) queries, exploring the two established approaches for handling RDF semantics, namely reformulation and saturation. In particular, we have focused on handling database updates within each approach and proposed a method for incrementally maintaining the saturation; updates raise specific difficulties due to the rich RDF semantics. Our techniques have been designed to be deployed on top of any RDBMS(-style) engine, and we have experimentally studied their performance trade-offs , , .
We addressed the problem of detecting multiple heterogeneous representations of a real-world object (often referred to as record linkage, duplicate detection, or entity resolution) in two contexts, i.e., for hierarchical data and for data where relationships between entities form a graph.
Concerning XML entity resolution, we contributed to a novel algorithm that uses a Bayesian network to determine the probability of two XML elements being duplicates. The probability is based both on content and on structure information given by the hierarchical XML model. To efficiently evaluate the Bayesian network to find duplicates, we devised two pruning techniques. Whereas the first is lossless in terms of not loosing any true duplicates, the second pruning heuristic trades off runtime for a somewhat lower accuracy of the duplicate detection result. An experimental evaluation shows that the proposed solutions are capable of outperforming other state-of-the art XML duplicate detection methods .
As for duplicate detection in entity graphs, we defined a general framework for algorithms tackling this problem. The general process consists of three steps, namely retrieval, classification, and update. We further proposed an algorithm complying to the framework that leverages an off-the-shelf relational database to store and to efficiently query information (both data and relationships) relevant for duplicate classification. We further extended our framework and algorithm to allow for parallel and batched processing. Our experimental validation on data of up to two orders of magnitude larger than data considered by other state-of-the-art algorithms showed that the proposed methods allow to scale duplicate detection in entity graphs to large volumes of data .
Data warehousing (DW) research has lead to a set of tools and techniques for efficiently analyzing large amounts of multi-dimensional data. As more data gets produced and shared in RDF, analytic concepts and tools for analyzing such irregular, graph-shaped, semantic-rich data are needed. We have introduced the first all-RDF model for warehousing RDF graphs. Notably, we have defined RDF analytical schemas, themselves full RDF graphs, and RDF analytical queries, corresponding to the relational DW star/snowflake schemas and cubes. We have shown how RDF OLAP operations can be performed on our RDF cubes. We have also performed experiments validating the practical interest of our approach.
We investigate architectures for storing Web data (in particular, XML documents and RDF graphs) based on commercial cloud platforms. In particular, we have developed the AMADA platform, which operates in a Software as a Service (SaaS) approach, allowing users to upload, index, store, and query large volumes of Web data. Since cloud users support monetary costs directly connected to their consumption of cloud resources, we focus on indexing content in the cloud. We study the applicability of several indexing strategies, and show that they lead not only to reducing query evaluation time, but also, importantly, to reducing the monetary costs associated with the exploitation of the cloud-based warehouse , , .
When developing data transformations – a task omnipresent in applications like data integration, data migration, data cleaning, or scientific data processing – developers quickly face the need to verify the semantic correctness of the transformation. Declarative specifications of data transformations, e.g. SQL or ETL tools, increase developer productivity but usually provide limited or no means for inspection or debugging. In this situation, developers today have no choice but to manually analyze the transformation and, in case of an error, to (repeatedly) fix and test the transformation.
The above observations call for a more systematic management of a data transformation. Within Oak, we have so far focused on the first phase of the process described above, namely the analysis phase. Leveraging results obtained in previous years (by us and others), we solidified the theory of why-not provenance. Analogously to a distinction between different types of why-provenance, we defined three types of why-not provenance. For each of the three types, we surveyed the semantics employed by different approaches, e.g., set vs. bag semantics or existential vs. universal quantification. We also identified cases of implication and equivalence between why-not provenance of different types. We have leveraged this theoretical work during the design of a novel algorithm that has the potential to overcome usability and efficiency limitations of previous algorithms after further optimization, implementation, and validation in the future. Furthermore, we implemented different approaches for why-provenance and why-not provenance and included them in the Nautilus Analyzer, a system prototype for declarative query debugging. We demonstrated this prototype at CIKM 2012 .
A collaboration grant is ongoing with DataPublica, which started based on our common work on Linked Data for Digital Cities.
This Digiteo DIM LSC (Logiciels et Systèmes Complexes) project has started in October 2011. The aim is to design and implement data warehouse-style models and technologies for RDF data. This project supports the PhD scholarship of A. Roatis.
The ANR Codex project (Coordination, dynamicity and efficiency for XML, 2009-2012) has ended; the final review has taken place in Lyon in January 2012. The project was coordinated by Ioana Manolescu; Nicole Bidoit, Dario Colazzo and François Goasdoué also participated.
The ANR DataBridges project (Data integration for digital cities, 2011-2012) has ended; the final review has taken place in Paris in September 2012. The project was coordinated by Ioana Manolescu; François Goasdoué also participated.
The ANR ConnectedCities project (Clouds for digital cities, 2011-2012) has ended; the final review has taken place in Paris in September 2012. Dario Colazzo, François Goasdoué and Ioana Manolescu have participated to the project.
The ANR DataRing project (Massive data management in peer-to-peer, 2009-2012) has ended; the final review has taken place in Lyon in January 2012. Ioana Manolescu has participated to the project.
Program: KIC EIT ICT Labs
Project acronym: DataBridges
Project title: Data Integration for Digital Cities
Duration: January 2012 - December 2012
Coordinator: Ioana Manolescu
Other partners: Université Paris Sud (France), Technical University of Delft (The Netherlands), DFKI (Germany), Aalto University (Finland), KTH (Sweden), Alcatel-Lucent Bell Labs (France), DataPublica (France)
Abstract: DataBridges work focuses on two main topics: (
Program: KIC EIT ICT Labs
Project acronym: Europa
Project title: Efficient cloud-based data management
Duration: January 2012 - December 2012
Coordinator: Volker Markl (Technical Univ. Berlin)
Other partners: Université Paris Sud (France), Technical University of Delft (The Netherlands), DFKI (Germany), Aalto University (Finland), SICS (Sweden)
Abstract: Europa aims at developing techniques for large-scale efficient data management based on a cloud (massively parallel) processing paradigm. Within Europa, we have finalized the Amada platform, and our ongoing work focuses on an algebraic translation framework from XQuery into PACT programs. PACT is the parallel data processing language proposed by the Berlin partner.
Partner 1: organisme 1, labo 1 (pays 1)
Sujet 1 (max. 2 lignes)
Partner 2: organisme 2, labo 2 (pays 2)
Sujet 2 (max. 2 lignes)
We have been visited by:
Prof. Paolo Atzeni (Università Roma Tré), in June
Prof. Alin Deutsch (UCSD, USA) in June-July (Digiteo invited scientist)
Prof. Evi Pitoura (University of Ioannina, Greece), in October
Prof. Vassilis Christophides (FORTH, Greece) in December
Prof. Themis Palpanas (University of Trento, Italy) in December
Prof. Yanlei Diao (U. Massachussets at Amherst, USA) in December
Three students visited the team within the Inria Internship program: Karan Aggaral, Abishek Choudhary and Kuldeep Reddy.
M. Herschel is the guest editor of the special issue on Data Integration of the journal it-Information Technology, Volume 54, Issue 3, 2012.
I. Manolescu is the editor in chief of the ACM SIGMOD Record, an associate editor of the ACM Transactions on Internet Technologies, and a member of the “Experiments and Analysis” track of PVLDB.
Members of the project have been chairs of scientific events:
Ioana Manolescu
Bases de Données Avancées (BDA) 2012
IEEE International Conference on Data Engineering (ICDE) 2012, “Semistructured Data, XML and RDF” track
Extending Database Technologies (EDBT) 2012 conference, Tutorial track
Co-chair of the VLDB 2012 PhD workshop
Members of the project have participated in program committees:
Nicole Bidoit
Bases de données Avancées (BDA) 2012
Dario Colazzo
IEEE International Conference on Data Engineering (ICDE) 2012
François Goasdoué
Bases de Données Avancées (BDA) 2012
IEEE International Conference on Tools with Artificial Intelligence (ICTAI) 2012
Melanie Herschel
IEEE International Conference on Data Engineering (ICDE) 2012
Conference on Very Large Databases (VLDB) 2012
VLDB PhD Workshop 2012
Bases de Données Avancées (BDA) 2012
Ioana Manolescu
ACM SIGMOD Conference 2012
IEEE International Conference on Data Engineering (ICDE), “Cloud, Data Warehousing, and Large Data” track
Data Analytics in the Cloud Workshop, in collaboration with EDBT/ICDT 2012
Non-conventional Data Access Workshop, in collaboration with the ER conference 2012
Cloud Intelligence Workshop, in collaboration with VLDB 2012
Members of the project participate in steering committees:
Nicole Bidoit
Ecole thématique Masse de données Distribuées, Aussois, 27 mai - 1er juin 2012
Ioana Manolescu
Workshop on Open Data (WOD) 2012
Ioana Manolescu gave a keynote presentation titled “Triples with a purpose” at the Semantic Search Workshop (SSW) next to the VLDB Conference 2012.
Licence, Nicole Bidoit, Bases de données, 25.5h éq. TD, L3, Université Paris-Sud, France
Master, Nicole Bidoit, SGBD relationnels: implémentation, 18 éq. TD, M2, Université Paris-Sud, France
Master, Nicole Bidoit, Bases de données et Systèmes d'Information, 27 éq. TD, M2, Université Paris-Sud, France
Master, Nicole Bidoit, Mise à Niveau en Informatique - Bases de Données, 40 éq. TD, M1, Université Paris-Sud, France
Master, Nicole Bidoit, Base de données Avancées, 30 éq. TD, M1, Université Paris-Sud, France
Master, Nicole Bidoit, Données et connaissances pour le WEB, 32 éq. TD, M2, Université Paris-Sud, France
Master : Dario Colazzo, SGBD relationnels: tuning d'applications , 21h éq. TD, M2, Université Paris-Sud, France
Master : Dario Colazzo, Bases de données , 21h éq. TD, M2, Université Paris-Sud, France
Master : Dario Colazzo, Systèmes de Gestion de Bases de Données, 54h éq. TD, Polytech, Université Paris-Sud, France
Master : Dario Colazzo, Base de Données, 21h éq. TD, Polytech, Université Paris-Sud, France
Master : Dario Colazzo, Base de Données Avancées, 31h éq. TD, Polytech, Université Paris-Sud, France
Master : Dario Colazzo, Mise à niveau bases de données, 17h éq. TD, M1, Université Paris-Sud, France
Master : Dario Colazzo, Gestion des données sur Internet, 18h éq. TD, M1, Université Paris-Sud, France
Master : Dario Colazzo, Base de Données , 20h éq. TD, Polytech, Université Paris-Sud, France
Master : Dario Colazzo, Tutorat d'apprentis, 20h eq. TD, Polytech, Université Paris-Sud, France
Licence : François Goasdoué, Bases de données, 62,5h éq. TD, L3, Université Paris-Sud, France
Master : François Goasdoué, Web Sémantique, 74h éq. TD, M2, Université Paris-Sud, France
Master : François Goasdoué, Données et connaissances pour le Web, 7,5h éq. TD, M2, Université Paris-Sud, France
Master : François Goasdoué, Modèles de raisonnement distribué, 4,5h éq. TD, M2, Université Paris-Sud, France
Master : François Goasdoué, Décision et raisonnement, 9h éq. TD, M2, Université Paris-Dauphine, France
Master : François Goasdoué, Ontologies et Web Sémantique, 13,5h éq. TD, M2, Université Paris-Dauphine, France
Master: Melanie Herschel, Entrepôts de données et requêtes OLAP, 99.5h éq. TD, M2, Université Paris-Sud, France
Master: Melanie Herschel, Intégration de données et Web sémantique, 4,5h éq. TD, M2, Université Paris-Sud, France
Master : Ioana Manolescu, Données Sémi-Structurées et XML, 18h éq. TD M2, Université Paris-Sud, France
Master : Ioana Manolescu, Services Web, 27 éq. TD, M2, Université Paris-Dauphine, France
PhD & HdR :
HdR : François Goasdoué, Knowledge Representation meets DataBases for the sake of ontology-based data management , Univ. Paris-Sud, 11/07/2012
PhD : Mohamed-Amine Baazizi, Satic Analysis for the Optimization of Temporal XML document Updates, Univ. Paris-Sud, 07/09/2012, Nicole Bidoit et Dario Colazzo
PhD : Noor Malla, Partitioning XML data, towards distributed and parallel management, Univ. Paris-sud, 21/09/2012, Nicole Bidoit et Dario Colazzo
PhD : Federico Ulliana, Types for Detecting XML Query-Update Independence, Univ. Paris-Sud, 12/12/2012, Nicole Bidoit et Dario Colazzo
PhD : Konstantinos Karanasos, View-Based Techniques for the Efficient Management of Web Data , Univ. Paris-Sud, 29/06/2012, François Goasdoué and Ioana Manolescu
PhD in progress : Asterios Katsifodimos, Efficient Distributed Views for XML Data Management, Univ. Paris-Sud, 29/06/2012, Ioana Manolescu
PhD in progress : Julien Leblay, Efficient management of annotated documents, 01/10/2010, François Goasdoué and Ioana Manolescu
PhD in progress : Alexandra Roatis, Traitement efficace de requêtes SPARQL avec extensions OLAP pour entrepôts RDF, 01/09/2011, Dario Colazzo, François Goasdoué and Ioana Manolescu
PhD in progress: Aikaterini Tzompanaki, Foundations and Algorithms to Compute the Provenance of Missing Data, 1/11/2012, Melanie Herschel and Nicole Bidoit
PhD in progress : Stamatis Zampetakis, Scalable algorithms for cloud-based semantic web data management, 15/10/2012, François Goasdoué and Ioana Manolescu
Nicole Bidoit, PhD defense committee of François Hantry, “Problèmes basés sur les Causes dans le cadre de la Logique Temporelle Linéaire : Théorie et Applications”, Univ. Lyon 1, 17/09/2012.
Dario Colazzo, PhD defense committee of Bogdan Butnaru, “Optimizations of XQuery in peer-to-peer distributed XML databases”, Univ. Versaille, 12/04/2012.
François Goasdoué, PhD defense committee of Jitao Yang, “Un modèle de données pour bibliothèques numériques”, Univ. Paris-Sud, 30/05/2012.
Ioana Manolescu, PhD defense committee of Silviu Maniu, “Data Management in Social Networks”, Telecom ParisTech, 28/09/2012.
Ioana Manolescu, PhD defense committee of Jordi Creus, “ROSES : Un moteur de requêtes continues pour l’aggrégation de flux RSS à large échelle”, UPMC - Sorbonne Universités, 7/12/2012.
Ioana Manolescu, HDR defense committee of Cédric du Mouza, “Indexing Very Large Datasets”, Université Paris-Dauphine, 30/11/2012.
Ioana Manolescu presented the issues involved in cloud-based management of Web data, to a panel on the technology aspects of Big Data, at the “Big Data” industrial conference at Paris on March 20, 2012.