Section: New Results

Data and Metadata Management

Uncertain Data Management

Participants : Reza Akbarinia, Patrick Valduriez, Guillaume Verger.

Data uncertainty in scientific applications can be due to many different reasons: incomplete knowledge of the underlying system, inexact model parameters, inaccurate representation of initial boundary conditions, inaccuracy in equipments, etc. For instance, in the monitoring of plant contamination, sensors generate periodically data which may be uncertain. Instead of ignoring (or correcting) uncertainty, which may generate major errors, we need to manage it rigorously and provide support for querying.

In [46] , we address the problem of aggregate queries that return possible sum values and their probabilities. This kind of query which, we call ALL-SUM, is also known as sum probability distribution. The results of ALL-SUM can be used for answering many other type of queries over probabilistic data. In general, the problem of ALL-SUM query execution is NP-complete. We propose pseudo-polynomial algorithms that are efficient in many practical applications, e.g. when the aggr attribute values are small integers or real numbers with small precision, i.e. small number of digits after decimal point. These cases cover many practical attributes, e.g. temperature, blood pressure, needed human recourses per patient in medical applications.

We have started to develop a probabilistic database prototype, called ProbDB (Probabilistic Database), on top of an RDBMS. ProbDB divides a query into two parts: probabilistic and deterministic (i.e. non probabilistic). The deterministic part is executed by the underlying RDBMS, and the rest of work is done by our probabilistic query processing algorithms that are executed over the data returned by the RDBMS. In [51] , we demonstrated the efficient execution of aggregate queries with the first version of ProbDB.

Metadata Integration

Participants : Zohra Bellahsène, Rémi Coletta, Duy Hoa Ngo.

Due to the various types of heterogeneity of ontologies, ontology matching must exploit many features of ontology elements in order to improve matching quality. For this purpose, numerous similarity metrics have been proposed to deal with ontology semantics at different levels: elements level, structural level and instance level.

Elements level metrics can be categorized in three groups: (1) terminological, (2) structural and (3) semantic. Metrics of the first group exploit text features such as names, labels and comments to compute the similarity score between entities. Whereas metrics of the last two groups exploit the hierarchy and semantic relationship features. Our approach consists in first using terminological metrics. Then, during the matching process, mappings discovered by terminological metrics are used as input mappings to other metrics of the second and third groups. Obviously, the more precise results terminological metrics are, the more accurate results structural and semantic metrics have.

However, finding a good combination of different metrics is very difficult and time consuming. We proposed YAM++ (not Yet Another Matcher), an approach that uses machine learning to combine similarity metrics. Our main contributions are: the definition of new metrics dealing with terminological and context profile features of entities in ontologies [37] , and the use of a decision tree model to combine similarity metrics [38] .

To improve matching quality of YAM++, we exploit instances accompanying ontologies. We then apply similarity flooding propagation algorithm to discover more semantic mappings. At the 2011 competition of the Ontology Alignment Evaluation Initiative (http://oaei.ontologymatching.org ), YAM++ achieved excellent results: first position on the Conference track and second position on the Benchmark track [39] .