## Section: New Results

### Querying Heterogeneous Linked Data

#### Aggregates

Aggregation refers to the computation of aggregates in databases, that
is, the computation of a function of the answer of a query, such as
counting the number of answers, finding the optimal one for a given
objective function or enumerating all of them with a small delay
between two distinct answers. The goal of aggregation is typically to
compute such aggregates without explicitly generating the whole set of
answers. We study aggregation problem within the ANR project
*Aggreg* coordinated by Niehren.

At ** ICALP** Bourhis (with Amarilli, Jachiet and Mengel)
[13] developped a new algorithm to efficiently
enumerates the solutions of certain type of circuits. They apply their
result to give new proofs previous results on efficient enumeration
for queries defined by tree automata or

`FO`queries over structures with bounded tree width by using these circuits as aggregates to represent the set of all solutions of a query and then enumerating them.

Again at ** ICALP** [15] Bacquey in an
collaboration with Caen and Marseille (Grandjean and Olive) prove that
linear time complexity on cellular automata is exactly characterized
by inductive first-order Horn formulas. The method of proof also
implies the following result: the enumeration of the ground atoms that
are consequences of any inductive first-order Horn formula on a given
structure can be performed in linear time (in the cardinality of the
domain of the structure) by a cellular automaton (of appropriate
dimension).

#### Provenance

Provenance is a type of aggregates that aims at exhibiting the
contributions of tuples of a database to a query answer. This allows
to give an explanation of the query answers, that can help to judge
their reliability. Provenance is studied within the ANR project
*Aggreg*.

In a paper at Icdt [14], Bourhis (with Amarilli, Monet and Senellart) studies the combined complexity for computing circuit representation of the provenance, which were used to efficiently evaluate aggregations tasks. In particular, they exhibit a recursive language of queries capturing path queries that compute a compact representation of the provenance.

#### Recursive Queries

At ** PODS** [21], P. Bourhis proposed a formalisation of JSON documents, query languages and schema. This work is a collaboration with Chile. After having defined a clean theoretical framework to study JSON documents, Bourhis and his co-authors study the decidability and complexity of navigational query answering for different languages, relating each of them with existing implementations. Finally, they extend the documents with recursion together with a suitable querying language and study the complexity of query evaluating and query answering in this case.

At ** ICALP** [17], P. Bourhis studied in a collaboration with Oxford the problem of definability in decidable fixpoint logic. Bourhis and his co-authors gives new characterisation of formulas that can be expressed in decidable logic with fixpoint. One of their main result is an effective characterisation of the formulas of the guarded negation fragment with fixpoint that can be expressed in the guarded fragment with fixpoint. Their techniques are then extended to effectively characterise the first order formulas that can be defined in the guarded fragment.

A. Lemay contributed at Icde [16] the
*gMark* benchmark, a tool to generate large size graph database
and an associated set of queries. This work was done in cooperation
with Eindhoven and previous members of Links that are now in Lyon and
Clérmont-Ferrant. Its main interest is a great flexibility (the
generation of the graph can be done from a simple schema, but can also
incorporate elaborate a parameters), an ability to generate recursive
queries, and the possibility to generate large sets of queries of a
desired selectivity. This benchmark allowed for instance to highlight
difficulties for the existing query engines to deal with recursive
queries of high selectivity.

#### Data Integration

P. Bourhis and S. Tison presented at
** IJCAI** [18] — the
top conference in Artificial Intelligence — a new ontology mediated
query answering system (OMQA) for JSON document. This work is a
collaboration with researchers from the University of Montpellier. The
strength of their contribution lies in the fact that their ontology is
very expressive and yet gives a tractable query answering
system. Moreover, they establish a non-trivial connection between
their query answering system and term rewriting, allowing them to
pinpoint the exact complexity of query answering and to evaluate it
directly over KV-stores.

Also a ** IJCAI** [20], P. Bourhis studied guarded ontology
languages that are compatible with cross product. This work was done
in cooperation with Edinburgh and Vienna. Cross product is a useful
modelling tool that allow to connect every element of one relation to
every element of another relation. However, in this paper, Bourhis and
his co-authors show that its introduction into guarded ontology –
even when it is limited to two relations – quickly leads to the
undecidability of query evaluation and query answering. However, they
isolate fragments where one can add cross products without losing the
decidability of these problems by either restricting the queries or
the ontology.

#### Schema Validation

I. Boneva presented at Iswc [19] her work on ShEx 2.0 (Shape Expression Language 2.0), a language to describe the vocabulary and the structure of an RDF graph. This work is a collaboration with Oviedo and MIT. The language is based on the notion of shapes, a typing system supporting algebraic operations, recursive references to other shapes or Boolean combination. In the paper, Boneva and her co-authors give efficient algorithms to test if an RDF graph satisfies a shapes schema together with implementation guidelines. Her research on the topic has also led to the publication of a book [25] on the validation of RDF data, containing among other things her contribution to ShEx.

Json documents are basically unordered data trees. Schemas for unordered data trees can thus be defined by appropriate notions of tree automata for unordered trees, as studied in a systematic manner by Boiret, Hugot, and Niehren [11] in cooperation with Treinen from Paris 7. Alternatively, schemas can be defined by closed logic formulas in the logics proposed by the same authors in [12] .They showed that logics for unordered data trees with equality tests of data values of siblings nodes remain decidable, and thus the equivalence problems of the corresponding tree automata. In contrast, the problem becomes undecidable when comparing cousins for equality of data values.