Efficient Enumeration for Conjunctive Queries over X-underbar Structures

Mostrare Modeling Tree Structures, Machine Learning, and Information Extraction

Perception, Cognition, Interaction

Knowledge and Data Representation and Management

Mostrareis a joint project with the lifl - umr cnrs 8022, Lille 1 and Lille 3 universities

Rémi Gilleron UnivFr Enseignant

Lille

professor, Team leader oui Joachim Niehren INRIA Chercheur

Lille

senior researcher (DR2), vice leader oui Karine Lewandowski INRIA Assistant

Lille

shared by 2 projects Iovka Boneva UnivFr Enseignant

Lille

assistant professor Anne-Cécile Caron UnivFr Enseignant

Lille

assistant professor Aurélien Lemay UnivFr Enseignant

Lille

assistant professor Yves Roos UnivFr Enseignant

Lille

assistant professor Sophie Tison UnivFr Enseignant

Lille

professor oui Marc Tommasi UnivFr Enseignant

Lille

professor oui Fabien Torre UnivFr Enseignant

Lille

assistant professor Sławek Staworko INRIA Enseignant

Lille

assistant professor Gemma Garriga INRIA Chercheur

Lille

junior researcher (CR1) since October 2010 Benoît Groz INRIA PhD

Lille

amnfellowship, since September 2008 Édouard Gilbert UnivFr PhD

Lille

amnfellowship, since November 2007 Grégoire Laurence UnivFr PhD

Lille

mesr, since October 2008 Jean-Baptiste Faddoul EtablissementPrive PhD

Lille

cifre xerox, since December 2008 Jean Decoster UnivFr PhD

Lille

mesr, since October 2009 Antoine M. Ndione INRIA PhD

Lille

inriafellowship, since October 2010 Tom Sebastian EtablissementPrive PhD

Lille

cifre innovimax, since December 2010 Jérôme Champavère UnivFr PostDoc

Lille

aterfrom October 2009 to August 2011 Gillaume Bagan INRIA PostDoc

Lille

inria, postdoc from September 2009 to August 2011 Camille Vacher UnivFr PostDoc

Lille

atersince September 2010 Shunichi Amano INRIA PostDoc

Lille

inria, postdoc from October 2010 to March 2011 Denis Debarbieux INRIA Technique

Lille

inria, from December 2010 to November 2012 Overall Objectives Presentation

The objective of Mostrareis to develop adaptive document processing methods for XML-based information systems. Adaptiveness becomes important when documents evolve frequently such as on the Web. The particularity of Mostrareis that we develop semi-automatic or automatic information extraction approaches that can fully benefit from the available tree structure of XMLdocuments.

Information extraction is an instance of document transformation. In order to exploit the tree structure of XMLdocuments, our goal is to investigate specification languages for tree transformations. These are based on approaches from database theory (such as the W3C standards XQuery and XSLT), automata, logic, and programming languages. We wish to define stochastic models of tree transformations, and to develop automatic or semi-automatic procedures for inferring them. Once available, we want to integrate these learning algorithms into innovative information extraction systems, semantic Web platforms, and document processing engines.

The following two paragraphs summarize our two main research objectives:

Modeling tree structures for information extraction.

We wish to continue our work on modeling languages for node selection queries in tree structured documents, that we contributed in the first phase of Mostrare. The new subject of interest of the second phase are XMLdocument transformations and tree transformations that generalize on node selection queries.

Machine learning for information extraction.

We wish to continue to study machine learning techniques for information extraction. One new goal is to develop learning algorithms that can induce XMLdocument transformations, based on their tree structure. Another new goal is to explore stochastic machine learning techniques that can deal with uncertainty in document sources.

Highlights

Lemay and Niehren present at ACM PODS the first learning algorithm for top-down XML transformations. This is a breakthrough result on transducer learning since supporting copying and flipping of subtrees for the first time, while all previous proposals were either restricted to words or to relabelings on trees.

A paper by Tison and Dauchet received the 2010 IEEE LICS Test-of-Time Award. They settled innovative techniques on tree automata, for showing that the first-order theory of one-step rewriting is decidable.

Scientific Foundations Modeling XMLdocument transformations Guillaume Bagan Iovka Boneva Anne-Cècile Caron Benoît Groz Joachim Niehren Yves Roos Sławek Staworko Sophie Tison Antoine M. Ndione Camille Vacher Shunichi Amano Tom Sebastian

XMLdocument transformations can be defined in W3C standards languages XQueryor XSLT. Programming XMLtransformations in these languages is often difficult and error prone even if the schemata of input and output documents are known. Advanced programming experience and considerable programming time may be necessary, that are not available in Web services or similar scenarios.

Alternative programming language for defining XMLtransformations have been proposed by the programming language community, for instance XDuce , Xtatic , , and CDuce , , . The type systems of these languages simplify the programming tasks considerably. But of course, they don't solve the general difficulty in programming XMLtransformations manually.

Languages for defining node selection queries arise as sub-language of all XMLtransformation languages. The W3C standards use XPathfor defining monadic queries, while XDuce and CDuce rely on regular queries defined by regular pattern equivalent to tree automata. Indeed, it is natural to look at node selection as a simple form of tree transformation. Monadic node selection queries correspond to deterministic transformations that annotate all selected nodes positively and all others negatively. N-ary node selection queries become non-deterministic transformations, yielding trees annotated by Boolean vectors.

After extensive studies of node selection queries in trees (in XPathand many other languages) the XMLcommunity has started more recently to formally investigate XMLtree transformations. The expressiveness and complexity of XQueryare studied in , . Type preservation is another problem, i.e., whether all trees of the input type get transformed into the output type, or vice versa, whether the inverse image of the output type is contained in the input type , .

The automata community usually approaches tree transformations by tree transducers , i.e., tree automata producing output structure. Macro tree transducers, for instance, have been proposed recently for defining XMLtransformations . From the view point of logic, tree transducers have been studied for MSO definability .

Machine learning for XMLdocument transformations Jérôme Champavère Jean Decoster Jean-Baptiste Faddoul Édouard Gilbert Rémi Gilleron Grégoire Laurence Aurélien Lemay Joachim Niehren Sławek Staworko Marc Tommasi Fabien Torre Gemma Garriga

Automatic or semi-automatic tools for inferring tree transformations are needed for information extraction. Annotated examples may support the learning process. The learning target will be models of XMLtree transformations specified in some of the languages discussed above.

Grammatical inferenceis commonly used to learn languages from examples and can be applied to learn transductions. Previous work on grammatical inference for transducers remains limited to the case of strings , . For the tree case, so far only very basic tree transducers have been shown to be learnable, by previous work of the Mostrare project. These are node selecting tree transducer (NSTTs) which preserve the structure of trees while relabeling their nodes deterministically.

Statistical inferenceis most appropriate for dealing with uncertain or noisy data. It is generally useful for information extraction from textual data given that current text understanding tools are still very much limited. XMLtransformations with noisy input data typically arise in data integration tasks, as for instance when converting PDFinto XML.

Stochastic tree transducers have been studied in the context of natural language processing , . A set of pairs of input and output trees defines a relation that can be represented by a 2-tape automaton called a stochastic finite-state transducer(SFST). A major problem consists in estimating the parameters of such transducer. SFST training algorithms are lacking so far .

Probabilistic context free grammars (pCFGs) are used in the context of PDFto XMLconversion . In the first step, a labeling procedure of leaves of the input document by labels of the output DTD is learned. In the second step, given a CFG as a generative model of output documents, probabilities are learned. Such two steps approaches are in competition with one step approaches estimating conditional probabilities directly.

A popular non generative model for information extraction is conditional random fields( CRF, see a survey ). One main advantage of CRFis to take into account long distance dependencies in the observed data. CRFhave been defined for general graphs but have mainly been applied to sequences, thus CRFfor XMLtrees should be investigated.

So called structured outputhas recently become a research topic in machine learning , . It aims at extending the classical categorization task, which consists to associate one or some labels to each input example, in order to handle structured output labels such as trees. Applicability of structured output learning algorithms remains to be asserted for real tasks such as XMLtransformations.

Application Domains Context

XMLtransformations are basic to data integration: HTMLto XMLtransformations are useful for information extraction from the Web; XMLto XMLtransformations are useful for data exchange between Web services or between peers or between databases. Doan and Halevy survey novel integration tasks that appear with the Semantic Web and the usage of ontologies. Therefore, the semi-automatic generation of XMLtransformations is a challenge in the database community and in the semantic Web community.

Also, XMLtransformations are useful for document processing. For instance, there is need of designing transformations from documents organized w.r.t visual format ( HTML, DOC, PDF) into documents organized w.r.t. semantic format ( XMLaccording to a DTDor a schema). The semi-automatic design of such transformations is obviously a very challenging objective.

Furthermore, quite some activities of Mostrare concern efficient evaluation of XPath queries on XML documents and XML streams. XPath is fundamental to all XML standards, in particular to XQuery, XSLT, and XProc.

Software VOLATA Fabien Torre correspondent

VOLATA VOtes of Least generAl generalizaTions in jAva

VOLATA is a bundle software containing several learning algorithms: learning algorithms for attribute-value datasets, grammatical inference algorithms and inductive logic programming algorithms. VOLATA has been applied to document classification tasks and information extraction tasks. The software is available at http:// www. grappa. univ-lille3. fr/ ~torre/ Recherche/ Softwares/ volata/ .

PICCATA Édouard Gilbert correspondent Feriel Lahlali Marc Tommasi

PICCATA: Programming Interface for effiCient Computations and Approximation on multiplicity Tree Automata.

Piccata is a programming interface for learning weighted and classical tree automata from examples. Piccata development started in 2009 with the former member Feriel Lahlali. Source code under Cecill licence is available on http:// piccata. gforge. inria. fr.

EVOXS Joachim Niehren correspondent Denis Debarbieux Tom Sebastian

EVOXS: Earliest evaluation of XPath on streams

This is an implementation of XPath query answering algorithms on XMLstreams following the algorithms developed in the PhD thesis of O. Gauwin in 2009 directed by J. Niehren and S. Tison. It consists of a compiler of a fragment of XPath to deterministic streaming tree automata and an earliest query answering algorithm for queries defined by deterministic streaming tree automata. The main developers of the first version are O. Gauwin and J. Niehren.

EVOXS will be the starting point for our cooperation with INNOVIMAX S.A.R.L in Paris, within the QuiXProc transfer project powered by D. Debarbieux, and the cifrePhD thesis of T. Sebastian, both starting in December 2010. The source code is available under the GNU public licence at https:// gforge. inria. fr/ projects/ evoxs.

New Results Modeling XMLdocument transformations Joachim Niehren Sophie Tison Sławek Staworko Aurélien Lemay Anne-Cécile Caron Yves Roos Shunichi Amano Camille Vacher Benoît Groz Antoine M. Ndione Tom Sebastian

Query answering and control access.Groz, Boneva, Roos, Tison, Caron and Staworko study the problem of update translation for views on XMLdocuments. More precisely, given an XMLview definition and a user defined view update program, the problem is to find a source update program that translates the view update without side effects on the view. Both views and update programs are represented by recognizable tree languages and different settings for the update problem are studied. The results of this line of research studied have been studied during 2010 and will be published in the 14th International Conference on Database Theory (ICDT'2011).

Answer enumeration.Bagan et. al investigate efficient enumeration algorithms for conjunctive queries for databases over binary relations that satisfy the X-underbar property. In particular, their algorithm is able to enumerate answers of XPath queries with variables with linear delay and quadratic precomputation time, both in the size of the database.

Bagan and Niehren study in a more efficient answer enumeration algorithm for a fragment of conditional XPath with variables, which is a first-order complete query language for unranked trees of bounded depth. Their algorithm requires only linear pre-computation time and constant delay for fixed queries, while depending linearly on the size of the query. It is based on a new enumeration algorithm for disjunctions of acyclic conjunctive queries on so called X-doublebar-structures they introduce, which are more restrictive previously known X-underbar structures.

Query answering on XML streams.Niehren started a transfer project QuiXProc with the industrial partner INNOVIMAX S.A.R.L. in December 2010, in which they plan to integrate the XPath query answering algorithms as developed by Mostrare into the XMLcoordination language XProc of the W3C. In this line of research, Niehren and Tison (with their previous PhD student O. Gauwin) showed in how to distinguish node selection queries on XMLstreams with bounded delay and concurrency. This is a journal version of a previous LATA 2009 conference paper.

Tree Automata.Tison et. al. revisit in the invited paper their results on tree automaton with global equality and disequality constraints (TAGEDs), previously published in 2008 at the International Conference on Developments in Language Theory (DLT).

Vacher is starting his postdoctoral studies on tree automata with constraints under supervision of Niehren and Tison.

M. Ndione is starting his PhD thesis on probabilistic algorithms for tree automata and transducers under supervision of Lemay and Niehren.

Logic.Amano started his postdoc on XMLdata exchange under supervision of Niehren.

Staworko et. al. introduce the framework of consistent query answers and repairs in order to alleviate the impact of inconsistency data on answers to a query. In particular, a repair is a minimally different consistent instance and an answer is consistent if it is present in every repair.

Programming languagesNiehren at. al. present a journal version of the attributed pi-calculus, a modeling language for systems biology. They add priorities compared to the CMSB 2008 conference version, while elaborating the analogy of priorities and stochastic rates in the pi-calculus.

Machine learning for XML document transformations Gemma Garriga Rémi Gilleron Aurélien Lemay Joachim Niehren Sławek Staworko Marc Tommasi Fabien Torre Jérôme Champavère Jean Decoster Jean-Baptiste Faddoul Édouart Gilbert Grégoire Laurence

Learning tree transformations: transducer induction.Lemay, Niehren et. al. present at ACM PODS the first learning algorithm for top-down XMLtransformations which allow to restructure trees by copying, flipping, and deleting of subtrees. This is a breakthrough result on transducer learning. Previous proposals were either restricted to transducers on words or to relabelings on trees. The results are obtained as a combination of a new top-down encoding of unranked into ranked trees guided by DTDs and a new learning algorithm for DTOPs. This learning result for DTOPs is derived from a new Myhill-Nerode theorem for DTOPs that the paper establishes. This theorem also shows the existence of unique minimal DTOPs that are constistent with a DTD. An alternative minimization algorithm for DTOPs was obtained previously, but without the Myhill-Nerode theorem and any link to learning. This result was obtained in cooperation within our associated team TRANSDUCE with NICTA Sydney created in 2010, but actually started already in 2007.

Learning queries or schemas: automata induction.Champavère defended his PhD thesis in September 2010 under the supervision of Niehren, Lemay and Gilleron. The thesis studies the incorporation of schema knowledge in XMLquery induction from annotated example trees. The results lift tree automata based induction algorithms for total monadic node selecting queries to partial queries whose domain is fixed by a known schema. Schema consistency is checked dynamically by testing language inclusion for tree automata. Another contribution of the thesis is to study query induction from pruned annotated examples where subtrees irrelevant for node selection may be cut away; in this line of research, the thesis presents a new learnability result for classes of queries that are stable under schema-guided pruning strategies.

Sequence classificationTorre et. al. present in a general framework for supervised classification; in particular, their goal is the classification of sequences by deciding whether a word belongs in some language or not. They integrate classical grammatical inference techniques into a general framework resulting from supervised classification: the hypotheses are therefore represented by automata or balls of strings, that are then combined by traditional boosting algorithms.

Induction of stochastic tree automata. Gilbert, Gilleron and Tommasi study in the inference of probability distributions over sets of unranked trees (i.e. trees where a node can have an unbounded number of direct subtrees). The main objective here is to build probabilistic decision procedures able to classify (XML/HTML) trees from different sources, or to get concise representations of sets of trees. The problem is formalized as the more general problem of learning tree series. For the question of defining recognizable tree series of unranked trees, the authors specify weighted automata for unranked trees via the extension of previous hedge automata and weighted trees. The paper also considers binary representations of unranked trees and shows that recognizable tree series for unranked trees can be defined and studied from recognizable tree series of those binary representations. The paper also presents decidability results on probabilistic tree automata and algorithms for computing sums of convergent series.

Multitask learning.Faddoul, Torre and Gilleron with their partner from XEROX Grenoble study in the problem of learning multiple related tasks from data simultaneously. For this problem, the authors propose a novel learning algorithm, called MT-Adaboos, which extends the traditional Adaboost algorithm to multitask setting by using simple decision stumps as weak classifiers. The practical and theoretical results of the paper show that the new algorithm learns the dependencies between tasks for different regions of the learning space.

Conditional random fields.Tommasi participated in the writing of a chapter of a book on conditional Markov fields for information extraction . In this french book chapter, the authors review some machine learning methods for information extraction and we focus on Conditional Random Fields (CRF), for which some prototypes were developed last year. The book chapter also gives the connexions with with Hidden Markov Models and logistic regression.

Learning and mining in graphsGarriga was hired as researcher (CR1). She started a research project on learning and mining data and data streams in networks and she leaded a new working group inside Mostrare.

Contracts and Grants with Industry Contracts and Grants with Industry Cifre Xerox (2009-2012) Jean-Baptiste Faddoul Rémi Gilleron Fabien Torre correspondent

Gilleron and Torre continue supervising the PhD thesis (Cifre) of Jean-Baptiste Faddoul together with B. Chidlovski from the Xerox's European Research Center ( xrce).

Cifre Innovimax (2010-2013) Tom Sebastian Joachim Niehren correspondent

Niehren started supervision the PhD thesis (Cifre) of Tom Sebastian on streaming algorithms for XSLT with M. Zergaoui from INNOVIMAX S.A.R.L. in Paris.

QuiXProc: INRIA Transfer Project with Innovimax (2010-2012) Denis Debarbieux Joachim Niehren correspondent

Niehren and Debarbieux started an INRIA transfer project with Innovimax S.A.R.L in Paris, on the integration of XPath streaming algorithms into XProc, the XMLcoordination language of the W3C.

Other Grants and Activities National Actions ANR Lampada (2009-2013) Marc Tommasi correspondent Édouard Gilbert Rémi Gilleron Aurélien Lemay Fabien Torre Gemma Garriga

The Lampada project on “Learning Algorithms, Models and sPArse representations for structured DAta” is coordinated by Tommasi from Mostrare. Our partners are the Sequelproject of Inria Lille Nord Europe, the Lif(Marseille), the Hubert Curienlaboratory (Saint-Etienne), and lip6 (Paris). More information on the project can be found on http:// lampada. gforge. inria. fr/ .

Lampada is a fundamental research project on machine learning and structured data. It focuses on scaling learning algorithms to handle large sets of complex data. The main challenges are 1) high dimension learning problems, 2) large sets of data and 3) dynamics of data. Complex data we consider are evolving and composed of parts in some relations. Representations of these data embed both structure and content information and are typically large sequences, trees and graphs. The main application domains are web2, social networks and biological data.

The project proposes to study formal representations of such data together with incremental or sequential machine learning methods and similarity learning methods.

The representation research topic includes condensed data representation, sampling, prototype selection and representation of streams of data. Machine learning methods include edit distance learning, reinforcement learning and incremental methods, density estimation of structured data and learning on streams.

ANR Defis Codex (2009-2012) Joachim Niehren correspondent Sławek Staworko Aurélien Lemay Sophie Tison Anne-Cécile Caron Jérôme Champavère

The Codex project on “Efficiency, Dynamicity and Composition for XML Models, Algorithms, and Systems” and is coordinated by Manolescu (GEMO, INRIA Saclay). The other partners of Mostrare there are Geneves (WAM, INRIA Grenoble), Colazzo(LRI, Orsay), Castagna (PPS, Paris 7), and Halfeld (Blois). Public information on Codex can be found on http:// codex. saclay. inria. fr/ .

The Codex project seeks to push the frontier of XMLtechnology in three interconnected directions. First, efficient algorithms and prototypes for massively distributed XMLrepositories are studied. Second, models are developed for describing, controlling, and reacting to the dynamic behavior of XMLcollections and XMLschemas with time. Third, methods and prototypes are developed for composing XMLprograms for richer interactions, and XMLschemas into rich, expressive, yet formally grounded type descriptions.

The main contributions of Mostrare are results on learning top-down XMLtransformations , on XPath query answering algorithms on XMLstreams , and on XMLquery learning . In addition the Codex project has lead to the creation of an INRIA transfer project QuiXProc with Innovimax and of a cifreproject with Innovimax.

ANR Blanc Enum (2007-2011) Guillaume Bagan Joachim Niehren correspondent Sophie Tison

The Enum project on “Complexity and Algorithms for Answer Enumeration”, is coordinated by A. Durand (Paris VII). The other partners are E. Grandjean (University of Caen), N. Creignou (University of Marseille). Public information on Enum can be found on http:// enumeration. gforge. inria. fr.

Enum studies algorithmic and complexity questions of answers enumeration, the task of generating all solutions of a given problem. Answer enumeration requires innovative efficient algorithms that can quickly serve large numbers of answers on demand. The prime application is query answering in databases, where huge answer sets arise naturally.

Mostrare contributed in 2010 to new answer enumeration algorithms for XPath queries , .

ARC ACCESS (2010–2011) Iovka Boneva correspondent Sophie Tison Anne-Cécile Caron Yves Roos Benoît Groz Sławek Staworko

This is a collaboration on the subject Access Control Policies for XML: Verification, Enforcement and Collaborative Edition, supported by the INRIA Collaboration Program (Action de Recherche Collaborative). The other participants involved are from the INRIA teams DAHU (INRIA Saclay – Île de France), PAREO and CASSIS (INIRA Nancy – Grand Est). This project is concerned with the security and access control for Web data exchange, in the context of Web applications and Web services. We aim at defining automatic verification methods for checking properties of access control policies (ACP) for XML, like consistency or secrecy, and for the comparison ACPs. One of our goals is to apply formal tools from tree automata theory for this purpose. A second important goal is to design efficient methods for ACP enforcement for secure query evaluation. We will study several scenarios for solving different variants of this problem, based on the notion of secure user views. As a case study, we will apply our methods to an XML-based collaborative editing system.

International Cooperations Transduce: INRIA Associated Team with NICTA Sydney. Guillaume Bagan Joachim Niehren correspondent Aurélien Lemay Benoît Groz Slawomir Staworko Grégoire Laurence

The leader of the our NICTA partner team is S. Maneth. Public information on Enum can be found on http:// transduce. gforge. inria. fr/ .

We keep cooperation on learning algorithms for XMLto XMLtransformations and on XMLquery answering algorithms. The main result in 2010 was a learning algorithm for top-down XMLtransformations .

Dissemination Animation of the scientific community Program Committees

R. Gilleronwas member of the program committee of CAP 2010 (Conférence Francophone sur l'Apprentissage Automatique).

J. Niehrenis member of the steering committee of RTA (International Conference on Rewriting Techniques and Applications), of the editorial board of FUNDAMENTA INFORMATICAE. In 2010 he was member of the program committees of LPAR 2010 (International Conference on Logic for Programming, Artificial Intelligence and Reasoning) and ATANLP 2010 (ACL 2010 Workshop on Applications of Tree Automata in Natural Language Processing).

S. Tisonwas member of the program committee of RTA 2010 (the 21st International Conference on Rewriting Techniques and Applications) and STACS 2011 (Annual Symposium on Theoretical Aspects of Computer Science). She is member of of the editorial board of RAIRO - ITA and of the steering committee of STACS.

M. Tommasiwas member of the program committee of ECML 2010 (European conference on Machine Learning), ATANLP 2010 (ACL 2010 Workshop on Applications of Tree Automata in Natural Language Processing) and LATA 2010 (the 4th International Conference on Language and Automata Theory and Applications).

F. Torrewas member of the program committee of ECML 2010 (European conference on Machine Learning) and CAP 2010 (The Conférence Francophone sur l'Apprentissage Automatique).

French Scientific Responsibilities

A.C. Caronis member of the french national evaluation committee for computer science assistant professors (CNU 27). She was member of selection committee for assistant professor in Lille.

R. Gilleronwas member of the scientific committee of the program Programme blanc SIMI2, ANR. He participated in the AERES evaluation committee of the computer science lab IRISA (Rennes). He was member of the selection committee in Nantes for a professor position, of the selection committee in Paris 6 for assistant professors, and of the selection committee in Lille for assistant professor.

S. Tisonwas co-head of the scientific committee of the program Programme blanc SIMI3, ANR. She is head of the computer science lab in Lille (LIFL). She chairs the scientific council of "Pôle de Compétitivité industries du Commerce". She was member of the national PES commission 27. She was invited member of the scientific council of ST2I (CNRS) until september 2010.

J. Niehrenwas president of the selection committee for postdocs and PhD students of the research center INRIA Lille Nord Europe, and member of the selection committee for 1 professor position at Ecole Polytechnique Lille.

M. Tommasiwas member of the Technological Development Committee of Inria Lille and of the scientific committee for selection of assitant professors at Rennes IFSIC and University of Marseille.

Miscellaneous

M.Tommasigave an invited talk on Conditional Random fields at the workshop ATALA (Association pour le traitement automatique des Langues).

Teaching Teaching hours


Iovka Boneva	192 hours	bachelor
Anne-Cécile Caron	192 hours	bachelor and masters
Jérôme Champavère	96 hours	masters
Rémi Gilleron	140 hours	bachelor and masters
Aurélien Lemay	49 hours	masters
Yves Roos	192 hours	bachelor andf masters
Sławek Staworko	192 hours	bachelor and masters
Marc Tommasi	192 hours	masters
Sophie Tison	96 hours	masters
Camille Vacher	90 hours	masters
Benoît Groz	64 hours	masters

Master lectures at the University of Lille

Affaire et Negociation Internationale: Base de Données, by A. Lemay

Traduction Spécialisé Multilingue: Création de Site Web, by A. Lemay

XML, by M. Tommasi

Networks, by M. Tommasi

Automatisation du traitement de l'information, by M. Tommasi

Supervised classification, by R. Gilleron

Unsupervised classification, by R. Gilleron

Information retrieval, by R. Gilleron

Advanced algorithms and complexity, by B. Groz, S. Tison

Content Management Systems, by J. Champavère

Semantic Web, by J. Champavère

Web Programming by J. Champavère

Advanced databases, by A-C. Caron

Semantic Web, by A-C. Caron

Master projects and internships

Radu Ciucanu from IASI in Romania started an internship on implementing enumeration algorithms for XPath dialects with variables in Scala. Directed by G. Baganand J.Niehren.

Surbi Maheshwari from from IIT Guwahati India started an internship on implementing learning algorithms for n-ary queries. Directed by A. Lemayand J. Niehren.

PhD theses

E. Gilbert, Learning weighted tree automata for information extraction from XML. Supervised by Tommasi and Gilleron

J. Champavère, Schema-guided query induction for information extraction. PhD defended in September 2010. Supervised by Niehren, Gilleron, and Lemay.

G. Laurence, Learning XMLtransformations for data exchange on the web. Supervised by Tommasi, Niehren, Staworko and Lemay.

B. Groz, XMLdatabase security and access control. Supervised by Tison and Staworko.

J.-B. Faddoul, Machine learning and applications to social network analysis. Supervised by Gilleron and Chidlowskii from XEROX European Research Center ( xrce).

J. Decoster, Statistical relational learning of XMLtransformations. Supervised by Tommasi and Torre.

A. M. Ndione, Probabilistic algorithms for tree automata and transducers. Supervised by Niehren and Lemay.

T. Sebastian, Streaming algorithms for XSLT. Supervised by Niehren.

PhD committees

A. Lemaybelonged to the PhD committee of J.Champavère (Lille 1).

J. Niehrenreviewed the PhD thesis of M. John (Unversity of Rostock, Germany) and was member of the PhD committee of J.Champavère (Lille 1).

M. Tommasibelonged to the PhD committee of J. Champavère (Lille 1) and Nataliya Sokolovska (Telecom ParisTech).

R. Gilleronwas a reviewer of the PhD of A. Zidouni (Marseille).

S. Tisonbelonged to the PhD committees of H. Sharif (Lille 1), H. Idabal (CRI Paris 1 Panthéon Sorbonne University), and C. Vacher (LSV, ENS Cachan).

Habilitation committees

R. Gilleronparticipated in the habilitation committee of L. Ralaivola (Marseille, as reviewer), J.C. Janodet (Saint-Etienne, as reviewer) and A. Habrard (Marseille, as member).

S. Tisonbelonged to the habilitation committee of F. Clautiaux (Lille 1).

Efficient Enumeration for Conjunctive Queries over X-underbar Structures Guillaume Bagan G. Arnaud Durand A. Emmanuel Filiot E. Olivier Gauwin O. 19th EACSL Annual Conference on Computer Science Logic Tchèque, République Brno 2010 http:// hal. inria. fr/ hal-00489955 BE Interactive Learning of Node Selecting Tree Transducers Julien Carme J. Rémi Gilleron R. Aurélien Lemay A. Joachim Niehren J. Machine Learning 66 1 2007 33–67 http:// hal. inria. fr/ inria-00087226 Efficient Inclusion Checking for Deterministic Tree Automata and XML Schemas Jérôme Champavère J. Rémi Gilleron R. Aurélien Lemay A. Joachim Niehren J. Information and Computation 207 11 2009 1181-1208 http:// hal. inria. fr/ inria-00366082/ en/ Boosting Multi-Task Weak Learners with Applications to Textual and Social Data Jean-Baptiste Faddoul J.-B. Boris Chidlovskii B. Fabien Torre F. Rémi Gilleron R. The Ninth International Conference on Machine Learning and Applications (ICMLA 2010) États-Unis Hayatt Regency Bethesda, Washington DC IEEE Dec 2010 http:// hal. inria. fr/ inria-00524718 Polynomial Time Fragments of XPath with Variables Emmanuel Filiot E. Joachim Niehren J. Jean-Marc Talbot J.-M. Sophie Tison S. 26th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems ACM-Press 2007 205-214 http:// hal. inria. fr/ inria-00135678 Tree Automata With Global Constraints Emmanuel Filiot E. Jean-Marc Talbot J.-M. Sophie Tison S. International Journal of Foundations of Computer Science 21 4 Aug 2010 571-596 http:// hal. inria. fr/ hal-00526987 BE Queries on XML Streams with Bounded Delay and Concurrency Olivier Gauwin O. Joachim Niehren J. Sophie Tison S. Information and Computation 2010 http:// hal. inria. fr/ inria-00491495 View update translation for XML Benoît Groz B. Iovka Boneva I. Yves Roos Y. Sophie Tison S. Anne-Cecile Caron A.-C. Slawomir Staworko S. ICDT 13th International Conference on Database Theory 2011 A Learning Algorithm for Top-Down XML Transformations Aurélien Lemay A. Sebastian Maneth S. Joachim Niehren J. 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems États-Unis Indianapolis ACM Press 2010 http:// hal. inria. fr/ inria-00460489 AU On the Minimization of XML Schemas and Tree Automata for Unranked Trees Wim Martens W. Joachim Niehren J. Journal of Computer and System Science 73 4 2007 550-583 http:// hal. inria. fr/ inria-00088406 Induction de requêtes guidée par schéma Jérôme Champavère J. Université des Sciences et Technologie de Lille - Lille I Sep 2010 http:// hal. inria. fr/ tel-00517358 Ph. D. Thesis Tree Automata With Global Constraints Emmanuel Filiot E. Jean-Marc Talbot J.-M. Sophie Tison S. International Journal of Foundations of Computer Science 21 4 Aug 2010 571-596 http:// hal. inria. fr/ hal-00526987 BE Queries on XML Streams with Bounded Delay and Concurrency Olivier Gauwin O. Joachim Niehren J. Sophie Tison S. 0890-5401 Information and Computation 2010 http:// hal. inria. fr/ inria-00491495 The Attributed Pi Calculus with Priorities Mathias John M. Cédric Lhoussaine C. Joachim Niehren J. Adelinde Uhrmacher A. Transactions on Computational Systems Biology XII 5945 Feb 2010 13-76 http:// hal. inria. fr/ inria-00422969 DE Consistent Query Answers in the Presence of Universal Constraints Slawomir Staworko S. Jan Chomicki J. 0306-4379 Information Systems 35 1 2010 1-22 http:// hal. inria. fr/ inria-00489298 US Champs Markoviens Conditionnels pour l'extraction d'information Isabelle Tellier I. Marc Tommasi M. Eric Gaussier E. François Yvon F. Modèles probabilistes pour l'accès à l'information textuelle Hermès 2010 http:// hal. inria. fr/ inria-00514525 Efficient Enumeration for Conjunctive Queries over X-underbar Structures Guillaume Bagan G. Arnaud Durand A. Emmanuel Filiot E. Olivier Gauwin O. 19th EACSL Annual Conference on Computer Science Logic Tchèque, République Brno 2010 http:// hal. inria. fr/ hal-00489955 International Conference on Computer Science Logic 19 CSL BE Boosting Multi-Task Weak Learners with Applications to Textual and Social Data Jean-Baptiste Faddoul J.-B. Boris Chidlovskii B. Fabien Torre F. Rémi Gilleron R. The Ninth International Conference on Machine Learning and Applications (ICMLA 2010) États-Unis Hayatt Regency Bethesda, Washington DC IEEE Dec 2010 http:// hal. inria. fr/ inria-00524718 International Conference on Machine Learning and Applications 9 ICML-A A Learning Algorithm for Top-Down XML Transformations Aurélien Lemay A. Sebastian Maneth S. Joachim Niehren J. 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems États-Unis Indianapolis ACM Press 2010 http:// hal. inria. fr/ inria-00460489 ACM Conference on Principle of Database Systems 29 PODS AU Sequences Classification by Least General Generalisations Frédéric Tantini F. Alain Terlutte A. Fabien Torre F. 10th International Colloquium on Grammatical Inference Espagne Valencia 6339 Springer Sep 2010 189-202 http:// www. springerlink. com http:// hal. inria. fr/ inria-00524707 International Colloquium on Grammatical Inference 10 ICGI Efficient Answer Enumeration for XPath Dialects with Variables Guillaume Bagan G. Joachim Niehren J. INRIA Nov 2010 http:// hal. inria. fr/ inria-00533757 Rapport de recherche Streamable Fragments of Forward XPath Olivier Gauwin O. Joachim Niehren J. INRIA 2010 http:// hal. inria. fr/ inria-00442250 Rapport de recherche Series, Weighted Automata, Probabilistic Automata and Probability Distributions for Unranked Trees. Édouard Gilbert É. Rémi Gilleron R. Marc Tommasi M. INRIA Feb 2010 http:// hal. inria. fr/ inria-00455955 RR-7200 Rapport de recherche CDuce: an XML-centric general-purpose language Véronique Benzaken V. Giuseppe Castagna G. Alain Frisch A. ACM SIGPLAN Notices 38 9 2003 51–63 A Full Pattern-Based Paradigm for XML Query Processing. Véronique Benzaken V. Giuseppe Castagna G. Cédric Miachon C. PADL Lecture Notes in Computer Science Springer Verlag 2005 235-252 Patterns and Types for Querying XML Giuseppe Castagna G. 10th International Symposium on Database Programming Languages Lecture Notes in Computer Science 3774 Springer Verlag 2005 1 - 26 Wrapping Web Information Providers by Transducer Induction Boris Chidlovskii B. Proc. European Conference on Machine Learning Lecture Notes in Artificial Intelligence 2167 2001 61 – 73 A probabilistic learning method for XML annotation of documents Boris Chidlovskii B. Jérôme Fuselier J. Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI'05) 2005 1016-1021 The theory of ground rewrite systems is decidable Max Dauchet M. Sophie Tison S. Logic in Computer Science, 1990. LICS '90, Proceedings., Fifth Annual IEEE Symposium on e 1990 242 -248 Semantic Integration Research in the Database Community: A Brief Survey AnHai Doan A. Alon Y. Halevy A. Y. AI magazine 26 1 2005 83-94 Parameter Estimation for Probabilistic Finite-State Transducers Jason Eisner J. Proceedings of the Annual meeting of the association for computational linguistic 2002 1–8 Bottom-up and top-down tree transformations. A comparision Joost Engelfriet J. Mathematical System Theory 9 1975 198–231 Macro tree transducers, attribute grammars, and MSO definable tree translations Joost Engelfriet J. Sebastian Maneth S. Information and Computation 154 1 1999 34–91 Regular Object Types Vladimir Gapeyev V. Benjamin Pierce B. European Conference on Object-Oriented Programming 2003 http:// www. cis. upenn. edu/ ~bcpierce/ papers/ regobj. pdf Training tree transducers J. Graehl J. K. Knight K. NAACL-HLT 2004 105-112 Regular expression pattern matching for XML Haruo Hosoya H. Benjamin Pierce B. Journal of Functional Programming 6 13 2003 961-1004 An overview of probabilistic tree transducers for natural language processing K. Knight K. J. Graehl J. Sixth International Conference on Intelligent Text Processing 2005 1-24 On the complexity of nonrecursive XQuery and functional query languages on complex values Christoph Koch C. 24th SIGMOD-SIGACT-SIGART Symposium on Principles of Database systems ACM-Press 2005 84–97 Type-based Optimization for Regular Patterns Michael Y. Levin M. Y. Benjamin Pierce B. 10th International Symposium on Database Programming Languages Lecture Notes in Computer Science 3774 2005 XML type checking with macro tree transducers Sebastian Maneth S. Alexandru Berlea A. Thomas Perst T. Helmut Seidl H. 24th ACM Symposium on Principles of Database Systems 2005 283–294 Foundations of Statistical Natural Language Processing C. Manning C. H. Schütze H. MIT Press

Cambridge

1999 Typechecking Top-Down Uniform Unranked Tree Transducers Wim Martens W. Frank Neven F. 9th International Conference on Database Theory London, UK Lecture Notes in Computer Science 2572 Springer Verlag 2003 64–78 Learning Subsequential Transducers for Pattern Recognition and Interpretation Tasks J. Oncina J. P. Garcia P. E. Vidal E. IEEE Trans. Patt. Anal. and Mach. Intell. 15 1993 448-458 An Introduction to Conditional Random Fields for Relational Learning Charles Sutton C. Andrew McCallum A. Introduction to Statistical Relational Learning MIT Press 2006 Learning Structured Prediction Models: A Large Margin Approach B. Taskar B. V. Chatalbashev V. D. Koller D. C. Guestrin C. Proceedings of the Twenty Second International Conference on Machine Learning (ICML'05) 2005 896 – 903 Large Margin Methods for Structured and Interdependent Output Variables Ioannis Tsochantaridis I. Thorsten Joachims T. Thomas Hofmann T. Yasemin Altun Y. Journal of Machine Learning Research 6 2005 1453–1484 Deciding Well-Definedness of XQuery Fragments Stijn Vansummeren S. Proceedings of the 24th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems 2005 37–48