Mostrareis a joint project with the lifl - umr cnrs 8022, Lille 1 and Lille 3 universities
The objective of Mostrareis to develop adaptive document processing methods for XML-based information systems. Adaptiveness becomes important when documents evolve frequently such as on the Web. The particularity of Mostrareis that we develop semi-automatic or automatic information extraction approaches that can fully benefit from the available tree structure of XMLdocuments.
Information extraction is an instance of document transformation. In order to exploit the tree structure of XMLdocuments, our goal is to investigate specification languages for tree transformations. These are based on approaches from database theory (such as the W3C standards XQuery and XSLT), automata, logic, and programming languages. We wish to define stochastic models of tree transformations, and to develop automatic or semi-automatic procedures for inferring them. Once available, we want to integrate these learning algorithms into innovative information extraction systems, semantic Web platforms, and document processing engines.
The following two paragraphs summarize our two main research objectives:
We wish to continue our work on modeling languages for node selection queries in tree structured documents, that we contributed in the first phase of Mostrare. The new subject of interest of the second phase are XMLdocument transformations and tree transformations that generalize on node selection queries.
We wish to continue to study machine learning techniques for information extraction. One new goal is to develop learning algorithms that can induce XMLdocument transformations, based on their tree structure. Another new goal is to explore stochastic machine learning techniques that can deal with uncertainty in document sources.
Lemay and Niehren present at ACM PODS the first learning algorithm for top-down XML transformations. This is a breakthrough result on transducer learning since supporting copying and flipping of subtrees for the first time, while all previous proposals were either restricted to words or to relabelings on trees.
A paper by Tison and Dauchet received the 2010 IEEE LICS Test-of-Time Award. They settled innovative techniques on tree automata, for showing that the first-order theory of one-step rewriting is decidable.
XMLdocument transformations can be defined in W3C standards languages XQueryor XSLT. Programming XMLtransformations in these languages is often difficult and error prone even if the schemata of input and output documents are known. Advanced programming experience and considerable programming time may be necessary, that are not available in Web services or similar scenarios.
Alternative programming language for defining XMLtransformations have been proposed by the programming language community, for instance XDuce , Xtatic , , and CDuce , , . The type systems of these languages simplify the programming tasks considerably. But of course, they don't solve the general difficulty in programming XMLtransformations manually.
Languages for defining node selection queries arise as sub-language of all XMLtransformation languages. The W3C standards use XPathfor defining monadic queries, while XDuce and CDuce rely on regular queries defined by regular pattern equivalent to tree automata. Indeed, it is natural to look at node selection as a simple form of tree transformation. Monadic node selection queries correspond to deterministic transformations that annotate all selected nodes positively and all others negatively. N-ary node selection queries become non-deterministic transformations, yielding trees annotated by Boolean vectors.
After extensive studies of node selection queries in trees (in XPathand many other languages) the XMLcommunity has started more recently to formally investigate XMLtree transformations. The expressiveness and complexity of XQueryare studied in , . Type preservation is another problem, i.e., whether all trees of the input type get transformed into the output type, or vice versa, whether the inverse image of the output type is contained in the input type , .
The automata community usually approaches tree transformations by tree transducers , i.e., tree automata producing output structure. Macro tree transducers, for instance, have been proposed recently for defining XMLtransformations . From the view point of logic, tree transducers have been studied for MSO definability .
Automatic or semi-automatic tools for inferring tree transformations are needed for information extraction. Annotated examples may support the learning process. The learning target will be models of XMLtree transformations specified in some of the languages discussed above.
Grammatical inferenceis commonly used to learn languages from examples and can be applied to learn transductions. Previous work on grammatical inference for transducers remains limited to the case of strings , . For the tree case, so far only very basic tree transducers have been shown to be learnable, by previous work of the Mostrare project. These are node selecting tree transducer (NSTTs) which preserve the structure of trees while relabeling their nodes deterministically.
Statistical inferenceis most appropriate for dealing with uncertain or noisy data. It is generally useful for information extraction from textual data given that current text understanding tools are still very much limited. XMLtransformations with noisy input data typically arise in data integration tasks, as for instance when converting PDFinto XML.
Stochastic tree transducers have been studied in the context of natural language processing , . A set of pairs of input and output trees defines a relation that can be represented by a 2-tape automaton called a stochastic finite-state transducer(SFST). A major problem consists in estimating the parameters of such transducer. SFST training algorithms are lacking so far .
Probabilistic context free grammars (pCFGs) are used in the context of PDFto XMLconversion . In the first step, a labeling procedure of leaves of the input document by labels of the output DTD is learned. In the second step, given a CFG as a generative model of output documents, probabilities are learned. Such two steps approaches are in competition with one step approaches estimating conditional probabilities directly.
A popular non generative model for information extraction is conditional random fields( CRF, see a survey ). One main advantage of CRFis to take into account long distance dependencies in the observed data. CRFhave been defined for general graphs but have mainly been applied to sequences, thus CRFfor XMLtrees should be investigated.
So called structured outputhas recently become a research topic in machine learning , . It aims at extending the classical categorization task, which consists to associate one or some labels to each input example, in order to handle structured output labels such as trees. Applicability of structured output learning algorithms remains to be asserted for real tasks such as XMLtransformations.
XMLtransformations are basic to data integration: HTMLto XMLtransformations are useful for information extraction from the Web; XMLto XMLtransformations are useful for data exchange between Web services or between peers or between databases. Doan and Halevy survey novel integration tasks that appear with the Semantic Web and the usage of ontologies. Therefore, the semi-automatic generation of XMLtransformations is a challenge in the database community and in the semantic Web community.
Also, XMLtransformations are useful for document processing. For instance, there is need of designing transformations from documents organized w.r.t visual format ( HTML, DOC, PDF) into documents organized w.r.t. semantic format ( XMLaccording to a DTDor a schema). The semi-automatic design of such transformations is obviously a very challenging objective.
Furthermore, quite some activities of Mostrare concern efficient evaluation of XPath queries on XML documents and XML streams. XPath is fundamental to all XML standards, in particular to XQuery, XSLT, and XProc.
VOLATA VOtes of Least generAl generalizaTions in jAva
VOLATA is a bundle software containing several learning
algorithms: learning algorithms for attribute-value datasets,
grammatical inference algorithms and inductive logic
programming algorithms. VOLATA has been applied to document
classification tasks and information extraction tasks. The
software is available at
http://
PICCATA: Programming Interface for effiCient Computations and Approximation on multiplicity Tree Automata.
Piccata is a programming interface for learning weighted
and classical tree automata from examples. Piccata
development started in 2009 with the former member Feriel
Lahlali. Source code under Cecill licence is available
on
http://
EVOXS: Earliest evaluation of XPath on streams
This is an implementation of XPath query answering algorithms on XMLstreams following the algorithms developed in the PhD thesis of O. Gauwin in 2009 directed by J. Niehren and S. Tison. It consists of a compiler of a fragment of XPath to deterministic streaming tree automata and an earliest query answering algorithm for queries defined by deterministic streaming tree automata. The main developers of the first version are O. Gauwin and J. Niehren.
EVOXS will be the starting point for our cooperation with
INNOVIMAX S.A.R.L in Paris, within the QuiXProc transfer
project powered by D. Debarbieux, and the
cifrePhD thesis of
T. Sebastian, both starting in December 2010. The source code
is available under the GNU public licence at
https://
Query answering and control access.Groz, Boneva, Roos, Tison, Caron and Staworko study the problem of update translation for views on XMLdocuments. More precisely, given an XMLview definition and a user defined view update program, the problem is to find a source update program that translates the view update without side effects on the view. Both views and update programs are represented by recognizable tree languages and different settings for the update problem are studied. The results of this line of research studied have been studied during 2010 and will be published in the 14th International Conference on Database Theory (ICDT'2011).
Answer enumeration.Bagan et. al investigate efficient enumeration algorithms for conjunctive queries for databases over binary relations that satisfy the X-underbar property. In particular, their algorithm is able to enumerate answers of XPath queries with variables with linear delay and quadratic precomputation time, both in the size of the database.
Bagan and Niehren study in a more efficient answer enumeration algorithm for a fragment of conditional XPath with variables, which is a first-order complete query language for unranked trees of bounded depth. Their algorithm requires only linear pre-computation time and constant delay for fixed queries, while depending linearly on the size of the query. It is based on a new enumeration algorithm for disjunctions of acyclic conjunctive queries on so called X-doublebar-structures they introduce, which are more restrictive previously known X-underbar structures.
Query answering on XML streams.Niehren started a transfer project QuiXProc with the industrial partner INNOVIMAX S.A.R.L. in December 2010, in which they plan to integrate the XPath query answering algorithms as developed by Mostrare into the XMLcoordination language XProc of the W3C. In this line of research, Niehren and Tison (with their previous PhD student O. Gauwin) showed in how to distinguish node selection queries on XMLstreams with bounded delay and concurrency. This is a journal version of a previous LATA 2009 conference paper.
Tree Automata.Tison et. al. revisit in the invited paper their results on tree automaton with global equality and disequality constraints (TAGEDs), previously published in 2008 at the International Conference on Developments in Language Theory (DLT).
Vacher is starting his postdoctoral studies on tree automata with constraints under supervision of Niehren and Tison.
M. Ndione is starting his PhD thesis on probabilistic algorithms for tree automata and transducers under supervision of Lemay and Niehren.
Logic.Amano started his postdoc on XMLdata exchange under supervision of Niehren.
Staworko et. al. introduce the framework of consistent query answers and repairs in order to alleviate the impact of inconsistency data on answers to a query. In particular, a repair is a minimally different consistent instance and an answer is consistent if it is present in every repair.
Programming languagesNiehren at. al. present a journal version of the attributed pi-calculus, a modeling language for systems biology. They add priorities compared to the CMSB 2008 conference version, while elaborating the analogy of priorities and stochastic rates in the pi-calculus.
Learning tree transformations: transducer induction.Lemay, Niehren et. al. present at ACM PODS the first learning algorithm for top-down XMLtransformations which allow to restructure trees by copying, flipping, and deleting of subtrees. This is a breakthrough result on transducer learning. Previous proposals were either restricted to transducers on words or to relabelings on trees. The results are obtained as a combination of a new top-down encoding of unranked into ranked trees guided by DTDs and a new learning algorithm for DTOPs. This learning result for DTOPs is derived from a new Myhill-Nerode theorem for DTOPs that the paper establishes. This theorem also shows the existence of unique minimal DTOPs that are constistent with a DTD. An alternative minimization algorithm for DTOPs was obtained previously, but without the Myhill-Nerode theorem and any link to learning. This result was obtained in cooperation within our associated team TRANSDUCE with NICTA Sydney created in 2010, but actually started already in 2007.
Learning queries or schemas: automata induction.Champavère defended his PhD thesis in September 2010 under the supervision of Niehren, Lemay and Gilleron. The thesis studies the incorporation of schema knowledge in XMLquery induction from annotated example trees. The results lift tree automata based induction algorithms for total monadic node selecting queries to partial queries whose domain is fixed by a known schema. Schema consistency is checked dynamically by testing language inclusion for tree automata. Another contribution of the thesis is to study query induction from pruned annotated examples where subtrees irrelevant for node selection may be cut away; in this line of research, the thesis presents a new learnability result for classes of queries that are stable under schema-guided pruning strategies.
Sequence classificationTorre et. al. present in a general framework for supervised classification; in particular, their goal is the classification of sequences by deciding whether a word belongs in some language or not. They integrate classical grammatical inference techniques into a general framework resulting from supervised classification: the hypotheses are therefore represented by automata or balls of strings, that are then combined by traditional boosting algorithms.
Induction of stochastic tree automata. Gilbert, Gilleron and Tommasi study in the inference of probability distributions over sets of unranked trees (i.e. trees where a node can have an unbounded number of direct subtrees). The main objective here is to build probabilistic decision procedures able to classify (XML/HTML) trees from different sources, or to get concise representations of sets of trees. The problem is formalized as the more general problem of learning tree series. For the question of defining recognizable tree series of unranked trees, the authors specify weighted automata for unranked trees via the extension of previous hedge automata and weighted trees. The paper also considers binary representations of unranked trees and shows that recognizable tree series for unranked trees can be defined and studied from recognizable tree series of those binary representations. The paper also presents decidability results on probabilistic tree automata and algorithms for computing sums of convergent series.
Multitask learning.Faddoul, Torre and Gilleron with their partner from XEROX Grenoble study in the problem of learning multiple related tasks from data simultaneously. For this problem, the authors propose a novel learning algorithm, called MT-Adaboos, which extends the traditional Adaboost algorithm to multitask setting by using simple decision stumps as weak classifiers. The practical and theoretical results of the paper show that the new algorithm learns the dependencies between tasks for different regions of the learning space.
Conditional random fields.Tommasi participated in the writing of a chapter of a book on conditional Markov fields for information extraction . In this french book chapter, the authors review some machine learning methods for information extraction and we focus on Conditional Random Fields (CRF), for which some prototypes were developed last year. The book chapter also gives the connexions with with Hidden Markov Models and logistic regression.
Learning and mining in graphsGarriga was hired as researcher (CR1). She started a research project on learning and mining data and data streams in networks and she leaded a new working group inside Mostrare.
Gilleron and Torre continue supervising the PhD thesis (Cifre) of Jean-Baptiste Faddoul together with B. Chidlovski from the Xerox's European Research Center ( xrce).
Niehren started supervision the PhD thesis (Cifre) of Tom Sebastian on streaming algorithms for XSLT with M. Zergaoui from INNOVIMAX S.A.R.L. in Paris.
Niehren and Debarbieux started an INRIA transfer project with Innovimax S.A.R.L in Paris, on the integration of XPath streaming algorithms into XProc, the XMLcoordination language of the W3C.
The Lampada project on “Learning Algorithms, Models and
sPArse representations for structured DAta” is coordinated
by Tommasi from Mostrare. Our partners are the
Sequelproject of
Inria Lille Nord Europe, the
Lif(Marseille),
the
Hubert
Curienlaboratory (Saint-Etienne), and
lip6 (Paris).
More information on the project can be found on
http://
Lampada is a fundamental research project on machine learning and structured data. It focuses on scaling learning algorithms to handle large sets of complex data. The main challenges are 1) high dimension learning problems, 2) large sets of data and 3) dynamics of data. Complex data we consider are evolving and composed of parts in some relations. Representations of these data embed both structure and content information and are typically large sequences, trees and graphs. The main application domains are web2, social networks and biological data.
The project proposes to study formal representations of such data together with incremental or sequential machine learning methods and similarity learning methods.
The representation research topic includes condensed data representation, sampling, prototype selection and representation of streams of data. Machine learning methods include edit distance learning, reinforcement learning and incremental methods, density estimation of structured data and learning on streams.
The Codex project on “Efficiency, Dynamicity and
Composition for XML Models, Algorithms, and Systems” and is
coordinated by Manolescu (GEMO, INRIA Saclay). The other
partners of Mostrare there are Geneves (WAM, INRIA
Grenoble),
Colazzo(LRI,
Orsay), Castagna (PPS, Paris 7), and Halfeld (Blois).
Public information on Codex can be found on
http://
The Codex project seeks to push the frontier of XMLtechnology in three interconnected directions. First, efficient algorithms and prototypes for massively distributed XMLrepositories are studied. Second, models are developed for describing, controlling, and reacting to the dynamic behavior of XMLcollections and XMLschemas with time. Third, methods and prototypes are developed for composing XMLprograms for richer interactions, and XMLschemas into rich, expressive, yet formally grounded type descriptions.
The main contributions of Mostrare are results on learning top-down XMLtransformations , on XPath query answering algorithms on XMLstreams , and on XMLquery learning . In addition the Codex project has lead to the creation of an INRIA transfer project QuiXProc with Innovimax and of a cifreproject with Innovimax.
The Enum project on “Complexity and Algorithms for
Answer Enumeration”, is coordinated by A. Durand (Paris
VII). The other partners are E. Grandjean (University of
Caen), N. Creignou (University of Marseille). Public
information on Enum can be found on
http://
Enum studies algorithmic and complexity questions of answers enumeration, the task of generating all solutions of a given problem. Answer enumeration requires innovative efficient algorithms that can quickly serve large numbers of answers on demand. The prime application is query answering in databases, where huge answer sets arise naturally.
Mostrare contributed in 2010 to new answer enumeration algorithms for XPath queries , .
This is a collaboration on the subject Access Control Policies for XML: Verification, Enforcement and Collaborative Edition, supported by the INRIA Collaboration Program (Action de Recherche Collaborative). The other participants involved are from the INRIA teams DAHU (INRIA Saclay – Île de France), PAREO and CASSIS (INIRA Nancy – Grand Est). This project is concerned with the security and access control for Web data exchange, in the context of Web applications and Web services. We aim at defining automatic verification methods for checking properties of access control policies (ACP) for XML, like consistency or secrecy, and for the comparison ACPs. One of our goals is to apply formal tools from tree automata theory for this purpose. A second important goal is to design efficient methods for ACP enforcement for secure query evaluation. We will study several scenarios for solving different variants of this problem, based on the notion of secure user views. As a case study, we will apply our methods to an XML-based collaborative editing system.
The leader of the our NICTA partner team is S. Maneth.
Public information on Enum can be found on
http://
We keep cooperation on learning algorithms for XMLto XMLtransformations and on XMLquery answering algorithms. The main result in 2010 was a learning algorithm for top-down XMLtransformations .
R. Gilleronwas member of the program committee of CAP 2010 (Conférence Francophone sur l'Apprentissage Automatique).
J. Niehrenis member of the steering committee of RTA (International Conference on Rewriting Techniques and Applications), of the editorial board of FUNDAMENTA INFORMATICAE. In 2010 he was member of the program committees of LPAR 2010 (International Conference on Logic for Programming, Artificial Intelligence and Reasoning) and ATANLP 2010 (ACL 2010 Workshop on Applications of Tree Automata in Natural Language Processing).
S. Tisonwas member of the program committee of RTA 2010 (the 21st International Conference on Rewriting Techniques and Applications) and STACS 2011 (Annual Symposium on Theoretical Aspects of Computer Science). She is member of of the editorial board of RAIRO - ITA and of the steering committee of STACS.
M. Tommasiwas member of the program committee of ECML 2010 (European conference on Machine Learning), ATANLP 2010 (ACL 2010 Workshop on Applications of Tree Automata in Natural Language Processing) and LATA 2010 (the 4th International Conference on Language and Automata Theory and Applications).
F. Torrewas member of the program committee of ECML 2010 (European conference on Machine Learning) and CAP 2010 (The Conférence Francophone sur l'Apprentissage Automatique).
A.C. Caronis member of the french national evaluation committee for computer science assistant professors (CNU 27). She was member of selection committee for assistant professor in Lille.
R. Gilleronwas member of the scientific committee of the program Programme blanc SIMI2, ANR. He participated in the AERES evaluation committee of the computer science lab IRISA (Rennes). He was member of the selection committee in Nantes for a professor position, of the selection committee in Paris 6 for assistant professors, and of the selection committee in Lille for assistant professor.
S. Tisonwas co-head of the scientific committee of the program Programme blanc SIMI3, ANR. She is head of the computer science lab in Lille (LIFL). She chairs the scientific council of "Pôle de Compétitivité industries du Commerce". She was member of the national PES commission 27. She was invited member of the scientific council of ST2I (CNRS) until september 2010.
J. Niehrenwas president of the selection committee for postdocs and PhD students of the research center INRIA Lille Nord Europe, and member of the selection committee for 1 professor position at Ecole Polytechnique Lille.
M. Tommasiwas member of the Technological Development Committee of Inria Lille and of the scientific committee for selection of assitant professors at Rennes IFSIC and University of Marseille.
M.Tommasigave an invited talk on Conditional Random fields at the workshop ATALA (Association pour le traitement automatique des Langues).
Iovka Boneva | 192 hours | bachelor |
Anne-Cécile Caron | 192 hours | bachelor and masters |
Jérôme Champavère | 96 hours | masters |
Rémi Gilleron | 140 hours | bachelor and masters |
Aurélien Lemay | 49 hours | masters |
Yves Roos | 192 hours | bachelor andf masters |
Sławek Staworko | 192 hours | bachelor and masters |
Marc Tommasi | 192 hours | masters |
Sophie Tison | 96 hours | masters |
Camille Vacher | 90 hours | masters |
Benoît Groz | 64 hours | masters |
Affaire et Negociation Internationale: Base de Données, by A. Lemay
Traduction Spécialisé Multilingue: Création de Site Web, by A. Lemay
XML, by M. Tommasi
Networks, by M. Tommasi
Automatisation du traitement de l'information, by M. Tommasi
Supervised classification, by R. Gilleron
Unsupervised classification, by R. Gilleron
Information retrieval, by R. Gilleron
Advanced algorithms and complexity, by B. Groz, S. Tison
Content Management Systems, by J. Champavère
Semantic Web, by J. Champavère
Web Programming by J. Champavère
Advanced databases, by A-C. Caron
Semantic Web, by A-C. Caron
Radu Ciucanu from IASI in Romania started an internship on implementing enumeration algorithms for XPath dialects with variables in Scala. Directed by G. Baganand J.Niehren.
Surbi Maheshwari from from IIT Guwahati India started an internship on implementing learning algorithms for n-ary queries. Directed by A. Lemayand J. Niehren.
E. Gilbert, Learning weighted tree automata for information extraction from XML. Supervised by Tommasi and Gilleron
J. Champavère, Schema-guided query induction for information extraction. PhD defended in September 2010. Supervised by Niehren, Gilleron, and Lemay.
G. Laurence, Learning XMLtransformations for data exchange on the web. Supervised by Tommasi, Niehren, Staworko and Lemay.
B. Groz, XMLdatabase security and access control. Supervised by Tison and Staworko.
J.-B. Faddoul, Machine learning and applications to social network analysis. Supervised by Gilleron and Chidlowskii from XEROX European Research Center ( xrce).
J. Decoster, Statistical relational learning of XMLtransformations. Supervised by Tommasi and Torre.
A. M. Ndione, Probabilistic algorithms for tree automata and transducers. Supervised by Niehren and Lemay.
T. Sebastian, Streaming algorithms for XSLT. Supervised by Niehren.
A. Lemaybelonged to the PhD committee of J.Champavère (Lille 1).
J. Niehrenreviewed the PhD thesis of M. John (Unversity of Rostock, Germany) and was member of the PhD committee of J.Champavère (Lille 1).
M. Tommasibelonged to the PhD committee of J. Champavère (Lille 1) and Nataliya Sokolovska (Telecom ParisTech).
R. Gilleronwas a reviewer of the PhD of A. Zidouni (Marseille).
S. Tisonbelonged to the PhD committees of H. Sharif (Lille 1), H. Idabal (CRI Paris 1 Panthéon Sorbonne University), and C. Vacher (LSV, ENS Cachan).
R. Gilleronparticipated in the habilitation committee of L. Ralaivola (Marseille, as reviewer), J.C. Janodet (Saint-Etienne, as reviewer) and A. Habrard (Marseille, as member).
S. Tisonbelonged to the habilitation committee of F. Clautiaux (Lille 1).