Extraction and Implication of Path Constraints

MOSTRARE Modeling Tree Structures, Machine Learning, and Information Extraction SYM Rémi Gilleron UnivFr Enseignant professor, University of Lille 3 oui Karine Lewandowski INRIA Assistant shared by 3 projects Joachim Niehren INRIA Chercheur senior researcher (DR2), UR Futurs oui Aurélien Lemay UnivFr Enseignant assistant professor Isabelle Tellier UnivFr Enseignant assistant professor oui Marc Tommasi UnivFr Enseignant assistant professor oui Fabien Torre UnivFr Enseignant assistant professor Anne-Cécile Caron UnivFr Enseignant assistant professor Yves Roos UnivFr Enseignant assistant professor Jean-Marc Talbot UnivFr Enseignant assistant professor until september 2006 oui Sophie Tison UnivFr Enseignant professor oui Iovka Boneva UnivFr PhD mesrfellowship, from October 2002 to June 2006 Laurent Candillier EtablissementPrive PhD cifrefellowship, from May 2003 to September 2006 Jérôme Champavere UnivFr PhD mesrfellowship, since October 2006 Emmanuel Filiot INRIA PhD INRIAand Région Nord-Pas-de-Calais fellowship, since October 2005 Olivier Gauwin INRIA PhD INRIACordi fellowship, since december 2006 Florent Jousse INRIA PhD INRIAand Région Nord-Pas-de-Calais fellowship, since October 2004 Patrick Marty INRIA PhD INRIAand Région Nord-Pas-de-Calais fellowship, since October 2003 Mathieu Keith INRIA Technique junior engineer since november 2006 Jean-Philippe Nirel INRIA Technique junior engineer from October 2005 to June 2006 Missi Tran INRIA Technique engineer from January to June 2006 Denis Debarbieux UnivFr PhD assistant professor until september 2006

MOSTRAREis a joint project with the lifl - umr 8022( cnrs, Lille 1 and Lille 3 universities)

Overall Objectives Overall Objectives

The objective of MOSTRAREis to develop adaptive document processing methods for XML-based information systems. Adaptiveness imports when documents evolve frequently such as on the Web. The particularity of MOSTRAREis that we devellop semi-automatic or automatic information extraction approaches that can fully benefit from the available tree structure of XMLdocuments.

Information extraction is an instance of document transformation. In order to exploit the tree structure of XMLdocuments, our goal is to investigate specification languages for tree transformations. These are based on approaches from database theory (such as the W3C standards XQuery and XSLT), automata, logic, and programming languages. We wish to define stochastic models of tree transformations, and to develop automatic or semi-automatic procedures for infering them. Once available, we want to integrate these learing algorithms into innovative information extraction systems, semantic Web platforms, and document processing engines.

The following two paragraphs summarize our two main research objectives:

Modeling tree structures for information extraction.

We wish to extend studies of modeling languages for node selection queries in tree structured documents, that we contributed in the first phase of Mostrare. The new subject of interest of the second phase are XMLdocument transformations and tree transformations that generalize on node selection queries.

Machine learning for information extraction.

We wish to extend our study of machine learning techniques for information extraction. One new goal is to develop learning algorithms that can induce XMLdocument transformations, based on their tree structure. Another new goal is to explore stochastic machine learning techniques that can deal with uncertainty in document sources.

Scientific Foundations Modeling XMLdocument transformations semi-structured documents trees queries transformations automata logic

XMLdocument transformations can be defined in W3C standards querying languages XQueryor XSLT. Programming XMLtransformations in these languages is often difficult and error prone even if the schemata of input and output documents are known. Advanced programming experience and considerable programming time may be necessary, that are not available in Web services or similar scenarios.

To illustrate the main difficulty of programming XMLtransformations, consider the example of PDFto XMLconversion, under the assumption that the output's DTDis given . In a first step, one can use an existing PDFto HTMLconverter. In a second step, it remains to convert HTMLinto XML. The DTDof the HTMLinput document, however, will be either unknown or uninformative. Furthermore, the input will contain errors that are to be accounted for.

Alternatives programming language for defining XMLtransformations have been proposed by the programming language community, for instance XDuce , Xtatic , , and CDuce , , . The type systems of these languages simplify the programming tasks considerably. But of course, they don't solve the general difficulty in programming XMLtransformations manually.

Languages for defining node selection queries arise as sublanguage of all XMLtransformation languages. The W3C standards use XPathfor defining monadic queries, while XDuce and CDuce rely on regular queries defined by regular pattern equivalent to tree automata. Indeed, it is natural to look at node selection as a simple form of tree transformation. Monadic node selection queries correspond to deterministic transformations that annotate all selected nodes positively and all others negatively. N-ary node selection queries become non-deterministic transformations, yielding trees annotated by Boolean vectors.

After extensive studies of node selection queries in trees (in XPathor many other languages) the XMLcommunity has started more recently to formally investigate XMLtree transformations. The expressiveness and complexity of XQueryare studied in , . Type preservation is another problem, i.e., whether all trees of the input type get transformed into the output type, or vice versa, whether the inverse image of the output type is contained in the input type , .

The automata community usually approaches tree transformations by tree transducers , i.e., tree automata producing output structure. Macro tree transducers, for instance, have been proposed recently for defining XMLtransformations , . From the view point of logics, tree transducers have been studied for MSO definability .

Machine learning for XMLdocument transformations grammatical inference statistical learning wrapper induction trees annotation transformation

Automatic or semi-automatic tools for inferring tree transformations are needed for information extraction. Annotated examples may support the learning process. The learning target will be models of XMLtree transformations specified in some of the languages discussed above.

Grammatical inference

should be useful for inferring tree transducers that represent XMLtransformations. So far only very basic tree transducers have been shown to be learnable, by previous work of the Mostrare project . These are node selecting tree transducer (NSTTs) which preserve the structure of trees while relabeling their nodes deterministically. Previous work on grammatical inference for transducers remains limited to the case of strings , . The case of trees remains to be explored.

Stochastic tree transducers have been studied in the context of natural language processing , . A set of pairs of input and output trees defines a relation that can be represented by a 2-tape automaton called a stochastic finite-state transducer(SFST). A major problem consists in estimating the parameters of such transducer. SFST training algorithms are lacking so far .

Statistical inference

is most appropriate for dealing with uncertain or noisy data. It is generally useful for information extraction from textual data given that current text understanding tools are still very much limited. XMLtransformations with noisy input data typically arise in data integration tasks, as for instance when converting PDFinto XML.

Probabilistic context free grammars (pCFGs) are used in the context of PDFto XMLconversion , . Such methods infer a CFG as a generative model on which to add probabilities in a second step. Such two step approaches are in competition with one step approaches estimating conditional probabilities directly.

A popular non generative model for information extraction is conditional random fields( CRF) . One main advantage of CRFis to take into account long distance dependencies in the observed data. CRFhave also been applied in many situations like in bioinformatics for gene prediction. Essentially, CRFhave been used to model sequences of words. CRFsuppose a graph that relates conditional independence of random variables and features that are used to estimate conditional probabilities. CRFhave been used for sequences with internal graph structure represented as a linear chain.

So called structured outputhas very recently become a hot research topic in machine learning , . It aims at extending the classical categorization task, which consists to associate one or some labels to each input example, in order to handle structured output labels such as trees. For instance, let us consider the task of syntactic parsing which consists in finding the syntactic tree of a sentence. One classical way is to find the tree that maximizes the probability of the sentence - the tree is usually modeled by a pCFG. In the structured outputpoint of view, this task is considered as a classical classification task, but the difference is that the label of the sentence is not a ``simple'' discrete value but its syntactic tree: the set of possible labels is the set of all the possibles syntactic trees. Giving a learning set of couples (tree,sentence), the structured output classification task consists in finding the tree-label of a new sentence.

Application Domains Introduction Web intelligence data integration semantic integration peer data management systems semantic Web document processing

XMLtransformations are basic to data integration: HTMLto XMLtransformations are useful for information extraction from the Web; XMLto XMLtransformations are useful for data exchange between Web services or between peers or between databases. Doan and Halevy survey novel integration tasks that appear with the Semantic Web and the usage of ontologies. Therefore, the semi-automatic generation of XMLtransformations is a challenge in the database community and in the semantic Web community.

Also, XMLtransformations are useful for document processing. For instance, there is need of designing transformations from documents organized w.r.t visual format ( HTML, DOC, PDF) into documents organized w.r.t. semantic format ( XMLaccording to a DTDor a schema). The semi-automatic design of such transformations is obviously a very challenging objective.

A Web Service for Information Extraction Aurélien Lemay correspondent Mathieu Keith Patrick Marty Fabien Torre wrapper induction monadic queries n-ary queries Web service

A Web service for information extraction is currently under development. The Web service will be included in a platform for the semantic Web which was developped by all partners of the Webcontent project. Our Web service will include wrapper induction programs for monadic queries and n-ary queries. These programs correspond to Squirrelprototype and pafprototype which were described in the previous MOSTRAREreports. The construction of a wrapper induction program will be interactive: the user can interact with the program by giving informations to be extracted or by correcting wrong extractions.

TreeCRF: conditional random fields for trees Florent Jousse correspondent Missi Tran Marc Tommasi conditional random fields XML trees tree labeling

TreeCRF is a stochastic system which allows to label element, attribute and text nodes of XMLtrees. It is available as a freely available JAVA library http://treecrf.gforge.inria.fr/. It provides automatic generation of features from pairs ( XMLinput tree, its labeling) allowing to define the model. Efficient implementations for inference and training algorithms are provided in the library. After training, labelings of new XMLinput trees can be computed.

New Results Modeling XMLdocument transformations Node Selection Queries Sophie Tison correspondent Emmanuel Filliot Joachim Niehren Jean-Marc Talbot Olivier Gauwin Anne-Cécile Caron database theory semi-structured documents XPath XQuery logic automata

XQuery is the W3C standard for defining tree transformation. Each expression of XQuery defines a composition of basic queries. Basic queries by FLWR expressions return the result of an n-ary node selecting query in some output tree. They rely on path expressions, storing selected nodes of the n-tuples in variables. The expressions for selecting n-tuples of nodes have been pushed down to XPath 2.0 very recently, which is a proper sublangage of XQuery 1.0.

Variables in XPath 2.0 are fundamental for selecting n-tuples of nodes in trees. The navigational core of XPath 2.0 is known to capture first-order logic while being PSPACE complete with respect to model checking. Filiot, Niehren, Talbot, and Tison , distinguish a fragment of Core XPath 2.0 that we call the polynomial-time path language (PPL). They show that PPL remains first-order complete even though enjoying polynomial time query answering (and thus model checking).

Monadic second-order (MSO) logic is more expressive than FO and thus XPath 2.0. The famous theorem of Thatcher and Wright (1968) states that tree automata can express the same queries than MSO. The traditional theorem holds with respect to ranked trees, but has been lifted to unranked trees as in XML meanwhile. It is also well known that n-ary queries represented by deterministic tree automata can be answered in polynomial time.

Martens and Niehren study minimization of XML Schema and tree automata for unranked trees. First, they study unranked tree automata that are standard in database theory, assuming bottom-up determinism and that horizontal recursion is represented by deterministic finite automata. They show that minimal automata in that class are not unique and that minimization is np complete. Second, they study more recent automata classes that do allow for polynomial time minimization. Among those, we show that bottom-up deterministic stepwise tree automata (invented in the Mostrare project) yield the most succinct representations. Third, they investigate abstractions of ML schema languages. In particular, they show that the class of one-pass preorder typable schemas allows for polynomial time minimization and unique minimal models.

Erk and Niehren study conjunctive queries in ranked trees with respect to satisfiablity. They show how to express dominance constraints in the once-only nesting fragment of stratified context unification, which therefore is NP-complete.

Programming Languages Joachim Niehren correspondent Concurrency functional programming

Kuttler, Lhoussaine, and Niehren , propose to model the dynamics of gene regulatory networks as concurrent processes in the stochastic pi calculus. As a first case study, they show how to express the control of transcription initiation at the lambda switch, a prototypical example where cooperative enhancement is crucial. This requires concurrent programming techniques that are new to systems biology, and necessitates stochastic parameters that we derive from the literature.

Niehren, Schwinghammer and Smolka introduce a new lambda calculus with futures, Lambda(fut), that models the operational semantics of concurrent statically typed functional programming languages with mixed eager and lazy threads such as Alice ML, a concurrent extension of Standard ML. Lambda(fut) is a minimalist extension of the call-by-value lambda-calculus that is sufficiently expressive to define and combine a variety of standard concurrency abstractions, such as channels, semaphores, and ports.

Machine learning for XMLdocument transformations Wrapper induction by grammatical inference Aurélien Lemay correspondent Rémi Gilleron Joachim Niehren Yves Roos Jérôme Champavère tree automata monadic queries ordered trees wrapper induction grammatical inference

Carme, Gilleron, Niehren, and Lemay investigate wrapper induction for Web information extraction by methods of grammatical inference. They consider Web documents in HTML as unranked ordered trees, and wrappers – the extraction target – as node selection queries in unranked trees. Users of a Web information extraction system are supposed to annotate example HTML documents, visually by the help of some Web browser. They may label informative nodes positively and others negatively. The tasks of the extraction system is then to infer the correct node selection query from the sample of annotated examples.

In , Carme, Gilleron, Lemay, and Niehren turn their induction algorithm for monadic queries into a visually interactive learning process that can also deal with document with just a few annotation (complete annotations are no longer required). Experiments on realistic Web documents confirm excellent quality with very few user interactions – annotations and corrections – during wrapper induction.

In , Lemay, Niehren and Gilleron consider n-ary queries. They propose an induction algorithm based on grammatical inference techniques. Preliminary experimental results are quite promising. Nevertheless, the work will be pursued to introduce pruning techniques and heuristics in order to get an even more efficient system and to allow the interactive design of n-ary wrappers.

Latteux, Lemay, Roos and Terlutte , , study learning of finite automata from positive examples. They consider Residual Finite State Automata (RFSA) which are non deterministic automata that share some properties with DFA (in particular, DFA are RFSA and RFSA can be much smaller). Latteux, Lemay, Roos and Terlutte introduced the class of biRFSA which are RFSA whose reverse are RFSA. This class is not learnable in general but they identified two non trivial subclasses that are learnable, the second one being learnable in polynomial time.

Statistical wrapper induction Patrick Marty Rémi Gilleron Marc Tommasi Fabien Torre correspondent supervised classification attribute-value representation wrapper induction textual data HTML data

Gilleron, Marty, Tommasi, and Torre approach wrapper induction by statistical machine learning techniques within Marty's PhD project. they have defined a system PAFto extracting n-ary queries in tree structured documents. The system is based on combination techniques. In , they have extended PAFto an interactive system allowing to define, with very few interactions with the user, n-ary queries over HTMLWeb pages. It is worth noting that the system can be applied whatever is the organization of the target n-ary relation in the input Web page.

Statistical clustering Fabien Torre correspondent Isabelle Tellier Laurent Candillier unsupervised classification subspace clustering Expectation-Maximization

Laurent Candillier has defended his PhD thesis in september 2006 . His work has achieved two main results. First, a new subspace clustering algorithm for attribute-value databases has been defined. This algorithm has been tested on many problems in which it has been proved to perform very well . It could also be applied to semi-structured data, after an appropriate encoding of XML data. The algorithm participated in the 2005 INEX/PASCAL challenge on document mining, where it has been classed second out of six in clustering. An adaptation of decision-tree learning algorithms applied to the same encoding for the supervised learning task, tested by the same authors, has been classed first in classification. The participants of the challenge having obtained the best results have written the chapter "Mining XML documents" of the book "Data Mining Patterns : New Methods and Applications", accepted to appear next year (the co-authors of this chapter are Laurent Candillier, Ludovic Denoyer Patrick Gallinari, Marie-Christine Rousset, Alexandre Termier and Anne-Marie Vercoustre). The second main achievement of Candillier and co-authors' work is the proposition of a new evaluation method for non supervised-learning. This method proposes to consider clustering as a pre-treatment for a task (for example supervised learning) which can be rigorously evaluated. The comparison of how the task is performed with or without the clustering as a pre-treatment mesures the information this clusering has brought. Many Experiments have proved that this method is robust and largely domain independant , .

Probabilistic XML tree labeling Florent Jousse correspondent Isabelle Tellier Marc Tommasi Rémi Gilleron conditional random fields XML trees tree labeling

Conditional random fields are graphical models defining conditional probability distributions. They have been successfully applied for labeling tasks in the case of sequences. We have defined XMLConditional Random Fields, a framework for building conditional models for labeling XMLdocuments in , . We have defined efficient algorithms for inference and parameter estimation. A prototype TreeCRF has been implemented. We have applied XMLConditional Random Fields to tree labeling tasks in information extraction and schema matching. Experiments yield very good results.

Contracts and Grants with Industry Contracts and Grants with Industry RNTL ATASH Rémi Gilleron correspondent Florent Jousse Aurélien Lemay Joachim Niehren Marc Tommasi

Atashis a french industrial project supported by the ``Agence Nationale de la Recherche (ANR)''. It is a collaboration with the Xerox Research Center Europe xrcein Grenoble and the lip6laboratory. The objective is the design of learning algorithms for tree transformations and their implementation for data integration of documents (PDF, html, doc) in XMLdatabases according to a target DTD. The project has begun in 2006. A PhD CIFRE supported by xrceand supervised by Rémi Gilleronand Boris Chidlovskiwill begin in january 2007.

RNTL Webcontent Rémi Gilleron Florent Jousse Patrick Marty Marc Tommasi Fabien Torre correspondent

Webcontentis a french industrial project supported by the ``Agence Nationale de la Recherche (ANR)''. It involves academic partners and companies. The objective is to develop a platform for Web document processing and semantic Web. We should integrate our Web service for information extraction, currently under development, in the platform. and adapt our prototypes for Web information extraction. We are also involved in academic works on the semantic web: semantic annotations, ontology inference and ontology mapping.

Others

The PhD of Laurent Candillier was supported by the company PERTINENCE in Paris.

Other Grants and Activities French Actions ACI TraLaLA: Transformation Languages, Logic and Application Anne-Cécile Caron correspondent Emmanuel Filiot Joachim Niehren Yves Roos Sophie Tison

We are involved in the French cooperation project ``ACI masse de données – TraLaLA – XML Transformation Languages, Logic and Application'' (2004–2007). We pay particular attention to the programming languages and query languages problems. We aim to cover in a uniform way a wide spectrum of different areas, namely: programming languages (expressiveness, typing, new programming primitives, query underlying logics, logical optimization), data access (streamed data, compression, access to secondary memory storages, persistency engines), implementation (pattern matching compiling, physical optimization, subtyping verification, execution models for streamed data). The marginal budget allocated to the Mostrare project is 53 Keuros over the period 2004-2007.

Ours partners are: Giuseppe Castagna(coordinator - liens), Luc Ségoufin( gemo inriaproject), Silvano Dal Zilio( lif) and Véronique Benzaken( LRI). More information about the project can be found on http://www.cduce.org/tralala.html.

ARA MDCA Marmota : Stochastic Tree Models and Stochastic Tree Transformation Rémi Gilleron Aurélien Lemay Joachim Niehren Marc Tommasi correspondent

We propose to study computational issues at the intersection of three domains: formal tree languages, machine learning and probabilistic models. Our study is mainly motivated by XMLdata manipulation: data integration on the Internet from heterogeneous and distributed sources; XMLannotation and transformation; XMLdocument classification and clustering. However, fundamental intended results have an important impact in many application domains. For instance, in bioinformatics and music retrieval, it is actually relevant to model data by using probabilistic trees. Therefore, this project is also concerned with the specific problems of these two applications domains and we will use large data sets of these areas. We will consider generative models for tree structured data, non generative models for tree structured data, and models for probabilistic tree pattern matching and probabilistic tree transformations: tree pattern matching algorithms, learning pattern languages, induction of tree transformations. The coordinator of the project is M. Tommasi. Our partners are: P. Gallinari( lip6), F. Denis( lif, and M. Sebban( Saint Etienne). More information about the project can be found on http://www.grappa.univ-lille3.fr/marmota.

ARC Mosaique Isabelle Tellier correspondent

This ARC (Common Research Action between INRIA projects) gathers several French teams working on the syntactic formalisation of natural language. Some of them have developped syntactic ressources, but the problem faced is that these ressources are neither comparable (because they are based on different grammatical formalisms) nor reusable by any other formalism than their own. None of them have a very large covering. So, the purpose of this project is to capitalize as much as possible the efforts already made, by developping bridges between various formalisms, or by proposing some higher level formalism, able to generalize several others. The first year of the ARC has mainly been dedicated to an exhaustive presentation of the available ressources, and of the particularities of each of them. The use of XML formats and tree models in this context links this project with Mostrare's goals.

Dissemination Scientific Animation

Program Committees:

S. Tisonwas member of the editorial board of RAIRO - Theoretical Informatics and Applications, was PC member of lpar'2006, planX'2007, and fossacs'2007.

R. Gilleronwas PC member of EGC'2006 and EGC'2007 (french conference on knowledge discovery)

F. Torrewas PC member of cap'2006 (french conference on machine learning)

I. Tellierwas PC member of coria'2007 (french conference on information retrieval); was member of the editorial committee for the special issue ``TAL: systemes question-reponse''.

J. Niehrenwas PC member of mfcs'2006, lpar2006, romand'2006 and CSLP'2006.

Workshop Organization

Anne-Cécile Caronco-organizes EGC'2006 (French Conference on knowledge discovery) and BDA'2006 (French Conference on databases) in Lille.

Sophie Tisonco-organizes a Workshop on Tree Automata, funded by the European Science Foundation, in Bonn (june 2006).

Invited talks

Joachim Niehren was invited to the dagstuhl seminar on constraint satisfaction

Rémi Gilleron, Jean-Marc Talbot and Joachim Niehren were invited for presentation at the tree automata workshop in Bonn.

French Scientific Responsibilities

S. Tisonis, vice-director of the lifl(computer science department in Lille), head of the research group STC of the lifl. She is member of the national evaluation committee (MSTP-DS9) for teaching and research.

R. Gilleronis member of the scientific council for the program ara - mdcade l' anr.

I. Tellieris member of the CNU 27 (french evaluation committee for assistant professors in computer science)

Teaching and Scientific Diffusion

Teaching

Joachim Niehren 10 hours masters

Aurélien Lemay 192 hours bachelor and masters

Isabelle Tellier 192 hours bachelor and masters

Marc Tommasi 192 hours bachelor and masters

Fabien Torre 192 hours bachelor and masters

Anne-Cécile Caron 192 hours bachelor and masters

Yves Roos 192 hours bachelor and masters

Sophie Tison 192 hours bachelor and masters

Joachim Niehren	10 hours	masters
Aurélien Lemay	192 hours	bachelor and masters
Isabelle Tellier	192 hours	bachelor and masters
Marc Tommasi	192 hours	bachelor and masters
Fabien Torre	192 hours	bachelor and masters
Anne-Cécile Caron	192 hours	bachelor and masters
Yves Roos	192 hours	bachelor and masters
Sophie Tison	192 hours	bachelor and masters

Master lectures presented at the university of Lille 1

Logic et Modelisation: A.-C. Caron, J. Niehren, and S. Tison

Machine Learning for Information Extraction: I. Tellier(2006-07)

Master projects:

direction of PhD thesis submitted in 2006:

I. Boneva, Logics for unranked and unordered trees, supervised by J. M. Talbotand S. Tison.

L. Candillier, on unsupervised learning by subspace clustering, supervised by I. Tellierand F. Torre.

habilitation thesis in 2006:

M. Tommasi, Machine Learning for tree structures.

PhD committees:

R. Gilleronbelonged to the committee of L. Candillier; I. Tellierbelonged to the committees of L. Candillier, E. Moreau(Nantes), and R. Eyraud(Saint Etienne, reviewer); S. Tisonbelonged to the committees of D. Marchal, A. Muller, J. Lemesre, I. Boneva, and C. Miachon(Orsay, reviewer);

Habilitation committees: R. Gilleronbelonged to the committees of M. Tommasi(Lille), F. Yvon(Paris); S. Tisonbelonged to the committee of M. Tommasi(Lille).

Extraction and Implication of Path Constraints Yves André Y. Anne-Cécile Caron A.-C. Denis Debarbieux D. Yves Roos Y. Sophie Tison S. Proceedings of the 29th Symposium on Mathematical Foundations of Computer Science (MFCS'04) Lecture Notes in Computer Science 3153 Springer Verlag 2004 863-875 When Ambients Cannot be Opened Iovka Boneva I. Jean-Marc Talbot J.-M. Theoretical Computer Science 0304-3975 333 2 2005 127-169 Expressiveness of a spatial logic for trees Iovka Boneva I. Jean-Marc Talbot J.-M. Sophie Tison S. Proceedings of the 20th Annual IEEE Symposium on Logic in Computer Science (LICS'05) IEEE Comp. Soc. Press 2005 280 - 289 Cascade Evaluation of Clustering Algorithms Laurent Candillier L. Isabelle Tellier I. Fabien Torre F. Olivier Bousquet O. 17th European Conference on Machine Learning (ECML'2006) Lecture Notes in Artificial Intelligence 4212 Springer Verlag 2006 574–581 Interactive Learning of Node Selecting Tree Transducers Julien Carme J. Rémi Gilleron R. Aurélien Lemay A. Joachim Niehren J. Machine Learning 0885-6125 Appears 2007 66 1 2006 33-67 Learning from Positive and Unlabeled Examples François Denis F. Rémi Gilleron R. Fabien Letouzey F. Theoretical Computer Science 0304-3975 348 1 2005 70-83 Interactive Tuples Extraction from Semi-Structured Data Rémi Gilleron R. Patrick Marty P. Marc Tommasi M. Fabien Torre F. 2006 IEEE / WIC / ACM International Conference on Web Intelligence 2006 On the Minimization of XML Schemas and Tree Automata for Unranked Trees Wim Martens W. Joachim Niehren J. Journal of Computer and System Science 0022-0000 In press 2007 2006 Logics for unranked and unordered trees and their use for querying semistructured data Iovka Boneva I. Ph. D. Thesis Universite Lille 1 2006 Contextualisation, Visualisation et Evaluation en Apprentissage Non Supervise Laurent Candillier L. Ph. D. Thesis Universite Charles de Gaulle, Lille 3 2006 Habilitation thesis: Machine Learning for Tree Structures Marc Tommasi M. Ph. D. Thesis Universite Charles de Gaulle, Lille 3 2006 Interactive Learning of Node Selecting Tree Transducers Julien Carme J. RÃ©mi Gilleron R. AurÃ©lien Lemay A. Joachim Niehren J. Machine Learning 0885-6125 To appear 66 1 2007 33-67 Dominance Constraints in Stratified Context Unification Katrin Erk K. Joachim Niehren J. Information Processing Letters 0020-0190 in press 2007 CÃ©line Kuttler C. Joachim Niehren J. Gene Regulation in the Pi Calculus: Simulating Cooperativity at the Lambda Switch Transactions on Computational Systems Biology 4230 VII 2006 24-55 Identification of biRFSA languages Michel Latteux M. AurÃ©lien Lemay A. Yves Roos Y. Alain Terlutte A. Theoretical Computer Science 0304-3975 356 1-2 2006 212-223 Minimal NFA and biRFSA Languages Michel Latteux M. Yves Roos Y. Alain Terlutte A. RAIRO - Theoretical Informatics and Applications 0988-3754 To appear 2007 On the Minimization of XML Schemas and Tree Automata for Unranked Trees Wim Martens W. Joachim Niehren J. Journal of Computer and System Science 0022-0000 in press 2007 A Concurrent Lambda Calculus with Futures Joachim Niehren J. Jan Schwinghammer J. Gert Smolka G. Theoretical Computer Science 0304-3975 364 3 2006 338-356 Learning Recursive Automata from Positive Examples Isabelle Tellier I. Revue d'Intelligence Artificielle 0992-499X New Methods in Machine Learning 20/2006 2006 775-804 Cascade Evaluation of Clustering Algorithms Laurent Candillier L. Isabelle Tellier I. Fabien Torre F. Olivier Bousquet O. 17th European Conference on Machine Learning (ECML'2006) Lecture Notes in Artificial Intelligence 4212 Springer Verlag 2006 574–581 Evaluation en Cascade d'Algorithmes de Clustering Laurent Candillier L. Isabelle Tellier I. Fabien Torre F. Olivier Bousquet O. 8eme Conference francophone sur l'Apprentissage automatique (CAp'2006) 2006 109–124 SuSE: Subspace Selection embedded in an EM algorithm Laurent Candillier L. Isabelle Tellier I. Fabien Torre F. Olivier Bousquet O. 8eme Conference francophone sur l'Apprentissage automatique (CAp'2006) 2006 331–345 Composing Monadic Queries in Trees Emmanuel Filiot E. Joachim Niehren J. Jean-Marc Talbot J.-M. Sophie Tison S. International PLAN-X Workshop Basic Research in Computer Science 2006 Polynomial Time Fragments of XPath with Variables Emmanuel Filiot E. Joachim Niehren J. Jean-Marc Talbot J.-M. Sophie Tison S. Submitted 2007 Colloque International Morphologie, Syntaxe et Sémantique des Subordonnants 2011 Extraction de relations dans les documents Web RÃ©mi Gilleron R. Patrick Marty P. Marc Tommasi M. Fabien Torre F. Revue RNTI - Actes de EGC'06 2006 415–420 Interactive Tuples Extraction from Semi-Structured Data RÃ©mi Gilleron R. Patrick Marty P. Marc Tommasi M. Fabien Torre F. 2006 IEEE / WIC / ACM International Conference on Web Intelligence 2006 Learning n-ary tree-pattern queries for web data integration Benjamin Habegger B. Denis Debarbieux D. 5th International Conference on Ontologies, Databases, and Applications of Semantics 2006 Champs conditionnels alÃ©atoires pour l'annotation d'arbres Florent Jousse F. RÃ©mi Gilleron R. Isabelle Tellier I. Marc Tommasi M. 8eme Conference francophone sur l'Apprentissage automatique (CAp'2006) 2006 171–186 Conditional Random Fields for XML Trees Florent Jousse F. RÃ©mi Gilleron R. Isabelle Tellier I. Marc Tommasi M. ECML Workshop on Mining and Learning in Graphs 2006 A Stochastic Pi Calculus for Concurrent Objects CÃ©line Kuttler C. CÃ©dric Lhoussaine C. Joachim Niehren J. 1st International Workshop on Probabilistic Automata and Logics 2006 Identification des langages biAFER Michel Latteux M. AurÃ©lien Lemay A. Yves Roos Y. Alain Terlutte A. 8eme Conference francophone sur l'Apprentissage automatique (CAp'2006) 2006 33–48 Learning n-ary Node Selecting Tree Transducers from Completely Annotated Examples AurÃ©lien Lemay A. Joachim Niehren J. RÃ©mi Gilleron R. International Colloquium on Grammatical Inference Lecture Notes in Artificial Intelligence 4201 Springer Verlag 2006 253-267 CDuce: an XML-centric general-purpose language Véronique Benzaken V. Giuseppe Castagna G. Alain Frisch A. ACM SIGPLAN Notices 0362-1340 38 9 2003 51–63 A Full Pattern-Based Paradigm for XML Query Processing. Véronique Benzaken V. Giuseppe Castagna G. Cédric Miachon C. PADL Lecture Notes in Computer Science Springer Verlag 2005 235-252 Patterns and Types for Querying XML Guiseppe Castagna G. 10th International Symposium on Database Programming Languages Lecture Notes in Computer Science 3774 Springer Verlag 2005 1 - 26 Wrapping Web Information Providers by Transducer Induction Boris Chidlovskii B. Proc. European Conference on Machine Learning Lecture Notes in Artificial Intelligence 2167 2001 61 – 73 Supervised learning for the legacy document conversion Boris Chidlovskii B. Jérôme Fuselier J. Proceedings of the 2004 ACM Symposium on Document Engineering 2004 220-228 A probabilistic learning method for XML annotation of documents Boris Chidlovskii B. Jérôme Fuselier J. Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI'05) 2005 1016-1021 Gene prediction with conditional random fields Aron Culotta A. David Kulp D. Andrew McCallum A. Technical report UM-CS-2005-028 University of Massachusetts, Amherst April 2005 http://www.cs.umass.edu/~culotta/pubs/culotta05gene.pdf Semantic Integration Research in the Database Community: A Brief Survey AnHai Doan A. Alon Y. Halevy A. Y. AI magazine 0738-4602 26 1 2005 83-94 Parameter Estimation for Probabilistic Finite-State Transducers Jason Eisner J. Proceedings of the Annual meeting of the association for computational linguistic 2002 1–8 Bottom-up and top-down tree transformations. A comparision J. Engelfriet J. Mathematical System Theory 9 1975 198–231 Macro tree transducers, attribute grammars, and MSO definable tree translations Joost Engelfriet J. Sebastian Maneth S. Information and Computation 0890-5401 154 1 1999 34–91 Regular Object Types Vladimir Gapeyev V. Benjamin C. Pierce B. C. European Conference on Object-Oriented Programming 2003 http://www.cis.upenn.edu/~bcpierce/papers/regobj.pdf Training tree transducers J. Graehl J. K. Knight K. NAACL-HLT 2004 105-112 Regular expression pattern matching for XML Haruo Hosoya H. Benjamin Pierce B. Journal of Functional Programming 0956-7968 6 13 2003 961-1004 An overview of probabilistic tree transducers for natural language processing K. Knight K. J. Graehl J. Sixth International Conference on Intelligent Text Processing 2005 1-24 On the complexity of nonrecursive XQuery and functional query languages on complex values Christoph Koch C. 24th SIGMOD-SIGACT-SIGART Symposium on Principles of Database systems ACM-Press 2005 84–97 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. John D. Lafferty J. D. Andrew McCallum A. Fernando C. N. Pereira F. C. N. Proceedings of the Eighteenth International Conference on Machine Learning (ICML) 2001 282-289 Type-based Optimization for Regular Patterns Michael Y. Levin M. Y. Benjamin C. Pierce B. C. 10th International Symposium on Database Programming Languages Lecture Notes in Computer Science 3774 2005 XML type checking with macro tree transducers Sebastian Maneth S. Alexandru Berlea A. Thomas Perst T. Helmut Seidl H. 24th ACM Symposium on Principles of Database Systems 2005 283–294 Foundations of Statistical Natural Language Processing C. Manning C. H. Schütze H. MIT Press

Cambridge

1999 Typechecking Top-Down Uniform Unranked Tree Transducers Wim Martens W. Frank Neven F. 9th International Conference on Database Theory, London, UK Lecture Notes in Computer Science 2572 Springer Verlag 2003 64–78 Composable XML transformations with tree transducers Hasashi Miyashita H. Makoto Murata M. 2005 Learning Subsequential Transducers for Pattern Recognition and Interpretation Tasks J. Oncina J. P. Garcia P. E. Vidal E. "IEEE Trans. Patt. Anal. and Mach. Intell. " 15 1993 448-458 Learning Structured Prediction Models: A Large Margin Approach B. Taskar B. V. Chatalbashev V. D. Koller D. C. Guestrin C. Proceedings of the Twenty Second International Conference on Machine Learning (ICML'05) 2005 896 – 903 Large Margin Methods for Structured and Interdependent Output Variables Ioannis Tsochantaridis I. Thorsten Joachims T. Thomas Hofmann T. Yasemin Altun Y. Journal of Machine Learning Research 1532-4435 6 2005 1453–1484 Deciding Well-Definedness of XQuery Fragments Stijn Vansummeren S. Proceedings of the 24th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems 2005 37–48 Logics for unranked and unordered trees and their use for querying semistructured data Iovka Boneva I. Ph. D. Thesis Universite Lille 1 2006 Contextualisation, Visualisation et Evaluation en Apprentissage Non Supervise Laurent Candillier L. Ph. D. Thesis Universite Charles de Gaulle, Lille 3 2006 Cascade Evaluation of Clustering Algorithms Laurent Candillier L. Isabelle Tellier I. Fabien Torre F. Olivier Bousquet O. 17th European Conference on Machine Learning (ECML'2006) Lecture Notes in Artificial Intelligence 4212 Springer Verlag 2006 574–581 Evaluation en Cascade d'Algorithmes de Clustering Laurent Candillier L. Isabelle Tellier I. Fabien Torre F. Olivier Bousquet O. 8eme Conference francophone sur l'Apprentissage automatique (CAp'2006) 2006 109–124 SuSE: Subspace Selection embedded in an EM algorithm Laurent Candillier L. Isabelle Tellier I. Fabien Torre F. Olivier Bousquet O. 8eme Conference francophone sur l'Apprentissage automatique (CAp'2006) 2006 331–345 Composing Monadic Queries in Trees Emmanuel Filiot E. Joachim Niehren J. Jean-Marc Talbot J.-M. Sophie Tison S. International PLAN-X Workshop Basic Research in Computer Science 2006 Extraction de relations dans les documents Web RÃ©mi Gilleron R. Patrick Marty P. Marc Tommasi M. Fabien Torre F. Revue RNTI - Actes de EGC'06 2006 415–420 Interactive Tuples Extraction from Semi-Structured Data RÃ©mi Gilleron R. Patrick Marty P. Marc Tommasi M. Fabien Torre F. 2006 IEEE / WIC / ACM International Conference on Web Intelligence 2006 Learning n-ary tree-pattern queries for web data integration Benjamin Habegger B. Denis Debarbieux D. 5th International Conference on Ontologies, Databases, and Applications of Semantics 2006 Champs conditionnels alÃ©atoires pour l'annotation d'arbres Florent Jousse F. RÃ©mi Gilleron R. Isabelle Tellier I. Marc Tommasi M. 8eme Conference francophone sur l'Apprentissage automatique (CAp'2006) 2006 171–186 Conditional Random Fields for XML Trees Florent Jousse F. RÃ©mi Gilleron R. Isabelle Tellier I. Marc Tommasi M. ECML Workshop on Mining and Learning in Graphs 2006 A Stochastic Pi Calculus for Concurrent Objects CÃ©line Kuttler C. CÃ©dric Lhoussaine C. Joachim Niehren J. 1st International Workshop on Probabilistic Automata and Logics 2006 Gene Regulation in the Pi Calculus: Simulating Cooperativity at the Lambda Switch CÃ©line Kuttler C. Joachim Niehren J. Transactions on Computational Systems Biology 4230 VII 2006 24-55 Identification des langages biAFER Michel Latteux M. AurÃ©lien Lemay A. Yves Roos Y. Alain Terlutte A. 8eme Conference francophone sur l'Apprentissage automatique (CAp'2006) 2006 33–48 Identification of biRFSA languages Michel Latteux M. AurÃ©lien Lemay A. Yves Roos Y. Alain Terlutte A. Theoretical Computer Science 0304-3975 356 1-2 2006 212-223 Learning n-ary Node Selecting Tree Transducers from Completely Annotated Examples AurÃ©lien Lemay A. Joachim Niehren J. RÃ©mi Gilleron R. International Colloquium on Grammatical Inference Lecture Notes in Artificial Intelligence 4201 Springer Verlag 2006 253-267 A Concurrent Lambda Calculus with Futures Joachim Niehren J. Jan Schwinghammer J. Gert Smolka G. Theoretical Computer Science 0304-3975 364 3 2006 338-356 Learning Recursive Automata from Positive Examples Isabelle Tellier I. Revue d'Intelligence Artificielle 0992-499X New Methods in Machine Learning 20/2006 2006 775-804 Habilitation thesis: Machine Learning for Tree Structures Marc Tommasi M. Ph. D. Thesis Universite Charles de Gaulle, Lille 3 2006