MOSTRAREis a joint project with the lifl - umr 8022( cnrs, Lille 1 and Lille 3 universities)
The objective of MOSTRAREis to develop adaptive document processing methods for XML-based information systems. Adaptiveness imports when documents evolve frequently such as on the Web. The particularity of MOSTRAREis that we devellop semi-automatic or automatic information extraction approaches that can fully benefit from the available tree structure of XMLdocuments.
Information extraction is an instance of document transformation. In order to exploit the tree structure of XMLdocuments, our goal is to investigate specification languages for tree transformations. These are based on approaches from database theory (such as the W3C standards XQuery and XSLT), automata, logic, and programming languages. We wish to define stochastic models of tree transformations, and to develop automatic or semi-automatic procedures for infering them. Once available, we want to integrate these learing algorithms into innovative information extraction systems, semantic Web platforms, and document processing engines.
The following two paragraphs summarize our two main research objectives:
We wish to extend studies of modeling languages for node selection queries in tree structured documents, that we contributed in the first phase of Mostrare. The new subject of interest of the second phase are XMLdocument transformations and tree transformations that generalize on node selection queries.
We wish to extend our study of machine learning techniques for information extraction. One new goal is to develop learning algorithms that can induce XMLdocument transformations, based on their tree structure. Another new goal is to explore stochastic machine learning techniques that can deal with uncertainty in document sources.
XMLdocument transformations can be defined in W3C standards querying languages XQueryor XSLT. Programming XMLtransformations in these languages is often difficult and error prone even if the schemata of input and output documents are known. Advanced programming experience and considerable programming time may be necessary, that are not available in Web services or similar scenarios.
To illustrate the main difficulty of programming XMLtransformations, consider the example of PDFto XMLconversion, under the assumption that the output's DTDis given . In a first step, one can use an existing PDFto HTMLconverter. In a second step, it remains to convert HTMLinto XML. The DTDof the HTMLinput document, however, will be either unknown or uninformative. Furthermore, the input will contain errors that are to be accounted for.
Alternatives programming language for defining XMLtransformations have been proposed by the programming language community, for instance XDuce , Xtatic , , and CDuce , , . The type systems of these languages simplify the programming tasks considerably. But of course, they don't solve the general difficulty in programming XMLtransformations manually.
Languages for defining node selection queries arise as sublanguage of all XMLtransformation languages. The W3C standards use XPathfor defining monadic queries, while XDuce and CDuce rely on regular queries defined by regular pattern equivalent to tree automata. Indeed, it is natural to look at node selection as a simple form of tree transformation. Monadic node selection queries correspond to deterministic transformations that annotate all selected nodes positively and all others negatively. N-ary node selection queries become non-deterministic transformations, yielding trees annotated by Boolean vectors.
After extensive studies of node selection queries in trees (in XPathor many other languages) the XMLcommunity has started more recently to formally investigate XMLtree transformations. The expressiveness and complexity of XQueryare studied in , . Type preservation is another problem, i.e., whether all trees of the input type get transformed into the output type, or vice versa, whether the inverse image of the output type is contained in the input type , .
The automata community usually approaches tree transformations by tree transducers , i.e., tree automata producing output structure. Macro tree transducers, for instance, have been proposed recently for defining XMLtransformations , . From the view point of logics, tree transducers have been studied for MSO definability .
Automatic or semi-automatic tools for inferring tree transformations are needed for information extraction. Annotated examples may support the learning process. The learning target will be models of XMLtree transformations specified in some of the languages discussed above.
should be useful for inferring tree transducers that represent XMLtransformations. So far only very basic tree transducers have been shown to be learnable, by previous work of the Mostrare project . These are node selecting tree transducer (NSTTs) which preserve the structure of trees while relabeling their nodes deterministically. Previous work on grammatical inference for transducers remains limited to the case of strings , . The case of trees remains to be explored.
Stochastic tree transducers have been studied in the context of natural language processing , . A set of pairs of input and output trees defines a relation that can be represented by a 2-tape automaton called a stochastic finite-state transducer(SFST). A major problem consists in estimating the parameters of such transducer. SFST training algorithms are lacking so far .
is most appropriate for dealing with uncertain or noisy data. It is generally useful for information extraction from textual data given that current text understanding tools are still very much limited. XMLtransformations with noisy input data typically arise in data integration tasks, as for instance when converting PDFinto XML.
Probabilistic context free grammars (pCFGs) are used in the context of PDFto XMLconversion , . Such methods infer a CFG as a generative model on which to add probabilities in a second step. Such two step approaches are in competition with one step approaches estimating conditional probabilities directly.
A popular non generative model for information extraction is conditional random fields( CRF) . One main advantage of CRFis to take into account long distance dependencies in the observed data. CRFhave also been applied in many situations like in bioinformatics for gene prediction. Essentially, CRFhave been used to model sequences of words. CRFsuppose a graph that relates conditional independence of random variables and features that are used to estimate conditional probabilities. CRFhave been used for sequences with internal graph structure represented as a linear chain.
So called structured outputhas very recently become a hot research topic in machine learning , . It aims at extending the classical categorization task, which consists to associate one or some labels to each input example, in order to handle structured output labels such as trees. For instance, let us consider the task of syntactic parsing which consists in finding the syntactic tree of a sentence. One classical way is to find the tree that maximizes the probability of the sentence - the tree is usually modeled by a pCFG. In the structured outputpoint of view, this task is considered as a classical classification task, but the difference is that the label of the sentence is not a ``simple'' discrete value but its syntactic tree: the set of possible labels is the set of all the possibles syntactic trees. Giving a learning set of couples (tree,sentence), the structured output classification task consists in finding the tree-label of a new sentence.
XMLtransformations are basic to data integration: HTMLto XMLtransformations are useful for information extraction from the Web; XMLto XMLtransformations are useful for data exchange between Web services or between peers or between databases. Doan and Halevy survey novel integration tasks that appear with the Semantic Web and the usage of ontologies. Therefore, the semi-automatic generation of XMLtransformations is a challenge in the database community and in the semantic Web community.
Also, XMLtransformations are useful for document processing. For instance, there is need of designing transformations from documents organized w.r.t visual format ( HTML, DOC, PDF) into documents organized w.r.t. semantic format ( XMLaccording to a DTDor a schema). The semi-automatic design of such transformations is obviously a very challenging objective.
A Web service for information extraction is currently under development. The Web service will be included in a platform for the semantic Web which was developped by all partners of the
Webcontent project. Our Web service will include wrapper induction programs for monadic queries and
n-ary queries. These programs correspond to
Squirrelprototype and
pafprototype which were described in the previous
MOSTRAREreports. The construction of a wrapper induction program will be interactive: the user can interact with the program by giving informations to
be extracted or by correcting wrong extractions.
TreeCRF is a stochastic system which allows to label element, attribute and text nodes of
XMLtrees. It is available as a freely available JAVA library
XQuery is the W3C standard for defining tree transformation. Each expression of XQuery defines a composition of basic queries. Basic queries by FLWR expressions return the result of an n-ary node selecting query in some output tree. They rely on path expressions, storing selected nodes of the n-tuples in variables. The expressions for selecting n-tuples of nodes have been pushed down to XPath 2.0 very recently, which is a proper sublangage of XQuery 1.0.
Variables in XPath 2.0 are fundamental for selecting n-tuples of nodes in trees. The navigational core of XPath 2.0 is known to capture first-order logic while being PSPACE complete with respect to model checking. Filiot, Niehren, Talbot, and Tison , distinguish a fragment of Core XPath 2.0 that we call the polynomial-time path language (PPL). They show that PPL remains first-order complete even though enjoying polynomial time query answering (and thus model checking).
Monadic second-order (MSO) logic is more expressive than FO and thus XPath 2.0. The famous theorem of Thatcher and Wright (1968) states that tree automata can express the same queries than MSO. The traditional theorem holds with respect to ranked trees, but has been lifted to unranked trees as in XML meanwhile. It is also well known that n-ary queries represented by deterministic tree automata can be answered in polynomial time.
Martens and Niehren study minimization of XML Schema and tree automata for unranked trees. First, they study unranked tree automata that are standard in database theory, assuming bottom-up determinism and that horizontal recursion is represented by deterministic finite automata. They show that minimal automata in that class are not unique and that minimization is np complete. Second, they study more recent automata classes that do allow for polynomial time minimization. Among those, we show that bottom-up deterministic stepwise tree automata (invented in the Mostrare project) yield the most succinct representations. Third, they investigate abstractions of ML schema languages. In particular, they show that the class of one-pass preorder typable schemas allows for polynomial time minimization and unique minimal models.
Erk and Niehren study conjunctive queries in ranked trees with respect to satisfiablity. They show how to express dominance constraints in the once-only nesting fragment of stratified context unification, which therefore is NP-complete.
Kuttler, Lhoussaine, and Niehren , propose to model the dynamics of gene regulatory networks as concurrent processes in the stochastic pi calculus. As a first case study, they show how to express the control of transcription initiation at the lambda switch, a prototypical example where cooperative enhancement is crucial. This requires concurrent programming techniques that are new to systems biology, and necessitates stochastic parameters that we derive from the literature.
Niehren, Schwinghammer and Smolka introduce a new lambda calculus with futures, Lambda(fut), that models the operational semantics of concurrent statically typed functional programming languages with mixed eager and lazy threads such as Alice ML, a concurrent extension of Standard ML. Lambda(fut) is a minimalist extension of the call-by-value lambda-calculus that is sufficiently expressive to define and combine a variety of standard concurrency abstractions, such as channels, semaphores, and ports.
Carme, Gilleron, Niehren, and Lemay investigate wrapper induction for Web information extraction by methods of grammatical inference. They consider Web documents in HTML as unranked ordered trees, and wrappers – the extraction target – as node selection queries in unranked trees. Users of a Web information extraction system are supposed to annotate example HTML documents, visually by the help of some Web browser. They may label informative nodes positively and others negatively. The tasks of the extraction system is then to infer the correct node selection query from the sample of annotated examples.
In , Carme, Gilleron, Lemay, and Niehren turn their induction algorithm for monadic queries into a visually interactive learning process that can also deal with document with just a few annotation (complete annotations are no longer required). Experiments on realistic Web documents confirm excellent quality with very few user interactions – annotations and corrections – during wrapper induction.
In
, Lemay, Niehren and Gilleron consider
n-ary queries. They propose an induction algorithm based on grammatical inference techniques. Preliminary experimental results are quite promising. Nevertheless, the
work will be pursued to introduce pruning techniques and heuristics in order to get an even more efficient system and to allow the interactive design of
n-ary wrappers.
Latteux, Lemay, Roos and Terlutte , , study learning of finite automata from positive examples. They consider Residual Finite State Automata (RFSA) which are non deterministic automata that share some properties with DFA (in particular, DFA are RFSA and RFSA can be much smaller). Latteux, Lemay, Roos and Terlutte introduced the class of biRFSA which are RFSA whose reverse are RFSA. This class is not learnable in general but they identified two non trivial subclasses that are learnable, the second one being learnable in polynomial time.
Gilleron, Marty, Tommasi, and Torre approach wrapper induction by statistical machine learning techniques within Marty's PhD project. they have defined a system
PAFto extracting
n-ary queries in tree structured documents. The system is based on combination techniques. In
, they have extended
PAFto an interactive system allowing to define, with very few interactions with the user,
n-ary queries over
HTMLWeb pages. It is worth noting that the system can be applied whatever is the organization of the target
n-ary relation in the input Web page.
Laurent Candillier has defended his PhD thesis in september 2006 . His work has achieved two main results. First, a new subspace clustering algorithm for attribute-value databases has been defined. This algorithm has been tested on many problems in which it has been proved to perform very well . It could also be applied to semi-structured data, after an appropriate encoding of XML data. The algorithm participated in the 2005 INEX/PASCAL challenge on document mining, where it has been classed second out of six in clustering. An adaptation of decision-tree learning algorithms applied to the same encoding for the supervised learning task, tested by the same authors, has been classed first in classification. The participants of the challenge having obtained the best results have written the chapter "Mining XML documents" of the book "Data Mining Patterns : New Methods and Applications", accepted to appear next year (the co-authors of this chapter are Laurent Candillier, Ludovic Denoyer Patrick Gallinari, Marie-Christine Rousset, Alexandre Termier and Anne-Marie Vercoustre). The second main achievement of Candillier and co-authors' work is the proposition of a new evaluation method for non supervised-learning. This method proposes to consider clustering as a pre-treatment for a task (for example supervised learning) which can be rigorously evaluated. The comparison of how the task is performed with or without the clustering as a pre-treatment mesures the information this clusering has brought. Many Experiments have proved that this method is robust and largely domain independant , .
Conditional random fields are graphical models defining conditional probability distributions. They have been successfully applied for labeling tasks in the case of sequences. We have defined XMLConditional Random Fields, a framework for building conditional models for labeling XMLdocuments in , . We have defined efficient algorithms for inference and parameter estimation. A prototype TreeCRF has been implemented. We have applied XMLConditional Random Fields to tree labeling tasks in information extraction and schema matching. Experiments yield very good results.
Atashis a french industrial project supported by the ``Agence Nationale de la Recherche (ANR)''. It is a collaboration with the Xerox Research Center Europe xrcein Grenoble and the lip6laboratory. The objective is the design of learning algorithms for tree transformations and their implementation for data integration of documents (PDF, html, doc) in XMLdatabases according to a target DTD. The project has begun in 2006. A PhD CIFRE supported by xrceand supervised by Rémi Gilleronand Boris Chidlovskiwill begin in january 2007.
Webcontentis a french industrial project supported by the ``Agence Nationale de la Recherche (ANR)''. It involves academic partners and companies. The objective is to develop a platform for Web document processing and semantic Web. We should integrate our Web service for information extraction, currently under development, in the platform. and adapt our prototypes for Web information extraction. We are also involved in academic works on the semantic web: semantic annotations, ontology inference and ontology mapping.
The PhD of Laurent Candillier was supported by the company PERTINENCE in Paris.
We are involved in the French cooperation project ``ACI masse de données – TraLaLA – XML Transformation Languages, Logic and Application'' (2004–2007). We pay particular attention to the programming languages and query languages problems. We aim to cover in a uniform way a wide spectrum of different areas, namely: programming languages (expressiveness, typing, new programming primitives, query underlying logics, logical optimization), data access (streamed data, compression, access to secondary memory storages, persistency engines), implementation (pattern matching compiling, physical optimization, subtyping verification, execution models for streamed data). The marginal budget allocated to the Mostrare project is 53 Keuros over the period 2004-2007.
Ours partners are: Giuseppe Castagna(coordinator - liens), Luc Ségoufin( gemo inriaproject), Silvano Dal Zilio( lif) and Véronique Benzaken( LRI). More information about the project can be found on http://www.cduce.org/tralala.html.
We propose to study computational issues at the intersection of three domains: formal tree languages, machine learning and probabilistic models. Our study is mainly motivated by XMLdata manipulation: data integration on the Internet from heterogeneous and distributed sources; XMLannotation and transformation; XMLdocument classification and clustering. However, fundamental intended results have an important impact in many application domains. For instance, in bioinformatics and music retrieval, it is actually relevant to model data by using probabilistic trees. Therefore, this project is also concerned with the specific problems of these two applications domains and we will use large data sets of these areas. We will consider generative models for tree structured data, non generative models for tree structured data, and models for probabilistic tree pattern matching and probabilistic tree transformations: tree pattern matching algorithms, learning pattern languages, induction of tree transformations. The coordinator of the project is M. Tommasi. Our partners are: P. Gallinari( lip6), F. Denis( lif, and M. Sebban( Saint Etienne). More information about the project can be found on http://www.grappa.univ-lille3.fr/marmota.
This ARC (Common Research Action between INRIA projects) gathers several French teams working on the syntactic formalisation of natural language. Some of them have developped syntactic ressources, but the problem faced is that these ressources are neither comparable (because they are based on different grammatical formalisms) nor reusable by any other formalism than their own. None of them have a very large covering. So, the purpose of this project is to capitalize as much as possible the efforts already made, by developping bridges between various formalisms, or by proposing some higher level formalism, able to generalize several others. The first year of the ARC has mainly been dedicated to an exhaustive presentation of the available ressources, and of the particularities of each of them. The use of XML formats and tree models in this context links this project with Mostrare's goals.
Program Committees:
S. Tisonwas member of the editorial board of RAIRO - Theoretical Informatics and Applications, was PC member of lpar'2006, planX'2007, and fossacs'2007.
R. Gilleronwas PC member of EGC'2006 and EGC'2007 (french conference on knowledge discovery)
F. Torrewas PC member of cap'2006 (french conference on machine learning)
I. Tellierwas PC member of coria'2007 (french conference on information retrieval); was member of the editorial committee for the special issue ``TAL: systemes question-reponse''.
J. Niehrenwas PC member of mfcs'2006, lpar2006, romand'2006 and CSLP'2006.
Workshop Organization
Anne-Cécile Caronco-organizes EGC'2006 (French Conference on knowledge discovery) and BDA'2006 (French Conference on databases) in Lille.
Sophie Tisonco-organizes a Workshop on Tree Automata, funded by the European Science Foundation, in Bonn (june 2006).
Invited talks
Joachim Niehren was invited to the dagstuhl seminar on constraint satisfaction
Rémi Gilleron, Jean-Marc Talbot and Joachim Niehren were invited for presentation at the tree automata workshop in Bonn.
French Scientific Responsibilities
S. Tisonis, vice-director of the lifl(computer science department in Lille), head of the research group STC of the lifl. She is member of the national evaluation committee (MSTP-DS9) for teaching and research.
R. Gilleronis member of the scientific council for the program ara - mdcade l' anr.
I. Tellieris member of the CNU 27 (french evaluation committee for assistant professors in computer science)
Teaching
Joachim Niehren | 10 hours | masters |
Aurélien Lemay | 192 hours | bachelor and masters |
Isabelle Tellier | 192 hours | bachelor and masters |
Marc Tommasi | 192 hours | bachelor and masters |
Fabien Torre | 192 hours | bachelor and masters |
Anne-Cécile Caron | 192 hours | bachelor and masters |
Yves Roos | 192 hours | bachelor and masters |
Sophie Tison | 192 hours | bachelor and masters |
Master lectures presented at the university of Lille 1
Logic et Modelisation: A.-C. Caron, J. Niehren, and S. Tison
Machine Learning for Information Extraction: I. Tellier(2006-07)
Master projects:
direction of PhD thesis submitted in 2006:
I. Boneva, Logics for unranked and unordered trees, supervised by J. M. Talbotand S. Tison.
L. Candillier, on unsupervised learning by subspace clustering, supervised by I. Tellierand F. Torre.
habilitation thesis in 2006:
M. Tommasi, Machine Learning for tree structures.
PhD committees:
R. Gilleronbelonged to the committee of L. Candillier; I. Tellierbelonged to the committees of L. Candillier, E. Moreau(Nantes), and R. Eyraud(Saint Etienne, reviewer); S. Tisonbelonged to the committees of D. Marchal, A. Muller, J. Lemesre, I. Boneva, and C. Miachon(Orsay, reviewer);
Habilitation committees: R. Gilleronbelonged to the committees of M. Tommasi(Lille), F. Yvon(Paris); S. Tisonbelonged to the committee of M. Tommasi(Lille).