ALMANACH - 2017 - Annual activity report

ALMANACH

ALMANACH - 2017

Team Almanach

Personnel

Overall Objectives

Research Program

Application Domains

Application domains of NLP and Computational Humanities

Highlights of the Year

New Software and Platforms

New Results

Bilateral Contracts and Grants with Industry

Industrial Collaborations

Partnerships and Cooperations

Dissemination

Bibliography

Previous |

Home | Next next

Section: New Results

Digital Humanities and Cultural Heritage

Participants : Stefan Pernes, Marie Puren, Charles Riondet, Laurent Romary, Dorian Seillier, Lionel Tadonfouet.

The very broad scope of Digital Humanities and Cultural Heritage is well represented in the latest works of the ALMAnaCH team, undertaken in various contexts (European and national research infrastructures and bilateral partnerships). However, the issues tackled always deal with interoperability, reusability and standardization:

The "Data Reuse Charter"[33] project is carried by a large consortium of European infrastructures and institutions
The “Standardization Survival Kit” (or SSK) [66] developed within the PARTHENOS project intends to show that proper data modelling and corresponding standards make digital content more sustainable and reusable. Arts and Humanities would be well-suited to taking up the technological prerequisites of standardization [41], as most technological domains have already done.
A concrete application of what offers the SSK has been developed within the EHRI project, where we built a methodology for the management of heterogeneous archival sources—expressed in the EAD Encoded archival description format—in one single environment, namely a federated portal [40], [48]. This method is based on a specification and customisation method inspired from the TEI, i.e. the definition of project-specific subsets of the standard and the maintenance of both technical and editorial specifications within a single framework.
the Time-US project aims to reconstruct the remuneration and time budgets of women and men working in the textile trades in four French industrial regions (Lille, Paris, Lyon, Marseille) in a long-term perspective. During the launch phase, the team has been active in the following domains:
- Collection of primary sources. The Time-Us team works on a heterogenous corpus of French handwritten and printed sources spanning from the seventeenth to the twentieth century; it includes court decisions, petitions, police reports and files, and sociological surveys on living conditions of the working class.
- Evaluation of technical solutions for image visualization, transcription and collaboration, such as Transkribus (https://transkribus.eu/Transkribus/). The Transkribus interface enables Humanities scholars to transcribe handwritten and printed historical sources, and offers a very powerful Handwritten Text Recognition engine.
- Creation of an annotation schema in XML/TEI. As the corpus gathers together diverse historical sources, the definition of a light and flexible annotation schema is a major step to create data to train parsing models. This data take the form of annotated texts encoded in TEI (Text Encoding Initiative). The annotation process starts as a collaborative effort, in order to get a first dataset that will later be used to train and configure NLP tools. The current step also helps designing a precise annotation guide between the NLP people and historians, in particular to clarify their expectations.
- Installation of a customized MediaWiki. Several digital projects have already taken into account the specific needs of historians in terms of image visualization, transcription and collaboration. But they do not address all the requirements of Humanities scholars working on primary sources, and the need of comprehensive Digital Humanities-based publishing systems is emerging. We have chosen to setup a specific digital workflow enabling historians and NLP experts to work together, namely a wiki under Mediawiki (http://timeusage.paris.inria.fr/mediawiki/index.php/Accueil) with the Transcribe Bentham transcription desk, adapted to our needs, and a TEI toolbar, specifically customized for tagging named entities and measures.
Archives nationales
In a complex of projects (eRabbinica, LAKME, NEH/DFG Mishna-Tosefta Synopsis) with different partners dealing with classical rabbinic literature in Middle Hebrew we thrive to create a critical edition with translation, linguistic annotation and lexicon of the Mishna (200k tokens, the hypotext of the Talmud). Hebrew, a script written from right to left and a highly agglutinative language, poses great challenges to encoding standards and demands the development of new technical solutions. No open source corpora exist for linguistically annotated texts in rabbinic Hebrew.
- Building on ocropus HTR capacities, we have added our own layout analysis algorithms for column and line segmentation [35] that have proven very succesful for literary manuscripts for the tasks of aligning existing transcriptions of manuscripts with the word and character ROIs and for new transcriptions reaching similar results to transkribus but with a much easier complete control of the layout analysis.
- With our partners at the University of Maryland we have produced a preliminary TEI transcription of the most important manuscript Kaufman A50 (https://raw.githubusercontent.com/umd-mith/mishnah/master/data/tei/S07326.xml). Further improvements are currently undertaken. We have been able to use this transcription to realign it with the manuscript glyphs.
- We have produced preliminary transcriptions of two further manuscripts (Cambridge 450.2 and Parma A) that are in the process of TEIzation. A fourth manuscript (Munich Cod. Ebr. 95) is currently in treatment.
- Our partners at Dicta, have produced a preliminary automatical linguistic annotation of a vulgate text of the Mishna with HMMs with data for lemma, POS and morphological analysis. In the LAKME project, we have now manually corrected 25k tokens (ca. 12 percent of the whole text) that will be used to train RNN to improve the current transcription of the remaining text and enter a human-machine dialogue to fully annotate the whole Mishna. The annotation will not only be the first open source annotation. It will also be considerably more detailed than the excellent but closed annotation of the Israel Academy of the Hebrew Language (http://maagarim.hebrew-academy.org.il/). The resulting system will enable us to annotate other texts such as Tosefta and Halakhic Midrashim for the upcoming Sofer Mahir (tachygraph) project.

Previous |

Home | Next next