Marco Passarotti

2024

pdf bib abs
Querying the Lexicon der indogermanischen Verben in the LiLa Knowledge Base: Two Use Cases
Valeria Irene Boano | Marco Passarotti | Riccardo Ginevra
Proceedings of the 9th Workshop on Linked Data in Linguistics @ LREC-COLING 2024

This paper presents two use cases of the etymological data provided by the *Lexicon der indogermanischen Verben* (LIV) after their publication as Linked Open Data and their linking to the LiLa Knowledge Base (KB) of interoperable linguistic resources for Latin. The first part of the paper briefly describes the LiLa KB and its structure. Then, the LIV and the information it contains are introduced, followed by a short description of the ontologies and the extensions used for modelling the LIV’s data and interlinking them to the LiLa ecosystem. The last section details the two use cases. The first case concerns the inflection types of the Latin verbs that reflect Proto-Indo-European stems, while the second one focusses on the Latin derivatives of the inherited stems. The results of the investigations are put in relation to current research topics in Historical Linguistics, demonstrating their relevance to the discipline.

pdf bib abs
The MOLOR Lemma Bank: a New LLOD Resource for Old Irish
Theodorus Fransen | Cormac Anderson | Sacha Beniamine | Marco Passarotti
Proceedings of the 9th Workshop on Linked Data in Linguistics @ LREC-COLING 2024

This paper describes the first steps in creating a Lemma Bank for Old Irish (600-900CE) within the Linked Data paradigm, taking inspiration from a similar resource for Latin built as part of the LiLa project (2018–2023). The focus is on the extraction and RDF conversion of nouns from Goidelex, a novel and highly structured morphological resource for Old Irish. The aim is to strike a good balance between retaining a representative level of morphological granularity and at the same time keeping the amount of lemma variants within workable limits, to facilitate straightforward resource interlinking for Old Irish, planned as future work.

pdf bib abs
The Services of the LiLa Knowledge Base of Interoperable Linguistic Resources for Latin
Marco Passarotti | Francesco Mambrini | Giovanni Moretti
Proceedings of the 9th Workshop on Linked Data in Linguistics @ LREC-COLING 2024

This paper describes three online services designed to ease the tasks of querying and populating the linguistic resources for Latin made interoperable through their publication as Linked Open Data in the LiLa Knowledge Base. As for querying the KB, we present an interface to search the collection of lemmas that represents the core of the Knowledge Base, and an interactive, graphical platform to run queries on the resources currently interlinked. As for populating the KB with new textual resources, we describe a tool that performs automatic tokenization, lemmatization and Part-of-Speech tagging of a raw text in Latin and links its tokens to LiLa.

pdf bib
Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024
Rachele Sprugnoli | Marco Passarotti
Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024

pdf bib abs
The Rise and Fall of Dependency Parsing in Dante Alighieri’s Divine Comedy
Claudia Corbetta | Marco Passarotti | Giovanni Moretti
Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024

In this paper, we conduct parsing experiments on Dante Alighieri’s Divine Comedy, an Old Italian poem composed between 1306-1321 and organized into three Cantiche —Inferno, Purgatorio, and Paradiso. We perform parsing on subsets of the poem using both a Modern Italian training set and sections of the Divine Comedy itself to evaluate under which scenarios parsers achieve higher scores. We find that employing in-domain training data supports better results, leading to an increase of approximately +17% in Unlabeled Attachment Score (UAS) and +25-30% in Labeled Attachment Score (LAS). Subsequently, we provide brief commentary on the differences in scores achieved among subsections of Cantiche, and we conduct experimental parsing on a text from the same period and style as the Divine Comedy.

pdf bib abs
Overview of the EvaLatin 2024 Evaluation Campaign
Rachele Sprugnoli | Federica Iurescia | Marco Passarotti
Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024

This paper describes the organization and the results of the third edition of EvaLatin, the campaign for the evaluation of Natural Language Processing tools for Latin. The two shared tasks proposed in EvaLatin 2024, i.,e., Dependency Parsing and Emotion Polarity Detection, are aimed to foster research in the field of language technologies for Classical languages. The shared datasets are described and the results obtained by the participants for each task are presented and discussed.

pdf bib abs
Exploring Neural Topic Modeling on a Classical Latin Corpus
Ginevra Martinelli | Paola Impicciché | Elisabetta Fersini | Francesco Mambrini | Marco Passarotti
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The large availability of processable textual resources for Classical Latin has made it possible to study Latin literature through methods and tools that support distant reading. This paper describes a number of experiments carried out to test the possibility of investigating the thematic distribution of the Classical Latin corpus Opera Latina by means of topic modeling. For this purpose, we train, optimize and compare two neural models, Product-of-Experts LDA (ProdLDA) and Embedded Topic Model (ETM), opportunely revised to deal with the textual data from a Classical Latin corpus, to evaluate which one performs better both on the basis of topic diversity and topic coherence metrics, and from a human judgment point of view. Our results show that the topics extracted by neural models are coherent and interpretable and that they are significant from the perspective of a Latin scholar. The source code of the proposed model is available at https://github.com/MIND-Lab/LatinProdLDA.

pdf bib abs
Modelling and Linking an Old Latin-Portuguese Dictionary to the LiLa Knowledge Base
Lucas Consolin Dezotti | Marco Passarotti | Francesco Mambrini
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This paper describes the steps undertaken to include data from Antonio Velez’s bilingual Latin-Portuguese dictionary (Index Totius Artis, 1744) into the LiLa Knowledge Base of interoperable linguistic resources for Latin. The paper focuses on how the lexical and lexicographic information of the source dictionary was modelled by using respectively the Lexicon Model for Ontologies (OntoLex-lemon) and its lexicog module. The linking process of the dictionary entries with those of the LiLa collection of Latin lemmas is detailed, discussing issues in dealing with ambiguities and typographical errors found in the source. The result is the first Latin-Portuguese lexical resource made interoperable with the (meta)data of the other linguistic resources for Latin interlinked in the LiLa Knowledge Base, providing new ways of assessing the dictionary information or using its content as starting point to explore the connections with other interlinked linguistic resources. A couple of use case scenarios illustrate those possibilities.

pdf bib abs
Representing Compounding with OntoLex. An Evaluation of Vocabularies for Word Formation Resources
Elena Benzoni | Matteo Pellegrini | Francesco Dedè | Marco Passarotti
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This paper explores how compounds are represented in resources documenting word formation, and proposes ways to convert them into Linked Open Data using the OntoLex model. The ultimate purpose is to offer a broad empirical evaluation of which of the two OntoLex modules allowing for the representation of compounds – Decomp and Morph – fits best the different formats and theoretical approaches of the resources we examine. We show that the vocabulary of Decomp alone is rarely sufficient to account for all relevant facts; in almost all cases, it is necessary to resort to the vocabulary of Morph, either to reify the relation between compounds and their constituents or to represent specifically morphological information or other aspects. Special attention is devoted to the format of the Universal Derivations project: the modelling strategy that we propose can be applied to all resources harmonized in that format, potentially allowing for the conversion into Linked Open Data of a large amount of structured data.

2023

pdf bib abs
Linking the Neulateinische Wortliste to the LiLa Knowledge Base of Interoperable Resources for Latin
Federica Iurescia | Eleonora Litta | Marco Passarotti | Matteo Pellegrini | Giovanni Moretti | Paolo Ruffolo
Proceedings of the 7th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

This paper describes the process of interlinking a lexical resource consisting of a list of more than 20,000 Neo-Latin words with other resources for Latin. The resources are made interoperable thanks to their linking to the anonymous Knowledge Base, which applies Linguistic Linked Open Data practices and data categories to describe and publish on the Web both textual and lexical resources for the Latin language.

2022

pdf bib
Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages
Rachele Sprugnoli | Marco Passarotti
Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages

pdf bib abs
Overview of the EvaLatin 2022 Evaluation Campaign
Rachele Sprugnoli | Marco Passarotti | Flavio Massimiliano Cecchini | Margherita Fantoli | Giovanni Moretti
Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages

This paper describes the organization and the results of the second edition of EvaLatin, the campaign for the evaluation of Natural Language Processing tools for Latin. The three shared tasks proposed in EvaLatin 2022, i.,e.,Lemmatization, Part-of-Speech Tagging and Features Identification, are aimed to foster research in the field of language technologies for Classical languages. The shared dataset consists of texts mainly taken from the LASLA corpus. More specifically, the training set includes only prose texts of the Classical period, whereas the test set is organized in three sub-tasks: a Classical sub-task on a prose text of an author not included in the training data, a Cross-genre sub-task on poetic and scientific texts, and a Cross-time sub-task on a text of the 15th century. The results obtained by the participants for each task and sub-task are presented and discussed.

pdf bib abs
Linking the LASLA Corpus in the LiLa Knowledge Base of Interoperable Linguistic Resources for Latin
Margherita Fantoli | Marco Passarotti | Francesco Mambrini | Giovanni Moretti | Paolo Ruffolo
Proceedings of the 8th Workshop on Linked Data in Linguistics within the 13th Language Resources and Evaluation Conference

This paper describes the process of interlinking the 130 Classical Latin texts provided by an annotated corpus developed at the LASLA laboratory with the LiLa Knowledge Base, which makes linguistic resources for Latin interoperable by following the principles of the Linked Data paradigm and making reference to classes and properties of widely adopted ontologies to model the relevant information. After introducing the overall architecture of the LiLa Knowledge Base and the LASLA corpus, the paper details the phases of the process of linking the corpus with the collection of lemmas of LiLa and presents a federated query to exemplify the added value of interoperability of LASLA’s texts with other resources for Latin.

pdf bib abs
Computational Morphology with OntoLex-Morph
Christian Chiarcos | Katerina Gkirtzou | Fahad Khan | Penny Labropoulou | Marco Passarotti | Matteo Pellegrini
Proceedings of the 8th Workshop on Linked Data in Linguistics within the 13th Language Resources and Evaluation Conference

This paper describes the current status of the emerging OntoLex module for linguistic morphology. It serves as an update to the previous version of the vocabulary (Klimek et al. 2019). Whereas this earlier model was exclusively focusing on descriptive morphology and focused on applications in lexicography, we now present a novel part and a novel application of the vocabulary to applications in language technology, i.e., the rule-based generation of lexicons, introducing a dynamic component into OntoLex.

pdf bib abs
The Index Thomisticus Treebank as Linked Data in the LiLa Knowledge Base
Francesco Mambrini | Marco Passarotti | Giovanni Moretti | Matteo Pellegrini
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Although the Universal Dependencies initiative today allows for cross-linguistically consistent annotation of morphology and syntax in treebanks for several languages, syntactically annotated corpora are not yet interoperable with many lexical resources that describe properties of the words that occur therein. In order to cope with such limitation, we propose to adopt the principles of the Linguistic Linked Open Data community, to describe and publish dependency treebanks as LLOD. In particular, this paper illustrates the approach pursued in the LiLa Knowledge Base, which enables interoperability between corpora and lexical resources for Latin, to publish as Linguistic Linked Open Data the annotation layers of two versions of a Medieval Latin treebank (the Index Thomisticus Treebank).

2020

pdf bib abs
A New Latin Treebank for Universal Dependencies: Charters between Ancient Latin and Romance Languages
Flavio Massimiliano Cecchini | Timo Korkiakangas | Marco Passarotti
Proceedings of the Twelfth Language Resources and Evaluation Conference

The present work introduces a new Latin treebank that follows the Universal Dependencies (UD) annotation standard. The treebank is obtained from the automated conversion of the Late Latin Charter Treebank 2 (LLCT2), originally in the Prague Dependency Treebank (PDT) style. As this treebank consists of Early Medieval legal documents, its language variety differs considerably from both the Classical and Medieval learned varieties prevalent in the other currently available UD Latin treebanks. Consequently, besides significant phenomena from the perspective of diachronic linguistics, this treebank also poses several challenging technical issues for the current and future syntactic annotation of Latin in the UD framework. Some of the most relevant cases are discussed in depth, with comparisons between the original PDT and the resulting UD annotations. Additionally, an overview of the UD-style structure of the treebank is given, and some diachronic aspects of the transition from Latin to Romance languages are highlighted.

pdf bib abs
Odi et Amo. Creating, Evaluating and Extending Sentiment Lexicons for Latin.
Rachele Sprugnoli | Marco Passarotti | Daniela Corbetta | Andrea Peverelli
Proceedings of the Twelfth Language Resources and Evaluation Conference

Sentiment lexicons are essential for developing automatic sentiment analysis systems, but the resources currently available mostly cover modern languages. Lexicons for ancient languages are few and not evaluated with high-quality gold standards. However, the study of attitudes and emotions in ancient texts is a growing field of research which poses specific issues (e.g., lack of native speakers, limited amount of data, unusual textual genres for the sentiment analysis task, such as philosophical or documentary texts) and can have an impact on the work of scholars coming from several disciplines besides computational linguistics, e.g. historians and philologists. The work presented in this paper aims at providing the research community with a set of sentiment lexicons built by taking advantage of manually-curated resources belonging to the long tradition of Latin corpora and lexicons creation. Our interdisciplinary approach led us to release: i) two automatically generated sentiment lexicons; ii) a gold standard developed by two Latin language and culture experts; iii) a silver standard in which semantic and derivational relations are exploited so to extend the list of lexical items of the gold standard. In addition, the evaluation procedure is described together with a first application of the lexicons to a Latin tragedy.

pdf bib abs
Representing Etymology in the LiLa Knowledge Base of Linguistic Resources for Latin
Francesco Mambrini | Marco Passarotti
Proceedings of the 2020 Globalex Workshop on Linked Lexicography

In this paper we describe the process of inclusion of etymological information in a knowledge base of interoperable Latin linguistic resources developed in the context of the LiLa: Linking Latin project. Interoperability is obtained by applying the Linked Open Data principles. Particularly, an extensive collection of Latin lemmas is used to link the (distributed) resources. For the etymology, we rely on the Ontolex-lemon ontology and the lemonEty extension to model the information, while the source data are taken from a recent etymological dictionary of Latin. As a result, the collection of lemmas LiLa is built around now includes 1,465 Proto-Italic and 1,393 Proto-Indo-European reconstructed forms that are used to explain the history of 1,400 Latin words. We discuss the motivation, methodology and modeling strategies of the work, as well as its possible applications and potential future developments.

pdf bib
Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages
Rachele Sprugnoli | Marco Passarotti
Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages

pdf bib abs
Overview of the EvaLatin 2020 Evaluation Campaign
Rachele Sprugnoli | Marco Passarotti | Flavio Massimiliano Cecchini | Matteo Pellegrini
Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages

This paper describes the first edition of EvaLatin, a campaign totally devoted to the evaluation of NLP tools for Latin. The two shared tasks proposed in EvaLatin 2020, i. e. Lemmatization and Part-of-Speech tagging, are aimed at fostering research in the field of language technologies for Classical languages. The shared dataset consists of texts taken from the Perseus Digital Library, processed with UDPipe models and then manually corrected by Latin experts. The training set includes only prose texts by Classical authors. The test set, alongside with prose texts by the same authors represented in the training set, also includes data relative to poetry and to the Medieval period. This also allows us to propose the Cross-genre and Cross-time subtasks for each task, in order to evaluate the portability of NLP tools for Latin across different genres and time periods. The results obtained by the participants for each task and subtask are presented and discussed.

2019

pdf bib abs
Harmonizing Different Lemmatization Strategies for Building a Knowledge Base of Linguistic Resources for Latin
Francesco Mambrini | Marco Passarotti
Proceedings of the 13th Linguistic Annotation Workshop

The interoperability between lemmatized corpora of Latin and other resources that use the lemma as indexing key is hampered by the multiple lemmatization strategies that different projects adopt. In this paper we discuss how we tackle the challenges raised by harmonizing different lemmatization criteria in the context of a project that aims to connect linguistic resources for Latin using the Linked Data paradigm. The paper introduces the architecture supporting an open-ended, lemma-based Knowledge Base, built to make textual and lexical resources for Latin interoperable. Particularly, the paper describes the inclusion into the Knowledge Base of its lexical basis, of a word formation lexicon and of a lemmatized and syntactically annotated corpus.

pdf bib
Linked Open Treebanks. Interlinking Syntactically Annotated Corpora in the LiLa Knowledge Base of Linguistic Resources for Latin
Francesco Mambrini | Marco Passarotti
Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019)

pdf bib
Proceedings of the Second International Workshop on Resources and Tools for Derivational Morphology
Magda Ševčíková | Zdeněk Žabokrtský | Eleonora Litta | Marco Passarotti
Proceedings of the Second International Workshop on Resources and Tools for Derivational Morphology

pdf bib
The Treatment of Word Formation in the LiLa Knowledge Base of Linguistic Resources for Latin
Eleonora Litta | Marco Passarotti | Francesco Mambrini
Proceedings of the Second International Workshop on Resources and Tools for Derivational Morphology

2018

pdf bib abs
Challenges in Converting the Index Thomisticus Treebank into Universal Dependencies
Flavio Massimiliano Cecchini | Marco Passarotti | Paola Marongiu | Daniel Zeman
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

This paper describes the changes applied to the original process used to convert the Index Thomisticus Treebank, a corpus including texts in Medieval Latin by Thomas Aquinas, into the annotation style of Universal Dependencies. The changes are made both to harmonise the Universal Dependencies version of the Index Thomisticus Treebank with the two other available Latin treebanks and to fix errors and inconsistencies resulting from the original process. The paper details the treatment of different issues in PoS tagging, lemmatisation and assignment of dependency relations. Finally, it assesses the quality of the new conversion process by providing an evaluation against a gold standard.

2017

pdf bib
The Lemlat 3.0 Package for Morphological Analysis of Latin
Marco Passarotti | Marco Budassi | Eleonora Litta | Paolo Ruffolo
Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language

pdf bib
The Treebanked Conspiracy. Actors and Actions in Bellum Catilinae
Marco Passarotti | Berta González Saavedra
Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories

2016

pdf bib
Nomen Omen. Enhancing the Latin Morphological Analyser Lemlat with an Onomasticon
Marco Budassi | Marco Passarotti
Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

pdf bib abs
Differentia compositionem facit. A Slower-Paced and Reliable Parser for Latin
Edoardo Maria Ponti | Marco Passarotti
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The Index Thomisticus Treebank is the largest available treebank for Latin; it contains Medieval Latin texts by Thomas Aquinas. After experimenting on its data with a number of dependency parsers based on different supervised machine learning techniques, we found that DeSR with a multilayer perceptron algorithm, a right-to-left transition, and a tailor-made feature model is the parser providing the highest accuracy rates. We improved the results further by using a technique that combines the output parses of DeSR with those provided by other parsers, outperforming the previous state of the art in parsing the Index Thomisticus Treebank. The key idea behind such improvement is to ensure a sufficient diversity and accuracy of the outputs to be combined; for this reason, we performed an in-depth evaluation of the results provided by the different parsers that we combined. Finally, we assessed that, although the general architecture of the parser is portable to Classical Latin, yet the model trained on Medieval Latin is inadequate for such purpose.

pdf bib abs
Latin Vallex. A Treebank-based Semantic Valency Lexicon for Latin
Marco Passarotti | Berta González Saavedra | Christophe Onambele
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Despite a centuries-long tradition in lexicography, Latin lacks state-of-the-art computational lexical resources. This situation is strictly related to the still quite limited amount of linguistically annotated textual data for Latin, which can help the building of new lexical resources by supporting them with empirical evidence. However, projects for creating new language resources for Latin have been launched over the last decade to fill this gap. In this paper, we present Latin Vallex, a valency lexicon for Latin built in mutual connection with the semantic and pragmatic annotation of two Latin treebanks featuring texts of different eras. On the one hand, such a connection between the empirical evidence provided by the treebanks and the lexicon allows to enhance each frame entry in the lexicon with its frequency in real data. On the other hand, each valency-capable word in the treebanks is linked to a frame entry in the lexicon.

2014

pdf bib abs
A Compact Interactive Visualization of Dependency Treebank Query Results
Chris Culy | Marco Passarotti | Ulla König-Cardanobile
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

One of the challenges of corpus querying is making sense of the results of a query, especially when a large number of results and linguistically annotated data are concerned. While the most widespread tools for querying syntactically annotated corpora tend to focus on single occurrences, one aspect that is not fully exploited yet in this area is that language is a complex system whose units are connected to each other at both microscopic (the single occurrences) and macroscopic level (the whole system itself). Assuming that language is a system, we describe a tool (using the DoubleTreeJS visualization) to visualize the results of querying dependency treebanks by forming a node from a single item type, and building a network in which the heads and the dependents of the central node are respectively the left and the right vertices of the tree, which are connected to the central node by dependency relations. One case study is presented, consisting in the exploitation of DoubleTreeJS for supporting one assumption in theoretical linguistics with evidence provided by the data of a dependency treebank of Medieval Latin.

pdf bib abs
Thomas Aquinas in the TüNDRA: Integrating the Index Thomisticus Treebank into CLARIN-D
Scott Martens | Marco Passarotti
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper describes the integration of the Index Thomisticus Treebank (IT-TB) into the web-based treebank search and visualization application TueNDRA (Tuebingen aNnotated Data Retrieval & Analysis). TueNDRA was originally designed to provide access via the Internet to constituency treebanks and to tools for searching and visualizing them, as well as tabulating statistics about their contents. TueNDRA has now been extended to also provide full support for dependency treebanks with non-projective dependencies, in order to integrate the IT-TB and future treebanks with similar properties. These treebanks are queried using an adapted form of the TIGERSearch query language, which can search both hierarchical and sequential information in treebanks in a single query. As a web application, making the IT-TB accessible via TueNDRA makes the treebank and the tools to use of it available to a large community without having to distribute software and show users how to install it.

pdf bib
From Syntax to Semantics. First Steps Towards Tectogrammatical Annotation of Latin
Marco Passarotti
Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)

2013

pdf bib
Non-Projectivity in the Ancient Greek Dependency Treebank
Francesco Mambrini | Marco Passarotti
Proceedings of the Second International Conference on Dependency Linguistics (DepLing 2013)

2012

pdf bib abs
First Steps towards the Semi-automatic Development of a Wordformation-based Lexicon of Latin
Marco Passarotti | Francesco Mambrini
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Although lexicography of Latin has a long tradition dating back to ancient grammarians, and almost all Latin grammars devote to wordformation at least one part of the section(s) concerning morphology, none of the today available lexical resources and NLP tools of Latin feature a wordformation-based organization of the Latin lexicon. In this paper, we describe the first steps towards the semi-automatic development of a wordformation-based lexicon of Latin, by detailing several problems occurring while building the lexicon and presenting our solutions. Developing a wordformation-based lexicon of Latin is nowadays of outmost importance, as the last years have seen a large growth of annotated corpora of Latin texts of different eras. While these corpora include lemmatization, morphological tagging and syntactic analysis, none of them features segmentation of the word forms and wordformation relations between the lexemes. This restricts the browsing and the exploitation of the annotated data for linguistic research and NLP tasks, such as information retrieval and heuristics in PoS tagging of unknown words.

2010

pdf bib abs
Improvements in Parsing the Index Thomisticus Treebank. Revision, Combination and a Feature Model for Medieval Latin
Marco Passarotti | Felice Dell’Orletta
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The creation of language resources for less-resourced languages like the historical ones benefits from the exploitation of language-independent tools and methods developed over the years by many projects for modern languages. Along these lines, a number of treebanks for historical languages started recently to arise, including treebanks for Latin. Among the Latin treebanks, the Index Thomisticus Treebank is a 68,000 token dependency treebank based on the Index Thomisticus by Roberto Busa SJ, which contains the opera omnia of Thomas Aquinas (118 texts) as well as 61 texts by other authors related to Thomas, for a total of approximately 11 million tokens. In this paper, we describe a number of modifications that we applied to the dependency parser DeSR, in order to improve the parsing accuracy rates on the Index Thomisticus Treebank. First, we adapted the parser to the specific processing of Medieval Latin, defining an ad-hoc configuration of its features. Then, in order to improve the accuracy rates provided by DeSR, we applied a revision parsing method and we combined the outputs produced by different algorithms. This allowed us to improve accuracy rates substantially, reaching results that are well beyond the state of the art of parsing for Latin.

2009

pdf bib
The Development of the “Index Thomisticus” Treebank Valency Lexicon
Barbara McGillivray | Marco Passarotti
Proceedings of the EACL 2009 Workshop on Language Technology and Resources for Cultural Heritage, Social Sciences, Humanities, and Education (LaTeCH – SHELT&R 2009)

pdf bib
The Index Thomisticus Treebank Project: Annotation, Parsing and Valency Lexicon
Barbara McGillivray | Marco Passarotti | Paolo Ruffolo
Traitement Automatique des Langues, Volume 50, Numéro 2 : Langues anciennes [Ancient Languages]

2008

pdf bib abs
The Annotation Guidelines of the Latin Dependency Treebank and Index Thomisticus Treebank: the Treatment of some specific Syntactic Constructions in Latin
David Bamman | Marco Passarotti | Roberto Busa | Gregory Crane
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The paper describes the treatment of some specific syntactic constructions in two treebanks of Latin according to a common set of annotation guidelines. Both projects work within the theoretical framework of Dependency Grammar, which has been demonstrated to be an especially appropriate framework for the representation of languages with a moderately free word order, where the linear order of constituents is broken up with elements of other constituents. The two projects are the first of their kind for Latin, so no prior established guidelines for syntactic annotation are available to rely on. The general model for the adopted style of representation is that used by the Prague Dependency Treebank, with departures arising from the Latin grammar of Pinkster, specifically in the traditional grammatical categories of the ablative absolute, the accusative + infinitive, and gerunds/gerundives. Sharing common annotation guidelines allows us to compare the datasets of the two treebanks for tasks such as mutually checking annotation consistency, diachronically studying specific syntactic constructions, and training statistical dependency parsers.