Stefanie Dipper

2024

pdf bib
Complexity of German Texts Written by Primary School Children
Jammila Laâguidi | Dana Neumann | Ronja Laarmann-Quante | Stefanie Dipper | Mihail Chifligarov
Proceedings of the 20th Conference on Natural Language Processing (KONVENS 2024)

pdf bib abs
Guidelines for the Annotation of Intentional Linguistic Metaphor
Stefanie Dipper | Adam Roussel | Alexandra Wiemann | Won Kim | Tra-my Nguyen
Proceedings of the 4th Workshop on Figurative Language Processing (FigLang 2024)

This paper presents guidelines for the annotation of intentional (i.e. non-conventionalized) linguistic metaphors. Expressions that contribute to the same metaphorical image are annotated as a chain, additionally a semantically contrasting expression of the target domain is marked as an anchor. So far, a corpus of ten TEDx talks with a total of 20k tokens has been annotated according to these guidelines. 1.25% of the tokens are intentional metaphorical expressions.

pdf bib abs
UD for German Poetry
Stefanie Dipper | Ronja Laarmann-Quante
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities

This article deals with the syntactic analysis of German-language poetry from different centuries. We use Universal Dependencies (UD) as our syntactic framework. We discuss particular challenges of the poems in terms of tokenization, sentence boundary recognition and special syntactic constructions. Our annotated corpus currently consists of 20 poems with a total of 2,162 tokens, which originate from the PoeTree.de corpus. We present some statistics on our annotations and also evaluate the automatic UD annotation from PoeTree.de using our annotations.

pdf bib abs
Universal Dependencies: Extensions for Modern and Historical German
Stefanie Dipper | Cora Haiber | Anna Maria Schröter | Alexandra Wiemann | Maike Brinkschulte
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In this paper we present extensions of the UD scheme for modern and historical German. The extensions relate in part to fundamental differences such as those between different kinds of arguments and modifiers. We illustrate the extensions with examples from the MHG data and discuss a number of MHG-specific constructions. At the current time, we have annotated a corpus of Middle High German with almost 29K tokens using this scheme, which to our knowledge is the first UD treebank for Middle High German. Inter-annotator agreement is very high: the annotators achieve a score of α = 0.85. A statistical analysis of the annotations shows some interesting differences in the distribution of labels between modern and historical German.

Independently of the medial representation (written/spoken), language can exhibit characteristics of conceptual orality or literacy, which mainly manifest themselves on the lexical or syntactic level. In this paper we aim at automatically identifying conceptually-oral historical texts, with the ultimate goal of gaining knowledge about spoken data of historical time stages. We apply a set of general linguistic features that have been proven to be effective for the classification of modern language data to historical German texts from various registers. Many of the features turn out to be equally useful in determining the conceptuality of historical data as they are for modern data, especially the frequency of different types of pronouns and the ratio of verbs to nouns. Other features like sentence length, particles or interjections point to peculiarities of the historical data and reveal problems with the adoption of a feature set that was developed on modern language data.

pdf bib
Proceedings of the 14th Linguistic Annotation Workshop
Stefanie Dipper | Amir Zeldes
Proceedings of the 14th Linguistic Annotation Workshop

2019

pdf bib abs
Variation between Different Discourse Types: Literate vs. Oral
Katrin Ortmann | Stefanie Dipper
Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects

This paper deals with the automatic identification of literate and oral discourse in German texts. A range of linguistic features is selected and their role in distinguishing between literate- and oral-oriented registers is investigated, using a decision-tree classifier. It turns out that all of the investigated features are related in some way to oral conceptuality. Especially simple measures of complexity (average sentence and word length) are prominent indicators of oral and literate discourse. In addition, features of reference and deixis (realized by different types of pronouns) also prove to be very useful in determining the degree of orality of different registers.

pdf bib abs
The making of the Litkey Corpus, a richly annotated longitudinal corpus of German texts written by primary school children
Ronja Laarmann-Quante | Stefanie Dipper | Eva Belke
Proceedings of the 13th Linguistic Annotation Workshop

To date, corpus and computational linguistic work on written language acquisition has mostly dealt with second language learners who have usually already mastered orthography acquisition in their first language. In this paper, we present the Litkey Corpus, a richly-annotated longitudinal corpus of written texts produced by primary school children in Germany from grades 2 to 4. The paper focuses on the (semi-)automatic annotation procedure at various linguistic levels, which include POS tags, features of the word-internal structure (phonemes, syllables, morphemes) and key orthographic features of the target words as well as a categorization of spelling errors. Comprehensive evaluations show that high accuracy was achieved on all levels, making the Litkey Corpus a useful resource for corpus-based research on literacy acquisition of German primary school children and for developing NLP tools for educational purposes. The corpus is freely available under https://www.linguistics.rub.de/litkeycorpus/.

2018

pdf bib abs
Survey: Anaphora With Non-nominal Antecedents in Computational Linguistics: a Survey
Varada Kolhatkar | Adam Roussel | Stefanie Dipper | Heike Zinsmeister
Computational Linguistics, Volume 44, Issue 3 - September 2018

This article provides an extensive overview of the literature related to the phenomenon of non-nominal-antecedent anaphora (also known as abstract anaphora or discourse deixis), a type of anaphora in which an anaphor like “that” refers to an antecedent (marked in boldface) that is syntactically non-nominal, such as the first sentence in “It’s way too hot here. That’s why I’m moving to Alaska.” Annotating and automatically resolving these cases of anaphora is interesting in its own right because of the complexities involved in identifying non-nominal antecedents, which typically represent abstract objects such as events, facts, and propositions. There is also practical value in the resolution of non-nominal-antecedent anaphora, as this would help computational systems in machine translation, summarization, and question answering, as well as, conceivably, any other task dependent on some measure of text understanding. Most of the existing approaches to anaphora annotation and resolution focus on nominal-antecedent anaphora, classifying many of the cases where the antecedents are syntactically non-nominal as non-anaphoric. There has been some work done on this topic, but it remains scattered and difficult to collect and assess. With this article, we hope to bring together and synthesize work done in disparate contexts up to now in order to identify fundamental problems and draw conclusions from an overarching perspective. Having a good picture of the current state of the art in this field can help researchers direct their efforts to where they are most necessary. Because of the great variety of theoretical approaches that have been brought to bear on the problem, there is an equally diverse array of terminologies that are used to describe it, so we will provide an overview and discussion of these terminologies. We also describe the linguistic properties of non-nominal-antecedent anaphora, examine previous annotation efforts that have addressed this topic, and present the computational approaches that aim at resolving non-nominal-antecedent anaphora automatically. We close with a review of the remaining open questions in this area and some of our recommendations for future research.

2017

pdf bib
Variance in Historical Data: How bad is it and how can we profit from it for historical linguistics?
Stefanie Dipper
Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language

pdf bib abs
Investigating Diatopic Variation in a Historical Corpus
Stefanie Dipper | Sandra Waldenberger
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

This paper investigates diatopic variation in a historical corpus of German. Based on equivalent word forms from different language areas, replacement rules and mappings are derived which describe the relations between these word forms. These rules and mappings are then interpreted as reflections of morphological, phonological or graphemic variation. Based on sample rules and mappings, we show that our approach can replicate results from historical linguistics. While previous studies were restricted to predefined word lists, or confined to single authors or texts, our approach uses a much wider range of data available in historical corpora.

pdf bib abs
Annotating Orthographic Target Hypotheses in a German L1 Learner Corpus
Ronja Laarmann-Quante | Katrin Ortmann | Anna Ehlert | Maurice Vogel | Stefanie Dipper
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

NLP applications for learners often rely on annotated learner corpora. Thereby, it is important that the annotations are both meaningful for the task, and consistent and reliable. We present a new longitudinal L1 learner corpus for German (handwritten texts collected in grade 2–4), which is transcribed and annotated with a target hypothesis that strictly only corrects orthographic errors, and is thereby tailored to research and tool development for orthographic issues in primary school. While for most corpora, transcription and target hypothesis are not evaluated, we conducted a detailed inter-annotator agreement study for both tasks. Although we achieved high agreement, our discussion of cases of disagreement shows that even with detailed guidelines, annotators differ here and there for different reasons, which should also be considered when working with transcriptions and target hypotheses of other corpora, especially if no explicit guidelines for their construction are known.

2016

pdf bib
Annotating Spelling Errors in German Texts Produced by Primary School Children
Ronja Laarmann-Quante | Lukas Knichel | Stefanie Dipper | Carina Betken
Proceedings of the 10th Linguistic Annotation Workshop held in conjunction with ACL 2016 (LAW-X 2016)

pdf bib
Evaluating Inter-Annotator Agreement on Historical Spelling Normalization
Marcel Bollmann | Stefanie Dipper | Florian Petran
Proceedings of the 10th Linguistic Annotation Workshop held in conjunction with ACL 2016 (LAW-X 2016)

2014

pdf bib
CorA: A web-based annotation tool for historical and other non-standard language data
Marcel Bollmann | Florian Petran | Stefanie Dipper | Julia Krasselt
Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)

2013

pdf bib
Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse
Antonio Pareja-Lora | Maria Liakata | Stefanie Dipper
Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse

2012

pdf bib abs
The Use of Parallel and Comparable Data for Analysis of Abstract Anaphora in German and English
Stefanie Dipper | Melanie Seiss | Heike Zinsmeister
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Parallel corpora ― original texts aligned with their translations ― are a widely used resource in computational linguistics. Translation studies have shown that translated texts often differ systematically from comparable original texts. Translators tend to be faithful to structures of the original texts, resulting in a """"shining through"""" of the original language preferences in the translated text. Translators also tend to make their translations most comprehensible with the effect that translated texts can be more explicit than their source texts. Motivated by the need to use a parallel resource for cross-linguistic feature induction in abstract anaphora resolution, this paper investigates properties of English and German texts in the Europarl corpus, taking into account both general features such as sentence length as well as task-dependent features such as the distribution of demonstrative noun phrases. The investigation is based on the entire Europarl corpus as well as on a small subset thereof, which has been manually annotated. The results indicate English translated texts are sufficiently """"authentic"""" to be used as training data for anaphora resolution; results for German texts are less conclusive, though.

2011

pdf bib
Rule-Based Normalization of Historical Texts
Marcel Bollmann | Florian Petran | Stefanie Dipper
Proceedings of the Workshop on Language Technologies for Digital Humanities and Cultural Heritage

2010

pdf bib
OTTO: A Transcription and Management Tool for Historical Texts
Stefanie Dipper | Lara Kresse | Martin Schnurrenberger | Seong-Eun Cho
Proceedings of the Fourth Linguistic Annotation Workshop

2009

pdf bib
Annotating Discourse Anaphora
Stefanie Dipper | Heike Zinsmeister
Proceedings of the Third Linguistic Annotation Workshop (LAW III)

2008

pdf bib abs
Measures for Term and Sentence Relevances: an Evaluation for German
Heike Bieler | Stefanie Dipper
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Terms, term relevances, and sentence relevances are concepts that figure in many NLP applications, such as Text Summarization. These concepts are implemented in various ways, though. In this paper, we want to shed light on the impact that different implementations can have on the overall performance of the systems. In particular, we examine the interplay between term definitions and sentence-scoring functions. For this, we define a gold standard that ranks sentences according to their significance and evaluate a range of relevant parameters with respect to the gold standard.

pdf bib abs
Annotation of Information Structure: an Evaluation across different Types of Texts
Julia Ritz | Stefanie Dipper | Michael Götze
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We report on the evaluation of information structural annotation according to the Linguistic Information Structure Annotation Guidelines (LISA, (Dipper et al., 2007)). The annotation scheme differentiates between the categories of information status, topic, and focus. It aims at being language-independent and has been applied to highly heterogeneous data: written and spoken evidence from typologically diverse languages. For the evaluation presented here, we focused on German texts of different types, both written texts and transcriptions of spoken language, and analyzed the annotation quantitatively and qualitatively.

pdf bib
A Flexible Framework for Integrating Annotations from Different Tools and Tag Sets
Christian Chiarcos | Stefanie Dipper | Michael Götze | Ulf Leser | Anke Lüdeling | Julia Ritz | Manfred Stede
Traitement Automatique des Langues, Volume 49, Numéro 2 : Plate-formes pour le traitement automatique des langues [Platforms for Natural Language Processing]