Latest developments in historical NER are dominated by deep learning techniques which have recently shown state-of-the-art results for modern NER. Deep learning-based sequence labelling approaches rely on word and character distributed representations and learn sentence or sequence features during end-to-end training. Most models are based on BiLSTM architectures or self-attention networks, and use a CRF layer as tag decoder to capture dependencies between labels (see Appendix A.3.3). Building on these results, much work attempts to apply and/or adapt deep learning approaches to historical documents, under different settings and following different strategies.
6.3.2 Approaches based on Static Embeddings.
First attempts are based on state-of-the-art BiLSTM-CRF and investigate the transferability of various types of pre-trained static embeddings to historical material. They all use traditional CRFs as a baseline.
Focusing on location names in 19-20C English travelogues,
18 Sprugnoli [
187] compares two classifiers, Stanford CRF and BiLSTM-CRF, and experiment with different word embeddings: GloVe embeddings, based on linear bag-of-words contexts and trained on Common Crawl data [
154], Levy and Goldberg embeddings, produced from the English Wikipedia with a dependency-based approach [
122], and fastText embeddings, also trained on the English Wikipedia but using sub-word information [
22]. Additionally to these pre-trained vectors, Sprugnoli trains each embedding type afresh on historical data (a subset of the Corpus of Historical American English), ending up with 3
\(\times\) 2 input options for the neural model. Both classifiers are trained on a relatively small labelled corpus. Results show that the neural approach performs systematically and remarkably better than CRF, with a difference ranging from 11 to 14 F-score percentage points, depending on the word vectors used (best F-score is 87.4
\(\%\)). If in-domain supervised training improves the F-score of the Stanford CRF module, it is worth noting that the gain is mainly due to recall, the precision of the English default model remaining higher. In this regard, the neural approach shows a better P/R balance across all settings. With respect to embeddings, linear bag-of-words contexts (GloVe) prove to be more appropriate (at least in this context), with its historical embeddings yielding the highest scores across all metrics (fastText following immediately after). A detailed examination of results reveals an uneven impact of in-domain embeddings, leading either to higher precision but lower recall (Levy and GloVe), or higher recall but lower precision (fastText and GloVe). Overall, this work shows the positive impact of in-domain training data: the BiLSTM-CRF approach, combined with in-domain training set and in-domain historical embeddings, systematically outperforms the linear CRF classifier.
In the context of reference mining in the arts and humanities, Rodriguez Alves [
170] also investigate the benefit of BiLSTM over traditional CRFs, and of multiple input representations. Their experiments focus on three architectural components: input layer (word and character-level word embeddings), prediction layers (Softmax and CRF), and learning setting (multi-task and single-task). Authors consider a domain-specific tagset of 27 entity types covering reference components (e.g., author, title, archive, publisher) and work with 19-21C scholarly books and journals featuring a wide variety of referencing styles and sources.
19 While character-level word embeddings, likely to help with OCR noise and rare words, are learned either via CNN or BiLSTM, word embeddings are based on word2vec and are tested under various settings: present or not, pre-trained on the in-domain raw corpus or randomly initialised, and frozen or fined-tuned on the labelled corpus during training. Among those settings, the one including in-domain word embeddings further fine-tuned during training and CRF prediction layer yields the best results (
\(89.7\%\) F-score). Character embeddings provide a minor yet positive contribution, and are better learned via BiLSTM than with CNN. The BiLSTM outperforms the CRF baseline by a large margin (+
\(7\%\)), except for very infrequent tags. Overall, this work confirms the importance of word information (rather in-domain, though here results with generic embeddings were not reported) and the remarkable capacities of a BiLSTM network to learn features, better decoded by a CRF classifier than a softmax function.
Working with Czech historical newspapers,
20 Hubková et al. [
101] target the recognition of five generic entity types. Authors experiment with two neural architectures, LSTM and BiLSTM, followed by a softmax layer. Both are trained on a relatively small labelled corpus (4k entities) and fed with modern fastText embeddings (as released by the fastText library) under three scenarios: randomly initialised, frozen, and fine-tuned. Character-level word embeddings are not used. Results show that the BiLSTM model based on pre-trained embeddings with no further fine-tuning performs best (
\(73\%\) F-score). Authors do not comment on the performance degradation resulting from fine-tuning, but one reason might be the small size of the training data.
Rather than aiming at calibrating a system to a specific historical setting, Riedl and Padó [
166] adopt a more generic stance and investigate the possibility of building a German NER system that performs at the state of the art for both contemporary and historical texts. The underlying question—whether one type of model can be optimised to perform well across settings— naturally resonates with the needs of cultural heritage institution practitioners (see also Schweter and Baiter [
180] and Labusch et al. [
118] hereafter). Experimental settings consist of: two sets of German labelled corpora, with large contemporary datasets (CoNNL-03 and GermEval) and small historical ones (from the Friedrich Temann and Austrian National library); two types of classifiers, CRFs (Stanford and GermaNER) and BiLSTM-CRF; finally, for the neural system, usage of fastText embeddings derived from generic (Wikipedia) and in-domain (Europeana corpus) data. On this base, authors perform three experiments. The first investigates the performances of the two types of systems on the contemporary datasets. On both GermEval and CoNNL, the BiLSTM-CRF models outperform the traditional CRF ones, with Wikipedia-based embeddings yielding better results than the Europeana-based ones. It is noteworthy that the GermaNER CRF model performs better than the LSTM of Lample et al. [
119] on CoNLL-03, but suffers from low recall compared to BiLSTM. The second experiment focuses on all-corpora crossing, with each system being trained and evaluated on all possible combinations of contemporary and historical corpora pairs. With no surprise, best results are obtained when models are trained and evaluated on the same material. Interestingly, CRFs perform better than BiLSTM in the historical setting (i.e., train and test sets from historical corpora) by quite a margin, suggesting that although not optimised for historical texts, CRFs are more robust than BiLSTM when faced with small training datasets. The type of embeddings (Wikipedia vs. Europeana) plays a minor role in the BiLSTM performance in the historical setting. Ultimately, the third experiment explores how to overcome this neural net dependence on large data with domain adaptation transfer learning: the model is trained on a contemporary corpus until convergence and then further trained on a historical one for a few more epochs. Results show consistent benefits for BiLSTM on historical datasets (ca. +4 F-score percentage points). In general, main difficulties relate to OCR mistakes and wrongly hyphenated words due to line breaks, and to the
Organisation type. Overall, this work shows that BiLSTM and CRF achieve similar performances in a small-data historical setting, but that BiLSTM-CRF outperforms CRF when supplied with enough data or in a transfer learning setting.
This first set of work confirms the suitability of the state-of-the-art BiLSTM-CRF approach for historical documents, with the major advantage of not requiring feature engineering. Provided that there is enough in-domain training data, this architecture obtains better performances than traditional CRFs (the latter performing on par or better otherwise). In-domain pre-training of static word embeddings seems to contribute positively, although to various degrees depending on the experimental settings and embedding types. Sub-word information (either character embeddings or character-based word embeddings) also appears to have positive effect.
6.3.3 Approaches based on Character-level LM Embeddings.
Approaches described above rely on static, token-level word representations which fail to capture context information. This drawback can be overcome by context-dependent representations derived from the task of modelling language, either as distribution over characters, such as the Flair contextual string embeddings [
5], or over words, such as BERT [
48] and ELMo [
155] (see Appendix A.3.2). Such representations have boosted performances of modern NER and are also used in the context of historical texts. This section considers work based on character-based contextualised embeddings (flair).
In the context of the CLEF-HIPE-2020 shared task [
60], Dekhili and Sadat [
46] proposed different variations of a BiLSTM-CRF network, with and without the in-domain HIPE flair embeddings and/or an attention layer. The gains of adding one or the other or both are not easy to interpret, with uneven performances of the model variants across NE types. Their overall F-scores range from
\(62\%\) to
\(65\%\) under the strict evaluation regime. For some entity types the CRF baseline is better than the neural models, and the benefit of in-domain embeddings is overall more evident than the one of the attention layer (which proved more useful in handling metonymic entities).
Kew et al. [
110] address the recognition of toponyms in an alpine heritage corpus consisting of over 150 years of mountaineering articles in five languages (mainly from the Swiss and British Alpine Clubs). Focusing on fine-grained entity types (city, mountain, glacier, valley, lake, and cabin), the authors compare three approaches. The first is a traditional gazetteer-based approach completed with a few heuristics which achieves high precision across types (
\(88\%\) P,
\(73\%\) F-score), and even very high precision (
\(\gt\)\(95\%\)) for infrequent categories with regular patterns. Suitable for reliable location-based search but suffering from low recall, this approach is then compared with a BiLSTM-CRF architecture. The neural system is fed with stacked embeddings composed of in-domain contextual string embeddings pre-trained on the alpine corpus concatenated with general-purpose fastText word embeddings pre-trained on web data, and trained on a silver training set containing 28k annotations obtained via the application of the gazetteer-based approach. The model leads to an increase of recall for the most frequent categories, without degrading precision scores (
\(76\%\) F-score). This shows the generalisation capacity of the neural approach in combination with context-sensitive string embeddings and given sufficient training data.
Swaileh et al. [
194] target even more specific entity types in French and German financial yearbooks from the first half of 20C. They apply a BiLSTM-CRF network trained on custom data and fed with modern flair embeddings. Results are very good (between
\(85\%\) to
\(95\%\) F-score depending on the book sections), with the CRF baseline and the BiLSTM model performing on par for French books, and BiLSTM being better than CRF for the German one, which has a lower OCR quality. Overall, these performances can be explained by the regularity of the structure and language as well as the quality of the considered material, resulting in stable contexts and non-noisy entities.
6.3.4 Approaches based on Word-level LM Embeddings.
The release of pre-trained contextualised language model-based word embeddings such as BERT (based on transformers) and ELMo (based on LSTM) pushed further the upper bound of modern NER performances. They show promising results either in replacement or in combination with other embedding types, and offer the possibility of being further fine-tuned [
125]. If they are becoming a new paradigm of modern NER, the same seems to be true for historical NER.
Using pre-trained modern embeddings. We first consider work based on pre-trained modern LM-based word embeddings (BERT or ELMo) without extensive comparison experiments. They make use of BiLSTM or transformer architectures.
Working on the “Chinese Twenty-Four Histories”, a set of Chinese official history books covering a period from 3000 BCE to 17C, Yu and Wang [
212] face the problems of the complexity of classical Chinese and of the absence of appropriate training data in their attempt to recognise
Person and
Location. Their BiLSTM-CRF model is trained on a NE-annotated modern Chinese corpus and makes use of modern Chinese BERT embeddings in a feature extraction setting (frozen). Evaluated on a (small) dataset representative of the time span of the target corpus, the model achieves relatively good performances (from
\(72\%\) to
\(82\%\) F-score depending on the book), with a pretty good P/R balance, better results for
Location than for
Person, and on recent books. Given the complete ‘modern’ setting of embeddings and training labelled data, those results shows the benefit of large LM-based embeddings—keeping in mind the small size of the test set and perhaps the regularity of entity occurrences in the material, not detailed in the paper.
Also based on the bare usage of state-of-the-art LM-based representations is a set of work from the HIPE-2020 evaluation campaign. These works tackle the recognition of five entity types in about 200 years of historical newspapers in French, English, and German.
21 The task included various NER settings, however only the coarse literal NE recognition is considered here. Ortiz Suárez et al. [
148] focused on French and German. They first pre-process the newspaper line-based format (or column segments) into sentence-split segments before training a BiLSTM-CRF model using a combination of modern static fastText and contextualised ELMo embeddings as input representations. They favoured ELMo over BERT because of its capacity to handle long sequences and its dynamic vocabulary thanks to its CNN character embedding layer. In-domain fastText embeddings provided by the organisers were tested but performed lower. Their models ranked third on both languages during the shared task, with strict F-score of
\(79\%\) and
\(65\%\) for French and German, respectively. The considerably lower performance of their improved CRF baseline illustrates the advantage of contextual embeddings-based neural models. Ablation experiments on sentence splitting showed an improvement of 3.5 F-score percentage points on French (except for
Location) confirming the importance of proper context for neural NER.
Running for French and English, Kristanti and Romary [
114] also make use of a BiLSTM-CRF relying on modern fastText and ELMo embeddings. In the absence of training set for English, authors use the CoNLL-2012 corpus, while for French the training data is further augmented with another NE-annotated journalistic corpus from 1990, which proved to have positive impact. They scored at
\(63\%\) and
\(52\%\) in terms of strict F-score for French and English, respectively. Compared to the French results of Ortiz Suàez et al., Kristanti and Romary use the same French embeddings but a different implementation framework and different hyper-parameters, and do not apply sentence segmentation.
Finally, still within the HIPE-2020 context, two teams tested pre-trained LM embeddings with transformer-based architectures. Provatorova et al. [
163] proposed an approach based on the fine-tuning of BERT models using Huggingface’s transformer framework for the three shared task’s languages, using the cased multilingual BERT base model for French and German and the cased monolingual BERT base model for English. They used the CoNLL-03 data for training their English model, the HIPE data for the others, and additionally set up a majority vote ensemble of five fine-tuned model instances per language in order to improve the robustness of the approach. Their models achieved F-scores of
\(68\%\),
\(52\%,\) and
\(47\%\) for French, German, and English, respectively. Ghannay et al. [
82] used CamemBERT, a multi-layer bidirectional transformer similar to ROBERTa [
128,
134] initialised with a pre-trained modern French CamemBERT and completed with a CRF tag decoder. This model obtained the second-best results for French with
\(81\%\) strict F-score.
Even when learned from modern data, pre-trained LM-based word embeddings encode rich prior knowledge that effectively support neural models trained on (usually) small historical training sets. As for HIPE-related systems, it should be noted that word-level LM embeddings systematically lead to slightly higher recall than precision, demonstrating their powerful generalisation capacities, even on noisy texts.
Using modern and historical pre-trained embeddings. As for static embeddings, it is logical to expect higher performances from LM-embeddings when pre-trained on historical data, in combination with modern ones or not. The set of work reviewed here explores this perspective.
Ahmed et al. [
4] work on the recognition of universal and domain-specific entities in German historical biodiversity literature.
22 They experiment with two BiLSTM-CRF implementations (their own and Flair framework) which both use modern token-level German word embeddings and are trained on the BIOfid corpus. Experiments consist in adding richer representations (modern Flair embeddings, additionally completed by newly trained ELMo embeddings or BERT base multilingual cased embeddings) or adding more task-specific training data (GermEval, CoNLL-03, and BIOfid). Models perform more or less equally, and authors explained the low gain of in-domain ELMo embeddings by the small size of the training data (100k sentences). Higher gains come with larger labelled data, however the absence of ablation tests hinders the complete understanding of the contribution of the historical part of this labelled data, and the use of two implementation frameworks does not warrant full results comparability.
Both Schweter and Baiter [
180] and Labusch et al. [
118] build on the work of Riedl and Padó [
166] and try to improve NER performances on the same historical German evaluation datasets, thereby constituting (with HIPE-2020) one of the few sets of comparable experiments.
23 Schweter and Baiter seek to offset the lack of training data by using only unlabelled data via pre-trained embeddings and language models. They use the Flair framework to train and combine (“stack”) their language models, and to train a BiLSTM-CRF model. Their first experiment consists in testing various static word representations, with: character embeddings learned during training, fastText embeddings pre-trained on Wikipedia or Common Crawl (with no sub-word information), and the combination of all of these. While Riedl and Padó experimented with similar settings (character embeddings and pre-trained modern and historical fastText embeddings), it appears that combining Wikipedia and Common Crawl embeddings leads to better performances, even higher than the transfer learning setting of Riedl and Padó using more labelled data. As a second experiment, Schweter and Baiter use pre-trained LM embeddings: flair embeddings newly trained on two historical corpora having temporal overlaps with the test data, and two modern pre-trained BERT models (multilingual and German). On both historical test sets, in-domain LMs yield the best results (outperforming those of Riedl and Padó), all the more so when the temporal overlap between embedding and task-specific training data is large. This demonstrates that the selection of the language model corpus plays an important role, and that unlabelled data close in time might have more impact than more (and difficult to obtain) labelled data.
With the objective of developing a versatile approach that performs decently on texts of different epochs without intense adaptation, Labusch et al. [
118] experiment with BERT under different pre-training and fine-tuning settings. In a nutshell, they apply a model based on multilingual BERT embeddings, which is further pre-trained on large OCRed historical German unlabelled data (the Digital Collection of the Berlin State Library) and subsequently fine-tuned on several NE-labelled datasets (CoNLL-03, GermEval, and the German part of Europeana NER corpora). Tested across different contemporary/historical dataset pairs (similar to the all-corpora crossing of Riedl and Padó [
166]), it appears that additional in-domain pre-training is most of the time beneficial for historical pairs, while performances worsen on contemporary ones. The combination of several task-specific training datasets has positive yet less important impact than BERT pre-training, as already observed by Schweter and Baiter [
180]. Overall, this work shows that an appropriately pre-trained BERT model delivers decent recognition performances in a variety of settings. In order to further improve them, authors propose to use the BERT large instead of the BERT base model, to build more historical labelled training data, and to improve the OCR quality of the collections.
The same spirit of combinatorial optimization drove the work of Todorov and Colavizza [
200] and Schweter and März [
181] in the context of HIPE-2020. Todorov and Colavizza build on the bidirectional LSTM-CRF architecture of Lample et al. [
119] and introduce a multi-task approach by splitting the top layers of the model (i.e., the final layers are specific to each entity type). Their general embedding layer combines a multitude of embeddings, on the level of characters, sub-words and words; some newly trained by the authors, as well as pre-trained BERT and HIPE’s in-domain fastText embeddings. They also vary the segmentation of the input: line segmentation, document segmentation as well as sub-document segmentation for long documents. No additional NER training material was used for German and French, while for English, the Groningen Meaning Bank
24 was adapted for training. Results suggest that splitting the top layers for each entity type is not beneficial. However, the addition of various embeddings improves the performance, as shown in the very detailed ablation test report. In this regard, character-level and BERT embeddings are particularly important, while in-domain embeddings contribute mainly to recall. Fine-tuning pre-trained embeddings did not prove beneficial. Using (sub-)document segmentation clearly improved results when compared to the line segmentation found in newspapers, emphasising once again the importance of context. Post-campaign F-scores for coarse literal NER are
\(75\%\) and
\(66\%\) for French and German (strict setting). English experiments yielded poor results, certainly due to the time and linguistic gaps between training and test data, and the pretty bad OCR quality of the material (in the same way as for Provatorova et al. [
163] and Kristanti and Romary [
114]).
For their part, Schweter and März [
181] focused on German and experimented with ensembling different word and subword embeddings (modern fastText and historical self-trained and HIPE flair embeddings), as well as transformer-based language models (trained on modern and historical data), all integrated by the neural Flair NER tagging framework [
5]. They used a state-of-the-art BiLSTM with an on-top CRF layer as proposed by [
100], and performed sentence splitting and hyphen normalisation as pre-processing. To identify the optimal combination of embeddings and LMs, authors first selected the best embeddings for each type before combining them. Using richer representations (fastText<flair<BERT) leads to better results each time. Among the options, Wikipedia fastText embeddings proved better than the Common Crawl ones, suggesting that similar data (news) is more beneficial than larger data for static representations; HIPE flair embeddings proved better than other historical ones, likely because of their larger training data size and data proximity; and BERT LM trained on large data proved better than the one trained on historical (smaller) data. The best final combination includes fastText and BERT, leading to
\(65\%\) F-score on coarse literal NER (strict setting).
Finally, Boros et al. [
28] also tackled NER tagging for HIPE-2020 in all languages and achieved best results. They used a hierarchical transformer-based model [
205] built upon BERT in a multi-task learning setting. On top of the pre-trained BERT blocks (multilingual BERT for all languages, additionally Europeana BERT for German and CamemBERT for French [
134]), two task-specific transformer layers were optionally added to alleviate data sparsity issues, for instance out-of-vocabulary words, spelling variations, or OCR errors in the HIPE dataset. A state-of-the-art CRF layer was added on top in order to model the context dependencies between entity tags. For base BERT with a limited context of 512 sub-tokens, documents are too long and newspaper lines are too short for proper contextualization. Therefore, an important pre-processing step consisted in the reconstruction of hyphenated words and in sentence segmentation. For the two languages with in-domain training data (French and German), their best run consisted in BERT fine-tuning, completed with the two stacked transformer blocks and the CRF layer. For English without in-domain training data, two options for fine-tuning were tested: (a) training on monolingual CoNLL-03 data, and (b) transfer learning by training on the French and German HIPE data. Both options worked better without transformer layers, and training on the French and German HIPE data led to better results. Final F-scores for coarse literal NER were
\(84\%\),
\(79\%,\) and
\(63\%\) for French, German, and English, respectively (strict setting).
Conclusion on deep learning approaches. What conclusions can be drawn from all this? First, the 20 or so papers reviewed above illustrate the growing interest of researchers and practitioners from different fields in the application of deep learning approaches to NER on historical collections. Second, it is obvious that these many publications also equate with a great diversity in terms of document, system, and task settings. Apart from the historical German [
118,
166,
180] and HIPE papers, most publications use different datasets and evaluation settings, which prevents result comparison; what is more, the sensitivity of DL approaches to experimental settings (pre-processing, embeddings, hyper-parameters, hardware) usually undermines any attempt to compare or reproduce experiments, and often leads to misconceptions about what works and what does not [
211]. As shown in the DL literature review above, what is reported can sometimes be contradictory. However, and with this in mind, a few conclusions can be drawn:
–
State-of-the-art BiLSTM architectures achieve very good performances and largely outperform traditional CRFs, except in small data contexts and on very regular entities. As inference layer, CRF is a better choice than softmax (also confirmed by Yang et al. [
211]). Yet, in the fast-changing DL landscape, transformer-based networks are already taking over BiLSTM.
–
Character and sub-word information is beneficial and helps to deal with OOV words, presumably historical spelling variations and OCR errors. CNN appears to be a better option than LSTM to learn character embeddings.
–
As for word representation, the richer the better. The same neural architecture performs better with character or word-based contextualised embeddings than with static ones, and even better with stacked embeddings. The combination of flair or fastText embeddings plus a BERT language model seems to provide an appropriate mix of morphological and lexical information. Contextualised representations also have positive impact in low resource setting.
–
Pre-trained modern embeddings prove to transfer reasonably well to historical texts, even more when learned on very large textual data. As expected, in-domain embeddings contribute positively to performances most of the time, and the temporal proximity between the corpora from which embeddings are derived and the targeted historical material seems to play an important role. Although a combination of generic and historical prior knowledge is likely to increase performances, what is best between very large modern vs. in-domain LMs remains an open question.
–
Careful pre-processing of input text (word de-hyphenation, sentence segmentation) in order to work with valid linguistic units appears to be a key factor.
Ultimately, apart from clearly outperforming traditional ML and rule-based systems, the most compelling aspect of DL approaches is certainly their transferability; if much still needs to be investigated, the possibility of having systems performing (relatively) well across historical settings —or a subset thereof— seems to be an achievable goal.