Article

Overview of CLEF HIPE 2020: Named Entity Recognition and Linking on Historical Newspapers

Authors:

Matteo Romanello,

Alex Flückiger,

Simon ClematideAuthors Info & Claims

Experimental IR Meets Multilinguality, Multimodality, and Interaction: 11th International Conference of the CLEF Association, CLEF 2020, Thessaloniki, Greece, September 22–25, 2020, Proceedings

Pages 288 - 310

https://doi.org/10.1007/978-3-030-58219-7_21

Published: 22 September 2020 Publication History

Abstract

This paper presents an overview of the first edition of HIPE (Identifying Historical People, Places and other Entities), a pioneering shared task dedicated to the evaluation of named entity processing on historical newspapers in French, German and English. Since its introduction some twenty years ago, named entity (NE) processing has become an essential component of virtually any text mining application and has undergone major changes. Recently, two main trends characterise its developments: the adoption of deep learning architectures and the consideration of textual material originating from historical and cultural heritage collections. While the former opens up new opportunities, the latter introduces new challenges with heterogeneous, historical and noisy inputs. In this context, the objective of HIPE, run as part of the CLEF 2020 conference, is threefold: strengthening the robustness of existing approaches on non-standard inputs, enabling performance comparison of NE processing on historical texts, and, in the long run, fostering efficient semantic indexing of historical documents. Tasks, corpora, and results of 13 participating teams are presented.

References

[1]

Akbik, A., Bergmann, T., Blythe, D., Rasul, K., Schweter, S., Vollgraf, R.: FLAIR: an easy-to-use framework for state-of-the-art NLP. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (demonstrations), pp. 54–59. Association for Computational Linguistics, Minneapolis, Minnesota, June 2019. https://www.aclweb.org/anthology/N19-4010

[2]

Akbik, A., Blythe, D., Vollgraf, R.: Contextual string embeddings for sequence labeling. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 1638–1649. Association for Computational Linguistics, Santa Fe, New Mexico, USA, August 2018. http://www.aclweb.org/anthology/C18-1139

[3]

Bojanowski P, Grave E, Joulin A, and Mikolov T Enriching word vectors with subword information Trans. Assoc. Comput. Linguist. 2017 5 135-146 https://www.aclweb.org/anthology/Q17-1010

[4]

Bollmann, M.: A large-scale comparison of historical text normalization systems. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 3885–3898. Association for Computational Linguistics, Minneapolis, Minnesota (2019).

[5]

Borin, L., Kokkinakis, D., Olsson, L.J.: Naming the past: named entity and animacy recognition in 19th century Swedish literature. In: Proceedings of the Workshop on Language Technology for Cultural Heritage Data (LaT-eCH 2007), pp. 1–8 (2007)

[6]

Cappellato, L., Eickhoff, C., Ferro, N., Névéol, A. (eds.): CLEF 2020 Working Notes. In: CEUR Workshop Proceedings Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum (2020)

[7]

Chiron, G., Doucet, A., Coustaty, M., Visani, M., Moreux, J.P.: Impact of OCR errors on the use of digital libraries: towards a better access to information. In: Proceedings of the 17th ACM/IEEE Joint Conference on Digital Libraries JCDL 2017, pp. 249–252. IEEE Press, Piscataway (2017), http://dl.acm.org/citation.cfm?id=3200334.3200364

[8]

Chiu JP and Nichols E Named entity recognition with bidirectional LSTM-CNNs Trans. Assoc. Comput. Linguist. 2016 4 357-370

[9]

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018). http://arxiv.org/abs/1810.04805

[10]

Dupont, Y., Dinarelli, M., Tellier, I., Lautier, C.: Structured named entity recognition by cascading CRFs. In: Intelligent Text Processing and Computational Linguistics (CICling) (2017)

[11]

Ehrmann, M., Colavizza, G., Rochat, Y., Kaplan, F.: Diachronic evaluation of NER systems on old newspapers. In: Proceedings of the 13th Conference on Natural Language Processing KONVENS 2016, pp. 97–107. Bochumer Linguistische Arbeitsberichte (2016). https://infoscience.epfl.ch/record/221391?ln=en

[12]

Ehrmann, M., Romanello, M., Flückiger, A., Clematide, S.: Extended overview of CLEF HIPE 2020: named entity processing on historical newspapers. In: Cappellato, L., Eickhoff, C., Ferro, N., Névéol, A. (eds.) CLEF 2020 Working Notes. Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum CEUR-WS (2020)

[13]

Ehrmann, M., Romanello, M., Flückiger, A., Clematide, S.: HIPE - shared task participation guidelines (v1.1) (2020).

[14]

Ehrmann, M., Romanello, M., Flückiger, A., Clematide, S.: Impresso named entity annotation guidelines (2020).

[15]

El Vaigh, C.B., Goasdoué, F., Gravier, G., Sébillot, P.: Using knowledge base semantics in context-aware entity linking. In: 2019 Proceedings of the ACM Symposium on Document Engineering DocEng 2019, pp. 1–10. Association for Computing Machinery, Berlin, Germany, September 2019.

Digital Library

[16]

Galibert, O., Rosset, S., Grouin, C., Zweigenbaum, P., Quintard, L.: Extended named entity annotation on OCRed documents : from corpus constitution to evaluation campaign. In: Proceedings of the Eighth conference on International Language Resources and Evaluation, pp. 3126–3131. Istanbul, Turkey (2012)

[17]

Ganea, O.E., Hofmann, T.: Deep joint entity disambiguation with local neural attention. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2619–2629 (2017)

[18]

Grishman, R., Sundheim, B.: Design of the MUC-6 evaluation. In: Proceedings of the Sixth Conference on Message Understanding Conference (MUC-6), Columbia, Maryland (1995)

[19]

Hoffart, J., et al.: Robust disambiguation of named entities in text. In: EMNLP (2011)

[20]

Hooland SV, Wilde MD, Verborgh R, Steiner T, and Van de Walle R Exploring entity recognition and disambiguation for cultural heritage collections Digit. Sch. Humanit. 2015 30 2 262-279

[21]

van Hulst, J.M., Hasibi, F., Dercksen, K., Balog, K., de Vries, A.P.: REL: an entity linker standing on the shoulders of giants. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR 2020. ACM (2020)

[22]

Kaplan, F., di Lenardo, I.: Big data of the past. Front. Digit. Humanit. 4 (2017).

[23]

Klie, J.C., Bugert, M., Boullosa, B., de Castilho, R.E., Gurevych, I.: The inception platform: machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018)

[24]

Kolitsas, N., Ganea, O.E., Hofmann, T.: End-to-end neural entity linking. In: Proceedings of the 22nd Conference on Computational Natural Language Learning, pp. 519–529. Association for Computational Linguistics, Brussels, Belgium, October 2018.

[25]

Krippendorff K Content Analysis: An Introduction to its Methodology 1980 Thousand Oaks Sage Publications

[26]

Labusch, K., Neudecker, C., Zellhöfer, D.: BERT for named entity recognition in contemporary and historic german. In: Preliminary proceedings of the 15th Conference on Natural Language Processing (KONVENS 2019): Long Papers, pp. 1–9. German Society for Computational Linguistics & Language Technology, Erlangen, Germany (2019)

[27]

Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition, March 2016. arXiv:1603.01360. http://arxiv.org/abs/1603.01360

[28]

Lavergne, T., Cappé, O., Yvon, F.: Practical very large scale CRFs. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 504–513. Association for Computational Linguistics (2010)

[29]

Linhares Pontes E, Hamdi A, Sidere N, and Doucet A Jatowt A, Maeda A, and Syn SY Impact of OCR quality on named entity linking Digital Libraries at the Crossroads of Digital Information for the Future 2019 Cham Springer 102-115

Digital Library

[30]

Makhoul, J., Kubala, F., Schwartz, R., Weischedel, R.: Performance measures for information extraction. In: Proceedings of DARPA Broadcast News Workshop, pp. 249–252 (1999)

[31]

Martin, L., et al.: Camembert: a tasty french language model (2019)

[32]

May, P.: German ELMo model (2019). https://github.com/t-systems-on-site-services-gmbh/german-elmo-model

[33]

Nadeau D and Sekine S A survey of named entity recognition and classification Lingvisticae Investigationes 2007 30 1 3-26

[34]

Neudecker, C., Antonacopoulos, A.: Making Europe’s historical newspapers searchable. In: 2016 12th IAPR Workshop on Document Analysis Systems (DAS), pp. 405–410. IEEE, Santorini, Greece, April 2016.

[35]

Nguyen, D.B., Hoffart, J., Theobald, M., Weikum, G.: Aida-light: high-throughput named-entity disambiguation. In: LDOW (2014)

[36]

Nouvel D, Antoine J-Y, and Friburger N Vetulani Z and Mariani J Pattern mining for named entity recognition Human Language Technology Challenges for Computer Science and Linguistics 2014 Cham Springer 226-237

Digital Library

[37]

Okazaki, N.: CRFsuite: a fast implementation of Conditional Random Fields (CRFs) (2007). http://www.chokkan.org/software/crfsuite/

[38]

Ortiz Suárez, P.J., Dupont, Y., Muller, B., Romary, L., Sagot, B.: Establishing a new state-of-the-art for French named entity recognition. In: Proceedings of the 12th Language Resources and Evaluation Conference, pp. 4631–4638. European Language Resources Association, Marseille, France, May 2020. https://www.aclweb.org/anthology/2020.lrec-1.569

[39]

Pennington J, Socher R, and Manning CD Glove: global vectors for word representation EMNLP 2014 14 1532-43

[40]

Peters, M., et al.: Deep Contextualized Word Representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). pp. 2227–2237. Association for Computational Linguistics, New Orleans, Louisiana (2018).

[41]

Piotrowski M Natural language processing for historical texts Synth. Lect. Hum. Lang. Technol. 2012 5 2 1-157

[42]

Plank, B.: What to do about non-standard (or non-canonical) language in NLP. In: Proceedings of the 13th Conference on Natural Language Processing KONVENS 2016. Bochumer Linguistische Arbeitsberichte (2016)

[43]

Rao D, McNamee P, and Dredze M Poibeau T, Saggion H, Piskorski J, and Yangarber R Entity linking: finding extracted entities in a knowledge base Multi-source, Multilingual Information Extraction and Summarization 2013 Heidelberg Springer 93-115

[44]

Rosset, S., Grouin, C., Fort, K., Galibert, O., Kahn, J., Zweigenbaum, P.: Structured named entities in two distinct press corpora: contemporary broadcast news and old newspapers. In: Proceedings of the 6th Linguistic Annotation Workshop, pp. 40–48. Association for Computational Linguistics (2012)

[45]

Rosset, S., Grouin, C., Zweigenbaum, P.: Entités nommées structurées : guide d’annotation Quaero. NOTES et DOCUMENTS 2011–04, LIMSI-CNRS (2011)

[46]

Smith, D.A., Cordell, R.: A research agenda for historical and multilingual optical character recognition. Technical report (2018). http://hdl.handle.net/2047/D20297452

[47]

Sporleder C Natural language processing for cultural heritage domains Lang. Linguist. Compass 2010 4 9 750-768

[48]

van Strien, D., Beelen, K., Ardanuy, M.C., Hosseini, K., McGillivray, B., Colavizza, G.: Assessing the impact of OCR quality on downstream NLP tasks. In: ICAART 2020 - Proceedings of the 12th International Conference on Agents and Artificial Intelligence. SCITEPRESS - Science and Technology Publications, January 2020.

[49]

van Strien, D., Beelen, K., Ardanuy, M., Hosseini, K., McGillivray, B., Colavizza, G.: Assessing the impact of OCR quality on downstream NLP tasks. In: Proceedings of the 12th International Conference on Agents and Artificial Intelligence, pp. 484–496. SCITEPRESS - Science and Technology Publications, Valletta, Malta (2020).

[50]

Terras, M.: The rise of digitization. In: Rikowski, R. (ed.) Digitisation Perspectives, pp. 3–20. Sense Publishers, Rotterdam (2011).

[51]

Vaswani, A., et al.: Attention is all you need. CoRR abs/1706.03762 (2017). http://arxiv.org/abs/1706.03762

[52]

Vilain, M., Su, J., Lubar, S.: Entity extraction is a boring solved problem: or is it? In: Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers, NAACL-Short 2007, Rochester, New York, pp. 181–184. Association for Computational Linguistics (2007). http://dl.acm.org/citation.cfm?id=1614108.1614154

Cited By

Piryani BMozafari JJatowt AHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper PagesProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657891(2038-2048)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657891
Sun WTran HGonzález-Gallardo CCoustaty MDoucet A(2024)LIAS: Layout Information-Based Article Separation in Historical NewspapersLinking Theory and Practice of Digital Libraries10.1007/978-3-031-72437-4_15(256-272)Online publication date: 24-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72437-4_15
Avram AIuga AManolache GMatei VMicliuş RMuntean VSorlescu MŞerban DUrse APăiş VCercel D(2024)HistNERo: Historical Named Entity Recognition for the Romanian LanguageDocument Analysis and Recognition - ICDAR 202410.1007/978-3-031-70543-4_8(126-144)Online publication date: 30-Aug-2024
https://dl.acm.org/doi/10.1007/978-3-031-70543-4_8
Show More Cited By

Recommendations

A Multilingual Dataset for Named Entity Recognition, Entity Linking and Stance Detection in Historical Newspapers
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

Named entity processing over historical texts is more and more being used due to the massive documents and archives being stored in digital libraries. However, due to the poor annotated resources of historical nature, information extraction performances ...
Overview of HIPE-2022: Named Entity Recognition and Linking in Multilingual Historical Documents
Experimental IR Meets Multilinguality, Multimodality, and Interaction
Abstract
This paper presents an overview of the second edition of HIPE (Identifying Historical People, Places and other Entities), a shared task on named entity recognition and linking in multilingual historical documents. Following the success of the ...
Introducing the CLEF 2020 HIPE Shared Task: Named Entity Recognition and Linking on Historical Newspapers
Advances in Information Retrieval
Abstract
Since its introduction some twenty years ago, named entity (NE) processing has become an essential component of virtually any text mining application and has undergone major changes. Recently, two main trends characterise its developments: the ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

Experimental IR Meets Multilinguality, Multimodality, and Interaction: 11th International Conference of the CLEF Association, CLEF 2020, Thessaloniki, Greece, September 22–25, 2020, Proceedings

Sep 2020

408 pages

ISBN:978-3-030-58218-0

DOI:10.1007/978-3-030-58219-7

Editors:
Avi Arampatzis
Department of Electrical and Computer Engineering, Democritus University of Thrace, Xanthi, Greece
,
Evangelos Kanoulas
University of Amsterdam, Amsterdam, The Netherlands
,
Theodora Tsikrika
Information Technologies Institute, Centre for Research and Technology Hellas, Thessaloniki, Greece
,
Stefanos Vrochidis
Information Technologies Institute, Centre for Research and Technology Hellas, Thessaloniki, Greece
,
Hideo Joho
Faculty of Library, Information and Media Science, University of Tsukuba, Ibaraki, Japan
,
Christina Lioma
Department of Computer Science, University of Copenhagen, Copenhagen, Denmark
,
Carsten Eickhoff
Brown University, Providence, RI, USA
,
Aurélie Névéol
LIMSI-CNRS, Orsay, France
,
Linda Cappellato
Department of Information Engineering, University of Padova, Padua, Italy
,
Nicola Ferro
Department of Information Engineering, University of Padova, Padua, Italy

© Springer Nature Switzerland AG 2020.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 22 September 2020

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 14 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Piryani BMozafari JJatowt AHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper PagesProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657891(2038-2048)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657891
Sun WTran HGonzález-Gallardo CCoustaty MDoucet A(2024)LIAS: Layout Information-Based Article Separation in Historical NewspapersLinking Theory and Practice of Digital Libraries10.1007/978-3-031-72437-4_15(256-272)Online publication date: 24-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72437-4_15
Avram AIuga AManolache GMatei VMicliuş RMuntean VSorlescu MŞerban DUrse APăiş VCercel D(2024)HistNERo: Historical Named Entity Recognition for the Romanian LanguageDocument Analysis and Recognition - ICDAR 202410.1007/978-3-031-70543-4_8(126-144)Online publication date: 30-Aug-2024
https://dl.acm.org/doi/10.1007/978-3-031-70543-4_8
González-Gallardo CBoros EGiamphy EHamdi AMoreno JDoucet A(2023)Injecting Temporal-Aware Knowledge in Historical Named Entity RecognitionAdvances in Information Retrieval10.1007/978-3-031-28244-7_24(377-393)Online publication date: 2-Apr-2023
https://dl.acm.org/doi/10.1007/978-3-031-28244-7_24
Sevgili ÖShelmanov AArkhipov MPanchenko ABiemann CAlam MBuscaldi DCochez MOsborne FRefogiato Recupero DSack H(2022)Neural entity linkingSemantic Web10.3233/SW-22298613:3(527-570)Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.3233/SW-222986
Boros ECabrera-Diego LDoucet A(2022)Experimenting with Unsupervised Multilingual Event Detection in Historical NewspapersFrom Born-Physical to Born-Virtual: Augmenting Intelligence in Digital Libraries10.1007/978-3-031-21756-2_15(182-193)Online publication date: 30-Nov-2022
https://dl.acm.org/doi/10.1007/978-3-031-21756-2_15
Ehrmann MRomanello MNajem-Meyer SDoucet AClematide S(2022)Overview of HIPE-2022: Named Entity Recognition and Linking in Multilingual Historical DocumentsExperimental IR Meets Multilinguality, Multimodality, and Interaction10.1007/978-3-031-13643-6_26(423-446)Online publication date: 5-Sep-2022
https://dl.acm.org/doi/10.1007/978-3-031-13643-6_26
Monroc CMiret BBonhomme MKermorvant C(2022)A Comprehensive Study of Open-Source Libraries for Named Entity Recognition on Handwritten Historical DocumentsDocument Analysis Systems10.1007/978-3-031-06555-2_29(429-444)Online publication date: 22-May-2022
https://dl.acm.org/doi/10.1007/978-3-031-06555-2_29
Tüselmann OFink G(2022)Named Entity Linking on Handwritten Document ImagesDocument Analysis Systems10.1007/978-3-031-06555-2_14(199-213)Online publication date: 22-May-2022
https://dl.acm.org/doi/10.1007/978-3-031-06555-2_14
Koudoro-Parfait CLejeune GRoe G(2021)Spatial Named Entity Recognition in Literary TextsProceedings of the 5th ACM SIGSPATIAL International Workshop on Geospatial Humanities10.1145/3486187.3490206(13-21)Online publication date: 2-Nov-2021
https://dl.acm.org/doi/10.1145/3486187.3490206

View Options

View options

Media

Figures

Other

Tables

View Table of Contents