research-article

IsiXhosa Named Entity Recognition Resources

Authors:

Andiswa BukulaAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 22, Issue 2

Article No.: 35, Pages 1 - 19

https://doi.org/10.1145/3531478

Published: 27 December 2022 Publication History

Abstract

Named entity recognition has been one of the most widely researched natural language processing technologies over the past two decades. For the South African languages, however, relatively little research and development work has been done. This changed with the release of the NCHLT named entity annotated resources, a collection of named entity annotated data and Conditional Random Field-based named entity recognisers for ten of the official languages.

In this work, we provide a detailed description and linguistic analysis of the named entity (NE) annotated data for the agglutinative isiXhosa language, by analysing the morphosyntactic features relevant to the three main types of NE, viz. person, location, and organisation. From the data, we identify suffix and capitalisation features that may be good predictors of the different NE types. Based on these features, we describe the named entity recogniser and feature set developed as part of the NCHLT release. The recogniser has high precision, 0.9713 overall, but relatively low recall, 0.7409, especially for person names, 0.5963, resulting in an overall F-score of 0.8406. Although there are various avenues to improve the named entity recogniser, this is a significant release for a historically under-resourced language.

References

[1]

Diana Maynard, Valentin Tablan, Cristian Ursu, Hamish Cunningham, and Yorick Wilks. 2001. Named entity recognition from diverse text types. In Proceedings of the Recent Advances in Natural Language Processing. DOI:

Digital Library

[2]

Maud Ehrmann, Damien Nouvel, and Sophie Rosset. 2016. Named entity resources-overview and outlook. In Proceedings of the European Language Resources Association (LREC’16). 3349–3356.

[3]

Bogdan Babych and Anthony Hartley. 2003. Improving machine translation quality with automatic named entity recognition. In Proceedings of the European Chapter of the Assocition for Computational Linguistics (EACL’03). 1–8.

[4]

Alexandra Balahur and Marco Turchi. 2012. Multilingual sentiment analysis using machine translation? In Proceedings of the Association for Computational Linguistics. 52–60.

[5]

Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating non-local information into information extraction systems by Gibbs sampling. In Proceedings of the 43rd Annual Meeting, Association for Computational Linguistics. 363–370. DOI:

Digital Library

[6]

Francis Kubala, Richard Schwartz, Rebecca Stone, and Ralph Weischedel. 1998. Named entity extraction from speech. In Proceedings of the DARPA Speech Recognition and Natural Language Workshop. Morgan Kaufmann. 287–1992.

[7]

Roald Eiselen. 2016. Government domain named entity recognition for South African languages. In Proceedings of the European Language Resources Association (LREC’16). 3344–3348.

[8]

W. Fourie, J. V. Du Toit, and D. P. Snyman. 2014. Comparing support vector machine and multinomial naive Bayes for named entity classification of South African languages. In Proceedings of the Pattern Recognition Association of South Africa Conference. Retrieved from http://hdl.handle.net/10394/16239.

[9]

Gordon Matthew. 2013. Benoemde–Entiteitherkenning Vir Afrikaans. PhD Thesis, North-West University, Vanderbijlpark.

[10]

Martin J. Puttkammer. 2006. Outomatiese Afrikaanse Tekseenheididentifisering. PhD Thesis, North-West University, Potchefstroom.

[11]

Lev Ratinov and Dan Roth. 2009. Design challenges and misconceptions in named entity recognition. In Proceedings of the CoNLL'09, Association for Computational Linguistics. 147–155. DOI:

[12]

Mark De Vos, Kristin Van der Merwe, and Caroline Van der Mescht. 2014. A linguistic research programme for reading in African languages to underpin CAPS. J. Lang. Teach. 48, 2 (2014), 149–177. DOI:

[13]

Beth M. Sundheim. 1995. Overview of results of the MUC-6 evaluation. In Proceedings of the Message Understanding Conference (MUC’95). 13–31.

Digital Library

[14]

Einat Minkov, Richard C. Wang, and William W. Cohen. 2005. Extracting personal names from email: Applying named entity recognition to informal text. In Proceedings of the Association for Computational Linguistics (EMNLP-HLT’05). 443–450.

Digital Library

[15]

John S. Garofolo, Jonathan G. Fiscus, and William M. Fisher. 1997. Design and preparation of the 1996 Hub-4 broadcast news benchmark test corpora. In Proceedings of the DARPA Speech Recognition Workshop. Morgan Kaufmann, 15–21.

[16]

Nancy Chinchor and Patricia Robinson. 1997. MUC-7 named entity task definition. In Proceedings of the 7th Message Understanding Conference (MUC’97). 1–21.

[17]

Erik F. Tjong Kim Sang. 2002. Introduction to the CoNLL-2002 shared task: Language-independent named entity recognition. In Proceedings of the Association for Computational Linguistics (COLING’02).

Digital Library

[18]

Erik F. Tjong Kim Sang, and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Association for Computational Linguistics (HLT-NAACL’03).

Digital Library

[19]

David Nadeau and Satoshi Sekine. 2007. A survey of named entity recognition and classification. Lingvisticae Investigationes 30, 1 (2007), 3–26.

[20]

Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke Zettlemoyer, and Michael Auli. 2019. Cloze-driven pretraining of self-attention networks. Retrieved from https://arXiv:1903.07785.

[21]

Yufan Jiang, Chi Hu, Tong Xiao, Chunliang Zhang, and Jingbo Zhu. 2019. Improved differentiable architecture search for language modeling and named entity recognition. In Proceedings of the Association for Computational Linguistics (EMNLP-IJCNLP’19). 3585–3590. DOI:

[22]

Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. FAIRSEQ: A fast, extensible toolkit for sequence modeling. Retrieved from https://arXiv:1904.01038.

[23]

Zihan Wang, Jingbo Shang, Liyuan Liu, Lihao Lu, Jiacheng Liu, and Jiawei Han. 2019. CrossWeigh: Training named entity tagger from imperfect annotations. In Proceedings of the Association for Computational Linguistics (EMNLP-IJCNLP’19). 5154–5163. DOI:

[24]

Khaled Shaalan and Hafsa Raza. 2009. NERA: Named entity recognition for arabic. J. Amer. Soc. Info. Sci. Technol. 60, 8 (2009), 1652–1663. DOI:

[25]

Ilias G. Maglogiannis. 2007. Emerging artificial intelligence applications in computer engineering: Real word AI systems with applications in eHealth, HCI, information retrieval and pervasive technologies. IOS Press, Amsterdam.

[26]

Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. Retrieved from https://arXiv:1603.01360.

[27]

Andrew McCallum and Wei Li. 2003. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the Association for Computational Linguistics (HLT-NAACL’03). 188–191.

Digital Library

[28]

Minghao Wu, Fei Liu, and Trevor Cohn. 2018. Evaluating the utility of hand-crafted features in sequence labelling. In Proceedings of the Association for Computational Linguistics (EMNLP’18). 2850–2856. DOI:

[29]

Jie Yang and Yue Zhang. 2018. NCRF++: An open-source neural sequence labeling toolkit. In Proceedings of the Association for Computational Linguistics. 74–79. DOI:

[30]

Adam G. Rouse, Guy Hotson, Ryan J. Smith, Marc H. Schieber, Nitish V. Thakor, and Brock A. Wester. 2016. High precision neural decoding of complex movement trajectories using recursive Bayesian estimation with dynamic movement primitives. IEEE Robot. Autom. Lett. 1, 2 (2016), 676–683. DOI:

[31]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. Retrieved from https://arXiv:1810.04805.

[32]

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. Retrieved from https://arXiv:1802.05365.

[33]

Alan Akbik, Tanja Bergmann, Duncan Blythe, Kashif Rasul, Stefan Schweter, and Roland Vollgraf. 2019. FLAIR: An easy-to-use framework for state-of-the-art NLP. In Proceedings of the Association for Computational Linguistics. 54–59. DOI:

[34]

Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual string embeddings for sequence labeling. In Proceedings of the Association for Computational Linguistics (COLING’18). 1638–1649.

[35]

Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, and Yuji Matsumoto. 2020. LUKE: Deep contextualized entity representations with entity-aware self-attention. In Proceedings of the Association for Computational Linguistics (EMNLP’20). 6442–6454. DOI:

[36]

Jouni Luoma, Miika Oinonen, Maria Pyykönen, Veronika Laippala, and Sampo Pyysalo. 2020. A Broad-coverage corpus for finnish named entity recognition. In Proceedings of the European Language Resources Association (LREC’20). 4615–4624.

[37]

Reyyan Yeniterzi. 2011. Exploiting morphology in turkish named entity recognition system. In Proceedings of the Association for Computational Linguistics. 105–110.

[38]

Braja Gopal Patra, Nuna Debbarma, Dipankar Das, and Sivaji Bandyopadhyay. 2015. Named entity recognizer for less resourced language Kokborok. In Proceedings of the International Conference on Asian Language Processing (IALP’15). IEEE, 164–168.

[39]

Sherief Abdallah, Khaled Shaalan, and Muhammad Shoaib. 2012. Integrating rule-based system with classification for Arabic named entity recognition. In Computational Linguistics and Intelligent Text Processing, A. Gelbukh (ed.). Springer, Berlin, 311–322.

[40]

Khaled Shaalan. 2014. A survey of Arabic named entity recognition and classification. Comput. Linguist. 40, 2 (2014), 469–510. DOI:

Digital Library

[41]

L. Pretorius and S. Bosch. 2009. Exploiting cross-linguistic similarities in Zulu and Xhosa computational morphology. In Proceedings of the Association for Computational Linguistics (EACL'09). 96–103.

[42]

Carl Meinhof. 1932. Introduction to the Phonology of the Bantu Languages, Trans. Reimer, Berlin.

[43]

Sandile Gxilishe, Peter de Villiers, Jill de Villiers, A. Belikova, L. Meroni, and Mari Umeda. 2007. The acquisition of subject agreement in xhosa. In Proceedings of the Conference on Generative Approaches to Language Acquisition (GALANA’07). Citeseer, 114–123.

[44]

Rigardt Pretorius, Ansu Berg, Laurette Pretorius, and Biffie Viljoen. 2009. Setswana tokenisation and computational verb morphology: Facing the challenge of a disjunctive orthography. In Proceedings of the Association for Computational Linguistics. 66–73.

[45]

Derek Nurse and Gérard Philippson. 2006. The Bantu Languages. Routledge, London.

[46]

Travis W. Perry. 2020. Isixhosa noun classes. Retrieved from http://facweb.furman.edu/∼perrytravis/courses/bio39/Academics/Isixhosa/nounclasses.html.

[47]

George Tucker Childs. 2003. An Introduction to African Languages. John Benjamins Publishing, Amsterdam.

[48]

Buyiswa Mini and Nonkosi Tyolwana. 2005. Revision of Isixhosa Orthography and other Editorial Matters. PANSALB, Pretoria.

[49]

K. Podile and R. Eiselen. 2016. NCHLT isiXhosa named entity annotated corpus. Dataset. Centre for Text Technology. Retrieved from https://hdl.handle.net/20.500.12185/312.

[50]

Martin Puttkammer, Martin Schlemmer, Wikus Pienaar, and Ruan Bekker. 2014. NCHLT isiXhosa Text Corpora. Dataset. Centre for Text Technology. Retrieved from https://hdl.handle.net/20.500.12185/314.

[51]

Martin Puttkammer, Roald Eiselen, Justin Hocking, and Frederik Koen. 2018. NLP web services for resource-scarce languages. In Proceedings of the Association for Computational Linguistics (ACL’18). 43–49.

[52]

Michal Konkol and Miloslav Konopík. 2013. CRF-based Czech named entity recognizer and consolidation of Czech NER research. In Proceedings of the International Conference on Text, Speech and Dialogue. Springer, 153–160.

[53]

Taku Kudo. 2013. CRF++: Yet another CRF toolkit. Version 0.58. Retrieved from https://taku910.github.io/crfpp/.

[54]

Roald Eiselen and Martin J. Puttkammer. 2014. Developing text resources for ten South African languages. In Proceedings of the European Language Resources Association (LREC’14). 3698–3703.

[55]

Melinda Loubser and Martin J. Puttkammer. 2020. Viability of neural networks for core technologies for resource-scarce languages. Information 11, 1 (2020), 41. DOI:

Cited By

Index Terms

IsiXhosa Named Entity Recognition Resources
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Information extraction

Recommendations

Unsupervised biomedical named entity recognition

Display Omitted BM-NER is approached by an unsupervised stepwise method.Noun phrase chunking is a good approximation of boundary detection.Distributional semantics works well in classifying entities.The system performs well on clinical and biological ...
Named entity recognition and resolution in legal text
Semantic Processing of Legal Texts

Named entities in text are persons, places, companies, etc. that are explicitly mentioned in text using proper nouns. The process of finding named entities in a text and classifying them to a semantic type, is called named entity recognition. Resolution ...
Learning multilingual named entity recognition from Wikipedia

We automatically create enormous, free and multilingual silver-standard training annotations for named entity recognition (ner) by exploiting the text and structure of Wikipedia. Most ner systems rely on statistical models of annotated data to identify ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 22, Issue 2

February 2023

624 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/3572719

Editor:
Imed Zitouni
Google, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 December 2022

Online AM: 02 June 2022

Accepted: 12 April 2022

Revision received: 25 March 2022

Received: 07 September 2021

Published in TALLIP Volume 22, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

National Centre for Human Language Technology

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
289
Total Downloads

Downloads (Last 12 months)92
Downloads (Last 6 weeks)10

Reflects downloads up to 29 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents