skip to main content
research-article

IsiXhosa Named Entity Recognition Resources

Published: 27 December 2022 Publication History

Abstract

Named entity recognition has been one of the most widely researched natural language processing technologies over the past two decades. For the South African languages, however, relatively little research and development work has been done. This changed with the release of the NCHLT named entity annotated resources, a collection of named entity annotated data and Conditional Random Field-based named entity recognisers for ten of the official languages.
In this work, we provide a detailed description and linguistic analysis of the named entity (NE) annotated data for the agglutinative isiXhosa language, by analysing the morphosyntactic features relevant to the three main types of NE, viz. person, location, and organisation. From the data, we identify suffix and capitalisation features that may be good predictors of the different NE types. Based on these features, we describe the named entity recogniser and feature set developed as part of the NCHLT release. The recogniser has high precision, 0.9713 overall, but relatively low recall, 0.7409, especially for person names, 0.5963, resulting in an overall F-score of 0.8406. Although there are various avenues to improve the named entity recogniser, this is a significant release for a historically under-resourced language.

References

[1]
Diana Maynard, Valentin Tablan, Cristian Ursu, Hamish Cunningham, and Yorick Wilks. 2001. Named entity recognition from diverse text types. In Proceedings of the Recent Advances in Natural Language Processing. DOI:
[2]
Maud Ehrmann, Damien Nouvel, and Sophie Rosset. 2016. Named entity resources-overview and outlook. In Proceedings of the European Language Resources Association (LREC’16). 3349–3356.
[3]
Bogdan Babych and Anthony Hartley. 2003. Improving machine translation quality with automatic named entity recognition. In Proceedings of the European Chapter of the Assocition for Computational Linguistics (EACL’03). 1–8.
[4]
Alexandra Balahur and Marco Turchi. 2012. Multilingual sentiment analysis using machine translation? In Proceedings of the Association for Computational Linguistics. 52–60.
[5]
Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating non-local information into information extraction systems by Gibbs sampling. In Proceedings of the 43rd Annual Meeting, Association for Computational Linguistics. 363–370. DOI:
[6]
Francis Kubala, Richard Schwartz, Rebecca Stone, and Ralph Weischedel. 1998. Named entity extraction from speech. In Proceedings of the DARPA Speech Recognition and Natural Language Workshop. Morgan Kaufmann. 287–1992.
[7]
Roald Eiselen. 2016. Government domain named entity recognition for South African languages. In Proceedings of the European Language Resources Association (LREC’16). 3344–3348.
[8]
W. Fourie, J. V. Du Toit, and D. P. Snyman. 2014. Comparing support vector machine and multinomial naive Bayes for named entity classification of South African languages. In Proceedings of the Pattern Recognition Association of South Africa Conference. Retrieved from http://hdl.handle.net/10394/16239.
[9]
Gordon Matthew. 2013. Benoemde–Entiteitherkenning Vir Afrikaans. PhD Thesis, North-West University, Vanderbijlpark.
[10]
Martin J. Puttkammer. 2006. Outomatiese Afrikaanse Tekseenheididentifisering. PhD Thesis, North-West University, Potchefstroom.
[11]
Lev Ratinov and Dan Roth. 2009. Design challenges and misconceptions in named entity recognition. In Proceedings of the CoNLL'09, Association for Computational Linguistics. 147–155. DOI:
[12]
Mark De Vos, Kristin Van der Merwe, and Caroline Van der Mescht. 2014. A linguistic research programme for reading in African languages to underpin CAPS. J. Lang. Teach. 48, 2 (2014), 149–177. DOI:
[13]
Beth M. Sundheim. 1995. Overview of results of the MUC-6 evaluation. In Proceedings of the Message Understanding Conference (MUC’95). 13–31.
[14]
Einat Minkov, Richard C. Wang, and William W. Cohen. 2005. Extracting personal names from email: Applying named entity recognition to informal text. In Proceedings of the Association for Computational Linguistics (EMNLP-HLT’05). 443–450.
[15]
John S. Garofolo, Jonathan G. Fiscus, and William M. Fisher. 1997. Design and preparation of the 1996 Hub-4 broadcast news benchmark test corpora. In Proceedings of the DARPA Speech Recognition Workshop. Morgan Kaufmann, 15–21.
[16]
Nancy Chinchor and Patricia Robinson. 1997. MUC-7 named entity task definition. In Proceedings of the 7th Message Understanding Conference (MUC’97). 1–21.
[17]
Erik F. Tjong Kim Sang. 2002. Introduction to the CoNLL-2002 shared task: Language-independent named entity recognition. In Proceedings of the Association for Computational Linguistics (COLING’02).
[18]
Erik F. Tjong Kim Sang, and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Association for Computational Linguistics (HLT-NAACL’03).
[19]
David Nadeau and Satoshi Sekine. 2007. A survey of named entity recognition and classification. Lingvisticae Investigationes 30, 1 (2007), 3–26.
[20]
Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke Zettlemoyer, and Michael Auli. 2019. Cloze-driven pretraining of self-attention networks. Retrieved from https://arXiv:1903.07785.
[21]
Yufan Jiang, Chi Hu, Tong Xiao, Chunliang Zhang, and Jingbo Zhu. 2019. Improved differentiable architecture search for language modeling and named entity recognition. In Proceedings of the Association for Computational Linguistics (EMNLP-IJCNLP’19). 3585–3590. DOI:
[22]
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. FAIRSEQ: A fast, extensible toolkit for sequence modeling. Retrieved from https://arXiv:1904.01038.
[23]
Zihan Wang, Jingbo Shang, Liyuan Liu, Lihao Lu, Jiacheng Liu, and Jiawei Han. 2019. CrossWeigh: Training named entity tagger from imperfect annotations. In Proceedings of the Association for Computational Linguistics (EMNLP-IJCNLP’19). 5154–5163. DOI:
[24]
Khaled Shaalan and Hafsa Raza. 2009. NERA: Named entity recognition for arabic. J. Amer. Soc. Info. Sci. Technol. 60, 8 (2009), 1652–1663. DOI:
[25]
Ilias G. Maglogiannis. 2007. Emerging artificial intelligence applications in computer engineering: Real word AI systems with applications in eHealth, HCI, information retrieval and pervasive technologies. IOS Press, Amsterdam.
[26]
Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. Retrieved from https://arXiv:1603.01360.
[27]
Andrew McCallum and Wei Li. 2003. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the Association for Computational Linguistics (HLT-NAACL’03). 188–191.
[28]
Minghao Wu, Fei Liu, and Trevor Cohn. 2018. Evaluating the utility of hand-crafted features in sequence labelling. In Proceedings of the Association for Computational Linguistics (EMNLP’18). 2850–2856. DOI:
[29]
Jie Yang and Yue Zhang. 2018. NCRF++: An open-source neural sequence labeling toolkit. In Proceedings of the Association for Computational Linguistics. 74–79. DOI:
[30]
Adam G. Rouse, Guy Hotson, Ryan J. Smith, Marc H. Schieber, Nitish V. Thakor, and Brock A. Wester. 2016. High precision neural decoding of complex movement trajectories using recursive Bayesian estimation with dynamic movement primitives. IEEE Robot. Autom. Lett. 1, 2 (2016), 676–683. DOI:
[31]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. Retrieved from https://arXiv:1810.04805.
[32]
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. Retrieved from https://arXiv:1802.05365.
[33]
Alan Akbik, Tanja Bergmann, Duncan Blythe, Kashif Rasul, Stefan Schweter, and Roland Vollgraf. 2019. FLAIR: An easy-to-use framework for state-of-the-art NLP. In Proceedings of the Association for Computational Linguistics. 54–59. DOI:
[34]
Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual string embeddings for sequence labeling. In Proceedings of the Association for Computational Linguistics (COLING’18). 1638–1649.
[35]
Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, and Yuji Matsumoto. 2020. LUKE: Deep contextualized entity representations with entity-aware self-attention. In Proceedings of the Association for Computational Linguistics (EMNLP’20). 6442–6454. DOI:
[36]
Jouni Luoma, Miika Oinonen, Maria Pyykönen, Veronika Laippala, and Sampo Pyysalo. 2020. A Broad-coverage corpus for finnish named entity recognition. In Proceedings of the European Language Resources Association (LREC’20). 4615–4624.
[37]
Reyyan Yeniterzi. 2011. Exploiting morphology in turkish named entity recognition system. In Proceedings of the Association for Computational Linguistics. 105–110.
[38]
Braja Gopal Patra, Nuna Debbarma, Dipankar Das, and Sivaji Bandyopadhyay. 2015. Named entity recognizer for less resourced language Kokborok. In Proceedings of the International Conference on Asian Language Processing (IALP’15). IEEE, 164–168.
[39]
Sherief Abdallah, Khaled Shaalan, and Muhammad Shoaib. 2012. Integrating rule-based system with classification for Arabic named entity recognition. In Computational Linguistics and Intelligent Text Processing, A. Gelbukh (ed.). Springer, Berlin, 311–322.
[40]
Khaled Shaalan. 2014. A survey of Arabic named entity recognition and classification. Comput. Linguist. 40, 2 (2014), 469–510. DOI:
[41]
L. Pretorius and S. Bosch. 2009. Exploiting cross-linguistic similarities in Zulu and Xhosa computational morphology. In Proceedings of the Association for Computational Linguistics (EACL'09). 96–103.
[42]
Carl Meinhof. 1932. Introduction to the Phonology of the Bantu Languages, Trans. Reimer, Berlin.
[43]
Sandile Gxilishe, Peter de Villiers, Jill de Villiers, A. Belikova, L. Meroni, and Mari Umeda. 2007. The acquisition of subject agreement in xhosa. In Proceedings of the Conference on Generative Approaches to Language Acquisition (GALANA’07). Citeseer, 114–123.
[44]
Rigardt Pretorius, Ansu Berg, Laurette Pretorius, and Biffie Viljoen. 2009. Setswana tokenisation and computational verb morphology: Facing the challenge of a disjunctive orthography. In Proceedings of the Association for Computational Linguistics. 66–73.
[45]
Derek Nurse and Gérard Philippson. 2006. The Bantu Languages. Routledge, London.
[47]
George Tucker Childs. 2003. An Introduction to African Languages. John Benjamins Publishing, Amsterdam.
[48]
Buyiswa Mini and Nonkosi Tyolwana. 2005. Revision of Isixhosa Orthography and other Editorial Matters. PANSALB, Pretoria.
[49]
K. Podile and R. Eiselen. 2016. NCHLT isiXhosa named entity annotated corpus. Dataset. Centre for Text Technology. Retrieved from https://hdl.handle.net/20.500.12185/312.
[50]
Martin Puttkammer, Martin Schlemmer, Wikus Pienaar, and Ruan Bekker. 2014. NCHLT isiXhosa Text Corpora. Dataset. Centre for Text Technology. Retrieved from https://hdl.handle.net/20.500.12185/314.
[51]
Martin Puttkammer, Roald Eiselen, Justin Hocking, and Frederik Koen. 2018. NLP web services for resource-scarce languages. In Proceedings of the Association for Computational Linguistics (ACL’18). 43–49.
[52]
Michal Konkol and Miloslav Konopík. 2013. CRF-based Czech named entity recognizer and consolidation of Czech NER research. In Proceedings of the International Conference on Text, Speech and Dialogue. Springer, 153–160.
[53]
Taku Kudo. 2013. CRF++: Yet another CRF toolkit. Version 0.58. Retrieved from https://taku910.github.io/crfpp/.
[54]
Roald Eiselen and Martin J. Puttkammer. 2014. Developing text resources for ten South African languages. In Proceedings of the European Language Resources Association (LREC’14). 3698–3703.
[55]
Melinda Loubser and Martin J. Puttkammer. 2020. Viability of neural networks for core technologies for resource-scarce languages. Information 11, 1 (2020), 41. DOI:

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing
ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 2
February 2023
624 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/3572719
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 December 2022
Online AM: 02 June 2022
Accepted: 12 April 2022
Revision received: 25 March 2022
Received: 07 September 2021
Published in TALLIP Volume 22, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Named entity recognition
  2. isiXhosa
  3. natural language processing

Qualifiers

  • Research-article
  • Refereed

Funding Sources

  • National Centre for Human Language Technology

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 289
    Total Downloads
  • Downloads (Last 12 months)92
  • Downloads (Last 6 weeks)10
Reflects downloads up to 29 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media