skip to main content
note

Boosting Neural POS Tagger for Farsi Using Morphological Information

Published: 22 July 2016 Publication History

Abstract

Farsi (Persian) is a low-resource language that suffers from the data sparsity problem and a lack of efficient processing tools. Due to their broad application in natural language processing tasks, part-of-speech (POS) taggers are one of those important tools that should be considered in this respect. Despite recent work on Farsi tagging, there is still room for improvement. The best reported accuracy so far is 96%, which in special cases can rise to 96.9%. The main problem with existing taggers is their inefficiency in coping with out-of-vocabulary (OOV) words. Addressing both problems of accuracy and OOV words, we developed a neural network-based POS tagger (NPT) that performs efficiently on Farsi. Despite using less data, NPT provides better results in comparison to state-of-the-art systems. Our proposed tagger performs with an accuracy of 97.4%, with performance highly influenced by morphological features. We carry out a shallow morphological analysis and show considerable improvement over the baseline configuration.

References

[1]
James Bergstra, Frédéric Bastien, Olivier Breuleux, Pascal Lamblin, Razvan Pascanu, Olivier Delalleau, Guillaume Desjardins, et al. 2011. Theano: Deep learning on GPUs with Python. In Proceedings of Advances in Neural Information Processing Systems 24 (NIPS’11).
[2]
Mahmood Bijankhan, Javad Sheykhzadegan, Mohammad Bahrani, and Masood Ghayoomi. 2011. Lessons from building a Persian written corpus: Peykare. Language Resources and Evaluation 45, 2, 143--164.
[3]
Thorsten Brants. 2000. TnT: A statistical part-of-speech tagger. In Proceedings of the 6th Conference on Applied Natural Language Processing. 224--231.
[4]
Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. 2012. Implementing neural networks efficiently. In Neural Networks: Tricks of the Trade. Springer, 537--557.
[5]
Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, 2493--2537.
[6]
Erick R. Fonseca, João Luís G. Rosa, and Sandra Maria Aluísio. 2015. Evaluating word embeddings and a revised corpus for part-of-speech tagging in Portuguese. Journal of the Brazilian Computer Society 21, 1, 1--14.
[7]
Eugenie Giesbrecht and Stefan Evert. 2009. Is part-of-speech tagging a solved task? An evaluation of POS taggers for the German Web as corpus. In Proceedings of the 5th Web as Corpus Workshop. 27--35.
[8]
Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics. 249--256.
[9]
Péter Halácsy, András Kornai, and Csaba Oravecz. 2007. HunPos: An open source trigram tagger. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions. 209--212.
[10]
Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. 2006. A fast learning algorithm for deep belief nets. Neural Computation 18, 7, 1527--1554.
[11]
Kurt Hornik, Maxwell Stinchcombe, and Halbert White. 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2, 5, 359--366.
[12]
M. Jagadeesh, M. Anand Kumar, and K. P. Soman. 2016. Deep belief network based part-of-speech tagger for Telugu language. In Proceedings of the 2nd International Conference on Computer and Communication Technologies. 75--84.
[13]
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia (MM’14). ACM, New York, NY, 675--678.
[14]
Ji Ma, Yue Zhang, and Jingbo Zhu. 2014. Tagging the Web: Building a robust Web tagger with neural network. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Vol. 1. 144--154.
[15]
Christopher D. Manning. 2011. Part-of-speech tagging from 97% to 100%: Is it time for some linguistics? In Proceedings of the 12th International Conference on Computational Linguistics and Intelligent Text Processing, Part I (CICLing’11). 171--189.
[16]
William J. Masek and Michael S. Paterson. 1980. A faster algorithm computing string edit distances. Journal of Computer and System Sciences 20, 1, 18--31.
[17]
Karine Megerdoomian. 2004. Developing a Persian part of speech tagger. In Proceedings of the 1st Workshop on Persian Language and Computer. 99--105.
[18]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv:1301.3781.
[19]
Mahdi Mohseni and Behrouz Minaei-Bidgoli. 2010. A Persian part-of-speech tagger based on morphological analysis. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC’10). 1253--1257.
[20]
Farhad Oroumchian, Samira Tasharofi, Hadi Amiri, Hossein Hojjat, and Fahime Raja. 2006. Creating a Feasible Corpus for Persian POS Tagging. Technical Report No. TR3/06. University of Wollongong, New South Wales, Australia.
[21]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532--1543. <url>http://www.aclweb.org/anthology/D14-1162</url>.
[22]
John R. Perry and Alan S. Kaye. 2007. Persian morphology. Morphologies of Asia and Africa 2, 975--1019.
[23]
Juan Antonio Prezortiz and Mikel L. Forcada. 2001. Part-of-speech tagging with recurrent neural networks. In Proceedings of the International Joint Conference on Neural Networks (IJCNN’01).
[24]
Fahimeh Raja, Hadi Amiri, Samira Tasharofi, Mehdi Sarmadi, Hossein Hojjat, and Farhad Oroumchian. 2007. Evaluation of part of speech tagging on Persian text. In Proceedings of the 2nd Workshop on Computational Approaches to Arabic Script-Based Languages.
[25]
Cicero D. Santos and Bianca Zadrozny. 2014. Learning character-level representations for part-of-speech tagging. In Proceedings of the 31st International Conference on Machine Learning (ICML’14). 1818--1826.
[26]
Helmut Schmid. 1994. Part-of-speech tagging with neural networks. In Proceedings of the 15th Conference on Computational Linguistics, Volume 1 (COLING’94). 172--176.
[27]
Mojgan Seraji. 2011. A statistical part-of-speech tagger for Persian. In Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA’11). 340--343.
[28]
Mojgan Seraji, Beáta Megyesi, and Joakim Nivre. 2012. A basic language resource kit for Persian. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12). 2245--2252.
[29]
Mehrnoush Shamsfard, Soheila Kiani, and Yaseer Shahedi. 2009. STeP-1: Standard text preparation for Persian language. In Proceedings of the 3rd Workshop on Computational Approaches to Arabic Script-Based Languages.
[30]
Huihsin Tseng, Daniel Jurafsky, and Christopher Manning. 2005. Morphological features help POS tagging of unknown words across language varieties. In Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing. 32--39.
[31]
Peilu Wang, Yao Qian, Frank K. Soong, Lei He, and Hai Zhao. 2015. Part-of-speech tagging with bidirectional long short-term memory recurrent neural network. arXiv:1510.06168.
[32]
Othman Zennaki, Nasredine Semmar, and Laurent Besacier. 2015. Unsupervised and Lightly Supervised Part-of-Speech Tagging Using Recurrent Neural Networks. Retrieved June 30, 2016, from https://aclweb.org/anthology/Y/Y15/Y15-1016.pdf.
[33]
Xiaoqing Zheng, Hanyang Chen, and Tianyu Xu. 2013. Deep learning for Chinese word segmentation and POS tagging. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’13). 647--657.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing
ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 16, Issue 1
TALLIP Notes and Regular Papers
March 2017
133 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/2961867
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 July 2016
Accepted: 01 April 2016
Revised: 01 March 2016
Received: 01 January 2016
Published in TALLIP Volume 16, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Farsi
  2. POS tagging
  3. morphological analysis

Qualifiers

  • Note
  • Research
  • Refereed

Funding Sources

  • CNGL Programme
  • Science Foundation Ireland
  • ADAPT Centre at Dublin City University

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)0
Reflects downloads up to 07 Nov 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media