skip to main content
research-article

Acquisition of Morphology of an Indic Language from Text Corpus

Published: 01 June 2008 Publication History

Abstract

This article describes an approach to unsupervised learning of morphology from an unannotated corpus for a highly inflectional Indo-European language called Assamese spoken by about 30 million people. Although Assamese is one of Indias national languages, it utterly lacks computational linguistic resources. There exists no prior computational work on this language spoken widely in northeast India. The work presented is pioneering in this respect. In this article, we discuss salient issues in Assamese morphology where the presence of a large number of suffixal determiners, sandhi, samas, and the propensity to use suffix sequences make approximately 50% of the words used in written and spoken text inflected. We implement methods proposed by Gaussier and Goldsmith on acquisition of morphological knowledge, and obtain F-measure performance below 60%. This motivates us to present a method more suitable for handling suffix sequences, enabling us to increase the F-measure performance of morphology acquisition to almost 70%. We describe how we build a morphological dictionary for Assamese from the text corpus. Using the morphological knowledge acquired and the morphological dictionary, we are able to process small chunks of data at a time as well as a large corpus. We achieve approximately 85% precision and recall during the analysis of small chunks of coherent text.

References

[1]
Bora, S. 1968. bahal byaakaran. Jnananath Bora, Guwahati.
[2]
Borer, H. 1998. Morphology and syntax. In The Handbook of Morphology, Spencer, A. and Zwicky, A. M. eds., 151--190, Blackwell Publishers Ltd.
[3]
Chen, T. Y., Kuo, F.-C., and Merkel, R. 2004. On the statistical properties of the F-measure. In Proceedings of the 4th International Conference on Quality Software (QSIC’04). IEEE Press, Los Alamitos, CA, 146--153.
[4]
Creutz, M. 2003. Unsupervised segmentation of words using prior distributions of morph length and frequency. In Proceedings of the 41st Annual Meeting of the Association of Computational Linguistics (ACL’03), 280--287.
[5]
Creutz, M. and Lagus, K. 2004. Induction of a simple morphology for highly-inflecting languages. In Proceedings of the 7th Meeting of the ACL Special Interest Group in Computational Phonology (SIGPHON’04), 43--51.
[6]
Creutz, M. and Lagus, K. 2005. Inducing the morphological lexicon of a natural language from unannotated text. In Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR’05), 106--113.
[7]
Daelemans, W. 1993. Memory-based lexical acquisition and processing. In Proceedings of the Third International EAMT Workshop on Machine Translation and the Lexicon (EAMT’93), 85--98.
[8]
Gaussier, E. 1999. Unsupervised learning of derivational morphology from inflectional lexicons. In Proceedings of the Unsupervised Learning in Natural Language Processing Workshop (ACL’99). ACL, 24--30.
[9]
Gasser, M. 1994. Acquiring receptive morphology: A connectionist approach. In Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics (ACL’94), 279--286.
[10]
Goldsmith, J. 2001. Unsupervised learning of the morphology of a natural language. Comput. Linguist. 27, 2, 153--193.
[11]
Goswami, G. 1990. asamiyaa byaakaranar moulik bisaar. Bina Library, Guwahati, India.
[12]
Leiber, R. 1992. Deconstructing morphology: Word formation in syntactic theory. University of Chicago Press, Chicago, IL.
[13]
Medhi, K. 1999. Assamese Grammar and Origin of the Assamese Language. 3rd Ed. Lawyer’s Book Stall, Guwahati, India.
[14]
Porter, M. 1980. An algorithm for suffix stripping. Autom. Library Inform. Syst., 14, 3, 130--137.
[15]
Saravanan, M., Reghv Raj, P. C., Murty, V. S., and Raman, S. 2002. Improved porter’s algorithm for root word stemming. In Proceedings of the International Conference on Natural Language Processing (ICON’02), 21--30.
[16]
Sarma, D. D. 1977. sahaj byaakaran. Assam State Textbook Production and Publication Corporation Ltd., Guwahati-1, India.
[17]
Schneider, G. 1998. An introduction to government and binding. University of Zurich. http://www.ifi.unizh.ch/CL/gschneid/dreitaegig.ps.gz.
[18]
Sharma, U. 2006. Unsupervised acquisition of morphology of a highly inflectional language. PhD dissertation, Department of Computer Science and Engineering, Tezpur University, Assam, India.
[19]
Sharma, U., Das, R., and Kalita, J. 2006. Unsupervised acquisition of morphological features of Assamese from a text corpus. In Proceedings of the National Workshop on Trends in Advanced Computing (NWTAC’06), 178--184.
[20]
Sharma, U., Kalita, J., and Das, R. 2002. Unsupervised learning of morphology for building a lexicon for highly inflectional language. In Proceedings of the Workshop on Morphological and Phonological Learning (ACL’02), 1--10.
[21]
Sharma, U., Kalita, J., and Das, R. 2003. Root word stemming by multiple evidence from corpus. In Proceedings of the 6th International Conference on Computational Intelligence and Natural Computing (CINC’03).
[22]
Snover, M. G., Jarosz, G. E., and Brent, M. R. 2002. Unsupervised learning of morphology using a novel directed search algorithm: Taking the first step. In Proceedings of the Workshop on Morphological and Phonological Learning (ACL’02), 11--20.
[23]
Vasu, S. C. 1891. The ashtadhyayi of panini (edited and translated into English), vol. I. Motilal Banarsidass, Delhi, India.

Cited By

View all

Index Terms

  1. Acquisition of Morphology of an Indic Language from Text Corpus

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian Language Information Processing
    ACM Transactions on Asian Language Information Processing  Volume 7, Issue 3
    August 2008
    78 pages
    ISSN:1530-0226
    EISSN:1558-3430
    DOI:10.1145/1386869
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 June 2008
    Accepted: 01 February 2008
    Revised: 01 February 2008
    Received: 01 March 2007
    Published in TALIP Volume 7, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Assamese
    2. Indo-European languages
    3. Morphology
    4. machine learning

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)9
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 14 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media