Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
The present paper discusses the methodology in creating probably the first big corpus for Bhojpuri with 169,275 words, introduced in Singh, 2015. The present general domain Bhojpuri corpus is created based upon the web crawling technology by using ILCrawler. A statistical tagger for Bhojpuri is already trained on the same corpus using Support Vector method. This experiment was a test for ensuring the representativeness of the web drawn corpus for Bhojpuri, which by rule does not has any standard variety and many regional dialects. The objective of this presentation is to serve the researchers/practitioners working on less resourced languages with a full-fledged guideline on creating and validating the corpus considering the language variety, genre and the achieving issues.
This paper makes an attempt to describe and discuss the process of development of a new Bangla monolingual digital text corpus (namely, the Bangla Web Corpus (BWC)), which is developed as a part of the ILCI-2 project supported by DietY, Govt. of India with textual data retrieved from internet, digital portals, and web pages. It also tries to address the methods and strategies that are applied for this purpose; the issues that have cropped up in the act of generating the whole corpus database; and the major problems that are faced at the time of creating the corpus. In our opinion, the issues that have cropped up in the process, the problems that are faced, and the strategies and methods that are adopted to achieve the goal can give clear insights to deal with similar situations for generating corpora in other less resourced and less computer-savvy Indian languages. The acts of fishing language data from the web and harvesting the BWC may be treated as milestones in the history of Bangla corpus generation, as the BWC holds tremendous potentials for opening up new avenues for web crawling and language corpus building in the wider spectrum of research in language technology and applied linguistics. An on-line version of the BWC that is on the verge of being hoisted in the net, will contribute towards building an interface where language users are allowed to navigate through web-enabled corpus to address their linguistic needs. Here lies the theoretical relevance, empirical pertinence, and functional importance of the work which seeks to propose a makeshift guideline for the new generation of corpus developers in Indian languages.
2018
Developing a corpus for the study of various aspects of a language is a highly challenging task which involves effective planning and implementation of the same. The prime concern in the development of a corpus is the overall design criteria. In this chapter we aim at presenting some theoretical guidelines on the design criteria of a one million words digital corpus of Hindi Newspaper Text Corpus (HNTC) which has been developed as a part of an ongoing research activity. After the determination of the planning stage a comprehensive description of the various steps involved in the development of the corpus is discussed. An overview of the developed corpus is also highlighted with detailed specifications. Since the developed corpus has to be used subsequently for various kinds of linguistic analysis, it has been documented efficiently. This chapter also tends to give importance to documentation, storage and management of the developed corpus as it requires extreme care on the part of t...
WILDRE 2, LREC 2014. ISBN (978-2-9517408-8-4), 2014
The present paper talks about the application of the Bureau of Indian Standards (BIS) scheme for one of the most widely spoken Indian languages 'Bhojpuri'. Bhojpuri has claimed for its inclusion in the Eighth Schedule of the Indian Constitution, where currently 22 major Indian languages are already enlisted. Recently through Indian government initiatives these scheduled languages have received the attention from Computational aspect, but unfortunately this non-scheduled language still lacks such attention for its development in the field of NLP. The present work is possibly the first of its kind.The BIS tagset is an Indian standard designed for tagging almost all the Indian languages. Annotated corpora in Bhojpuri and the simplified annotation guideline to this tagset will serve as an important tool for such well-known NLP tasks as POS-Tagger, Phrase Chunker, Parser, Structural Transfer, Word Sense Disambiguation (WSD), etc.
Annotation of corpora is very significant for any kind of NLP oriented researches and applications. Natural language processing, speech recognition and other related areas require annotated corpora to serve as an important tool for investigators. Constructing statistical models for automatic processing of natural languages annotated corpora offers a basic building block. Creation of annotated corpora is the first step towards natural language processing. Annotated corpora proved to be very useful for language processing. So, annotated corpora are created for languages across the world. Unfortunately not much work has been carried out for creation of annotated corpora Indian languages. Unavailability of annotated corpora, large enough to experiment statistical algorithms is the main bottleneck for the computational processing of Indian languages. Annotation of corpora is done at different levels of language analysis such as part of speech, phrase/clause level, dependency level, etc. The basic step towards building an annotated corpus is Part of speech tagging (POS) and the next level of tagging is Chunking. The present draft is a compilation of information I have gathered while I was working for Indian Language to Indian Language Machine Translation. The material was lying in my laptop sine long. I thought of making to see the light.
2012
To study about various naturally occurring phenomenons on natural language text, a well structured text corpus is very much essential. The quality and structure of a corpus can directly influence on performance of various Natural Language Processing applications. Assamese is one of the major Indian languages used by the people of north east India. Language technology development works in Assamese language have been started at various levels, and research and development works started demanding a structured and well covered Assamese Corpus in UNICODE format. Here we present various issues and problems related to building an Assamese text corpus. We review our experience with constructing one such corpus including about 1.5 million words of Assamese language. It will provide a significant effort by serving as an important research tool for language and NLP researchers.
international journal of engineering trends and technology, 2014
Chhattishgarhi is a official language in the Indian state of Chhattisgarh. Spoken by 17.5 million people. In this paper we will see the work has been done in the field of natural language processing (NLP) using Chhattisgarhi language and other state languages .main goal of NLP is to create machine learning, create translator, create dictionary and create POS tagger. POS tagger is one of the important tools that are used to develop language translator and information extraction so that computer based be compatible for natural language processing. Part-of-speech tagging is the process of assigning a part-of-speech like noun, verb, pronoun, preposition, adverb, adjective or other lexical class marker to each word in a sentence. There are different types POS tagger are exist, are based on probabilistic approach and some based on morphological approaches. So in this paper we will see various developments of POS tagger and the major work has been done using Chhattishgarhi and other Indian...
This paper presents an overview of corpus classification and development in electronic format for 16 language-pairs, with Hindi as the source language. In a multilingual country like India, the major thrust in language technology lies in providing inter-communication services and direct information access in one's own language. As a result, language technology in India has seen major developments over the last decade in terms of machine translation and speech synthesis systems. As deeper research advances, the need for high quality standardised corpus is being seen as a primary challenge. To address these needs, the government of India has initiated a mega project called the Indian Languages Corpora Initiative (ILCI) to collect parallel annotated corpus in 17 scheduled languages of the Indian constitution. The project is in its second phase currently, within which it aims to collect 8,50,000 parallel annotated sentences in 17 Indian languages in the domains of Entertainment and Agriculture. Together with the 6,00,000 parallel sentences collected in Phase 1 in the domains of Health and Tourism (Choudhary & Jha, 2011), The corpus being developed is one of the largest known parallel annotated corpora for any Indian language till date. This phase will ultimately also see the development of chunking standards for processing the annotated corpus.
Zenodo (CERN European Organization for Nuclear Research), 2021
Léxico de la vida social, 2016
International Journal Dental and Medical Sciences Research
Pensamiento Psicológico, 2009
Medicina Paliativa, 2014
Logic and Logical Philosophy (Online First), 2023
Cuicuilco Revista de Ciencias Antropológicas , 2024
Edward Elgar Publishing , 2020
Australasian Journal of Philosophy, 1986
Semina: Ciências Agrárias, 2014
The Sport Psychologist, 2006
Cuadernos de la Sociedad Española de Ciencias Forestales, 2018
Malaria journal, 2024
Human Pathology, 2004
Cancer Management and Research, 2014
Angewandte Chemie, 2018
Applied Sciences
International Journal of Health Sciences (IJHS), 2022
Minerva: Revista de filología clásica, 1999