Academia.eduAcademia.edu

Web Drawn Corpus for Bhojpuri

The present paper discusses the methodology in creating probably the first big corpus for Bhojpuri with 169,275 words, introduced in Singh, 2015. The present general domain Bhojpuri corpus is created based upon the web crawling technology by using ILCrawler. A statistical tagger for Bhojpuri is already trained on the same corpus using Support Vector method. This experiment was a test for ensuring the representativeness of the web drawn corpus for Bhojpuri, which by rule does not has any standard variety and many regional dialects. The objective of this presentation is to serve the researchers/practitioners working on less resourced languages with a full-fledged guideline on creating and validating the corpus considering the language variety, genre and the achieving issues.

Web Drawn Corpus for Bhojpuri Srishti Singh Jawaharlal Nehru University New Delhi, India [email protected] Abstract- The present paper discusses the methodology in creating probably the first big corpus for Bhojpuri with 169,275 words, introduced in Singh, 2015. The present general domain Bhojpuri corpus is created based upon the web crawling technology by using ILCrawler. A statistical tagger for Bhojpuri is already trained on the same corpus using Support Vector method. This experiment was a test for ensuring the representativeness of the web drawn corpus for Bhojpuri, which by rule does not has any standard variety and many regional dialects. The objective of this presentation is to serve the researchers/practitioners working on less resourced languages with a full- fledged guideline on creating and validating the corpus considering the language variety, genre and the achieving issues. Keywords: web based Bhojpuri corpus, methodology, Crawling and Validation issues. ILCrawler, I. INTRODUCTION Bhojpuri was an oral tradition for a long time and the less acceptability of it as a standard language till twentieth century was one reason for the underdeveloped technical advent of the language. Though, there are increasing literatures in Bhojpuri for last two decades, and growing popularity in different domains, still the language is less resourced. Now a day, there are blogs, newspapers, columns etc. are found in Bhojpuri but no application for the user is available. Crawling down the data from web is a very cost efficient process and saves a lot of researchers time spent over the extremely labour intensive manual collection (Choudhary, N., 2011). One main reason for extracting data from the web was scarcity of variety in the Bhojpuri text books, where web sources provide easy access to any desired discipline within no time. Moreover, Bhojpuri is a language with many regional dialects and the uniformity was another issue while creating corpus. This study was initiated by the researcher in 2013, as part of M.Phil1 dissertation. And as the outcome of the study a statistical tagger was devised using Support Vector algorithm for Bhojpuri. The guideline for annotating Bhojpuri corpus following BIS2 (Singh and Banerjee, 2014) and the SVM based statistical tagger for Bhojpuri (Singh and Jha, 2015) has M.Phil Dissertation entitled “Challenges in Automatic POS tagging of Indian Languages- A Comparative Study of Hindi and Bhojpuri” 11 2 BIS- Bureau of Indian Standards already been released. The focus of this piece of work is to present a guideline for the researchers working on similar trends which include the selection and validation process of the data, applicability of tools like crawler and sanitizers for Bhojpuri, careful selection of language variety, genre and the archiving issues met. A. Bhojpuri Corpus and its Architecture Corpus is considered to be the basic building block for Language Technology. An electronically retrievable data becomes easy source for the researcher/practitioner for bringing further annotation and data analysis processes. The present Bhojpuri corpus is a monolingual general domain annotated corpus with 169,275 words. It is a web based corpus as the corpus data is achieved from different internet sources involving six different domains like entertainment, politics, literature etc. The annotation process, as described above, is already discussed and therefore, not focused here. One of the motivations of selection of web drawn data over the literature available in Bhojpuri is due to the demand of variety in the data. Moreover, the scarcity of texts in machine readable format and the variety of Bhojpuri also become an important factor for moving towards internet. The processes involved in creating the corpus are as shown in the figure below: Fig. 1 Process involved in creating the corpus The process of creating a corpus for Bhojpuri started with the data collection. The data was collected with the help of ILCrawler which is a tool devised as part of ILCI 3 project for collecting the data from web sources. The collected data was 3 ILCI-Indian Languages Corpora Initiative is a government funded project for creating language resources then cleaned for the errors and noises with manual validation as well as ILSanitizer. After cleaning the formatting of the data was done by manually selecting and eliminating the undesired information and editing and compiling the error free data into a desired format (the ILCI format is followed to some extent). II. CREATING THE CORPUS A. Corpus Data The corpus under discussion is probably the first big corpus for Bhojpuri with a total of 169,275 tokens with an average sentence length of 16 tokens. This accounts for 9,019 sentences in the corpus when the experiment was conducted. The data of the corpus includes six major domains which were further subdivided into different categories for fairer selection and distribution of text. Broader the area of coverage of the training file, the better will be the performance of the tagger. These domains are blogs, entertainment, literature, politics, sports and miscellaneous. These qualify for both quality and quantity, considering the representativeness of the corpus. Each subdomain contributes to the corpus, in the following ratio: TABLE 1. SIX MAJOR CORPUS DOMAINS The figures in the table represents that blogs and entertainment are the two major contributing domain in the corpus with ratio of 35 and 32 percent respectively where politics and miscellaneous occupies 11 and 17 percent. There are only 3 percent literary data was chosen at the time of experiment which is now being added upon. The sports data was added later on in the corpus which measured only 0.26 percent of the corpus because of less availability of sports related online text and some of it was already there in the blogs and miscellaneous data. These two files were used as initial inputs for training the tagger (Singh and Jha, 2015). B. Data Collection 1) Web Sources of the data The authenticity of the corpus also depends upon the source where the input data comes from. In this respect, the verifiability and representativeness of the corpus is also importance. Concerning this fact, websites for Bhojpuri newspapers, literary sites and only authentic reports were chosen as the source for retrieving data (Singh, 2015, cited from the dissertation). The major websites for archiving the corpus data are:       http://www.thesundayindian.com/ http://www.bhojpurika.com/ http://trendsarrived.com/ http://anjoria.wordpress.com/ http://tatkakhabar.com/ http://norivers.org/ (for blogs and literary columns) According to Hardie (2012, 57-60), the copying and redistribution of the web data can be a legal issue like the copyright issue but this can be avoided to some extent by promoting public sharing of the data although it is freely available on internet and authors are also paid for every visit. The uniformity in the writing styles on websites like „The Sunday Indian’ and the „tatka khabar’ also motivated the web drawn data collection. 2) ILCrawler This is era of technology the readily available Unicode resources are relief from the burden of cost and effort put in collecting data in giga bytes. Crawler is one such data mining tool which reduces manual effort for finding and collecting data and hundreds of web links are explored within minutes. ILCrawler, a product of ILCI project, used for collecting the Bhojpuri data, is a java based application which is designed to meet the needs of the researcher in the following ways: a) The script of the language for data extraction and the links and the limit of sub-links within each link can be adjusted according to the need. The output is also produced link-wise facilitating the researcher to keep an easy cross-check over data. b) Extra information like advertisements etc. can be avoided by featuring the tool and very less (around 0%) chances of having a data from a language any script other than the one chosen. Roman numerals were sometimes found in the devanagari data which was necessary for the elicitation of facts and figures. 3) Corpus Cleaning Cleaning is the process of reconstructing the hampered text. The sanitation and the selection of data for the final text are integral features of corpus cleaning. Keeping a check on the noises and reducing it to minimum is the onjective of data cleaning which includes all the possible discrepancies found in a running text while copying like spelling error, redundancy, omitted letters or word(s), improper sentence break etc.(Singh, 2015 cited from Dissertation). The symbol and punctuation related errors for example (.) instead of (|) in the devanagari data for sentence boundary marker were also checked here. The problem of multiple spacing, clubbed tokens etc are dealt too. ILSanitizer was applied for cleaning the corpus data in this experiment which has the following features: a) ILSantizer was programmed to take only full-fledged sentences and any string carrying less than three tokens were eliminated except for the imperatives. b) Helps in sentence alignment of the test and the scattered or fragmented tokens were removed. This alignment of sentences and shuffling of data is integral process of this sanitizer. c) The sentence length can also be decided by the sanitizer but here, only the maximum tokens per file were featured at 100,000 tokens at max. C. Data Management and Formatting The ILCI standard of formatting is followed for the data management and compilation where data from each domain is then compiled into several set of 1000 sentences, approx, with an average length of 16 tokens. Each sentence is provided a unique sentence ID following ILCI naming convention containing the letters from the lanaguge name and domain plus the sentence no. and the files are then converted to UTF 8 format. After compilation, the corpus data was found to have the following sets under each language domain with the description of the contents as shown in the table below: A. Restrictions of ILCrawler a) Some of the issue were found at the crawler‟s side, for example, it is a java based application which detects the languages on the internet on the basis of their script and as we know that there are 100 plus languages spoken in India including the 22 scheduled languages. Among these the first ten languages (for example, Hindi, Sanskrit, Marathi, Konkani, Bodo etc) in the list have adopted Devanagari as their script. At this point, if there is a page open for Marathi or Bodo on web for extraction it might also crawl down all the windows open with Hindi advertisement and columns. This also happened here, in case of Bhojpuri data where some Hindi, Marathi, Sanskrit and other language data was found in the crawled output. Such sentences were manually operated later on while validation. Some of the examples from the foreign language data are given in the table below: TABLE 3. DATA FROM FOREIGN LANGAUGES TABLE 2. FILE FORMATTING OF THE DATA Not only this but the script based selection also sometimes looses some useful information like dates or measurement written in roman digits, some scientific names written in roman in the runnning text etc. From table , we get that there were 4 files compiled each for blogs and entertainment domain, 1 file for literature, politics and sports each and two for miscellaneous with a total of 13 files. Al the files are listed in the table above along with their respective sentence length and no. of tokens where blog set 1,2 &3, entertainment set 1, 2 &3, politics set 1, and miscellaneous set 1 has an average of 900 sentences whereas other have relatively less sentences. Once, the formatting is done, the corpus was sent for manual validation before proceeding with the further annotation processing. The validation issues will be discussed in detail in the following section. III. CHALLENGES IN CORPUS CREATION FOR BHOJPURI a) Besides this there are also some web papges found to have Bhojpuri data in Roman script. These were basically poetry and fiction. Such data were excluded from the corpus in order to save time and effort in transliterating them to Devanagari. Such sections can be considered for further development of the corpus. b) The multiple copies of Headers (main heading of the page) is shown with every sub-link and the crawler copies it as many time it access any of the sub-links. Though this can be checked by checking duplicacy in the data. Some major types of headers copied during crawling where different tabs (buttons) are joined as if to form a single string. For example: TABLE 4. HEADER DISCREPANCY b) Word fragments As discussed above, some words are found broken in its fragment isolated from the string like „चि‟ (ʧɪ) which stands for mister (Mr.) in Hindi. Such cases are also found where words (name of a person or place or both) like „नई दि्ल ‟ (nəʝi ðɪlli) is found after multiple spaces in a complete sentence. Such instances were corrected and also noted down form reformations. Some of these are: TABLE 6. LIST OF FRAGMENTED WORDS B) Sanitizing issues a) Typing errors A large number of mistakes in the typing have been encountered throughout the data. These are probably due to the changed format of teh output after copying from internet because the same can be verified from the source text. The letters and diacritics are found both misplaced and sometimes reduplicated also. A list of such words is presented in the table below: TABLE 5. TYPING ERRORS c) Space missing between words or collocations The data was also checked for the spacing between the words. Multiple spaces between words and the omitted space between phrases and collocation were heavily found. Multiple spacing can be corrected automatically but for adding missed out space manually was very time taking. See the table below for example: TABLE 7. SPACE MISSING BETWEEN WORDS/COLLOCATIONS b) Unaccpeted terminology Although slangs are one important aspect of language and also helps understanding the kinship relations and society better, but as Mishra (2003) correctly points out that decency in speech and correct use of language are the necessary elements for the standardization of a language. Therefore, terminologies not welcomed by the speakers are eliminated from the corpus. Slangs like „ रा’ (səsura), „ ँडुआ‟ (bʰəɽ̃ ua) and „ रा जािा‟ (həraməʤaða) meaning moron are not part of the present corpus. IV. CONCLUSION The present paper discussed the motivation for building language resource for Bhojpuri by creating the monolingual general domain corpus for the language. A SVM based statistical Bhojpuri tagger has already been evolved which was trained on the same corpus as presented here. The crawling technique for the data extraction from web source was the main focus of study. The data was carefully selected considering the representativeness of the corpus as it does not represent any psrticular domain but many. It, initially, deals with the methodologies for creating the corpus which involved data collection and cleaning, data management and formatting followed by the careful selection and validation processes. The second half of the paper discusses several issues faced and are categorized into three: crawling, cleaning and validating issues. REFERENCES [1] Choudhary, N. (2011). Web-Drawn Corpus for Indian Language: A case of Hindi. CCIS. ICISIL 2011 (pp:139). [2] Hardie A. And McEnery, T. (2012). Corpus Linguistics: Method, theory and practise. Cambridge University Press. C) Validation Issues a) Genre specific sentences A corpus is best representative of a language if the data is collected from day to day natural language use. But there are some limitations to these. The genre specific text like poetry and idioms are a way problematic for the machine learning because they demand more semantic and background knowledge of the language. Therefore, the section of songs and poetry were avoided at this level to reduce the ambiguity for the machine. For example: TABLE 8. POETRY DATA [3] Mishra, R.B. (2003). Sociology of Bhojpuri Language. Swasti Publication, Varanasi. [4] Singh, S. And Banerjee, E. (2014). Annotating Bhojpuri Corpus Using BIS Scheme. In Proceedings of 2nd Workshop on Indian Language Data: Resource and Evaluation at (LREC 2014). [5] Singh, S. And Jha, G.N. (2015). Statistical tagger for Bhopuri: Employing Support Vector Machine. In Proceedings of International Conference on Computing, Communication and Informatics (ICACCI, 2015).