Skip to main content

srishti singh

Jawaharlal Nehru University, Centre Of Linguistics, Research Scholar

Followers

90

Following

18

Co-authors

4

Public Views

Interests

Uploads

Publications by srishti singh

English Multi-word Expressions (MWE): A Tagset for Health Domain

Proceedings of ICACCI, 2018 ISBN- 978-1-5386-5314-2, 2018

Abstract- This paper discusses the need for an independent MWE tagset for handling the technicali... more Abstract- This paper discusses the need for an independent MWE tagset for handling the technicality of healthcare domain for clinical English and reports efforts to develop a tagset, guideline and a tagger for healthcare domain. The tagset contains12 tags and training a CRF based MWE tagger model to test the reliability of the tagset on medical data was performed with an accuracy of 73%. The tagger which is under continuous improvement through sanitizing the corpus, annotation and errors has improved to an accuracy of 79%.

Keywords-- healthcare domain, Multiword Expressions, English MWE tagset.

Demo: Part-of-Speech Tagger for Bhojpuri,

LREC 2018 (ISBN 979-10-95546-09-2)

This paper is a demonstration of a POS (Part-of-Speech) annotation tool created for Bhojpuri, a l... more This paper is a demonstration of a POS (Part-of-Speech) annotation tool created for Bhojpuri, a lesser resourced language. Bhojpuri is a popular Indian language and spoken by more than 33 million speakers (census 2001) in India. The digital platform the availability of a good POS tagger is an important requirement for language resource creation and the POS tagger discussed here is one of the initial experiments aiming at language resource creation for Bhojpuri. The tagger was created as part of dissertation work and is based on the BIS (Bureau of Indian Standards) annotation scheme. Tagger performs decently on other varieties of Bhojpuri as well because of the variety of corpus data collected from different sources. The average accuracy achieved by the tool, so far, is 88.6% for general domain.

Challenges in Annotation and Domain Adaptation in Hindi POS Tagger: with Reference to Cricket

by Atul Kr. Ojha and srishti singh

Proceedings of the ICATCCTT - 2017, 2017

In this paper, author reports the scope of domain adaptation for multi-domain Hindi POS tagger to... more In this paper, author reports the scope of domain adaptation for multi-domain Hindi POS tagger to a new domain, i.e. the popular domain of Cricket, through an initial experiment. Utility of Adaptation of new domain is proposed and verified by testing the accuracy of existing Hindi POS tagger for sports domain (here, Cricket) resulting in reduced average accuracy of 87.77% from approx. 94% overall tagger accuracy. Manual validation method is followed for evaluating the test result for generating correct error report for the sports domain data. Alongside, inter-annotator agreement/disagreement found among evaluators, and some major tagger based errors like unseen vocabulary and inconsistent performance has been recorded along with some suggestions for the improvement, serving as the basis of introducing adaptation for the Hindi tagger.

Indian languages on the TypeCraft platform – the case of Hindi and Odia

by Dr. Pitambar Behera, Nath Girish, and srishti singh

Valance annotation of Hindi on TypeCraft

WILDRE 3, LREC, 2016. ISBN (978-2-9517408-9-1), 2016

The present paper describes the suitability of TypeCraft framework through valance annotation and... more The present paper describes the suitability of TypeCraft framework through valance annotation and construction for Hindi. This paper deals with the different levels of annotation provided on TypeCraft and source for initiating construction labeling using syntactic and semantic information embedded in the language. The annotation challenges in presentation of some major constructions in Hindi like idiomatic expressions, reduplication, conjunct verbs and explicator verbs are discussed along with the construction labeling for Hindi which is a new technique for closer syntactic analysis. This platform also supports more than one free translations and discourse sense labeling for each sentence.

Annotating Bhojpuri Corpus using BIS Scheme

WILDRE 2, LREC 2014. ISBN (978-2-9517408-8-4), 2014

The present paper talks about the application of the Bureau of Indian Standards (BIS) scheme for ... more The present paper talks about the application of the Bureau of Indian Standards (BIS) scheme for one of the most widely spoken Indian languages 'Bhojpuri'. Bhojpuri has claimed for its inclusion in the Eighth Schedule of the Indian Constitution, where currently 22 major Indian languages are already enlisted. Recently through Indian government initiatives these scheduled languages have received the attention from Computational aspect, but unfortunately this non-scheduled language still lacks such attention for its development in the field of NLP. The present work is possibly the first of its kind.The BIS tagset is an Indian standard designed for tagging almost all the Indian languages. Annotated corpora in Bhojpuri and the simplified annotation guideline to this tagset will serve as an important tool for such well-known NLP tasks as POS-Tagger, Phrase Chunker, Parser, Structural Transfer, Word Sense Disambiguation (WSD), etc.

Statistical Tagger for Bhojpuri (employing Support Vector )

ICACCI, 2015. ISBN ( ISBN 978-1-4799-8791-7), 2015

— The authors present the first Support Vector Machines (SVM) based statistical Parts of Speech (... more — The authors present the first Support Vector Machines (SVM) based statistical Parts of Speech (POS) Tagger developed for Bhojpuri. Bhojpuri is a less resourced Indo Aryan language of the Asian continent and the POS tagger presented here is a step towards developing language resources for it. SVMs have already been trained on other languages like Malayalam and Bangla with an accuracy of 86-90 %. The present research came up with approximately 89-90% and 94.24% accuracy for test and gold datasets respectively. Keywords—Bhojpuri, annotated corpora,, POS tagset, SVM based tagger.

A Hybrid Chunker for Hindi and Indian English

by Atul Kr. Ojha, srishti singh, and Dr. Pitambar Behera

Proceedings of the 3rd Workshop on Indian Language Data: Resources and Evaluation (under the 10th LREC2016, May 23-28, 2016), May 23, 2016

The paper presents a CRF based hybridized chunker for Hindi and Indian English. The immediate goa... more The paper presents a CRF based hybridized chunker for Hindi and Indian English. The immediate goal is to chunk text data in the ILCI project funded by DeitY, Govt of India. The experiment was conducted on 25k annotated sentences on the data from health and tourism domains. 23k sentences were used for training and the rest 2k sentences were used for evaluation. The experiment involved the following stages: training the chunker, automatic chunking and validation of chunked output for Hindi and Indian English; and finding measures to solve issues detected at different levels of experiment. The chunker for Indian English is developed on ILMT chunk tag scheme to meet the necessary mapping requirements of the translation tool for English to Indian languages. The accuracies of Hindi and Indian English chunker are 88.84% & 89.04 %, respectively. So far as Hindi chunker is concerned, we have observed errors in the chunk categories such as noun (pronominal), verb finite, verb non-finite (conjunct verb), adjectival phrase etc. Errors like finite-non-finite, adverb-conjunction, wh-determiner and conjunction chunk etc are discussed in detail for the development of English chunker. Implementation of hybrid approach for error resolution has also been attempted.

Training & Evaluation of POS Taggers in Indo-Aryan Languages: A Case of Hindi, Odia and Bhojpuri

by Atul Kr. Ojha, Dr. Pitambar Behera, and srishti singh

Proceedings of 7th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, ISBN No-978-83-932640-8-7, Nov 27, 2015

The present paper discusses the training and evaluation of the CRF and SVM algorithms for Indo-Ar... more The present paper discusses the training and evaluation of the CRF and SVM algorithms for Indo-Aryan languages: Hindi, Odia and
Bhojpuri. For annotation of the corpus, we have used Bureau of Indian Standards (BIS) annotation scheme which is a common
standard of annotation for Indian languages. The main objective of the paper is to provide an idea of the error pattern and suggestions
following the same algorithms. The experiment is conducted with 90k tokens training and 2k tokens test data each, for ease of
comparison among languages. In the evaluation report, we focus on each tool (SVM and CRF++) at the level of accuracy, error
analysis of the tools, the error pattern and common error of the system. The accuracy of the SVM taggers ranges between 88 to 93.7
% whereas CRF ranges between 82 to 86.7%. CRF performs less qualitatively than SVM for Odia and Hindi which is not true for
Bhojpuri. In this study, we have observed that languages having more variations are suitable for CRF in comparison to SVM.

Conference Presentations by srishti singh

Multi-Word Expressions Extraction for Clinical Domain (write to us for full paper)

9th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics May 17-19, 2019, Poznań, Poland, 2019

The present paper describes the development of a Multiword Expression (MWE) extractor for English... more The present paper describes the development of a Multiword Expression (MWE) extractor for English for clinical domain. Identification of a named entity can either leave some tokens of the entity unidentified or can include extra tokens that are not part of that particular entity. This is the problem of Multiword Expression extraction. Correct identification of MWE results in more accurate Named Entity Recognition (NER) and then to down line products that use them. This paper also discusses the challenges met in training the tagger along with the complexities of clinical domain. The evaluation of the tagger output and error analysis is also presented. The MWE tagger has achieved an F1 score of 79% on five-fold cross validation.

Challenges in Annotation and Domain Adaptation in Hindi POS Tagger: with Reference to Cricket

by Anupama Pandey, srishti singh, and Atul Kr. Ojha

In this paper, author reports the scope of domain adaptation for multi-domain Hindi POS tagger to... more In this paper, author reports the scope of domain adaptation for multi-domain Hindi POS tagger to a new domain, i.e. the popular domain of Cricket, through an initial experiment. Utility of Adaptation of new domain is proposed and verified by testing the accuracy of existing Hindi POS tagger for sports domain (here, Cricket) resulting in reduced average accuracy of 87.77% from approx. 94% overall tagger accuracy. Manual validation method is followed for evaluating the test result for generating correct error report for the sports domain data. Alongside, inter-annotator agreement/disagreement found among evaluators, and some major tagger based errors like unseen vocabulary and inconsistent performance has been recorded along with some suggestions for the improvement, serving as the basis of introducing adaptation for the Hindi tagger.

SANSKRIT IN VARANASI: A STRUGGLE FOR SURVIVAL

Sanskrit is the language in which Vedas, Upanishads, Puranas, ethics and several other man langua... more Sanskrit is the language in which Vedas, Upanishads, Puranas, ethics and several other man language in India for a longer period of time. Sanskrit is a classical language of India. The corpus of Sanskrit literature encompasses a rich tradition of poetry and drama as well as scientific, technical, philosophical and Hindu religious texts. Sanskrit continues to be widely used as a ceremonial language in Hindu religious rituals ancient scriptures are transcribed. If we look in history, it was a widely spoken common in the forms of hymns and mantras. Spoken Sanskrit is still in use in a few traditional institutions in India, and there are many attempts at revival. Language is dynamic and so is time. With passing time and emergence of several other languages, Sanskrit started loosing its dominancy but it showed complete tolerance towards all.

Web Drawn Corpus for Bhojpuri

The present paper discusses the methodology in creating probably the first big corpus for Bhojpur... more The present paper discusses the methodology in creating probably the first big corpus for Bhojpuri with 169,275 words, introduced in Singh, 2015. The present general domain Bhojpuri corpus is created based upon the web crawling technology by using ILCrawler. A statistical tagger for Bhojpuri is already trained on the same corpus using Support Vector method. This experiment was a test for ensuring the representativeness of the web drawn corpus for Bhojpuri, which by rule does not has any standard variety and many regional dialects. The objective of this presentation is to serve the researchers/practitioners working on less resourced languages with a full-fledged guideline on creating and validating the corpus considering the language variety, genre and the achieving issues.

Book by srishti singh

Automatic POS Tagging of Bhojpuri: A Comparative Study with Hindi

LAP LAMERT Academic Publication. ISBN 978-613-8-38992-7, 2018

The Book is printed version of M.Phil Dissertation entitled "Challenges in Automatic POS Tagging... more The Book is printed version of M.Phil Dissertation entitled "Challenges in Automatic POS Tagging of Indian Languages- A Comparative Study of Hindi and Bhojpuri" submitted to the Centre for Linguistics, Jawaharlal Nehru University in 2015.

The present work is an initial experiment on creating the first automatic Part-of Speech POS tagger for Bhojpuri language under research programme. Bhojpuri is a lesser resource language and does not have much technology available, therefore, this work presents the first big representative Bhojpuri corpus of approx 2.67 lakh tokens from different domains and a SVM (Support Vector Machine) based POS tagger trained on this corpus. The accuracy of the tagger achieved under this experiment is approx. 87 %. This also cover a detail guideline of annotating Bhojpuri corpus following BIS scheme and a comparative analysis of performances of Bhojpuri and Hindi POS taggers trained with SVM.

Journal paper by srishti singh

Grammatical Sketch of Banarasi : A Dialect of Bhojpuri

Research Review International Journal of Multidiscilinary & Best Linguistics paper in 'Student Paper Contest' @ ICON 2018, Patiala., Nov 10, 2018

The present paper is demonstration of the effort made to draw " Grammatical Sketch Banarasi ". Ba... more The present paper is demonstration of the effort made to draw " Grammatical Sketch Banarasi ". Banarasi, a dialect of Bhojpuri, is still an oral tradition in Varanasi and demands its culture to be preserved. Therefore, this work is an initiative towards documenting present day local language of Varanasi. A spoken corpus of folk stories in Banarasi w collected and transcribed as part of this research work which was extended and eventua contributed for creating Bhojpuri corpus (Singh and Banerjee, 2014). The paper covers phonological, morphological and syntactic analysis of the language and also touches discou at the level of code-mixing and code-switching in present form of language. The features l PNG (person, number and gender), TAM (tense, aspect and mode), kinship and focus etc are discussed in detail.

Keywords:
First Banarasi Corpus, Banarasi Folk Literature, Phonological and Morphological features, Code-mixing and Code-switching

Papers by srishti singh

English Multi-Word Expressions (MWE): A Tagset for Health Domain

2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI), 2018

This paper discusses the need for an independent MWE tagset for handling the technicality of heal... more This paper discusses the need for an independent MWE tagset for handling the technicality of healthcare domain for clinical English and reports efforts to develop a tagset, guideline and a tagger for healthcare domain. The tagset contains 12 tags and training a CRF based MWE tagger model to test the reliability of the tagset on medical data was performed with an accuracy of 73%. The tagger which is under continuous improvement through sanitizing the corpus, annotation and errors has improved to an accuracy of 79%.

Challenges in annotation and domain adaptation in hindi POS tagger: With reference to cricket

In this paper, author reports the scope of domain adaptation for multi-domain Hindi POS tagger to... more In this paper, author reports the scope of domain adaptation for multi-domain Hindi POS tagger to a new domain, i.e. the popular domain of Cricket, through an initial experiment. Utility of Adaptation of new domain is proposed and verified by testing the accuracy of existing Hindi POS tagger for sports domain (here, Cricket) resulting in reduced average accuracy of 87.77% from approx. 94% overall tagger accuracy. Manual validation method is followed for evaluating the test result for generating correct error report for the sports domain data. Alongside, inter — annotator agreement/disagreement found among evaluators, and some major tagger based errors like unseen vocabulary and inconsistent performance has been recorded along with some suggestions for the improvement, serving as the basis of introducing adaptation for the Hindi tagger.

English Multi-word Expressions (MWE): A Tagset for Health Domain

Proceedings of ICACCI, 2018 ISBN- 978-1-5386-5314-2, 2018

Abstract- This paper discusses the need for an independent MWE tagset for handling the technicali... more Abstract- This paper discusses the need for an independent MWE tagset for handling the technicality of healthcare domain for clinical English and reports efforts to develop a tagset, guideline and a tagger for healthcare domain. The tagset contains12 tags and training a CRF based MWE tagger model to test the reliability of the tagset on medical data was performed with an accuracy of 73%. The tagger which is under continuous improvement through sanitizing the corpus, annotation and errors has improved to an accuracy of 79%.

Keywords-- healthcare domain, Multiword Expressions, English MWE tagset.

Demo: Part-of-Speech Tagger for Bhojpuri,

LREC 2018 (ISBN 979-10-95546-09-2)

This paper is a demonstration of a POS (Part-of-Speech) annotation tool created for Bhojpuri, a l... more This paper is a demonstration of a POS (Part-of-Speech) annotation tool created for Bhojpuri, a lesser resourced language. Bhojpuri is a popular Indian language and spoken by more than 33 million speakers (census 2001) in India. The digital platform the availability of a good POS tagger is an important requirement for language resource creation and the POS tagger discussed here is one of the initial experiments aiming at language resource creation for Bhojpuri. The tagger was created as part of dissertation work and is based on the BIS (Bureau of Indian Standards) annotation scheme. Tagger performs decently on other varieties of Bhojpuri as well because of the variety of corpus data collected from different sources. The average accuracy achieved by the tool, so far, is 88.6% for general domain.

Challenges in Annotation and Domain Adaptation in Hindi POS Tagger: with Reference to Cricket

by Atul Kr. Ojha and srishti singh

Proceedings of the ICATCCTT - 2017, 2017

In this paper, author reports the scope of domain adaptation for multi-domain Hindi POS tagger to... more In this paper, author reports the scope of domain adaptation for multi-domain Hindi POS tagger to a new domain, i.e. the popular domain of Cricket, through an initial experiment. Utility of Adaptation of new domain is proposed and verified by testing the accuracy of existing Hindi POS tagger for sports domain (here, Cricket) resulting in reduced average accuracy of 87.77% from approx. 94% overall tagger accuracy. Manual validation method is followed for evaluating the test result for generating correct error report for the sports domain data. Alongside, inter-annotator agreement/disagreement found among evaluators, and some major tagger based errors like unseen vocabulary and inconsistent performance has been recorded along with some suggestions for the improvement, serving as the basis of introducing adaptation for the Hindi tagger.

Indian languages on the TypeCraft platform – the case of Hindi and Odia

by Dr. Pitambar Behera, Nath Girish, and srishti singh

Valance annotation of Hindi on TypeCraft

WILDRE 3, LREC, 2016. ISBN (978-2-9517408-9-1), 2016

The present paper describes the suitability of TypeCraft framework through valance annotation and... more The present paper describes the suitability of TypeCraft framework through valance annotation and construction for Hindi. This paper deals with the different levels of annotation provided on TypeCraft and source for initiating construction labeling using syntactic and semantic information embedded in the language. The annotation challenges in presentation of some major constructions in Hindi like idiomatic expressions, reduplication, conjunct verbs and explicator verbs are discussed along with the construction labeling for Hindi which is a new technique for closer syntactic analysis. This platform also supports more than one free translations and discourse sense labeling for each sentence.

Annotating Bhojpuri Corpus using BIS Scheme

WILDRE 2, LREC 2014. ISBN (978-2-9517408-8-4), 2014

The present paper talks about the application of the Bureau of Indian Standards (BIS) scheme for ... more The present paper talks about the application of the Bureau of Indian Standards (BIS) scheme for one of the most widely spoken Indian languages 'Bhojpuri'. Bhojpuri has claimed for its inclusion in the Eighth Schedule of the Indian Constitution, where currently 22 major Indian languages are already enlisted. Recently through Indian government initiatives these scheduled languages have received the attention from Computational aspect, but unfortunately this non-scheduled language still lacks such attention for its development in the field of NLP. The present work is possibly the first of its kind.The BIS tagset is an Indian standard designed for tagging almost all the Indian languages. Annotated corpora in Bhojpuri and the simplified annotation guideline to this tagset will serve as an important tool for such well-known NLP tasks as POS-Tagger, Phrase Chunker, Parser, Structural Transfer, Word Sense Disambiguation (WSD), etc.

Statistical Tagger for Bhojpuri (employing Support Vector )

ICACCI, 2015. ISBN ( ISBN 978-1-4799-8791-7), 2015

— The authors present the first Support Vector Machines (SVM) based statistical Parts of Speech (... more — The authors present the first Support Vector Machines (SVM) based statistical Parts of Speech (POS) Tagger developed for Bhojpuri. Bhojpuri is a less resourced Indo Aryan language of the Asian continent and the POS tagger presented here is a step towards developing language resources for it. SVMs have already been trained on other languages like Malayalam and Bangla with an accuracy of 86-90 %. The present research came up with approximately 89-90% and 94.24% accuracy for test and gold datasets respectively. Keywords—Bhojpuri, annotated corpora,, POS tagset, SVM based tagger.

A Hybrid Chunker for Hindi and Indian English

by Atul Kr. Ojha, srishti singh, and Dr. Pitambar Behera

Proceedings of the 3rd Workshop on Indian Language Data: Resources and Evaluation (under the 10th LREC2016, May 23-28, 2016), May 23, 2016

The paper presents a CRF based hybridized chunker for Hindi and Indian English. The immediate goa... more The paper presents a CRF based hybridized chunker for Hindi and Indian English. The immediate goal is to chunk text data in the ILCI project funded by DeitY, Govt of India. The experiment was conducted on 25k annotated sentences on the data from health and tourism domains. 23k sentences were used for training and the rest 2k sentences were used for evaluation. The experiment involved the following stages: training the chunker, automatic chunking and validation of chunked output for Hindi and Indian English; and finding measures to solve issues detected at different levels of experiment. The chunker for Indian English is developed on ILMT chunk tag scheme to meet the necessary mapping requirements of the translation tool for English to Indian languages. The accuracies of Hindi and Indian English chunker are 88.84% & 89.04 %, respectively. So far as Hindi chunker is concerned, we have observed errors in the chunk categories such as noun (pronominal), verb finite, verb non-finite (conjunct verb), adjectival phrase etc. Errors like finite-non-finite, adverb-conjunction, wh-determiner and conjunction chunk etc are discussed in detail for the development of English chunker. Implementation of hybrid approach for error resolution has also been attempted.

Training & Evaluation of POS Taggers in Indo-Aryan Languages: A Case of Hindi, Odia and Bhojpuri

by Atul Kr. Ojha, Dr. Pitambar Behera, and srishti singh

Proceedings of 7th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, ISBN No-978-83-932640-8-7, Nov 27, 2015

The present paper discusses the training and evaluation of the CRF and SVM algorithms for Indo-Ar... more The present paper discusses the training and evaluation of the CRF and SVM algorithms for Indo-Aryan languages: Hindi, Odia and
Bhojpuri. For annotation of the corpus, we have used Bureau of Indian Standards (BIS) annotation scheme which is a common
standard of annotation for Indian languages. The main objective of the paper is to provide an idea of the error pattern and suggestions
following the same algorithms. The experiment is conducted with 90k tokens training and 2k tokens test data each, for ease of
comparison among languages. In the evaluation report, we focus on each tool (SVM and CRF++) at the level of accuracy, error
analysis of the tools, the error pattern and common error of the system. The accuracy of the SVM taggers ranges between 88 to 93.7
% whereas CRF ranges between 82 to 86.7%. CRF performs less qualitatively than SVM for Odia and Hindi which is not true for
Bhojpuri. In this study, we have observed that languages having more variations are suitable for CRF in comparison to SVM.

Multi-Word Expressions Extraction for Clinical Domain (write to us for full paper)

9th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics May 17-19, 2019, Poznań, Poland, 2019

The present paper describes the development of a Multiword Expression (MWE) extractor for English... more The present paper describes the development of a Multiword Expression (MWE) extractor for English for clinical domain. Identification of a named entity can either leave some tokens of the entity unidentified or can include extra tokens that are not part of that particular entity. This is the problem of Multiword Expression extraction. Correct identification of MWE results in more accurate Named Entity Recognition (NER) and then to down line products that use them. This paper also discusses the challenges met in training the tagger along with the complexities of clinical domain. The evaluation of the tagger output and error analysis is also presented. The MWE tagger has achieved an F1 score of 79% on five-fold cross validation.

Challenges in Annotation and Domain Adaptation in Hindi POS Tagger: with Reference to Cricket

by Anupama Pandey, srishti singh, and Atul Kr. Ojha

In this paper, author reports the scope of domain adaptation for multi-domain Hindi POS tagger to... more In this paper, author reports the scope of domain adaptation for multi-domain Hindi POS tagger to a new domain, i.e. the popular domain of Cricket, through an initial experiment. Utility of Adaptation of new domain is proposed and verified by testing the accuracy of existing Hindi POS tagger for sports domain (here, Cricket) resulting in reduced average accuracy of 87.77% from approx. 94% overall tagger accuracy. Manual validation method is followed for evaluating the test result for generating correct error report for the sports domain data. Alongside, inter-annotator agreement/disagreement found among evaluators, and some major tagger based errors like unseen vocabulary and inconsistent performance has been recorded along with some suggestions for the improvement, serving as the basis of introducing adaptation for the Hindi tagger.

SANSKRIT IN VARANASI: A STRUGGLE FOR SURVIVAL

Sanskrit is the language in which Vedas, Upanishads, Puranas, ethics and several other man langua... more Sanskrit is the language in which Vedas, Upanishads, Puranas, ethics and several other man language in India for a longer period of time. Sanskrit is a classical language of India. The corpus of Sanskrit literature encompasses a rich tradition of poetry and drama as well as scientific, technical, philosophical and Hindu religious texts. Sanskrit continues to be widely used as a ceremonial language in Hindu religious rituals ancient scriptures are transcribed. If we look in history, it was a widely spoken common in the forms of hymns and mantras. Spoken Sanskrit is still in use in a few traditional institutions in India, and there are many attempts at revival. Language is dynamic and so is time. With passing time and emergence of several other languages, Sanskrit started loosing its dominancy but it showed complete tolerance towards all.

Web Drawn Corpus for Bhojpuri

The present paper discusses the methodology in creating probably the first big corpus for Bhojpur... more The present paper discusses the methodology in creating probably the first big corpus for Bhojpuri with 169,275 words, introduced in Singh, 2015. The present general domain Bhojpuri corpus is created based upon the web crawling technology by using ILCrawler. A statistical tagger for Bhojpuri is already trained on the same corpus using Support Vector method. This experiment was a test for ensuring the representativeness of the web drawn corpus for Bhojpuri, which by rule does not has any standard variety and many regional dialects. The objective of this presentation is to serve the researchers/practitioners working on less resourced languages with a full-fledged guideline on creating and validating the corpus considering the language variety, genre and the achieving issues.

Automatic POS Tagging of Bhojpuri: A Comparative Study with Hindi

LAP LAMERT Academic Publication. ISBN 978-613-8-38992-7, 2018

The Book is printed version of M.Phil Dissertation entitled "Challenges in Automatic POS Tagging... more The Book is printed version of M.Phil Dissertation entitled "Challenges in Automatic POS Tagging of Indian Languages- A Comparative Study of Hindi and Bhojpuri" submitted to the Centre for Linguistics, Jawaharlal Nehru University in 2015.

The present work is an initial experiment on creating the first automatic Part-of Speech POS tagger for Bhojpuri language under research programme. Bhojpuri is a lesser resource language and does not have much technology available, therefore, this work presents the first big representative Bhojpuri corpus of approx 2.67 lakh tokens from different domains and a SVM (Support Vector Machine) based POS tagger trained on this corpus. The accuracy of the tagger achieved under this experiment is approx. 87 %. This also cover a detail guideline of annotating Bhojpuri corpus following BIS scheme and a comparative analysis of performances of Bhojpuri and Hindi POS taggers trained with SVM.

Grammatical Sketch of Banarasi : A Dialect of Bhojpuri

Research Review International Journal of Multidiscilinary & Best Linguistics paper in 'Student Paper Contest' @ ICON 2018, Patiala., Nov 10, 2018

The present paper is demonstration of the effort made to draw " Grammatical Sketch Banarasi ". Ba... more The present paper is demonstration of the effort made to draw " Grammatical Sketch Banarasi ". Banarasi, a dialect of Bhojpuri, is still an oral tradition in Varanasi and demands its culture to be preserved. Therefore, this work is an initiative towards documenting present day local language of Varanasi. A spoken corpus of folk stories in Banarasi w collected and transcribed as part of this research work which was extended and eventua contributed for creating Bhojpuri corpus (Singh and Banerjee, 2014). The paper covers phonological, morphological and syntactic analysis of the language and also touches discou at the level of code-mixing and code-switching in present form of language. The features l PNG (person, number and gender), TAM (tense, aspect and mode), kinship and focus etc are discussed in detail.

Keywords:
First Banarasi Corpus, Banarasi Folk Literature, Phonological and Morphological features, Code-mixing and Code-switching

English Multi-Word Expressions (MWE): A Tagset for Health Domain

2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI), 2018

This paper discusses the need for an independent MWE tagset for handling the technicality of heal... more This paper discusses the need for an independent MWE tagset for handling the technicality of healthcare domain for clinical English and reports efforts to develop a tagset, guideline and a tagger for healthcare domain. The tagset contains 12 tags and training a CRF based MWE tagger model to test the reliability of the tagset on medical data was performed with an accuracy of 73%. The tagger which is under continuous improvement through sanitizing the corpus, annotation and errors has improved to an accuracy of 79%.

Challenges in annotation and domain adaptation in hindi POS tagger: With reference to cricket

In this paper, author reports the scope of domain adaptation for multi-domain Hindi POS tagger to... more In this paper, author reports the scope of domain adaptation for multi-domain Hindi POS tagger to a new domain, i.e. the popular domain of Cricket, through an initial experiment. Utility of Adaptation of new domain is proposed and verified by testing the accuracy of existing Hindi POS tagger for sports domain (here, Cricket) resulting in reduced average accuracy of 87.77% from approx. 94% overall tagger accuracy. Manual validation method is followed for evaluating the test result for generating correct error report for the sports domain data. Alongside, inter — annotator agreement/disagreement found among evaluators, and some major tagger based errors like unseen vocabulary and inconsistent performance has been recorded along with some suggestions for the improvement, serving as the basis of introducing adaptation for the Hindi tagger.