research-article

Investigator name recognition from medical journal articles: a comparative study of SVM and structural SVM

Authors:

George R. ThomaAuthors Info & Claims

DAS '10: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems

Pages 121 - 128

https://doi.org/10.1145/1815330.1815346

Published: 09 June 2010 Publication History

Abstract

Automated extraction of bibliographic information from journal articles is key to the affordable creation and maintenance of citation databases, such as MEDLINE®. A newly required bibliographic field in this database is "Investigator Names": names of people who have contributed to the research addressed in the article, but who are not listed as authors. Since the number of such names is often large, several score or more, their manual entry is prohibitive. The automated extraction of these names is a problem in Named Entity Recognition (NER), but differs from typical NER due to the absence of normal English grammar in the text containing the names. In addition, since MEDLINE conventions require names to be expressed in a particular format, it is necessary to identify both first and last names of each investigator, an additional challenge. We seek to automate this task through two machine learning approaches: Support Vector Machine and structural SVM, both of which show good performance at the word and chunk levels. In contrast to traditional SVM, structural SVM attempts to learn a sequence by using contextual label features in addition to observational features. It outperforms SVM at the initial learning stage without using contextual observation features. However, with the addition of these contextual features from neighboring tokens, SVM performance improves to match or slightly exceed that of the structural SVM.

References

[1]

Ananiadou, S., Friedman, Carol, and Tsujii, Jun'ichi. 2004. Introduction: named entity recognition in biomedicine. Journal of Biomedical Informatics, 37, 6, 393--395.

Digital Library

[2]

Masayuki, Asahara, Matsumoto Y. 2003. Japanese named entity extraction with redundant morphological analysis. In Proc. Human Language Technology conference - North American chapter of the Association for Computational Linguistics. 8--15.

Digital Library

[3]

Bikel, D. M., Schwartz, R. L., and Weischedel, R. M. 1999. An algorithm that learns what's in a name. Machine Learning, 34, 1--3, 211--231.

Digital Library

[4]

Burges, C. J. C. 1998. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2, 2, 1--43.

Digital Library

[5]

Carvalho, V. R. and Cohen, W. W. 2004. Learning to extract signature and reply lines from email. Proc. of the Conference on Email and Anti-Spam 2004, Mountain View, California.

[6]

Chang, C.-C. and Lin, C.-J. 2001. LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

[7]

Corters, C. and Vapnik, V. 1995. Support vector network. Machine Learning, 20, 273--297.

Digital Library

[8]

Crammer, K. and Singer, Y. 2001. On the algorithmic implementation of multi-class kernel-based vector machines. Machine Learning Research, 2, 265--292.

Digital Library

[9]

Cristianini, N. and Shawe-Taylor, J. 2000. An introduction to support vector machines and other kernel-based learning methods. Cambrige University Press, Cambridge, UK.

Digital Library

[10]

Joachims, T. 1998. Text categorization with support vector machine. Proc. Euro. Con. Machine Learning, 137--142.

Digital Library

[11]

Joachims, T., Finley, T., Yu, Chun-Nam. 2009. Cutting-plane training of structural SVMs. Machine Learning Journal, 27--59.

Digital Library

[12]

Kim, J. D., Ohta, T., Tateishi, Y. and Tsujii, J. 2004. Introduction to the bio-entity recognition task at JNLPBA. Proc. Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA).

Digital Library

[13]

Lee, C., Hou, W. J. and Chen, H.-H. 2004. Annotating multiple types of biomedical entities: a single word classification approach. Proc. Joint Workshop on Natural Language Processing in Biomedicine and its Applications.

Digital Library

[14]

McCallum, A., Freitag, D., and Pereira, F. 2000. Maximum entropy models for information extraction and segmentation. Proc. of the 17^th International Conference on Machine Learning, 591--598.

Digital Library

[15]

McCallum, A. and Li, W. 2003. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. Proc. of the 7^th Conference on Natural Language Learning (CoNLL-2003), 4, 188--191.

Digital Library

[16]

Nadeau, D. and Satoshi, S. 2007. A survey of named entity recognition and classification. Linguisticae Investigations, 30, 1, 3--26.

[17]

Nguyen, N. and Guo, Y. 2007. Comparisons of sequence labeling algorithms and extensions. Proc. of the 24^th International Conference on Machine Learning, 681--688.

Digital Library

[18]

Satoshi, S., Nobata, C. 2004. Definition, dictionaries and tagger for extended named entity hierarchy. In Proc. Conference on Language Resources and Evaluation.

[19]

Settles, B., 2004. Biomedical named entity recognition using conditional random Fields and novel feature sets. Proc. Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA).

Digital Library

[20]

Si, L., Kanungo, T., Huang, X. 2005. Boosting performance of bio-entity recognition by combining results from multiple systems. Proc. Workshop on Data Mining in Bioinformatics (BioKDD).

Digital Library

[21]

De Sitter, A. and Daelemans, W. 2003. Information extraction via double classification. In Proceedings of International Workshop on Adaptive Text Extraction and Mining, Dubrovnik.

[22]

Tsochantaridis, I., Hofmann, T., Joachims, T., Altun Y. 2004. Support vector machine learning for interdependent and structured output spaces. Int'l Conf. on Machine Learning.

Digital Library

[23]

Vapnik, V. 1995. The nature of statistical learning theory, New York: Springer-Verlag.

Digital Library

[24]

Vapnik, V. 1998. Statistical learning theory. Wiley.

[25]

Weston, J. and Watkins, C. 1999. Support vector machines for multi-class pattern recognition. In Proc. of the 7^th European Symposium on Artificial Neural Networks.

[26]

http://www.cs.cornell.edu/People/tj/svm_light/svm_hmm.html.

Cited By

Misra DHall RPayne SThoma GBoughida KHoward BNelson MVan de Sompel HSølvberg I(2012)Digital preservation and knowledge discovery based on documents from an international health science programProceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries10.1145/2232817.2232823(23-26)Online publication date: 10-Jun-2012
https://dl.acm.org/doi/10.1145/2232817.2232823
Packer TLutes JStewart AEmbley DRingger ESeppi KJensen LBasili RLopresti DRinglstetter CRoy SSchulz KSubramaniam L(2010)Extracting person names from diverse and noisy OCR textProceedings of the fourth workshop on Analytics for noisy unstructured text data10.1145/1871840.1871845(19-26)Online publication date: 26-Oct-2010
https://dl.acm.org/doi/10.1145/1871840.1871845

Index Terms

Investigator name recognition from medical journal articles: a comparative study of SVM and structural SVM

Recommendations

Using MEDLINE as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles

Biomedical abbreviations and acronyms are widely used in biomedical literature. Since many of them represent important content in biomedical literature, information retrieval and extraction benefits from identifying the meanings of those terms. On the ...
Identification of Investigator Name Zones Using SVM Classifiers and Heuristic Rules
ICDAR '13: Proceedings of the 2013 12th International Conference on Document Analysis and Recognition

The research reported in biomedical articles often involves large numbers of investigators at different institutions. To properly credit these investigators, an article's authors frequently name them together in some part of the article. These ...
Two-stage approach to named entity recognition using Wikipedia and DBpedia
IMCOM '17: Proceedings of the 11th International Conference on Ubiquitous Information Management and Communication

In natural language understanding, extraction of named entity (NE) mentions in given text and classification of the mentions into pre-defined NE types are important processes. Most NE recognition (NER) relies on resources such as a training corpus or NE ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

DAS '10: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems

June 2010

490 pages

ISBN:9781605587738

DOI:10.1145/1815330

General Chairs:
David Doermann
University of Maryland, College Park
,
Venu Govindaraju
University at Buffalo, SUNY
,
Daniel Lopresti
Lehigh University
,
Prem Natarajan
Raytheon BBN Technologies

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

DAS '10

DAS '10: The Eighth IAPR International Workshop on Document Analysis Systems

June 9 - 11, 2010

Massachusetts, Boston, USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
256
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Misra DHall RPayne SThoma GBoughida KHoward BNelson MVan de Sompel HSølvberg I(2012)Digital preservation and knowledge discovery based on documents from an international health science programProceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries10.1145/2232817.2232823(23-26)Online publication date: 10-Jun-2012
https://dl.acm.org/doi/10.1145/2232817.2232823
Packer TLutes JStewart AEmbley DRingger ESeppi KJensen LBasili RLopresti DRinglstetter CRoy SSchulz KSubramaniam L(2010)Extracting person names from diverse and noisy OCR textProceedings of the fourth workshop on Analytics for noisy unstructured text data10.1145/1871840.1871845(19-26)Online publication date: 26-Oct-2010
https://dl.acm.org/doi/10.1145/1871840.1871845

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten