skip to main content
10.1145/1815330.1815346acmotherconferencesArticle/Chapter ViewAbstractPublication PagesdasConference Proceedingsconference-collections
research-article

Investigator name recognition from medical journal articles: a comparative study of SVM and structural SVM

Published: 09 June 2010 Publication History

Abstract

Automated extraction of bibliographic information from journal articles is key to the affordable creation and maintenance of citation databases, such as MEDLINE®. A newly required bibliographic field in this database is "Investigator Names": names of people who have contributed to the research addressed in the article, but who are not listed as authors. Since the number of such names is often large, several score or more, their manual entry is prohibitive. The automated extraction of these names is a problem in Named Entity Recognition (NER), but differs from typical NER due to the absence of normal English grammar in the text containing the names. In addition, since MEDLINE conventions require names to be expressed in a particular format, it is necessary to identify both first and last names of each investigator, an additional challenge. We seek to automate this task through two machine learning approaches: Support Vector Machine and structural SVM, both of which show good performance at the word and chunk levels. In contrast to traditional SVM, structural SVM attempts to learn a sequence by using contextual label features in addition to observational features. It outperforms SVM at the initial learning stage without using contextual observation features. However, with the addition of these contextual features from neighboring tokens, SVM performance improves to match or slightly exceed that of the structural SVM.

References

[1]
Ananiadou, S., Friedman, Carol, and Tsujii, Jun'ichi. 2004. Introduction: named entity recognition in biomedicine. Journal of Biomedical Informatics, 37, 6, 393--395.
[2]
Masayuki, Asahara, Matsumoto Y. 2003. Japanese named entity extraction with redundant morphological analysis. In Proc. Human Language Technology conference - North American chapter of the Association for Computational Linguistics. 8--15.
[3]
Bikel, D. M., Schwartz, R. L., and Weischedel, R. M. 1999. An algorithm that learns what's in a name. Machine Learning, 34, 1--3, 211--231.
[4]
Burges, C. J. C. 1998. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2, 2, 1--43.
[5]
Carvalho, V. R. and Cohen, W. W. 2004. Learning to extract signature and reply lines from email. Proc. of the Conference on Email and Anti-Spam 2004, Mountain View, California.
[6]
Chang, C.-C. and Lin, C.-J. 2001. LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
[7]
Corters, C. and Vapnik, V. 1995. Support vector network. Machine Learning, 20, 273--297.
[8]
Crammer, K. and Singer, Y. 2001. On the algorithmic implementation of multi-class kernel-based vector machines. Machine Learning Research, 2, 265--292.
[9]
Cristianini, N. and Shawe-Taylor, J. 2000. An introduction to support vector machines and other kernel-based learning methods. Cambrige University Press, Cambridge, UK.
[10]
Joachims, T. 1998. Text categorization with support vector machine. Proc. Euro. Con. Machine Learning, 137--142.
[11]
Joachims, T., Finley, T., Yu, Chun-Nam. 2009. Cutting-plane training of structural SVMs. Machine Learning Journal, 27--59.
[12]
Kim, J. D., Ohta, T., Tateishi, Y. and Tsujii, J. 2004. Introduction to the bio-entity recognition task at JNLPBA. Proc. Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA).
[13]
Lee, C., Hou, W. J. and Chen, H.-H. 2004. Annotating multiple types of biomedical entities: a single word classification approach. Proc. Joint Workshop on Natural Language Processing in Biomedicine and its Applications.
[14]
McCallum, A., Freitag, D., and Pereira, F. 2000. Maximum entropy models for information extraction and segmentation. Proc. of the 17th International Conference on Machine Learning, 591--598.
[15]
McCallum, A. and Li, W. 2003. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. Proc. of the 7th Conference on Natural Language Learning (CoNLL-2003), 4, 188--191.
[16]
Nadeau, D. and Satoshi, S. 2007. A survey of named entity recognition and classification. Linguisticae Investigations, 30, 1, 3--26.
[17]
Nguyen, N. and Guo, Y. 2007. Comparisons of sequence labeling algorithms and extensions. Proc. of the 24th International Conference on Machine Learning, 681--688.
[18]
Satoshi, S., Nobata, C. 2004. Definition, dictionaries and tagger for extended named entity hierarchy. In Proc. Conference on Language Resources and Evaluation.
[19]
Settles, B., 2004. Biomedical named entity recognition using conditional random Fields and novel feature sets. Proc. Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA).
[20]
Si, L., Kanungo, T., Huang, X. 2005. Boosting performance of bio-entity recognition by combining results from multiple systems. Proc. Workshop on Data Mining in Bioinformatics (BioKDD).
[21]
De Sitter, A. and Daelemans, W. 2003. Information extraction via double classification. In Proceedings of International Workshop on Adaptive Text Extraction and Mining, Dubrovnik.
[22]
Tsochantaridis, I., Hofmann, T., Joachims, T., Altun Y. 2004. Support vector machine learning for interdependent and structured output spaces. Int'l Conf. on Machine Learning.
[23]
Vapnik, V. 1995. The nature of statistical learning theory, New York: Springer-Verlag.
[24]
Vapnik, V. 1998. Statistical learning theory. Wiley.
[25]
Weston, J. and Watkins, C. 1999. Support vector machines for multi-class pattern recognition. In Proc. of the 7th European Symposium on Artificial Neural Networks.
[26]
http://www.cs.cornell.edu/People/tj/svm_light/svm_hmm.html.

Cited By

View all
  • (2012)Digital preservation and knowledge discovery based on documents from an international health science programProceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries10.1145/2232817.2232823(23-26)Online publication date: 10-Jun-2012
  • (2010)Extracting person names from diverse and noisy OCR textProceedings of the fourth workshop on Analytics for noisy unstructured text data10.1145/1871840.1871845(19-26)Online publication date: 26-Oct-2010

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
DAS '10: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
June 2010
490 pages
ISBN:9781605587738
DOI:10.1145/1815330
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. MEDLINE
  2. document analysis
  3. investigator name
  4. named entity recognition
  5. structural SVM
  6. support vector machine (SVM)

Qualifiers

  • Research-article

Conference

DAS '10

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2012)Digital preservation and knowledge discovery based on documents from an international health science programProceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries10.1145/2232817.2232823(23-26)Online publication date: 10-Jun-2012
  • (2010)Extracting person names from diverse and noisy OCR textProceedings of the fourth workshop on Analytics for noisy unstructured text data10.1145/1871840.1871845(19-26)Online publication date: 26-Oct-2010

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media