skip to main content
10.1145/2254129.2254154acmotherconferencesArticle/Chapter ViewAbstractPublication PageswimsConference Proceedingsconference-collections
research-article

A comparison of layout based bibliographic metadata extraction techniques

Published: 13 June 2012 Publication History

Abstract

Social research networks such as Mendeley and CiteULike offer various services for collaboratively managing bibliographic metadata. Compared with traditional libraries, metadata quality is of crucial importance in order to create a crowdsourced bibliographic catalog for search and browsing. Artifacts, in particular PDFs which are managed by the users of the social research networks, become one important metadata source and the starting point for creating a homogeneous, high quality, bibliographic catalog. Natural Language Processing and Information Extraction techniques have been employed to extract structured information from unstructured sources. However, given highly heterogeneous artifacts that cover a range of publication styles, stemming from different publication sources, and imperfect PDF processing tools, how accurate are metadata extraction methods in such real-world settings? This paper focuses on answering that question by investigating the use of Conditional Random Fields and Support Vector Machines on real-world data gathered from Mendeley and Linked-Data repositories. We compare style and content features on existing state-of-the-art methods on two newly created real-world data sets for metadata extraction. Our analysis shows that two-stage SVMs provide reasonable performance in solving the challenge of metadata extraction for crowdsourcing bibliographic metadata management.

References

[1]
ParsCit: An open-source CRF Reference String Parsing Package. European Language Resources Association, 2008.
[2]
L. Bottou. Large-scale machine learning with stochastic gradient descent. In Y. Lechevallier and G. Saporta, editors, Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT'2010), pages 177--187, Paris, France, August 2010. Springer.
[3]
M. Granitzer, M. Hristakeva, R. Knight, and K. Jack. A comparison of metadata extraction techniques for crowdsourced bibliographic metadata management. In Proceedings of the 27th Symposium On Applied Computing (poster), page to appear. ACM New York, NY, USA, 2012.
[4]
H. Han, C. L. Giles, E. Manavoglu, H. Zha, Z. Zhang, and E. A. Fox. Automatic document metadata extraction using support vector machines. In JCDL'03, pages 37--48, 2003.
[5]
H. Han, E. Manavoglu, H. Zha, K. Tsioutsiouliklis, C. L. Giles, and X. Zhang. Rule-based word clustering for document metadata extraction. In Proceedings of the 2005 ACM symposium on Applied computing - SAC '05, page 1049, New York, New York, USA, 2005. ACM Press.
[6]
J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML '01, pages 282--289, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.
[7]
M. Luong, T. Nguyen, and M. Kan. Logical structure recovery in scholarly articles with rich document features. Journal of Digital Library Systems. Forthcoming, 2011.
[8]
G. Michael. Results and raw experimental data on the e-prints data set for a comparison of metadata extraction techniques for crowed-source bibliographic metadata management. http://goo.gl/WHfU9, 2011.
[9]
F. Peng and A. McCallum. Accurate information extraction from research papers using conditional random fields. In Proceedings of Human Language Technology Conference, pages 329--336. HLT-NAACL04, 2004.
[10]
K. Seymore, A. McCallum, and R. Rosenfeld. Learning hidden Markov model structure for information extraction. In Proceedings of AAAI 99 Workshop on Machine Learning for Information Extraction, pages 37--42, 1999.
[11]
F. Wu and D. Weld. Open information extraction using wikipedia. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 118--127. Association for Computational Linguistics, 2010.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
WIMS '12: Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
June 2012
571 pages
ISBN:9781450309158
DOI:10.1145/2254129
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

  • UCV: University of Craiova
  • WNRI: Western Norway Research Institute

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. bibliographic metadata
  2. layout features
  3. metadata extraction
  4. research papers

Qualifiers

  • Research-article

Funding Sources

Conference

WIMS '12
Sponsor:
  • UCV
  • WNRI

Acceptance Rates

Overall Acceptance Rate 140 of 278 submissions, 50%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media