research-article

A comparison of layout based bibliographic metadata extraction techniques

Authors:

Roman KernAuthors Info & Claims

WIMS '12: Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics

Article No.: 19, Pages 1 - 8

https://doi.org/10.1145/2254129.2254154

Published: 13 June 2012 Publication History

Get Access

Abstract

Social research networks such as Mendeley and CiteULike offer various services for collaboratively managing bibliographic metadata. Compared with traditional libraries, metadata quality is of crucial importance in order to create a crowdsourced bibliographic catalog for search and browsing. Artifacts, in particular PDFs which are managed by the users of the social research networks, become one important metadata source and the starting point for creating a homogeneous, high quality, bibliographic catalog. Natural Language Processing and Information Extraction techniques have been employed to extract structured information from unstructured sources. However, given highly heterogeneous artifacts that cover a range of publication styles, stemming from different publication sources, and imperfect PDF processing tools, how accurate are metadata extraction methods in such real-world settings? This paper focuses on answering that question by investigating the use of Conditional Random Fields and Support Vector Machines on real-world data gathered from Mendeley and Linked-Data repositories. We compare style and content features on existing state-of-the-art methods on two newly created real-world data sets for metadata extraction. Our analysis shows that two-stage SVMs provide reasonable performance in solving the challenge of metadata extraction for crowdsourcing bibliographic metadata management.

References

[1]

ParsCit: An open-source CRF Reference String Parsing Package. European Language Resources Association, 2008.

Google Scholar

[2]

L. Bottou. Large-scale machine learning with stochastic gradient descent. In Y. Lechevallier and G. Saporta, editors, Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT'2010), pages 177--187, Paris, France, August 2010. Springer.

Crossref

Google Scholar

[3]

M. Granitzer, M. Hristakeva, R. Knight, and K. Jack. A comparison of metadata extraction techniques for crowdsourced bibliographic metadata management. In Proceedings of the 27th Symposium On Applied Computing (poster), page to appear. ACM New York, NY, USA, 2012.

Digital Library

Google Scholar

[4]

H. Han, C. L. Giles, E. Manavoglu, H. Zha, Z. Zhang, and E. A. Fox. Automatic document metadata extraction using support vector machines. In JCDL'03, pages 37--48, 2003.

Digital Library

Google Scholar

[5]

H. Han, E. Manavoglu, H. Zha, K. Tsioutsiouliklis, C. L. Giles, and X. Zhang. Rule-based word clustering for document metadata extraction. In Proceedings of the 2005 ACM symposium on Applied computing - SAC '05, page 1049, New York, New York, USA, 2005. ACM Press.

Digital Library

Google Scholar

[6]

J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML '01, pages 282--289, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.

Digital Library

Google Scholar

[7]

M. Luong, T. Nguyen, and M. Kan. Logical structure recovery in scholarly articles with rich document features. Journal of Digital Library Systems. Forthcoming, 2011.

Google Scholar

[8]

G. Michael. Results and raw experimental data on the e-prints data set for a comparison of metadata extraction techniques for crowed-source bibliographic metadata management. http://goo.gl/WHfU9, 2011.

Google Scholar

[9]

F. Peng and A. McCallum. Accurate information extraction from research papers using conditional random fields. In Proceedings of Human Language Technology Conference, pages 329--336. HLT-NAACL04, 2004.

Google Scholar

[10]

K. Seymore, A. McCallum, and R. Rosenfeld. Learning hidden Markov model structure for information extraction. In Proceedings of AAAI 99 Workshop on Machine Learning for Information Extraction, pages 37--42, 1999.

Google Scholar

[11]

F. Wu and D. Weld. Open information extraction using wikipedia. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 118--127. Association for Computational Linguistics, 2010.

Digital Library

Google Scholar

Cited By

View all

Ahmed MAfzal M(2020)FLAG-PDFe: Features Oriented Metadata Extraction Framework for Scientific PublicationsIEEE Access10.1109/ACCESS.2020.29979078(99458-99469)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.2997907
Bodo Z(2018)A CiteSeerX-Based Dataset for Record Linkage and Metadata Extraction2018 20th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)10.1109/SYNASC.2018.00044(230-236)Online publication date: Sep-2018
https://doi.org/10.1109/SYNASC.2018.00044
Nasar ZJaffry SMalik M(2018)Information extraction from scientific articlesScientometrics10.1007/s11192-018-2921-5117:3(1931-1990)Online publication date: 1-Dec-2018
https://dl.acm.org/doi/10.1007/s11192-018-2921-5
Show More Cited By

Index Terms

A comparison of layout based bibliographic metadata extraction techniques
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
2. Information systems
  1. Information retrieval
    1. Document representation

Recommendations

A Heuristic Baseline Method for Metadata Extraction from Scanned Electronic Theses and Dissertations
JCDL '20: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020

Extracting metadata from scholarly papers is an important text mining problem. Widely used open-source tools such as GROBID are designed for born-digital scholarly papers but often fail for scanned documents, such as Electronic Theses and Dissertations (...
A comparison of metadata extraction techniques for crowdsourced bibliographic metadata management
SAC '12: Proceedings of the 27th Annual ACM Symposium on Applied Computing

Social research networks such as Mendeley and CiteULike offer various services for collaboratively managing bibliographic metadata and uploading textual artifacts. One core problem thereby is the extraction of bibliographic metadata from the textual ...
Evaluation of header metadata extraction approaches and tools for scientific PDF documents
JCDL '13: Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries

This paper evaluates the performance of tools for the extraction of metadata from scientific articles. Accurate metadata extraction is an important task for automating the management of digital libraries. This comparative study is a guide for developers ...

Comments

Information & Contributors

Information

Published In

WIMS '12: Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics

June 2012

571 pages

ISBN:9781450309158

DOI:10.1145/2254129

Conference Chair:
Dumitru Dan Burdescu
University of Craiova, Romania
,
Program Chairs:
Rajendra Akerkar
Western Norway Research Institute, Norway
,
Costin Bădică
SUniversity of Craiova, Romania

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Seventh Framework Programme

Conference

WIMS '12

Sponsor:

UCV
WNRI

WIMS '12: 2nd International Conference on Web Intelligence, Mining and Semantics

June 13 - 15, 2012

Craiova, Romania

Acceptance Rates

Overall Acceptance Rate 140 of 278 submissions, 50%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
305
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Ahmed MAfzal M(2020)FLAG-PDFe: Features Oriented Metadata Extraction Framework for Scientific PublicationsIEEE Access10.1109/ACCESS.2020.29979078(99458-99469)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.2997907
Bodo Z(2018)A CiteSeerX-Based Dataset for Record Linkage and Metadata Extraction2018 20th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)10.1109/SYNASC.2018.00044(230-236)Online publication date: Sep-2018
https://doi.org/10.1109/SYNASC.2018.00044
Nasar ZJaffry SMalik M(2018)Information extraction from scientific articlesScientometrics10.1007/s11192-018-2921-5117:3(1931-1990)Online publication date: 1-Dec-2018
https://dl.acm.org/doi/10.1007/s11192-018-2921-5
Klampfl SGranitzer MJack KKern R(2018)Unsupervised document structure analysis of digital scientific articlesInternational Journal on Digital Libraries10.1007/s00799-014-0115-114:3-4(83-99)Online publication date: 19-Dec-2018
https://dl.acm.org/doi/10.1007/s00799-014-0115-1
Bui DDel Fiol GJonnalagadda S(2016)PDF text classification to leverage information extraction from publication reportsJournal of Biomedical Informatics10.1016/j.jbi.2016.03.02661:C(141-148)Online publication date: 1-Jun-2016
https://dl.acm.org/doi/10.1016/j.jbi.2016.03.026
Hsiao WChang TThomas E(2014)Extracting bibliographical data for PDF documents with HMM and external resourcesProgram10.1108/PROG-12-2011-005948:3(293-313)Online publication date: Jul-2014
https://doi.org/10.1108/PROG-12-2011-0059
Klampfl SKern R(2013)An Unsupervised Machine Learning Approach to Body Text and Table of Contents Extraction from Digital Scientific ArticlesResearch and Advanced Technology for Digital Libraries10.1007/978-3-642-40501-3_15(144-155)Online publication date: 2013
https://doi.org/10.1007/978-3-642-40501-3_15
Stegmaier FSeifert CKern RHöfler PBayerl SGranitzer MKosch HLindstaedt SMutlu BSabol VSchlegel KZwicklbauer S(2012)Unleashing Semantics of Research DataRevised Selected Papers of the First Workshop on Specifying Big Data Benchmarks - Volume 816310.1007/978-3-642-53974-9_10(103-112)Online publication date: 17-Dec-2012
https://dl.acm.org/doi/10.1007/978-3-642-53974-9_10

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

A Heuristic Baseline Method for Metadata Extraction from Scanned Electronic Theses and Dissertations

A comparison of metadata extraction techniques for crowdsourced bibliographic metadata management

Evaluation of header metadata extraction approaches and tools for scientific PDF documents

Comments

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Other Metrics

Article Metrics

Other Metrics

Cited By

Login options

Full Access

PDF

eReader

Abstract

References

Cited By

Index Terms

Recommendations

A Heuristic Baseline Method for Metadata Extraction from Scanned Electronic Theses and Dissertations

A comparison of metadata extraction techniques for crowdsourced bibliographic metadata management

Evaluation of header metadata extraction approaches and tools for scientific PDF documents

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations