skip to main content
10.1145/1998076.1998140acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
research-article

A metadata geoparsing system for place name recognition and resolution in metadata records

Published: 13 June 2011 Publication History

Abstract

This paper describes an approach for performing recognition and resolution of place names mentioned over the descriptive metadata records of typical digital libraries. Our approach exploits evidence provided by the existing structured attributes within the metadata records to support the place name recognition and resolution, in order to achieve better results than by just using lexical evidence from the textual values of these attributes. In metadata records, lexical evidence is very often insufficient for this task, since short sentences and simple expressions are predominant. Our implementation uses a dictionary based technique for recognition of place names (with names provided by Geonames), and machine learning for reasoning on the evidences and choosing a possible resolution candidate. The evaluation of our approach was performed in data sets with a metadata schema rich in Dublin Core elements. Two evaluation methods were used. First, we used cross-validation, which showed that our solution is able to achieve a very high precision of 0,99 at 0,55 recall, or a recall of 0,79 at 0,86 precision. Second, we used a comparative evaluation with an existing commercial service, where our solution performed better on any confidence level (p<0,001).

References

[1]
Seth, G., "Unstructured Data and the 80 Percent Rule: Investigating the 80%", technical report Clarabridge Bridgepoints, 2008. http://clarabridge.com/default.aspx?tabid=137&ModuleID=635&ArticleID=551
[2]
C. Shilakes, J. Tylman, "Enterprise Information Portals", Merrill Lynch report, 1998.
[3]
S. Sarawagi, "Information Extraction", Found. Trends databases, vol. 1, pp. 261--377, Now Publishers Inc., 2008.
[4]
Vatant, B., Wick, M., "Geonames ontology". http://www.geonames.org/ontology/
[5]
J. Leidner, "Toponym Resolution in Text". PhD thesis, University of Edinburgh, 2007.
[6]
Y. Kanada, "A method of geographical name extraction from Japanese text for thematic geographical search", in proceedings of the 8th International Conference on Information and Knowledge Management, 1999.
[7]
A. Olligschlaeger, A. Hauptmann, "Multimodal information systems and GIS: The Informedia digital video library", in proceedings of the ESRI User Conference, 1999.
[8]
C.J. Coates-Stephens, Sam. "The Analysis and Acquisition of Proper Names for the Understanding of Free Text", Computers and the Humanities 26.441--456, San Francisco: Morgan Kaufmann Publishers, 1992.
[9]
D. Nadeau, S. Sekine, "A survey of named entity recognition and classification", Linguisticae Investigationes, volume. 30, number 1, pp. 3--26, John Benjamins Publishing Company, 2007.
[10]
Y. Ravin, N. Wacholder, "Extracting Names from Natural-Language Text", IBM Research Report, IBM Research, 1997.
[11]
A. Mikheev, "A Knowledge-free Method for Capitalized Word Disambiguation", in 37th annual meeting of the association for computational linguistics, Association for Computational Linguistics, pp. 159--166, 1999. ISBN:1-55860-609-3
[12]
J. Silva, Z. Kozareva, J. Gabriel, and P. Lopes, "Cluster Analysis and Classification of Named Entities", in proceedings Conference on Language Resources and Evaluation, 2004.
[13]
D. Bikel, M. Daniel, S. Miller, R. Schwartz, R. Weischedel, "Nymble: a High-Performance Learning Name-finder", Proceedings of the Conference on Applied Natural Language Processing, Association for Computational Linguistics, 1997.
[14]
A. McCallum, D. Freitag, and F. C. N. Pereira, "Maximum Entropy Markov Models for Information Extraction and Segmentation", in Proceedings of the Seventeenth International Conference on Machine Learning, pp. 591--598, 2000.
[15]
J.D. Lafferty, A. McCallum, and F.C.N. Pereira, "Conditional Random Fields: Probabilistic Models for Segmenting and Labelling Sequence Data", Proceedings of the Eighteenth International Conference on Machine Learning, Morgan Kaufmann Publishers Inc., pp. 282--289, 2001.
[16]
B., Settles, "Biomedical Named Entity Recognition Using Conditional Random Fields and Rich Feature Sets", in Proc. Conference on Computational Linguistics. Joint Workshop on Natural Language Processing in Biomedicine and its Applications, 2004.
[17]
J. Pearl, "Bayesian Networks: A Model of Self-Activated Memory for Evidential Reasoning", in proceedings of the 7th Conference of the Cognitive Science Society, University of California, Irvine, pp. 329--334, 1985.
[18]
M. Wick, T. Becker, "Enhancing RSS Feeds with Extracted Geospatial Information for Further Processing and Visualization", in The Geospatial Web - How Geobrowsers, Social Software and the Web 2.0 are Shaping the Network Society, Springer, 2007.
[19]
B. Christian, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak, S. Hellmann, "DBpedia- A Crystallization Point for the Web of Data", in Web Semantics: Science, Services and Agents on the World Wide Web, Volume 7, Issue 3, pp. 154--165, 2009.
[20]
E. Amitay, N. Har'El, R. Sivan, A. Soffer, "Web-a-where: geotagging web content", in Proceedings of the 27th Annual international ACM SIGIR Conference on Research and Development in information Retrieval, 2004.
[21]
Y. Kanada, "A method of geographical name extraction from Japanese text for thematic geographical search", in proceedings of the 8th International Conference on Information and Knowledge Management, 1999.
[22]
E. Rauch, M. Bukatin, K. Baker, "A confidence-based framework for disambiguating geographic terms", in proceedings of the HLT-NAACL 2003 Workshop on Analysis of Geographic References, 2003.
[23]
Borbinha, J., Pedrosa, G., Reis, D., Luzio, J., Martins, B., Gil, J., Freire, N. 2007. "DIGMAP - Discovering Our Past World with Digitised Maps", in proceeding of the ECDL 2007 - Research and Advanced Technology for Digital Libraries, 11th European Conference, 2007.
[24]
A. Chandel, P.C. Nagesh, and S. Sarawagi, "Efficient Batch Top-k Search for Dictionary-based Entity Recognition", Proceedings of the 22nd International Conference on Data Engineering, IEEE Computer Society, p. 28, 2006.
[25]
J. Zhu, V. Uren, E. Motta, and J.Z.V. Uren, "ESpotter: Adaptive named entity recognition for web browsing", 3rd Conference on Professional Knowledge Management, pp. 518--529, 2005.
[26]
H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan, "GATE: A framework and graphical development environment for robust NLP tools and applications", Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002.
[27]
Martins, B., Borbinha, J., Pedrosa, G., Gil, J., and Freire, N., "Geographically-aware information retrieval for collections of digitized historical maps", in proceedings of the 4th ACM Workshop on Geographical information Retrieval, 2007.
[28]
R. Bayer, E. McCreight, "Organization and Maintenance of Large Ordered Indices", Mathematical and Information Sciences Report No. 20, Boeing Scientific Research Laboratories, 1970.
[29]
S. Weibel, J. Kunze, C. Lagoze, M. Wolf, "Dublin Core Metadata for Resource Discovery", Network Working Group Request for Comments: 2413, 1998.
[30]
L. Breiman, "Random Forests". Machine Learning 45, Springer Netherlands, pp 5--32, 2001.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
JCDL '11: Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
June 2011
500 pages
ISBN:9781450307444
DOI:10.1145/1998076
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. entity recognition
  2. entity resolution
  3. geographic information
  4. information extraction
  5. metadata

Qualifiers

  • Research-article

Conference

JCDL '11
Sponsor:
JCDL '11: Joint Conference on Digital Libraries
June 13 - 17, 2011
Ontario, Ottawa, Canada

Acceptance Rates

Overall Acceptance Rate 415 of 1,482 submissions, 28%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)14
  • Downloads (Last 6 weeks)2
Reflects downloads up to 14 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media