skip to main content
10.1145/3139958.3139986acmconferencesArticle/Chapter ViewAbstractPublication PagesgisConference Proceedingsconference-collections
research-article
Public Access

Effective Scalable and Integrative Geocoding for Massive Address Datasets

Published: 07 November 2017 Publication History

Abstract

With increased accessibility of large scale open data, public health studies are able to take advantage of integrative spatial big data to increase the spatial resolution to community or neighborhood level. One critical information for such studies is the large number of addresses of patients, which is private and highly sensitive. Geocoding such massive private addresses poses major challenges for public health researchers. Many geocoders provide only Web APIs which require sending private addresses over the Internet, which is not feasible. Commercial geocoders require high licensing fee and often have limitations on daily usage, which becomes a major hurdle for researchers. Scalability is another major challenge for large scale address dataset. In this paper, we present EaserGeocoder, a novel open source geocoder for effectively geocoding massive address datasets. EaserGeocoder takes an integrative approach by using multiple references based on open address data sources contributed by governments or communities. It takes a machine learning approach to automatically find the best answer from candidates produced by multiple references. The system provides high scalability through parallel processing. Our comparative studies demonstrate Easer-Geocoder outperforms open source geocoders and is comparable to commercial ones in terms of both accuracy and error. It provides a cost-effective and feasible solution for large scale public health studies.

References

[1]
2009. WhyUseSolr - Solr Wiki. https://wiki.apache.org/solr/. (09 2009).
[2]
2016. Physician and Other Supplier Data CY 2013. https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Physician-and-Other-Supplier2013.html. (Dec. 2016).
[3]
2016. SPARCS. https://www.health.ny.gov/statistics/sparcs/. (2016).
[4]
2016. What is Geocoding? http://desktop.arcgis.com/en/arcmap/10.3/guide-books/geocoding/. (2016).
[5]
2017. Data Science Toolkit. http://www.datasciencetoolkit.org/. (2017).
[6]
2017. GeoNames. http://www.geonames.org/. (2017).
[7]
2017. Google Maps Geocoding API. https://developers.google.com/. (2017).
[8]
2017. HERE Geocoder API. https://developer.here.com/. (2017).
[9]
2017. HIPAA PHI: List of 18 Identifiers and Definition of PHI. http://cphs.berkeley.edu/hipaa/hipaa18.html. (2017).
[10]
2017. MapQuest Developer Network. https://developer.mapquest.com/. (2017).
[11]
2017. NYS GIS Clearinghouse - NYS Address Points. (April 2017). Retrieved May 17,2017 from http://gis.ny.gov/gisdata/inventories/details.cfm?DSID=921
[12]
2017. OpenAddresses - Download Data. http://results.openaddresses.io/. (2017).
[13]
2017. OpenStreetMap Nominatim. http://nominatim.openstreetmap.org/. (2017).
[14]
2017. OpenStreetMap Wiki. http://wiki.openstreetmap.org. (2017).
[15]
2017. TIGER Products - Geography - U.S. Census Bureau. (April 2017). Retrieved May 26,2017 from https://www.census.gov/geo/maps-data/data/tiger.html
[16]
Rahul Bakshi, Craig A Knoblock, and Snehal Thakkar. 2004. Exploiting online sources to accurately geocode addresses. In Proceedings of the 12th annual ACM international workshop on Geographic information systems. ACM, 194--203.
[17]
Pavel Berkhin, Michael R Evans, Florin Teodorescu, Wei Wu, and Dragomir Yankov. 2015. A new approach to geocoding: BingGC. In Proceedings of the 23rd SIGSPATIAL International Conference. ACM, 7.
[18]
Michael R Cayo and Thomas O Talbot. 2003. Positional error in automated geocoding of residential addresses. International journal of health geographics 2, 1 (2003), 10.
[19]
Abhranil Chatterjee, Janit Anjaria, Sourav Roy, Arnab Ganguli, and Krishanu Seal. 2016. SAGEL: smart address geocoding engine for supply-chain logistics. In Proceedings of the 24th ACM SIGSPATIAL International Conference. ACM, 42.
[20]
Xin Chen, Hoang Vo, Ablimit Aji, and Fusheng Wang. 2014. High performance integrated spatial big data analytics. In Proceedings of the 3rd ACM SIGSPATIAL International Workshop. ACM, 11--14.
[21]
Xin Chen and Fusheng Wang. 2016. Integrative Spatial Data Analytics for Public Health Studies of New York State. In AMIA Annual Symposium Proceedings, Vol. 2016. American Medical Informatics Association, 391.
[22]
New York State Geographic Information Systems Clearinghouse. 2014. NYS GIS Clearinghouse - NYS Tax Parcels. (Nov. 2014). Retrieved Feb 20,2017 from http://gis.ny.gov/gisdata/inventories/details.cfm?DSID=1300
[23]
Steve M Dearwent, Robert R Jacobs, and John B Halbert. 2001. Locational uncertainty in georeferencing public health datasets. Journal of Exposure Science and Environmental Epidemiology 11, 4 (2001), 329.
[24]
Goldberg DW. 2017. Texas A&M University Geoservices. http://geoservices.tamu.edu. (2017).
[25]
Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189--1232.
[26]
Jerome H Friedman. 2002. Stochastic gradient boosting. Computational Statistics & Data Analysis 38, 4 (2002), 367--378.
[27]
DW Goldberg, JN Swift, and JP Wilson. 2008. Geocoding best practices: Analysis of Geocoding User Requirements. Los Angeles, CA: University of Southern California GIS Research Laboratory (2008).
[28]
Daniel W Goldberg. 2008. A geocoding best practices guide. (2008).
[29]
Daniel W Goldberg. 2011. Improving Geocoding Match Rates with Spatially-Varying Block Metrics. Transactions in GIS 15, 6 (2011), 829--850.
[30]
Daniel W Goldberg and Myles G Cockburn. 2010. Improving geocode accuracy with candidate selection criteria. Transactions in GIS 14, s1 (2010), 149--176.
[31]
Daniel W Goldberg, John P Wilson, and Craig A Knoblock. 2007. From text to geographic coordinates: The current state of geocoding. URISA-WASHINGTON DC- 19, 1 (2007), 33.
[32]
Geoffrey M Jacquez. 2012. A research agenda: does geocoding positional error matter in health GIS studies? Spatial and spatio-temporal epidemiology 3, 1 (2012), 7--16.
[33]
Ludovic Moncla, Walter Renteria-Agualimpia, Javier Nogueras-Iso, and Mauro Gaio. 2014. Geocoding for texts with fine-grain toponyms: an experiment on a geoparsed hiking descriptions corpus. In Proceedings of the 22nd ACM SIGSPATIAL International Conference. ACM, 183--192.
[34]
U.S. Department of Health & Human Services. 2015. Health Information Privacy | HHS.gov. https://www.hhs.gov/hipaa/index.html. (2015).
[35]
Gerard Rushton, Marc P Armstrong, Josephine Gittler, Barry R Greene, Claire E Pavlik, Michele M West, and Dale L Zimmerman. 2006. Geocoding in cancer research: a review. American journal of preventive medicine 30, 2 (2006), S16--S24.
[36]
Xuan Shi, Bowei Xue, and Imam M Xierali. 2016. Identifying the Uncertainty in Physician Practice Location through Spatial Analytics and Text Mining. International Journal of Environmental Research and Public Health 13, 9 (2016), 930.
[37]
Grant Ian Thrall. 1999. The future of GIS in public health management and practice. Journal of Public Health Management and Practice 5, 4 (1999), 82.
[38]
Eric A Whitsel, P Miguel Quibrera, Richard L Smith, Diane J Catellier, Duanping Liao, Amanda C Henley, and Gerardo Heiss. 2006. Accuracy of commercial geocoding: assessment and implications. Epidemiologic Perspectives & Innovations 3, 1 (2006), 8.
[39]
Duck-Hye Yang, Lucy Mackey Bilaver, Oscar Hayes, and Robert Goerge. 2004. Improving geocoding practices: evaluation of geocoding tools. Journal of medical systems 28, 4 (2004), 361--370.
[40]
PA Zandbergen, TC Hart, KE Lenzer, and ME Camponovo. 2012. Error propagation models to examine the effects of geocoding quality on spatial analysis of individual-level datasets. Spatial and spatio-temporal epidemiology 3, 1 (2012), 69--82.
[41]
Paul A Zandbergen. 2008. A comparison of address point, parcel and street geocoding techniques. Computers, Environment and Urban Systems 32, 3 (2008), 214--232.
[42]
Paul A Zandbergen. 2009. Geocoding quality and implications for spatial analysis. Geography Compass 3, 2 (2009), 647--680.

Cited By

View all

Index Terms

  1. Effective Scalable and Integrative Geocoding for Massive Address Datasets

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGSPATIAL '17: Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
    November 2017
    677 pages
    ISBN:9781450354905
    DOI:10.1145/3139958
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 November 2017

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Geocoding
    2. Geographic Information System
    3. Text Searching

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    Conference

    SIGSPATIAL'17
    Sponsor:

    Acceptance Rates

    SIGSPATIAL '17 Paper Acceptance Rate 39 of 193 submissions, 20%;
    Overall Acceptance Rate 257 of 1,238 submissions, 21%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)89
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 05 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media