skip to main content
10.1145/3308558.3313685acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Google Dataset Search: Building a search engine for datasets in an open Web ecosystem

Published: 13 May 2019 Publication History

Abstract

There are thousands of data repositories on the Web, providing access to millions of datasets. National and regional governments, scientific publishers and consortia, commercial data providers, and others publish data for fields ranging from social science to life science to high-energy physics to climate science and more. Access to this data is critical to facilitating reproducibility of research results, enabling scientists to build on others' work, and providing data journalists easier access to information and its provenance. In this paper, we discuss Google Dataset Search, a dataset-discovery tool that provides search capabilities over potentially all datasets published on the Web. The approach relies on an open ecosystem, where dataset owners and providers publish semantically enhanced metadata on their own sites. We then aggregate, normalize, and reconcile this metadata, providing a search engine that lets users find datasets in the “long tail” of the Web. In this paper, we discuss both social and technical challenges in building this type of tool, and the lessons that we learned from this experience.

References

[1]
Azure marketplace. http://datamarket.azure.com/browse/data.
[2]
Baeza-Yates, R., Ciaramita, M., Mika, P., and Zaragoza, H. Towards semantic search. In International Conference on Application of Natural Language to Information Systems (2008), Springer, pp. 4-11.
[3]
CKAN. http://ckan.org.
[4]
Data Catalog Vocabulary (DCAT). https://www.w3.org/TR/vocab-dcat/.
[5]
Erickson, J. S., Viswanathan, A., Shinavier, J., Shi, Y., and Hendler, J. A. Open government data: A data analytics approach. IEEE Intelligent Systems 28, 5 (2013), 19-23.
[6]
Etzioni, O., Banko, M., Soderland, S., and Weld, D. S. Open information extraction from the web. Communications of the ACM 51, 12 (2008), 68-74.
[7]
Fenner, M., Crosas, M., Grethe, J., Kennedy, D., Hermjakob, H., Rocca-Serra, P., Durand, G., Berjon, R., Karcher, S., Martone, M., and Clark, T. A data citation roadmap for scholarly data repositories. bioRxiv (2017).
[8]
Goel, S., Broder, A., Gabrilovich, E., and Pang, B. Anatomy of the long tail: ordinary people with extraordinary tastes. In Proceedings of the third ACM international conference on Web search and data mining (2010), ACM, pp. 201-210.
[9]
Goodman, A., Pepe, A., Blocker, A. W., Borgman, C. L., Cranmer, K., Crosas, M., Di Stefano, R., Gil, Y., Groth, P., Hedstrom, M., et al. Ten simple rules for the care and feeding of scientific data. PLoS computational biology 10, 4 (2014), e1003542.
[10]
Gray, A. J., Goble, C. A., and Jimenez, R. Bioschemas: From potato salad to protein annotation. In International Semantic Web Conference (Posters, Demos & Industry Tracks) (2017).
[11]
Guha, R. V., Brickley, D., and Macbeth, S. Schema.org: evolution of structured data on the web. Communications of the ACM 59, 2 (2016), 44-51.
[12]
Halevy, A., Korn, F., Noy, N. F., Olston, C., Polyzotis, N., Roy, S., and Whang, S. E. Goods: Organizing Google's datasets. In Proceedings of the 2016 International Conference on Management of Data (2016), ACM, pp. 795-806.
[13]
Kacprzak, E., Koesten, L. M., Ibáñez, L.-D., Simperl, E., and Tennison, J. A query log analysis of dataset search. In Web Engineering (Cham, 2017), J. Cabot, R. De Virgilio, and R. Torlone, Eds., Springer International Publishing, pp. 429-436.
[14]
Kaggle datasets. https://www.kaggle.com/datasets.
[15]
Kern, D., and Mathiak, B. Are there any differences in data set retrieval compared to well-known literature retrieval? In Research and Advanced Technology for Digital Libraries (Cham, 2015), S. Kapidakis, C. Mazurek, and M. Werla, Eds., Springer International Publishing, pp. 197-208.
[16]
Kindling, M., van de Sandt, S., Rücknagel, J., Schirmbacher, P., Pampel, H., Vierkant, P., Bertelmann, R., Kloska, G., Scholze, F., and Witt, M. The landscape of research data repositories in 2015: A re3data analysis. D-Lib Magazine 23, 3/4 (2017).
[17]
Koesten, L. M., Kacprzak, E., Tennison, J. F. A., and Simperl, E. The trials and tribulations of working with structured data: a study on information seeking behaviour. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (New York, NY, USA, 2017), CHI '17, ACM, pp. 1277-1289.
[18]
Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., and Halevy, A. Google's deep web crawl. Proceedings of the VLDB Endowment 1, 2 (2008), 1241-1252.
[19]
Nature scientific data. https://www.nature.com/sdata, 2018.
[20]
Neumaier, S., Umbrich, J., and Polleres, A. Lifting data portals to the web of data. In WWW 2017 Workshop on Linked Data on the Web (LDOW 2017) (Perth, Australia, 2017).
[21]
Noy, N., and Brickley, D. Facilitating the discovery of public datasets. https://ai.googleblog.com/2017/01/facilitating-discovery-of-public.html, 2017.
[22]
Ohno-Machado, L., Sansone, S.-A., Alter, G., Fore, I., Grethe, J., Xu, H., Gonzalez-Beltran, A., Rocca-Serra, P., Soysal, E., Zong, N., and Kim, H.-e. Datamed: Finding useful data across multiple biomedical data repositories. bioRxiv (2016).
[23]
Open data network. https://www.opendatanetwork.com/.
[24]
Perego, A., Friis-Christensen, A., Vaccari, L., and Tsinaraki, C. DCAT-AP to schema.org mapping. Unofficial draft, 2018.
[25]
Pujara, J., Miao, H., Getoor, L., and Cohen, W. Knowledge graph identification. In International Semantic Web Conference (2013), Springer, pp. 542-557.
[26]
Quandl. https://www.quandl.com.
[27]
Rastogi, V., Machanavajjhala, A., Chitnis, L., and Sarma, A. D. Finding connected components in map-reduce in logarithmic rounds. In Data Engineering (ICDE), 2013 IEEE 29th International Conference on (2013), IEEE, pp. 50-61.
[28]
RDF 1.1 Concepts and Abstract Syntax. https://www.w3.org/TR/rdf11-concepts/.
[29]
Rueda, L., Fenner, M., and Cruse, P. Datacite: Lessons learned on persistent identifiers for research data. IJDC 11, 2 (2016), 39-47.
[30]
Sansone, S.-A., Gonzalez-Beltran, A., Rocca-Serra, P., Alter, G., Grethe, J. S., Xu, H., Fore, I. M., Lyle, J., Gururaj, A. E., Chen, X., et al. DATS, the data tag suite to enable discoverability of datasets. Scientific data 4 (2017), 170059.
[31]
Suchanek, F. M., Sozio, M., and Weikum, G. Sofie: a self-organizing framework for information extraction. In Proceedings of the 18th international conference on World wide web (2009), ACM, pp. 631-640.
[32]
Varda, K. Protocol buffers: Google's data interchange format. Tech. rep., Google, 6 2008.
[33]
Wang, J., Aryani, A., Wyborn, L., and Evans, B. Providing research graph data in JSON-LD Using Schema.org. In Proceedings of the 26th International Conference on World Wide Web Companion (Republic and Canton of Geneva, Switzerland, 2017), WWW '17 Companion, International World Wide Web Conferences Steering Committee, pp. 1213-1218.
[34]
Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L. B., Bourne, P. E., et al. The fair guiding principles for scientific data management and stewardship. Scientific data 3 (2016).

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
WWW '19: The World Wide Web Conference
May 2019
3620 pages
ISBN:9781450366748
DOI:10.1145/3308558
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • IW3C2: International World Wide Web Conference Committee

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 May 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data discovery
  2. metadata
  3. search
  4. structured data

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

WWW '19
WWW '19: The Web Conference
May 13 - 17, 2019
CA, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)279
  • Downloads (Last 6 weeks)32
Reflects downloads up to 29 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media