skip to main content
10.1145/3395027.3419597acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
short-paper

COVIDSeer: Extending the CORD-19 Dataset

Published: 29 September 2020 Publication History

Abstract

We develop an enhanced version of CORD-19 dataset released by the Allen Institute for AI. Tools in the SeerSuite project are used to exploit information in original articles not directly provided in the CORD-19 datasets. We add 728 new abstracts, 70,102 figures and 31,446 tables with captions that are not provided in the current data release. We also built a vertical search engine COVIDSeer based on the new dataset we created. COVIDSeer has a relatively simple architecture with features like keyword filtering, and similar paper recommendation. The goal was to provide a system and dataset that can help scientists better navigate through the literature concerning COVID-19. The enriched dataset can serve as a supplement to the existing dataset. The search engine, which offers keyphrase-enhanced search, will hopefully help biomedical and life science researchers, medical students, and the general public to more effectively explore coronavirus-related literature. The entire data set and the system will be made open source.

References

[1]
Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A Pretrained Language Model for Scientific Text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 3606--3611.
[2]
Chandra Bhagavatula, Sergey Feldman, Russell Power, and Waleed Ammar. 2018. Content-Based Citation Recommendation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 238--251.
[3]
Cornelia Caragea, Florin Adrian Bulgarov, Andreea Godea, and Sujatha Das Gollapalli. 2014. Citation-Enhanced Keyphrase Extraction from Research Papers: A Supervised Approach. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, 1435--1446. https://doi.org/10.3115/v1/D14-1150
[4]
Hung-Hsuan Chen, Jian Wu, and C Lee Giles. 2017. Compiling Keyphrase Candidates for Scientific Literature Based on Wikipedia.
[5]
Christopher Clark and Santosh Divvala. 2015. Looking Beyond Text: Extracting Figures, Tables, and Captions from Computer Science Paper. (2015).
[6]
Emanuele Guidotti and David Ardia. 2020. COVID-19 data hub. (2020).
[7]
Jonathan Koren, Yi Zhang, and Xue Liu. 2008. Personalized Interactive Faceted Search. Association for Computing Machinery, New York, NY, USA.
[8]
Jimmy Lin. 2009. Is searching full text more effective than searching abstracts? BMC bioinformatics 10, 1 (2009), 46.
[9]
Patrice Lopez. 2009. GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications. In ECDL.
[10]
Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. arXiv preprint arXiv:1901.04085 (2019).
[11]
Pradeep B Teregowda, Isaac G Councill, Juan Pablo Fernández Ramírez, Madian Khabsa, Shuyi Zheng, and C Lee Giles. 2010. SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries by Crawling the Web. WebApps 10 (2010), 14--14.
[12]
Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar, Russell Reas, Jiangjiang Yang, Darrin Eide, Kathryn Funk, Rodney Michael Kinney, Ziyang Liu, William. Merrill, Paul Mooney, Dewey A. Murdick, Devvret Rishi, Jerry Sheehan, Zhihong Shen, Brandon Stilson, Alex D. Wade, Kuansan Wang, Christopher Wilhelm, Boya Xie, Douglas M. Raymond, Daniel S. Weld, Oren Etzioni, and Sebastian Kohlmeier. 2020. CORD-19: The Covid-19 Open Research Dataset. ArXiv abs/2004.10706 (2020).
[13]
Jian Wu, Jason Killian, Huaiyu Yang, Kyle Williams, Sagnik Ray Choudhury, Suppawong Tuarob, Cornelia Caragea, and C. Lee Giles. 2015. PDFMEF: A Multi-Entity Knowledge Extraction Framework for Scholarly Documents and Semantic Search. Association for Computing Machinery, New York, NY, USA.
[14]
Jian Wu, Kunho Kim, and C. Lee Giles. 2019. CiteSeerX: 20 years of service to scholarly big data. Proceedings of the Conference on Artificial Intelligence for Data Discovery and Reuse (2019).
[15]
Edwin Zhang, Nikhil Gupta, Rodrigo Nogueira, Kyunghyun Cho, and Jimmy Lin. 2020. Rapidly Deploying a Neural Search Engine for the COVID-19 Open Research Dataset: Preliminary Thoughts and Lessons Learned. arXiv:2004.05125 [cs.CL]

Cited By

View all

Index Terms

  1. COVIDSeer: Extending the CORD-19 Dataset

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    DocEng '20: Proceedings of the ACM Symposium on Document Engineering 2020
    September 2020
    130 pages
    ISBN:9781450380003
    DOI:10.1145/3395027
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 29 September 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data mining
    2. datasets
    3. information retrieval

    Qualifiers

    • Short-paper
    • Research
    • Refereed limited

    Conference

    DocEng '20
    Sponsor:
    DocEng '20: ACM Symposium on Document Engineering 2020
    September 29 - October 1, 2020
    CA, Virtual Event, USA

    Acceptance Rates

    Overall Acceptance Rate 194 of 564 submissions, 34%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)5
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 28 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Evaluation of the use of virtual simulators for training in problem-solving skills in university studentsSalud, Ciencia y Tecnología10.56294/saludcyt202412814(1281)Online publication date: 1-Jan-2024
    • (2023)Heterogeneous deep graph convolutional network with citation relational BERT for COVID-19 inline citation recommendationExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.118841213:PAOnline publication date: 1-Mar-2023
    • (2021)Comprehensive Review and Future Research Directions on Dynamic Faceted SearchApplied Sciences10.3390/app1117811311:17(8113)Online publication date: 31-Aug-2021
    • (2021)BIP4COVID19: Releasing impact measures for articles relevant to COVID-19Quantitative Science Studies10.1162/qss_a_001692:4(1447-1465)Online publication date: 1-Dec-2021
    • (2021)Seer-Dock: A General-Purpose Dockerized Scholarly Document Collection and Management FrameworkProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3463251(2485-2490)Online publication date: 11-Jul-2021

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media