skip to main content
research-article

SciND: a new triplet-based dataset for scientific novelty detection via knowledge graphs

Published: 08 January 2024 Publication History

Abstract

Detecting texts that contain semantic-level new information is not straightforward. The problem becomes more challenging for research articles. Over the years, many datasets and techniques have been developed to attempt automatic novelty detection. However, the majority of the existing textual novelty detection investigations are targeted toward general domains like newswire. A comprehensive dataset for scientific novelty detection is not available in the literature. In this paper, we present a new triplet-based corpus (SciND) for scientific novelty detection from research articles via knowledge graphs. The proposed dataset consists of three types of triples (i) triplet for the knowledge graph, (ii) novel triplets, and (iii) non-novel triplets. We build a scientific knowledge graph for research articles using triplets across several natural language processing (NLP) domains and extract novel triplets from the paper published in the year 2021. For the non-novel articles, we use blog post summaries of the research articles. Our knowledge graph is domain-specific. We build the knowledge graph for seven NLP domains. We further use a feature-based novelty detection scheme from the research articles as a baseline. Moreover, we show the applicability of our proposed dataset using our baseline novelty detection algorithm. Our algorithm yields a baseline F1 score of 72%. We show analysis and discuss the future scope using our proposed dataset. To the best of our knowledge, this is the very first dataset for scientific novelty detection via a knowledge graph. We make our codes and dataset publicly available at https://github.com/92Komal/Scientific_Novelty_Detection.

References

[3]
(2023) Day 192: NLP papers summary—guiding extractive summarization with question-answering rewards-ryan ong. https://ryanong.co.uk/
[5]
(2023) Medium—where good ideas find you. https://medium.com/
[6]
[7]
(2023) On learning language-invariant representations for universal machine translation—machine learning blog ml@cmu carnegie mellon university. https://blog.ml.cmu.edu/
[11]
(2023) Taming recurrent neural networks for better summarization abigail see. http://www.abigailsee.com/
[12]
(2023) Towards a conversational agent that can chat about...anything—google ai blog. https://ai.googleblog.com/2020/01/towards-conversational-agent-that-can.html
[14]
(2023) What makes a good conversation? abigail see. http://www.abigailsee.com
[15]
(2023) Write and structure a journal article well writing your paper. https://authorservices.taylorandfrancis.com
[16]
Al-Zaidy, R.A., Caragea, C., Giles, C.L.: BI-LSTM-CRF sequence labeling for keyphrase extraction from scholarly documents. In: Liu, L., White, R.W., Mantrach, A., et al. (eds.) The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May 13–17, 2019, pp. 2551–2557. ACM (2019),
[17]
Aleksandrov, A., Völlinger, K.: Formalizing piecewise affine activation functions of neural networks in coq. In: Rozier, K.Y., Chaudhuri, S. (eds.) NASA Formal Methods—15th International Symposium, NFM 2023, Houston, TX, USA, May 16–18, 2023, Proceedings, Lecture Notes in Computer Science, vol. 13903, pp. 62–78. Springer, Berlin (2023).,
[18]
Alipourfard, N., Arendt, B., Benjamin, D.J., et al.: Systematizing confidence in open research and evidence (score) (2021)
[19]
Ammar, W., Groeneveld, D., Bhagavatula, C., et al.: Construction of the literature graph in semantic scholar. In: Bangalore, S., Chu-Carroll, J., Li, Y. (eds.) Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1–6, 2018, Volume 3 (Industry Papers), pp. 84–91. Association for Computational Linguistics (2018)
[20]
Amplayo RK, Hong S, and Song M Network-based approach to detect novelty of scholarly literature Inf. Sci. 2018 422 542-557
[21]
Andriani P and Kaminska R Exploring the dynamics of novelty production through exaptation: a historical analysis of coal tar-based innovations Res. Policy 2021 50 2 104,171
[22]
Auer, S., Bizer, C., Kobilarov, G., et al.: DBpedia: a nucleus for a web of open data. In: The Semantic Web: 6th International Semantic Web Conference, 2nd Asian Semantic Web Conference, ISWC 2007+ ASWC 2007, Busan, Korea, November 11–15, 2007. Proceedings, pp. 722–735. Springer (2007)
[23]
Banna MHA, Ghosh T, Nahian MJA, et al. A hybrid deep learning model to predict the impact of COVID-19 on mental health from social media big data IEEE Access 2023 11 77009-77022
[24]
Bast, H., Bäurle, F., Buchhold, B., et al.: Easy access to the freebase dataset. In: Chung, C., Broder, A.Z., Shim, K., et al. (eds.) 23rd International World Wide Web Conference, WWW ’14, Seoul, Republic of Korea, April 7–11, 2014, Companion Volume, pp. 95–98. ACM (2014).,
[25]
Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text. In: Inui, K., Jiang, J., Ng, V., et al. (eds.) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3–7, 2019, pp. 3613–3618. Association for Computational Linguistics (2019).,
[26]
Bu Y, Zou S, Liang Y, et al. Estimation of KL divergence: optimal minimax rate IEEE Trans. Inf. Theory 2018 64 4 2648-2674
[27]
Caruana R Multitask learning Mach. Learn. 1997 28 1 41-75
[28]
Cohan, A., Ammar, W., van Zuylen, M., et al.: Structural scaffolds for citation intent classification in scientific publications. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Volume 1 (2019) (Long and Short Papers), pp 3586–3596. Association for Computational Linguistics.
[29]
Devlin, J., Chang, M., Lee, K., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics (2019).
[30]
D’Souza, J., Auer, S.: NLPcontributions: an annotation scheme for machine reading of scholarly contributions in natural language processing literature. In: Zhang, C., Mayr, P., Lu, W., et al. (eds.) Proceedings of the 1st Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents co-located with the ACM/IEEE Joint Conference on Digital Libraries in 2020, EEKE@JCDL 2020, VirtualEvent, China, August 1st, 2020, CEUR Workshop Proceedings, vol. 2658, pp. 16–27. CEUR-WS.org (2020). https://ceur-ws.org/Vol-2658/paper2.pdf
[31]
ElSahar, H., Vougiouklis, P., Remaci, A., et al.: T-rex: A large scale alignment of natural language with knowledge base triples. In: Calzolari, N., Choukri, K., Cieri, C., et al. (eds.) Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7–12, 2018. European Language Resources Association (ELRA) (2018). http://www.lrec-conf.org/proceedings/lrec2018/summaries/632.html
[32]
Färber, M.: The microsoft academic knowledge graph: A linked data source with 8 billion triples of scholarly data. In: Ghidini, C., Hartig, O., Maleshkova, M., et al. (eds.) The Semantic Web—ISWC 2019—18th International Semantic Web Conference, Auckland, New Zealand, October 26–30, 2019, Proceedings, Part II, Lecture Notes in Computer Science, vol 11779, pp. 113–129. Springer, Berlin (2019)
[33]
Fathalla, S., Vahdati, S., Auer, S., et al.: Towards a knowledge graph representing research findings by semantifying survey articles. In: Kamps, J., Tsakonas, G., Manolopoulos, Y., et al. (eds.) Research and Advanced Technology for Digital Libraries—21st International Conference on Theory and Practice of Digital Libraries, TPDL 2017, Thessaloniki, Greece, September 18–21, 2017, Proceedings, Lecture Notes in Computer Science, vol. 10450, pp. 315–327. Springer (2017).
[34]
Fellbaum, C.: Wordnet. In: Theory and applications of ontology: computer applications, pp. 231–243. Springer, Berlin (2010)
[35]
Gabrilovich, E., Dumais, S., Horvitz, E.: Newsjunkie: providing personalized newsfeeds via analysis of information novelty. In: Proceedings of the 13th International Conference on World Wide Web, pp. 482–490 (2004)
[36]
Gamon, M.: Graph-based text representation for novelty detection. In: Proceedings of TextGraphs: The First Workshop on Graph Based Methods for Natural Language Processing, pp. 17–24 (2006)
[37]
Gardner MW and Dorling S Artificial neural networks (the multilayer perceptron)-a review of applications in the atmospheric sciences Atmos. Environ. 1998 32 14–15 2627-2636
[38]
Ghosal, T., Edithal, V., Ekbal, A., et al.: Novelty goes deep. A deep neural solution to document level novelty detection. In: Bender, E.M., Derczynski, L., Isabelle, P. (eds.) Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20–26, 2018, pp. 2802–2813. Association for Computational Linguistics (2018). https://aclanthology.org/C18-1237/
[39]
Ghosal, T., Salam, A., Tiwary, S., et al.: TAP-DLND 1.0 : A corpus for document level novelty detection. In: Calzolari, N., Choukri, K., Cieri, C., et al. (eds.) Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7–12, 2018. European Language Resources Association (ELRA) (2018). http://www.lrec-conf.org/proceedings/lrec2018/summaries/479.html
[40]
Ghosal T, Edithal V, Ekbal A, et al. Is your document novel? Let attention guide you. An attention-based model for document-level novelty detection Nat. Lang. Eng. 2021 27 4 427-454
[41]
Graves, A.: Long short-term memory. Supervised sequence labelling with recurrent neural networks, pp. 37–45 (2012)
[42]
Gupta, K., Ahmad, A., Ghosal, T., et al.: Contrisci: A BERT-based multitasking deep neural architecture to identify contribution statements from research papers. In: International Conference on Asian Digital Libraries, pp. 436–452. Springer, Berlin (2021)
[43]
Ji S, Pan S, Cambria E, et al. A survey on knowledge graphs: representation, acquisition, and applications IEEE Trans. Neural Netw. Learn. Syst. 2021 33 2 494-514
[44]
Koc, B.Y., Arsan, T., Pekcan, Ö.: Understanding of normal and abnormal hearts by phase space analysis and convolutional neural networks. CoRR arXiv:2305.10450. (2023)
[45]
Kumari R, Ashok N, Ghosal T, et al. Misinformation detection using multitask learning with mutual learning for novelty detection and emotion recognition Inf. Process. Manag. 2021 58 5 102,631
[46]
Kumari R, Ashok N, Ghosal T, et al. What the fake? probing misinformation detection standing on the shoulder of novelty and emotion Inf. Process. Manag. 2022 59 1 102,740
[47]
Kyriakides, G., Margaritis, K.G.: An introduction to neural architecture search for convolutional networks. CoRR arXiv:2005.11074 (2020)
[48]
Lafferty, J.D., McCallum, A., Pereira, FCN.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Brodley, C.E., Danyluk, A.P. (eds.) Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), Williams College, Williamstown, MA, USA, June 28–July 1, 2001, pp. 282–289. Morgan Kaufmann (2001)
[49]
Leshno M, Lin VY, Pinkus A, et al. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function Neural Netw. 1993 6 6 861-867
[50]
Liu, H., Sarol, MJ., Kilicoglu, H.: Uiuc_bionlp at semeval-2021 task 11: A cascade of neural models for structuring scholarly NLP contributions. In: Palmer, A., Schneider, N., Schluter, N., et al. (eds.) Proceedings of the 15th International Workshop on Semantic Evaluation, SemEval@ACL/IJCNLP 2021, Virtual Event/Bangkok, Thailand, August 5–6, 2021, pp 377–386. Association for Computational Linguistics (2021).
[51]
Luan, Y., He, L., Ostendorf, M., et al.: Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. In: Riloff, E., Chiang, D., Hockenmaier, J., et al. (eds.) Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31–November 4, 2018, pp. 3219–3232. Association for Computational Linguistics (2018).
[52]
Magnusson, I.H., Friedman, S.E.: Extracting fine-grained knowledge graphs of scientific claims: dataset and transformer-based results. In: Moens, M., Huang, X., Specia, L., et al. (eds.) Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event/Punta Cana, Dominican Republic, 7–11 November, 2021, pp. 4651–4658. Association for Computational Linguistics (2021).
[53]
Medsker LR and Jain L Recurrent neural networks Design Appl. 2001 5 64-67
[54]
Mihalcea, R., Tarau, P.: Textrank: bringing order into text. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, EMNLP 2004, A meeting of SIGDAT, a Special Interest Group of the ACL, held in conjunction with ACL 2004, 25–26 July 2004, Barcelona, Spain. ACL, pp. 404–411 (2004). https://aclanthology.org/W04-3252/
[55]
Miller, J.J.: Graph database applications and concepts with neo4j. In: Proceedings of the Southern Association for Information Systems Conference, Atlanta, GA, USA (2013)
[56]
Molchanov, D., Ashukha, A., Vetrov, D.P.: Variational dropout sparsifies deep neural networks. In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017, Proceedings of Machine Learning Research, vol 70. PMLR, pp. 2498–2507 (2017). http://proceedings.mlr.press/v70/molchanov17a.html
[57]
Mondal, I., Hou, Y., Jochim, C.: End-to-end construction of NLP knowledge graph. In: Zong, C., Xia, F., Li, W., et al. (eds.) Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1–6, 2021, Findings of ACL, vol ACL/IJCNLP 2021. Association for Computational Linguistics, pp. 1885–1895 (2021).
[58]
Oelen, A., Stocker, M., Auer, S.: Creating a scholarly knowledge graph from survey article tables. In: Ishita, E., Pang, N.L., Zhou, L. (eds.) Digital Libraries at Times of Massive Societal Transition—22nd International Conference on Asia-Pacific Digital Libraries, ICADL 2020, Kyoto, Japan, November 30–December 1, 2020, Proceedings, Lecture Notes in Computer Science, vol. 12504, pp. 373–389. Springer (2020)
[59]
Park M, Leahey E, and Funk RJ Papers and patents are becoming less disruptive over time Nature 2023 613 7942 138-144
[60]
Qi, P., Zhang, Y., Zhang, Y., et al.: Stanza: a python natural language processing toolkit for many human languages. In: Celikyilmaz, A., Wen, T. (eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, ACL 2020, Online, July 5–10, 2020, pp. 101–108. Association for Computational Linguistics (2020)
[61]
Ruder, S.: An overview of multi-task learning in deep neural networks. CoRR (2017). arXiv:1706.05098
[62]
Saikh, T., Ghosal, T., Ekbal, A., et al.: Document level novelty detection: Textual entailment lends a helping hand. In: Bandyopadhyay, S. (ed.) Proceedings of the 14th International Conference on Natural Language Processing, ICON 2017, Kolkata, India, December 18–21, 2017, pp. 131–140. NLP Association of India (2017). https://aclanthology.org/W17-7517/
[63]
Shailabh, S., Chaurasia, S., Modi, A.: Knowgraph@iitk at semeval-2021 task 11: Building knowledgegraph for NLP research. CoRR (2021) arXiv:2104.01619
[64]
Soboroff, I., Harman, D.: Overview of the TREC 2003 novelty track. In: Voorhees, E.M., Buckland, L.P. (eds.) Proceedings of The Twelfth Text REtrieval Conference, TREC 2003, Gaithersburg, Maryland, USA, November 18–21, 2003, NIST Special Publication, vol 500-255. National Institute of Standards and Technology (NIST), pp. 38–53 (2003). http://trec.nist.gov/pubs/trec12/papers/NOVELTY.OVERVIEW.pdf
[65]
Soboroff, I., Harman, D.: Novelty detection: The TREC experience. In: HLT/EMNLP 2005, Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, 6–8 October 2005, Vancouver, British Columbia, Canada, pp. 105–112. The Association for Computational Linguistics (2005) https://aclanthology.org/H05-1014/
[66]
Souza, F., Nogueira, RF., de Alencar Lotufo, R.: Portuguese named entity recognition using BERT-CRF. CoRR (2019) arXiv:1909.10649
[67]
Speer, R., Chin, J., Havasi, C.: Conceptnet 5.5: An open multilingual graph of general knowledge. In: Singh, S., Markovitch, S. (eds.) Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4–9, 2017, San Francisco, California, USA, pp. 4444–4451. AAAI Press (2017)
[68]
Tong, V.V., Huynh, T.T., Nguyen, T.T., et al.: Link-intensive alignment for incomplete knowledge graphs. CoRR (2021). arXiv:2112.09266
[69]
Tosi MDL and dos Reis JC SciKGraph: a knowledge graph approach to structure a scientific field J Inform. 2021 15 1 101,109
[70]
Tsai FS and Zhang Y D2s: document-to-sentence framework for novelty detection Knowl. Inf. Syst. 2011 29 2 419-433
[71]
Tsai FS and Zhang Y D2S: document-to-sentence framework for novelty detection Knowl. Inf. Syst. 2011 29 2 419-433
[72]
Vrandecic D and Krötzsch M Wikidata: a free collaborative knowledgebase Commun. ACM 2014 57 10 78-85
[73]
Wayne, C.L.: Topic detection and tracking (TDT). In: Workshop held at the University of Maryland on, Citeseer, p. 28 (1997)
[74]
Yang, Y., Zhang, J., Carbonell, J.G., et al.: Topic-conditioned novelty detection. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 23–26, 2002, Edmonton, Alberta, Canada, pp. 688–693. ACM (2002)
[75]
Yu, B., Li, Y., Wang, J.: Detecting causal language use in science findings. In: Inui, K., Jiang, J., Ng, V., et al. (eds.) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3–7, 2019, pp. 4663–4673. Association for Computational Linguistics (2019)
[76]
Zhang, Y., Callan, J.P., Minka, T.P.: Novelty and redundancy detection in adaptive filtering. In: Järvelin, K., Beaulieu, M., Baeza-Yates, R.A., et al. (eds.) SIGIR 2002: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August 11–15, 2002, Tampere, Finland, pp. 81–88. ACM (2002).,

Recommendations

Comments

Information & Contributors

Information

Published In

cover image International Journal on Digital Libraries
International Journal on Digital Libraries  Volume 25, Issue 4
Dec 2024
294 pages

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 08 January 2024
Accepted: 04 November 2023
Revision received: 03 November 2023
Received: 31 October 2022

Author Tags

  1. Scientific knowledge graph
  2. Information extraction
  3. Data preparation
  4. Novelty detection

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media