Research outputs by CORE

Writing about CORE? Discover our research outputs and cite our work. Below we have listed key articles about CORE by the Big Scientific Data and Text Analytics group (BSDTAG).

CORE reference articles
CORE aggregation approaches and infrastructure
Selected AI/ML papers
CORE Recommender
CORE repositories dashboard
CORE and download statistics
Supporting research assessment and evaluation

Find more our research outputs at Big Scientific Data and Text Analytics group website.

find more

CORE vision and mission

If you use CORE in your work, we kindly ask you to cite one of our reference publications.

CORE: A Global Aggregation Service for Open Access Papers

P. Knoth, D. Herrmannova, M. Cancellieri, L. Anastasiou, N. Pontika, S. Pearce, B. Gyawali and D. PrideIn Nature Scientific Data 10, 366 (2023)

This paper introduces CORE, describes CORE’s continuously growing dataset and the motivation behind its creation, presents the challenges associated with systematically gathering research papers from thousands of data providers worldwide at scale and outlines the solutions developed to overcome these challenges. It provides an in-depth discussion of the services and tools built on top of the indexed content and finally examines several use cases that have leveraged the CORE dataset and services.

CORE: three access levels to underpin open access

P. Knoth and Z. ZdrahalD-Lib Magazine, 18 (11/12)

Sets the vision for creating the CORE service, developing global-wide content aggregation of all open access research literature (on top of OAI-PMH protocol for metadata indexing and other protocols). It sets the mission to develop the three access levels (access at the granularity of papers; analytical access; access to raw data) via CORE.

CORE aggregation approaches and infrastructure

From open access metadata to open access content: two principles for increased visibility of open access content

P. KnothIn Open Repositories 2013

This paper describes the two principles that should be followed to ensure that content can be properly indexed from repositories. This paper could be of great interest to repository managers.

CORE: aggregation use cases for open access

P. Knoth and Z. ZdrahalIn 2nd International Workshop on Mining Scientific Publications (WOSP 2013)

This paper describes the use cases that must be supported by open access aggregators, it establishes the CORE use case and demonstrates the benefits of open access content aggregators.

Aggregating Research Papers from Publishers? Systems to Support Text and Data Mining: Deliberate Lack of Interoperability or Not?

P. Knoth and N. PontikaIn R. Eckart de Castilho, S. Ananiadou, T. Margoni, W. Peters and S. Piperidis (Eds.) INTEROP2016

This paper describes the technical challenges relating to machine interfaces, the interoperability issues on obtaining open access content and the complications of achieving a harmonisation across repositories’ and publishers’ systems.

CORE: connecting repositories in the open access domain

P. Knoth and Z. ZdrahalIn CERN Workshop on Innovations in Scholarly Communication (OAI7)

Connecting repositories in the open access domain using text mining and semantic data

P. Knoth, V. Robotka and Z. ZdrahalIn Research and Advanced Technology for Digital Libraries

This paper describes the CORE system in its early stages with a focus on the original idea of the CORE recommender.

AI/ML papers

CORE-GPT: Combining Open Access research and large language models for credible, trustworthy question answering

D. Pride, M. Cancellieri, P. KnothIn arXiv preprint arXiv:2307.04683, (2023)

In this paper, we present CORE-GPT, a novel question-answering platform that combines GPT-based language models and more than 32 million full-text open access scientific articles from CORE. We first demonstrate that GPT3.5 and GPT4 cannot be relied upon to provide references or citations for generated text. We then introduce CORE-GPT which delivers evidence-based answers to questions, along with citations and links to the cited papers, greatly increasing the trustworthiness of the answers and reducing the risk of hallucinations. CORE-GPT's performance was evaluated on a dataset of 100 questions covering the top 20 scientific domains in CORE, resulting in 100 answers and links to 500 relevant articles. The quality of the provided answers and relevance of the links were assessed by two annotators. Our results demonstrate that CORE-GPT can produce comprehensive and trustworthy answers across the majority of scientific domains, complete with links to genuine, relevant scientific articles.

Predicting article quality scores with machine learning: The UK Research Excellence Framework

P. Thelwall, Kayvan Kousha, Paul Wilson, Meiko Makita, Mahshid Abdoli, Emma Stuart, Jonathan Levitt, Petr Knoth, Matteo CancellieriIn Quantitative Science Studies 2023/5/1.

National research evaluation initiatives and incentive schemes choose between simplistic quantitative indicators and time-consuming peer/expert review, sometimes supported by bibliometrics. Here we assess whether machine learning could provide a third alternative, estimating article quality using more multiple bibliometric and metadata inputs. We investigated this using provisional three-level REF2021 peer review scores for 84,966 articles submitted to the U.K. Research Excellence Framework 2021, matching a Scopus record 2014–18 and with a substantial abstract. We found that accuracy is highest in the medical and physical sciences Units of Assessment (UoAs) and economics, reaching 42% above the baseline (72% overall) in the best case. This is based on 1,000 bibliometric inputs and half of the articles used for training in each UoA. Prediction accuracies above the baseline for the social science, mathematics, engineering, arts, and humanities UoAs were much lower or close to zero. The Random Forest Classifier (standard or ordinal) and Extreme Gradient Boosting Classifier algorithms performed best from the 32 tested. Accuracy was lower if UoAs were merged or replaced by Scopus broad categories. We increased accuracy with an active learning strategy and by selecting articles with higher prediction probabilities, but this substantially reduced the number of scores predicted.

Dynamic Context Extraction for Citation Classification

S. Nambanoor Kunnath, David Pride, Petr Knoth. 2022/11.FIn Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing

We investigate the effect of varying citation context window sizes on model performance in citation intent classification. Prior studies have been limited to the application of fixed-size contiguous citation contexts or the use of manually curated citation contexts. We introduce a new automated unsupervised approach for the selection of a dynamic-size and potentially non-contiguous citation context, which utilises the transformer-based document representations and embedding similarities. Our experiments show that the addition of non-contiguous citing sentences improves performance beyond previous results. Evaluating on the (1) domain-specific (ACL-ARC) and (2) the multi-disciplinary (SDP-ACT) dataset demonstrates that the inclusion of additional context beyond the citing sentence significantly improves the citation classification model’s performance, irrespective of the dataset’s domain.

Benchmark for research theme classification of scholarly documents

Ó. E Mendoza, Wojciech Kusa, Alaa El-Ebshihy, Ronin Wu, David Pride, Petr Knoth, Drahomira Herrmannova, Florina Piroi, Gabriella Pasi, Allan Hanbury.In Proceedings of the Third Workshop on Scholarly Document Processing 2022/10.

We present a new gold-standard dataset and a benchmark for the Research Theme Identification task, a sub-task of the Scholarly Knowledge Graph Generation shared task, at the 3rd Workshop on Scholarly Document Processing. The objective of the shared task was to label given research papers with research themes from a total of 36 themes. The benchmark was compiled using data drawn from the largest overall assessment of university research output ever undertaken globally (the Research Excellence Framework-2014). We provide a performance comparison of a transformer-based ensemble, which obtains multiple predictions for a research paper, given its multiple textual fields (eg title, abstract, reference), with traditional machine learning models. The ensemble involves enriching the initial data with additional information from open-access digital libraries and Argumentative Zoning techniques (CITATION). It uses a weighted sum aggregation for the multiple predictions to obtain a final single prediction for the given research paper.

CORE Recommender

Towards effective research recommender systems for repositories

P. Knoth, L. Anastasiou, A. Charalampous, M. Cancellieri, S. Pearce, N. Pontika and V. BayerIn Open Repositories 2017

In this paper, we argue why and how the integration of recommender systems for research can enhance the functionality and user experience in repositories.

Automatic generation of inter-passage links based on semantic similarity

P. Knoth, J. Novotny and Z. ZdrahalIn Computational Linguistics (COLING 2010)

This paper describes the algorithm that was used in the original version of the CORE recommender.

CORE repositories dashboard

Developing Infrastructure to Support Closer Collaboration of Aggregators with Open Repositories

N. Pontika, P. Knoth, M. Cancellieri and S. PearceLIBER Quarterly, 25 (4)

This paper presents the CORE Repositories Dashboard, a tool designed primarily for repository managers. It describes how the Dashboard improves the quality of the indexed papers, advances the collaboration between the repository managers and CORE, enables a straightforward management of their collections and enhances the transparency of the indexed content.

CORE and download statistics

Integration of the IRUS-UK Statistics in the CORE Repositories Dashboard

S. Pearce and N. Pontika

This poster presents the integration of the IRUS-UK service with the CORE Repositories Dashboard tool, which enables repository managers access reliable download statistics of the full-text papers indexed by CORE.

My repository is being aggregated: a blessing or a curse?

P. Knoth, L. Anastasiou and S. PearceIn Open Repositories 2014 (OR2014)

This paper describes the collaboration between aggregators and repositories in terms of sharing download usage statistics.

Supporting research assessment and evaluation

An Authoritative Approach to Citation Classification

D. Pride and P. KnothIn Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020

The paper introduces the new multi-disciplinary ACT dataset, which is currently the largest of its kind, annotated by the authors themselves.

ACT: An Annotation Platform for Citation Typing at Scale

D. Pride, P. Knoth and J. HaragIn 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL)

This paper describes the online tool for annotating citations in a research paper based on their purpose. Citations are also annotated based on how influential it is for research assessment.

Overview of the 2020 WOSP 3C Citation Context Classification Task

S. N. Kunnath, D. Pride, B. Gyawali and P. KnothIn Proceedings of the 8th International Workshop on Mining Scientific Publications

The overview paper highlights findings from the first edition of the 3C Citation Context Classification task, organised as part of the workshop, WOSP 2020.