Datasets
Standard Dataset
BERT fine-tuned CORD-19 NER Dataset
- Citation Author(s):
- Submitted by:
- Andres Frederic
- Last updated:
- Sun, 07/09/2023 - 12:37
- DOI:
- 10.21227/m7gj-ks21
- Data Format:
- License:
- Categories:
- Keywords:
Abstract
This Named Entities dataset is implemented by employing the widely used Large Language Model (LLM), BERT, on the CORD-19 biomedical literature corpus. By fine-tuning the pre-trained BERT on the CORD-NER dataset, the model gains the ability to comprehend the context and semantics of biomedical named entities. The refined model is then utilized on the CORD-19 to extract more contextually relevant and updated named entities. However, fine-tuning large datasets with LLMs poses a challenge. To counter this, two distinct sampling methodologies are utilized. First, for the NER task on the CORD-19, a Latent Dirichlet Allocation (LDA) topic modeling technique is employed. This maintains the sentence structure while concentrating on related content. Second, a straightforward greedy method is deployed to gather the most informative data of 25 entity types from the CORD-NER dataset.
This NER dataset can be applied to any supervised, unsupervised, or deep learning approaches.
This NER dataset is auto generated from the BERT model. So, the dataset is not 100% accurate. This benefit could help to utilize in any approach which tends to handle noisy data.