Distilling Named Entity Recognition Models for Endangered Species from Large Language Models

Atuhurra, Jesse; Dujohn, Seiveright Cargill; Kamigaito, Hidetaka; Shindo, Hiroyuki; Watanabe, Taro

Computer Science > Computation and Language

arXiv:2403.15430 (cs)

[Submitted on 13 Mar 2024]

Title:Distilling Named Entity Recognition Models for Endangered Species from Large Language Models

Authors:Jesse Atuhurra, Seiveright Cargill Dujohn, Hidetaka Kamigaito, Hiroyuki Shindo, Taro Watanabe

View PDF HTML (experimental)

Abstract:Natural language processing (NLP) practitioners are leveraging large language models (LLM) to create structured datasets from semi-structured and unstructured data sources such as patents, papers, and theses, without having domain-specific knowledge. At the same time, ecological experts are searching for a variety of means to preserve biodiversity. To contribute to these efforts, we focused on endangered species and through in-context learning, we distilled knowledge from GPT-4. In effect, we created datasets for both named entity recognition (NER) and relation extraction (RE) via a two-stage process: 1) we generated synthetic data from GPT-4 of four classes of endangered species, 2) humans verified the factual accuracy of the synthetic data, resulting in gold data. Eventually, our novel dataset contains a total of 3.6K sentences, evenly divided between 1.8K NER and 1.8K RE sentences. The constructed dataset was then used to fine-tune both general BERT and domain-specific BERT variants, completing the knowledge distillation process from GPT-4 to BERT, because GPT-4 is resource intensive. Experiments show that our knowledge transfer approach is effective at creating a NER model suitable for detecting endangered species from texts.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2403.15430 [cs.CL]
	(or arXiv:2403.15430v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2403.15430

Submission history

From: Jesse Atuhurra [view email]
[v1] Wed, 13 Mar 2024 15:38:55 UTC (3,575 KB)

Computer Science > Computation and Language

Title:Distilling Named Entity Recognition Models for Endangered Species from Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Distilling Named Entity Recognition Models for Endangered Species from Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators