Unsupervised Term Extraction for Highly Technical Domains

Fusco, Francesco; Staar, Peter; Antognini, Diego

Computer Science > Computation and Language

arXiv:2210.13118 (cs)

[Submitted on 24 Oct 2022]

Title:Unsupervised Term Extraction for Highly Technical Domains

Authors:Francesco Fusco, Peter Staar, Diego Antognini

View PDF

Abstract:Term extraction is an information extraction task at the root of knowledge discovery platforms. Developing term extractors that are able to generalize across very diverse and potentially highly technical domains is challenging, as annotations for domains requiring in-depth expertise are scarce and expensive to obtain. In this paper, we describe the term extraction subsystem of a commercial knowledge discovery platform that targets highly technical fields such as pharma, medical, and material science. To be able to generalize across domains, we introduce a fully unsupervised annotator (UA). It extracts terms by combining novel morphological signals from sub-word tokenization with term-to-topic and intra-term similarity metrics, computed using general-domain pre-trained sentence-encoders. The annotator is used to implement a weakly-supervised setup, where transformer-models are fine-tuned (or pre-trained) over the training data generated by running the UA over large unlabeled corpora. Our experiments demonstrate that our setup can improve the predictive performance while decreasing the inference latency on both CPUs and GPUs. Our annotators provide a very competitive baseline for all the cases where annotations are not available.

Comments:	Accepted at EMNLP 2022 (industry). 8 pages, 3 figures, 3 tables
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2210.13118 [cs.CL]
	(or arXiv:2210.13118v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2210.13118

Submission history

From: Diego Antognini [view email]
[v1] Mon, 24 Oct 2022 11:08:09 UTC (1,050 KB)

Computer Science > Computation and Language

Title:Unsupervised Term Extraction for Highly Technical Domains

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Unsupervised Term Extraction for Highly Technical Domains

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators