GP-GPT: Large Language Model for Gene-Phenotype Mapping

Lyu, Yanjun; Wu, Zihao; Zhang, Lu; Zhang, Jing; Li, Yiwei; Ruan, Wei; Liu, Zhengliang; Yu, Xiaowei; Cao, Chao; Chen, Tong; Chen, Minheng; Zhuang, Yan; Li, Xiang; Liu, Rongjie; Huang, Chao; Li, Wentao; Liu, Tianming; Zhu, Dajiang

Computer Science > Computation and Language

arXiv:2409.09825 (cs)

[Submitted on 15 Sep 2024 (v1), last revised 27 Sep 2024 (this version, v2)]

Title:GP-GPT: Large Language Model for Gene-Phenotype Mapping

Authors:Yanjun Lyu, Zihao Wu, Lu Zhang, Jing Zhang, Yiwei Li, Wei Ruan, Zhengliang Liu, Xiaowei Yu, Chao Cao, Tong Chen, Minheng Chen, Yan Zhuang, Xiang Li, Rongjie Liu, Chao Huang, Wentao Li, Tianming Liu, Dajiang Zhu

View PDF HTML (experimental)

Abstract:Pre-trained large language models(LLMs) have attracted increasing attention in biomedical domains due to their success in natural language processing. However, the complex traits and heterogeneity of multi-sources genomics data pose significant challenges when adapting these models to the bioinformatics and biomedical field. To address these challenges, we present GP-GPT, the first specialized large language model for genetic-phenotype knowledge representation and genomics relation analysis. Our model is fine-tuned in two stages on a comprehensive corpus composed of over 3,000,000 terms in genomics, proteomics, and medical genetics, derived from multiple large-scale validated datasets and scientific publications. GP-GPT demonstrates proficiency in accurately retrieving medical genetics information and performing common genomics analysis tasks, such as genomics information retrieval and relationship determination. Comparative experiments across domain-specific tasks reveal that GP-GPT outperforms state-of-the-art LLMs, including Llama2, Llama3 and GPT-4. These results highlight GP-GPT's potential to enhance genetic disease relation research and facilitate accurate and efficient analysis in the fields of genomics and medical genetics. Our investigation demonstrated the subtle changes of bio-factor entities' representations in the GP-GPT, which suggested the opportunities for the application of LLMs to advancing gene-phenotype research.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2409.09825 [cs.CL]
	(or arXiv:2409.09825v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2409.09825

Submission history

From: Yanjun Lyu [view email]
[v1] Sun, 15 Sep 2024 18:56:20 UTC (3,922 KB)
[v2] Fri, 27 Sep 2024 20:26:15 UTC (3,922 KB)

Computer Science > Computation and Language

Title:GP-GPT: Large Language Model for Gene-Phenotype Mapping

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:GP-GPT: Large Language Model for Gene-Phenotype Mapping

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators