research-article

Mapping layperson medical terminology into the Human Phenotype Ontology using neural machine translation models

Authors:

Enrico Manzini,

Jon Garrido-Aguirre,

Jordi Fonollosa,

Alexandre Perera-LlunaAuthors Info & Claims

Volume 204, Issue C

https://doi.org/10.1016/j.eswa.2022.117446

Published: 15 October 2022 Publication History

Abstract

In the medical domain there exists a terminological gap between patients and caregivers and the healthcare professionals. This gap may hinder the success of the communication between healthcare consumers and professionals in the field, with negative emotional and clinical consequences. In this work, we build a machine learning-based tool for the automatic translation between the terminology used by laypeople and that of the Human Phenotype Ontology (HPO). HPO is a structured vocabulary of phenotypic abnormalities found in human disease. Our method uses a vector space to represent an HPO-specific embedding as the output space for a neural network model trained on vector representations of layperson versions and other textual descriptors of medical terms. We explored different output embeddings coupled to different neural network architectures for the machine translation stage. We compute a similarity measure to evaluate the ability of the model to assign an HPO term to a layperson input. The best-performing models resulted with a similarity higher than 0.7 for more than 80% of the terms, with a median between 0.98 and 1. The translator model is made available in a web application at this link: https://hpotranslator.b2slab.upc.edu.

Highlights

•

We propose a method to map lay expressions into the Human Phenotype Ontology.

•

Inspired in machine translation we design different deep learning architectures.

•

We explore strategies to encode the semantic space of the terms in the ontology.

•

We evaluate several combinations of models, hyperparameters, and output embeddings.

•

Overall, the correct term is identified for 80% of the lay terms in the test set.

References

[1]

Baroni M., Dinu G., Kruszewski G., Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors, in: Proceedings of the 52nd annual meeting of the association for computational linguistics (Volume 1: Long papers), Association for Computational Linguistics, Baltimore, Maryland, 2014, pp. 238–247,.

[2]

Baroni M., Siri S., Using cooccurrence statistics and the web to discover synonyms in a technical language, in: Proceedings of the fourth international conference on language resources and evaluation, in: (LREC’04), 2004.

[3]

Chollet F., et al., Keras, 2015, https://keras.io.

[4]

Deerwester S., Dumais S.T., Furnas G.W., Landauer T.K., Harshman R., Indexing by latent semantic analysis, Journal of the American Society for Information Science 41 (6) (1990) 391–407,.

[5]

Gu Y., Tinn R., Cheng H., Lucas M., Usuyama N., Liu X., et al., Domain-specific language model pretraining for biomedical natural language processing, ACM Transactions on Computing for Healthcare 3 (1) (2022) 1–23.

[6]

Hagiwara M., Ogawa Y., Toyama K., Selection of effective contextual information for automatic synonym acquisition, in: Proceedings of the 21st international conference on computational linguistics and the 44th annual meeting of the association for computational linguistics, in: (ACL-44), 2006, pp. 353–360,.

Digital Library

[7]

Hochreiter S., Schmidhuber J., Long short-term memory, Neural Computation 9 (8) (1997) 1735–1780,.

Digital Library

[8]

Ivanović M., Budimac Z., An overview of ontologies and data resources in medical domains, Expert Systems with Applications 41 (2014) 5158–5166.

[9]

Jiang J.J., Conrath D.W., Semantic similarity based on corpus statistics and lexical taxonomy, in: Chen K., Huang C., Sproat R. (Eds.), Proceedings of the 10th research on computational linguistics international conference, ROCLING 1997, Taipei, Taiwan, August 1997, The Association for Computational Linguistics and Chinese Language Processing (ACLCLP), 1997, pp. 19–33.

[10]

Keselman A., Smith C., Divita G., Kim H., Browne A., Leroy G., et al., Consumer health concepts that do not map to the UMLS: Where do they fit?, Journal of the American Medical Informatics Association : JAMIA 15 (2008) 496–505,.

[11]

Kingma D.P., Ba J., Adam: A method for stochastic optimization, 2014, https://arxiv.org/abs/1412.6980.

[12]

Köhler S., Carmody L., Vasilevsky N., Jacobsen J.O.B., Danis D., Gourdine J.-P., et al., Expansion of the Human Phenotype Ontology (HPO) knowledge base and resources, Nucleic Acids Research 47 (D1) (2018) D1018–D1027,.

[13]

Luo J., Zheng Z., Ye H., Ye M., Wang Y., You Q., et al., A benchmark dataset for understandable medical language translation, 2020, ArXiv, abs/2012.02420.

[14]

McDonald R., Brokos G., Androutsopoulos I., Deep relevance ranking using enhanced document-query interactions, in: Proceedings of the 2018 conference on empirical methods in natural language processing, Association for Computational Linguistics, Brussels, Belgium, 2018, pp. 1849–1860,.

[15]

Mikolov T., Sutskever I., Chen K., Corrado G., Dean J., Distributed representations of words and phrases and their compositionality, in: Proceedings of the 26th international conference on neural information processing systems - Volume 2, in: NIPS’13, Curran Associates Inc., Red Hook, NY, USA, 2013, pp. 3111–3119.

[16]

Pakhomov S.V., Finley G., McEwan R., Wang Y., Melton G.B., Corpus domain effects on distributional semantic modeling of medical terms, Bioinformatics 32 (23) (2016) 3635–3644,.

[17]

Pérez A., Gojenola K., Casillas A., Oronoz M., de Ilarraza A.D., Computer aided classification of diagnostic terms in spanish, Expert Systems with Applications 42 (2015) 2949–2958.

[18]

Pilehvar M.T., Collier N., Improved semantic representation for domain-specific entities, in: Proceedings of the 15th workshop on biomedical natural language processing, Association for Computational Linguistics, Berlin, Germany, 2016, pp. 12–16,.

[19]

R Core Team, R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria, 2018.

[20]

Resnik P., Using information content to evaluate semantic similarity in a taxonomy, in: Proceedings of the 14th international joint conference on artificial intelligence - Volume 1, in: IJCAI’95, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1995, pp. 448–453.

[21]

Salton G., Buckley C., Term-weighting approaches in automatic text retrieval, Information Processing & Management 24 (5) (1988) 513–523,.

Digital Library

[22]

Sarma P.K., Liang Y., Sethares B., Domain adapted word embeddings for improved sentiment classification, in: Proceedings of the 56th annual meeting of the association for computational linguistics (Volume 2: Short papers), Association for Computational Linguistics, Melbourne, Australia, 2018, pp. 37–42,.

[23]

Seco N., Veale T., Hayes J., An intrinsic information content metric for semantic similarity in WordNet, in: Proceedings of the 16th European conference on artificial intelligence, in: ECAI’04, IOS Press, NLD, 2004, pp. 1089–1090.

[24]

Smith, C. A., Stavri, P. Z., & Chapman, W. W. (2002). In their own words? A terminological analysis of e-mail to a cancer information service. In Proceedings / AMIA ... annual symposium. AMIA symposium.

[25]

Sutskever I., Vinyals O., Le Q.V., Sequence to sequence learning with neural networks, in: Proceedings of the 27th international conference on neural information processing systems - Volume 2, in: NIPS’14, MIT Press, Cambridge, MA, USA, 2014, pp. 3104–3112.

[26]

Tong A., Levey A.S., Eckardt K.-U., Anumudu S., Arce C.M., Baumgart A., et al., Patient and caregiver perspectives on terms used to describe kidney health, Clinical Journal of the American Society of Nephrology 15 (7) (2020) 937–948.

[27]

Van Rossum G., Drake F.L., Python 3 reference manual, CreateSpace, Scotts Valley, CA, 2009.

[28]

Vasilevsky N., Foster E., Engelstad M., Carmody L., Might M., Chambers E., et al., Plain-language medical vocabulary for precision diagnosis, Nature Genetics 50 (2018) 474–476,.

[29]

Virtanen P., Gommers R., Oliphant T.E., Haberland M., Reddy T., Cournapeau D., et al., SciPy 1.0 Contributors, SciPy 1.0: Fundamental algorithms for scientific computing in Python, Nature Methods 17 (2020) 261–272.

[30]

Vydiswaran, V., Mei, Q., Hanauer, D. A., & Zheng, K. (2014). Mining consumer health vocabulary from community-generated text. In Proceedings of the American medical informatics association annual symposium (AMIA).

[31]

Wang Y., Liu S., Afzal N., Rastegar-Mojarad M., Wang L., Shen F., et al., A comparison of word embeddings for the biomedical natural language processing, Journal of Biomedical Informatics 87 (2018) 12–20,.

[32]

Weng W.-H., Chung Y.-A., Szolovits P., Unsupervised clinical language translation, in: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, in: KDD ’19, Association for Computing Machinery, New York, NY, USA, 2019, pp. 3121–3131,.

Digital Library

[33]

Yin W., Schütze H., Learning word meta-embeddings, in: Proceedings of the 54th annual meeting of the association for computational linguistics (Volume 1: Long papers), Association for Computational Linguistics, Berlin, Germany, 2016, pp. 1351–1360,.

[34]

Zeng-Treitler Q., Goryachev S., Kim H., Keselman A., Rosendale D., Making texts in electronic health records comprehensible to consumers: A prototype translator, in: AMIA ... Annual symposium proceedings / AMIA symposium. AMIA symposium, Vol. 11, 2007, pp. 846–850.

[35]

Zhang J., Bolanos L., Li T.S., Tanwar A., Freire G., Yang X., et al., Self-supervised detection of contextual synonyms in a multi-class setting: Phenotype annotation use case, in: EMNLP, 2021.

[36]

Zhang J., Zhang X., Sun K., Yang X., Dai C., Guo Y., Unsupervised annotation of phenotypic abnormalities via semantic latent representations on electronic health records, in: 2019 IEEE international conference on bioinformatics and biomedicine (BIBM), 2019, pp. 598–603.

[37]

Zhou C., Sun C., Liu Z., Lau F.C.M., A C-LSTM neural network for text classification, 2015, https://arxiv.org/abs/1511.08630.

[38]

Zielstorff R.D., Controlled vocabularies for consumer health, Journal of Biomedical Informatics 36 (4) (2003) 326–333,.

Digital Library

Cited By

Bacco LDell’Orletta FLai HMerone MNissim M(2023)A text style transfer system for reducing the physician–patient expertise gapExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.120874233:COnline publication date: 15-Dec-2023
https://dl.acm.org/doi/10.1016/j.eswa.2023.120874

Index Terms

Mapping layperson medical terminology into the Human Phenotype Ontology using neural machine translation models
1. Applied computing
  1. Life and medical sciences
2. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Index terms have been assigned to the content through auto-classification.

Recommendations

Machine enhanced translation of the Human Phenotype Ontology project

The Human Phenotype Ontology (HPO) project aims to provide a standardized vocabulary of phenotypic abnormalities encountered in human disease. Each term in the HPO describes a phenotypic abnormality, such as atrial septal defect. The HPO is currently ...
Practical use of medical terminology in curriculum mapping

BackgroundVarious information systems for medical curriculum mapping and harmonization have been developed and successfully applied to date. However, the methods for exploiting the datasets captured inside the systems are rather lacking. MethodWe ...
HPO2Vec+: Leveraging heterogeneous knowledge resources to enrich node embeddings for the Human Phenotype Ontology
Graphical abstract

Display Omitted
Highlights
- Node embeddings trained from heterogeneous biomedical knowledge resources.
- Enriched HPO embeddings can detect more associations between fine granularity nodes.
- Node embeddings trained from Orphanet showed the optimal performance.
...
Abstract Background
In precision medicine, deep phenotyping is defined as the precise and comprehensive analysis of phenotypic abnormalities, aiming to acquire a better understanding of the natural history of a disease and its genotype-phenotype ...

Comments

Information & Contributors

Information

Published In

cover image Expert Systems with Applications: An International Journal

Expert Systems with Applications: An International Journal Volume 204, Issue C

Oct 2022

1098 pages

ISSN:0957-4174

Issue’s Table of Contents

The Author(s).

Publisher

Pergamon Press, Inc.

United States

Publication History

Published: 15 October 2022

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Bacco LDell’Orletta FLai HMerone MNissim M(2023)A text style transfer system for reducing the physician–patient expertise gapExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.120874233:COnline publication date: 15-Dec-2023
https://dl.acm.org/doi/10.1016/j.eswa.2023.120874

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents