skip to main content
research-article

Mapping layperson medical terminology into the Human Phenotype Ontology using neural machine translation models

Published: 15 October 2022 Publication History

Abstract

In the medical domain there exists a terminological gap between patients and caregivers and the healthcare professionals. This gap may hinder the success of the communication between healthcare consumers and professionals in the field, with negative emotional and clinical consequences. In this work, we build a machine learning-based tool for the automatic translation between the terminology used by laypeople and that of the Human Phenotype Ontology (HPO). HPO is a structured vocabulary of phenotypic abnormalities found in human disease. Our method uses a vector space to represent an HPO-specific embedding as the output space for a neural network model trained on vector representations of layperson versions and other textual descriptors of medical terms. We explored different output embeddings coupled to different neural network architectures for the machine translation stage. We compute a similarity measure to evaluate the ability of the model to assign an HPO term to a layperson input. The best-performing models resulted with a similarity higher than 0.7 for more than 80% of the terms, with a median between 0.98 and 1. The translator model is made available in a web application at this link: https://hpotranslator.b2slab.upc.edu.

Highlights

We propose a method to map lay expressions into the Human Phenotype Ontology.
Inspired in machine translation we design different deep learning architectures.
We explore strategies to encode the semantic space of the terms in the ontology.
We evaluate several combinations of models, hyperparameters, and output embeddings.
Overall, the correct term is identified for 80% of the lay terms in the test set.

References

[1]
Baroni M., Dinu G., Kruszewski G., Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors, in: Proceedings of the 52nd annual meeting of the association for computational linguistics (Volume 1: Long papers), Association for Computational Linguistics, Baltimore, Maryland, 2014, pp. 238–247,.
[2]
Baroni M., Siri S., Using cooccurrence statistics and the web to discover synonyms in a technical language, in: Proceedings of the fourth international conference on language resources and evaluation, in: (LREC’04), 2004.
[3]
[4]
Deerwester S., Dumais S.T., Furnas G.W., Landauer T.K., Harshman R., Indexing by latent semantic analysis, Journal of the American Society for Information Science 41 (6) (1990) 391–407,.
[5]
Gu Y., Tinn R., Cheng H., Lucas M., Usuyama N., Liu X., et al., Domain-specific language model pretraining for biomedical natural language processing, ACM Transactions on Computing for Healthcare 3 (1) (2022) 1–23.
[6]
Hagiwara M., Ogawa Y., Toyama K., Selection of effective contextual information for automatic synonym acquisition, in: Proceedings of the 21st international conference on computational linguistics and the 44th annual meeting of the association for computational linguistics, in: (ACL-44), 2006, pp. 353–360,.
[7]
Hochreiter S., Schmidhuber J., Long short-term memory, Neural Computation 9 (8) (1997) 1735–1780,.
[8]
Ivanović M., Budimac Z., An overview of ontologies and data resources in medical domains, Expert Systems with Applications 41 (2014) 5158–5166.
[9]
Jiang J.J., Conrath D.W., Semantic similarity based on corpus statistics and lexical taxonomy, in: Chen K., Huang C., Sproat R. (Eds.), Proceedings of the 10th research on computational linguistics international conference, ROCLING 1997, Taipei, Taiwan, August 1997, The Association for Computational Linguistics and Chinese Language Processing (ACLCLP), 1997, pp. 19–33.
[10]
Keselman A., Smith C., Divita G., Kim H., Browne A., Leroy G., et al., Consumer health concepts that do not map to the UMLS: Where do they fit?, Journal of the American Medical Informatics Association : JAMIA 15 (2008) 496–505,.
[11]
Kingma D.P., Ba J., Adam: A method for stochastic optimization, 2014, https://arxiv.org/abs/1412.6980.
[12]
Köhler S., Carmody L., Vasilevsky N., Jacobsen J.O.B., Danis D., Gourdine J.-P., et al., Expansion of the Human Phenotype Ontology (HPO) knowledge base and resources, Nucleic Acids Research 47 (D1) (2018) D1018–D1027,.
[13]
Luo J., Zheng Z., Ye H., Ye M., Wang Y., You Q., et al., A benchmark dataset for understandable medical language translation, 2020, ArXiv, abs/2012.02420.
[14]
McDonald R., Brokos G., Androutsopoulos I., Deep relevance ranking using enhanced document-query interactions, in: Proceedings of the 2018 conference on empirical methods in natural language processing, Association for Computational Linguistics, Brussels, Belgium, 2018, pp. 1849–1860,.
[15]
Mikolov T., Sutskever I., Chen K., Corrado G., Dean J., Distributed representations of words and phrases and their compositionality, in: Proceedings of the 26th international conference on neural information processing systems - Volume 2, in: NIPS’13, Curran Associates Inc., Red Hook, NY, USA, 2013, pp. 3111–3119.
[16]
Pakhomov S.V., Finley G., McEwan R., Wang Y., Melton G.B., Corpus domain effects on distributional semantic modeling of medical terms, Bioinformatics 32 (23) (2016) 3635–3644,.
[17]
Pérez A., Gojenola K., Casillas A., Oronoz M., de Ilarraza A.D., Computer aided classification of diagnostic terms in spanish, Expert Systems with Applications 42 (2015) 2949–2958.
[18]
Pilehvar M.T., Collier N., Improved semantic representation for domain-specific entities, in: Proceedings of the 15th workshop on biomedical natural language processing, Association for Computational Linguistics, Berlin, Germany, 2016, pp. 12–16,.
[19]
R Core Team, R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria, 2018.
[20]
Resnik P., Using information content to evaluate semantic similarity in a taxonomy, in: Proceedings of the 14th international joint conference on artificial intelligence - Volume 1, in: IJCAI’95, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1995, pp. 448–453.
[21]
Salton G., Buckley C., Term-weighting approaches in automatic text retrieval, Information Processing & Management 24 (5) (1988) 513–523,.
[22]
Sarma P.K., Liang Y., Sethares B., Domain adapted word embeddings for improved sentiment classification, in: Proceedings of the 56th annual meeting of the association for computational linguistics (Volume 2: Short papers), Association for Computational Linguistics, Melbourne, Australia, 2018, pp. 37–42,.
[23]
Seco N., Veale T., Hayes J., An intrinsic information content metric for semantic similarity in WordNet, in: Proceedings of the 16th European conference on artificial intelligence, in: ECAI’04, IOS Press, NLD, 2004, pp. 1089–1090.
[24]
Smith, C. A., Stavri, P. Z., & Chapman, W. W. (2002). In their own words? A terminological analysis of e-mail to a cancer information service. In Proceedings / AMIA ... annual symposium. AMIA symposium.
[25]
Sutskever I., Vinyals O., Le Q.V., Sequence to sequence learning with neural networks, in: Proceedings of the 27th international conference on neural information processing systems - Volume 2, in: NIPS’14, MIT Press, Cambridge, MA, USA, 2014, pp. 3104–3112.
[26]
Tong A., Levey A.S., Eckardt K.-U., Anumudu S., Arce C.M., Baumgart A., et al., Patient and caregiver perspectives on terms used to describe kidney health, Clinical Journal of the American Society of Nephrology 15 (7) (2020) 937–948.
[27]
Van Rossum G., Drake F.L., Python 3 reference manual, CreateSpace, Scotts Valley, CA, 2009.
[28]
Vasilevsky N., Foster E., Engelstad M., Carmody L., Might M., Chambers E., et al., Plain-language medical vocabulary for precision diagnosis, Nature Genetics 50 (2018) 474–476,.
[29]
Virtanen P., Gommers R., Oliphant T.E., Haberland M., Reddy T., Cournapeau D., et al., SciPy 1.0 Contributors, SciPy 1.0: Fundamental algorithms for scientific computing in Python, Nature Methods 17 (2020) 261–272.
[30]
Vydiswaran, V., Mei, Q., Hanauer, D. A., & Zheng, K. (2014). Mining consumer health vocabulary from community-generated text. In Proceedings of the American medical informatics association annual symposium (AMIA).
[31]
Wang Y., Liu S., Afzal N., Rastegar-Mojarad M., Wang L., Shen F., et al., A comparison of word embeddings for the biomedical natural language processing, Journal of Biomedical Informatics 87 (2018) 12–20,.
[32]
Weng W.-H., Chung Y.-A., Szolovits P., Unsupervised clinical language translation, in: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, in: KDD ’19, Association for Computing Machinery, New York, NY, USA, 2019, pp. 3121–3131,.
[33]
Yin W., Schütze H., Learning word meta-embeddings, in: Proceedings of the 54th annual meeting of the association for computational linguistics (Volume 1: Long papers), Association for Computational Linguistics, Berlin, Germany, 2016, pp. 1351–1360,.
[34]
Zeng-Treitler Q., Goryachev S., Kim H., Keselman A., Rosendale D., Making texts in electronic health records comprehensible to consumers: A prototype translator, in: AMIA ... Annual symposium proceedings / AMIA symposium. AMIA symposium, Vol. 11, 2007, pp. 846–850.
[35]
Zhang J., Bolanos L., Li T.S., Tanwar A., Freire G., Yang X., et al., Self-supervised detection of contextual synonyms in a multi-class setting: Phenotype annotation use case, in: EMNLP, 2021.
[36]
Zhang J., Zhang X., Sun K., Yang X., Dai C., Guo Y., Unsupervised annotation of phenotypic abnormalities via semantic latent representations on electronic health records, in: 2019 IEEE international conference on bioinformatics and biomedicine (BIBM), 2019, pp. 598–603.
[37]
Zhou C., Sun C., Liu Z., Lau F.C.M., A C-LSTM neural network for text classification, 2015, https://arxiv.org/abs/1511.08630.
[38]
Zielstorff R.D., Controlled vocabularies for consumer health, Journal of Biomedical Informatics 36 (4) (2003) 326–333,.

Cited By

View all

Index Terms

  1. Mapping layperson medical terminology into the Human Phenotype Ontology using neural machine translation models
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image Expert Systems with Applications: An International Journal
        Expert Systems with Applications: An International Journal  Volume 204, Issue C
        Oct 2022
        1098 pages

        Publisher

        Pergamon Press, Inc.

        United States

        Publication History

        Published: 15 October 2022

        Author Tags

        1. Machine translation
        2. Word embedding
        3. Deep learning
        4. Medical informatics
        5. Deep phenotyping
        6. Human Phenotype Ontology

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 01 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all

        View Options

        View options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media