skip to main content
10.1145/3584371.3613013acmconferencesArticle/Chapter ViewAbstractPublication PagesbcbConference Proceedingsconference-collections
research-article

CodonBERT: Using BERT for Sentiment Analysis to Better Predict Genes with Low Expression

Published: 04 October 2023 Publication History

Abstract

Synonymous codons, which encode the same amino acid in a protein, are known to be used unequally in organisms. Prior research has been able to uncover "preferred" codons that are often found in more highly expressed genes. This has enabled different computational models that can predict gene expression of protein-coding genes; however, their performance is often affected by more diverse gene expression in higher organisms, i.e., high expression in only specific tissues or cell types. In this paper, we use a Natural Language Processing (NLP) algorithm, Bidirectional Encoder Representations from Transformers (BERT), to develop a new framework for predicting gene expression. Notably, our model architecture relies on the idea of sentiment analysis, i.e., assigning an overall "emotion" (sentiment) to protein-coding sequences. Our new framework, CodonBERT, is a a pre-trained model that better captures more intrinsic relationships between sequences and their expression, and we show that our model is capable of making substantially better predictions for a diverse collection of model organisms. Additionally, we show that our model learns inherent patterns of codon usage that can be traced using explainable AI (XAI) algorithms.

References

[1]
Ashley Babjac, Jun Li, and Scott Emrich. 2021. Fine-Grained Synonymous Codon Usage Patterns and their Potential Role in Functional Protein Production. In 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2187--2193.
[2]
Adrien Bibal, Rémi Cardon, David Alfter, Rodrigo Wilkens, Xiaoou Wang, Thomas François, and Patrick Watrin. 2022. Is Attention Explanation? An Introduction to the Debate. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 3889--3900.
[3]
J.L. Chaney, A. Steele, R. Carmichael, A. Rodriguez, A.T. Specht, K. Ngo, J. Li, S. Emrich, and P.L. Clark. 2017. Widespread position-specific conservation of synonymous rare codons within coding sequences. PLOS Computational Biology 13, 5 (05 2017), 1--19.
[4]
Matthieu Chartier, Francis Gaudreault, and Rafael Najmanovich. 2012. Large-Scale analysis of conserved rare codon clusters suggests an involvement in co-translational molecular recognition events. Bioinformatics 28, 11 (2012), 1438--1445.
[5]
KR1442 Chowdhary. 2020. Natural language processing. Fundamentals of artificial intelligence (2020), 603--649.
[6]
Patrick Cramer. 2021. AlphaFold2 and the future of structural biology. Nature structural & molecular biology 28, 9 (2021), 704--705.
[7]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[8]
Aysu Ezen-Can. 2020. A Comparison of LSTM and BERT for Small Corpus. arXiv preprint arXiv:2009.05451 (2020).
[9]
Masaki Stanley Fujimoto, Paul M. Bodily, Cole A. Lyman, Andrew J. Jacobsen, Quinn Snell, and Mark J. Clement. 2017. Modeling Global and Local Codon Bias with Deep Language Models. In 2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE). 151--156.
[10]
Justin Gardin, Rukhsana Yeasmin, Alisa Yurovsky, Ying Cai, Steve Skiena, Bruce Futcher, and Nahum Sonenberg. 2014. Measurement of average decoding rates of the 61 sense codons in vivo. eLife 3 (2014), e03735.
[11]
M.A. Gilchrist, W.-C. Chen, P. Shah, C. L. Landerer, and R. Zaretzki. 2015. Estimating gene expression and codon-specific translational efficiencies, mutation biases, and selection coefficients from genomic data alone. Genome Biology and Evolution 7, 6 (05 2015), 1559--1579.
[12]
Yanrong Ji, Zhihan Zhou, Han Liu, and Ramana V Davuluri. 2021. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics 37, 15 (2021), 2112--2120.
[13]
Brendan Juba and Hai S Le. 2019. Precision-recall versus accuracy and the role of large data sets. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 4039--4048.
[14]
Zena A Kadhuim and Samaher Al-Janabi. 2023. Codon-mRNA prediction using deep optimal neurocomputing technique (DLSTM-DSN-WOA) and multivariate analysis. Results in Engineering 17 (2023), 100847.
[15]
Enja Kokalj, Blaž Škrlj, Nada Lavrač, Senja Pollak, and Marko Robnik-Šikonja. 2021. BERT meets shapley: Extending SHAP explanations to transformer-based classifiers. In Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation. 16--21.
[16]
Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al. 2023. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 6637 (2023), 1123--1130.
[17]
Zhixiu Lu, Michael Gilchrist, and Scott Emrich. 2020. Analysis of mutation bias in shaping codon usage bias and its association with gene expression across species. In EPiC Series in Computing, Vol. 70. EasyChair, 139--148.
[18]
Muhammad Nabeel Asim, Muhammad Imran Malik, Andreas Dengel, and Sheraz Ahmed. 2020. K-mer Neural Embedding Performance Analysis Using Amino Acid Codons. In 2020 International Joint Conference on Neural Networks (IJCNN). 1--8.
[19]
Ananthan Nambiar, John Malcolm Forsyth, Simon Liu, and Sergei Maslov. 2023. DR-BERT: A Protein Language Model to Annotate Disordered Regions. bioRxiv (2023), 2023--02.
[20]
Sarang Narkhede. 2018. Understanding auc-roc curve. Towards Data Science 26, 1 (2018), 220--227.
[21]
Michal Perach, Zohar Zafrir, Tamir Tuller, and Oded Lewinson. 2021. Identification of conserved slow codons that are important for protein expression and function. RNA Biology 18, 12 (2021), 2296--2307.
[22]
Abdul Muntakim Rafi, Dmitry Penzar, Daria Nogina, Dohoon Lee, Eeshit Dhaval Vaishnav, Danyeong Lee, Nayeon Kim, Sangyeup Kim, Georgy Meshcheryakov, Andrey Lando, et al. 2023. Evaluation and optimization of sequence-based gene regulatory deep learning models. bioRxiv (2023), 2023--04.
[23]
Istvan Redl, Carlo Fisicaro, Oliver Dutton, Falk Hoffmann, Louie Henderson, Benjamin MJ Owens, Matthew Heberling, Emanuele Paci, and Kamil Tamiola. 2022. ADOPT: intrinsic protein disorder prediction through deep bidirectional transformers. bioRxiv (2022), 2022--05.
[24]
Andrea Riba, Noemi Di Nanni, Nitish Mittal, Erik Arhné, Alexander Schmidt, and Mihaela Zavolan. 2019. Protein synthesis rates and ribosome occupancies reveal determinants of translation elongation rates. Proceedings of the National Academy of Sciences 116, 30 (2019), 15023--15032.
[25]
Anabel Rodriguez, Gabriel Wright, Scott Emrich, and Patricia L Clark. 2018. % MinMax: A versatile tool for calculating and comparing synonymous codon usage and its impact on protein folding. Protein Science 27, 1 (2018), 356--362.
[26]
Benedek Rozemberczki, Lauren Watson, Péter Bayer, Hao-Tsung Yang, Olivér Kiss, Sebastian Nilsson, and Rik Sarkar. 2022. The shapley value in machine learning. arXiv preprint arXiv:2202.05594 (2022).
[27]
Eric W Sayers, Mark Cavanaugh, Karen Clark, Kim D Pruitt, Conrad L Schoch, Stephen T Sherry, and Ilene Karsch-Mizrachi. 2022. GenBank. Nucleic acids research 50, D1 (2022), D161.
[28]
P.M. Sharp and W.-H. Li. 1987. The codon adaptation index-a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Research 15, 3 (1987), 1281--1295.
[29]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[30]
Gabriel Wright, Anabel Rodriguez, Jun Li, Patricia L Clark, Tijana Milenković, and Scott J Emrich. 2020. Analysis of computational codon usage models and their association with translationally slow codons. PloS one 15, 4 (2020), e0232003.
[31]
He Zhang, Liang Zhang, Ang Lin, Congcong Xu, Ziyu Li, Kaibo Liu, Boxiang Liu, Xiaopin Ma, Fanfan Zhao, Huiling Jiang, Chunxiu Chen, Haifa Shen, Hangwen Li, David H. Mathews, Yujian Zhang, and Liang Huang. 2023. Algorithm for Optimized mRNA Design Improves Stability and Immunogenicity. Nature (2023).

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
BCB '23: Proceedings of the 14th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics
September 2023
626 pages
ISBN:9798400701269
DOI:10.1145/3584371
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 October 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. CUB
  2. expression
  3. sentiment analysis
  4. transformers
  5. SHAP

Qualifiers

  • Research-article

Conference

BCB '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 254 of 885 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)152
  • Downloads (Last 6 weeks)11
Reflects downloads up to 09 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media