skip to main content
10.1007/978-3-031-21517-9_16guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Deep Learning Based NLP Embedding Approach for Biosequence Classification

Published: 15 December 2022 Publication History

Abstract

Biological sequence analysis involves the study of structural characteristics and chemical composition of a sequence. From a computational perspective, the goal is to represent sequences using vectors which bring out the essential features of the virus and enable efficient classification. Methods such as one-hot encoding, Word2Vec models, etc. have been explored for embedding sequences into the Euclidean plane. But these methods either fail to capture similarity information between k-mers or face the challenge of handling Out-of-Vocabulary (OOV) k-mers. In order to overcome these challenges, in this paper we aim explore the possibility of embedding Biosequences of MERS, SARS and SARS-CoV-2 using Global Vectors (GloVe) model and FastText n-gram representation. We conduct an extensive study to evaluate their performance using classical Machine Learning algorithms and Deep Learning methods. We compare our results with dna2vec, which is an existing Word2Vec approach. Experimental results show that FastText n-gram based sequence embeddings enable deeper insights into understanding the composition of each virus and thus give a classification accuracy close to 1. We also provide a study regarding the patterns in the viruses and support our results using various visualization techniques.

References

[1]
Koyama T, Platt D, and Parida L Variants of the SARS-CoV-2 genomes Bull. World Health Organ. 2020 98 495-504
[2]
Malik YA Properties of coronavirus and SARS-CoV-2 Malays. J. Pathol. 2020 42 1 3-11 PMID: 32342926
[3]
Lan, T.C.T., et al.: Structure of the full SARS-CoV-2 RNA genome in infected cells
[4]
Junior JACN, Santos AM, Quintans-Júnior LJ, Walker CIB, Borges LP, and Serafini MR SARS, MERS and SARS-CoV-2 (COVID-19) treatment: a patent review Expert Opin. Ther. Pat. 2020 30 8 567-579
[5]
Li Q et al. The impact of mutations in SARS-CoV-2 spike on viral infectivity and antigenicity Cell 2020 182 5 1284-1294
[6]
Bojanowski P, Grave E, Joulin A, and Mikolov T Enriching word vectors with subword information Trans. Assoc. Comput. Linguist. 2017 5 135-146
[8]
Ng, P.: dna2vec: consistent vector representations of variable-length k-mers. arXiv preprint arXiv:1701.06279 (2017)
[9]
Lopez Rincon, A., et al.: Accurate identification of SARS-CoV-2 from viral genome sequences using deep learning. bioRxiv (2020)
[10]
Zhang, J., Chen, Q., Liu, B.: DeepDRBP-2L: a new genome annotation predictor for identifying DNA binding proteins and RNA binding proteins using convolutional neural network and long short-term memory. In: IEEE/ACM Transactions on Computational Biology and Bioinformatics
[11]
Jha, P.K., Vijay, A., Halu, A., Uchida, S., Aikawa, M.: Gene expression profiling reveals the shared and distinct transcriptional signatures in human lung epithelial cells infected with SARS-CoV-2, MERS-CoV, or SARS-CoV: potential implications in cardiovascular complications of COVID-19. Front Cardiovasc Med. 7, 623012 (2021). Accessed 15 Jan 2021
[12]
Wang, L., Zhou, J., Wang, Q., Wang, Y., Kang, C.: Rapid design and development of CRISPR-Cas13a targeting SARS-CoV-2 spike protein. Theranostics. 11(2), 649–664 (2021). Accessed 1 Jan 2021
[13]
Heo, L., Feig, M.: Modeling of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) proteins by machine learning and physics-based refinement (2020)
[14]
Mikolov, T., Corrado, G., Chen, K., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of the International Conference on Learning Representations (ICLR 2013), pp. 1–12 (2013)
[15]
Kwan, H.K., Arniker, S.B.: Numerical representation of DNA sequences, pp. 307–310 (2009).
[16]
Lopez-Rincon A et al. Classification and specific primer design for accurate detection of SARS-CoV-2 using deep learning Sci. Rep. 2021 11 1 1-11
[17]
Ballesio, F., et al.: Determining a novel feature-space for SARS-CoV-2 sequence data (2020)
[18]
Asgari, E., Mofrad, M.R.: Continuous distributed representation of biological sequences for deep proteomics and genomics, PLoS One 10, e0141287 (2015)
[19]
Kimothi, D., et al.: Distributed representations for biological sequence analysis. ArXiv abs/1608.05949 (2016). n. Pag
[20]
Le NQK, Yapp EKY, Nagasundaram N, and Yeh HY Classifying promoters by interpreting the hidden information of DNA sequences via deep learning and combination of continuous FastText N-grams Front. Bioeng. Biotechnol. 2019 7 305

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
Mining Intelligence and Knowledge Exploration: 9th International Conference, MIKE 2021, Hammamet, Tunisia, November 1–3, 2021, Proceedings
Nov 2021
247 pages
ISBN:978-3-031-21516-2
DOI:10.1007/978-3-031-21517-9
  • Editors:
  • Richard Chbeir,
  • Yannis Manolopoulos,
  • Rajendra Prasath

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 15 December 2022

Author Tags

  1. Biosequence classification
  2. MERS
  3. SARS
  4. SARS-CoV-2
  5. NLP
  6. FastText
  7. Global vectors

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 06 Jan 2025

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media