Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain
the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in
Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles
and JavaScript.
Protein language models learn from diverse sequences spanning the evolutionary tree and have proven to be powerful tools for sequence design, variant effect prediction and structure prediction. What are the foundations of protein language models, and how are they applied in protein engineering?
Models like ChatGPT and DALL-E2 generate text and images in response to a text prompt. Despite different data and goals, how can generative models be useful for protein engineering?
A mathematical concept known as a de Bruijn graph turns the formidable challenge of assembling a contiguous genome from billions of short sequencing reads into a tractable computational problem.
Hierarchical models provide reliable statistical estimates for data sets from high-throughput experiments where measurements vastly outnumber experimental samples.
Flux balance analysis is a mathematical approach for analyzing the flow of metabolites through a metabolic network. This primer covers the theoretical basis of the approach, several practical examples and a software toolbox for performing the calculations.
When prioritizing hits from a high-throughput experiment, it is important to correct for random events that falsely appear significant. How is this done and what methods should be used?
Networks in biology can appear complex and difficult to decipher. Merico et al. illustrate how to interpret biological networks with the help of frequently used visualization and analysis patterns.
Mapping the vast quantities of short sequence fragments produced by next-generation sequencing platforms is a challenge. What programs are available and how do they work?
Only a subset of single-nucleotide polymorphisms (SNPs) can be genotyped in genome-wide association studies. Imputation methods can infer the alleles of 'hidden' variants and use those inferences to test the hidden variants for association.
Only a subset of genetic variants can be examined in genome-wide surveys for genetic risk factors. How can a fixed set of markers account for the entire genome by acting as proxies for neighboring associations?
Decision trees have been applied to problems such as assigning protein function and predicting splice sites. How do these classifiers work, what types of problems can they solve and what are their advantages over alternatives?
The expectation maximization algorithm arises in many computational biology applications that involve probabilistic models. What is it good for, and how does it work?
Principal component analysis is often incorporated into genome-wide expression studies, but what is it and how can it be used to explore high-dimensional data?
Artificial neural networks have been applied to problems ranging from speech recognition to prediction of protein secondary structure, classification of cancers and gene prediction. How do they work and what might they be good for?
Computational prediction of gene structure is crucial for interpreting genomic sequences. But how do the algorithms involved work and how accurate are they?
Instrumentation aside, algorithms for matching mass spectra to proteins are at the heart of shotgun proteomics. How do these algorithms work, what can we expect of them and why is it so difficult to find protein modifications?
Support vector machines (SVMs) are becoming popular in a wide variety of biological applications. But, what exactly are SVMs and how do they work? And what are their most promising applications in the life sciences?
How can we computationally extract an unknown motif from a set of target sequences? What are the principles behind the major motif discovery algorithms? Which of these should we use, and how do we know we've found a 'real' motif?
Sequence motifs are becoming increasingly important in the analysis of gene regulation. How do we define sequence motifs, and why should we use sequence logos instead of consensus sequences to represent them? Do they have any relation with binding affinity? How do we search for new instances of a motif in this sea of DNA?