HMMER

Last updated
HMMER
Developer(s) Sean Eddy, Travis Wheeler, HMMER development team
Stable release
3.4 [1] / 15 August 2023;10 months ago (15 August 2023)
Repository
Written in C
Available in English
Type Bioinformatics tool
License BSD-3
Website hmmer.org
A profile HMM modelling a multiple sequence alignment A profile HMM modelling a multiple sequence alignment.png
A profile HMM modelling a multiple sequence alignment

HMMER is a free and commonly used software package for sequence analysis written by Sean Eddy. [2] Its general usage is to identify homologous protein or nucleotide sequences, and to perform sequence alignments. It detects homology by comparing a profile-HMM (a Hidden Markov model constructed explicitly for a particular search) to either a single sequence or a database of sequences. Sequences that score significantly better to the profile-HMM compared to a null model are considered to be homologous to the sequences that were used to construct the profile-HMM. Profile-HMMs are constructed from a multiple sequence alignment in the HMMER package using the hmmbuild program. The profile-HMM implementation used in the HMMER software was based on the work of Krogh and colleagues. [3] HMMER is a console utility ported to every major operating system, including different versions of Linux, Windows, and macOS.

Contents

HMMER is the core utility that protein family databases such as Pfam and InterPro are based upon. Some other bioinformatics tools such as UGENE also use HMMER.

HMMER3 also makes extensive use of vector instructions to increase computational speed. This work is based upon an earlier publication showing a significant acceleration of the Smith-Waterman algorithm for aligning two sequences. [4]

Profile HMMs

A profile HMM is a variant of an HMM relating specifically to biological sequences. Profile HMMs turn a multiple sequence alignment into a position-specific scoring system, which can be used to align sequences and search databases for remotely homologous sequences. [5] They capitalise on the fact that certain positions in a sequence alignment tend to have biases in which residues are most likely to occur, and are likely to differ in their probability of containing an insertion or a deletion. Capturing this information gives them a better ability to detect true homologs than traditional BLAST-based approaches, which penalise substitutions, insertions and deletions equally, regardless of where in an alignment they occur. [6]

The core profile HMM architecture used by HMMER. Hmmer coreProfileHMM.png
The core profile HMM architecture used by HMMER.

Profile HMMs center around a linear set of match (M) states, with one state corresponding to each consensus column in a sequence alignment. Each M state emits a single residue (amino acid or nucleotide). The probability of emitting a particular residue is determined largely by the frequency at which that residue has been observed in that column of the alignment, but also incorporates prior information on patterns of residues that tend to co-occur in the same columns of sequence alignments. This string of match states emitting amino acids at particular frequencies is analogous to position specific score matrices or weight matrices. [5]

A profile HMM takes this modelling of sequence alignments further by modelling insertions and deletions, using I and D states, respectively. D states do not emit a residue, while I state do emit a residue. Multiple I state can occur consecutively, corresponding to multiple residues between consensus columns in an alignment. M, I and D states are connected by state transition probabilities, which also vary by position in the sequence alignment, to reflect the different frequencies of insertions and deletions across sequence alignments. [5]

The HMMER2 and HMMER3 releases used an architecture for building profile HMMs called the Plan 7 architecture, named after the seven states captured by the model. In addition to the three major states (M, I and D), six additional states capture non-homologous flanking sequence in the alignment. These 6 states collectively are important for controlling how sequences are aligned to the model, e.g. whether a sequence can have multiple consecutive hits to the same model (in the case of sequences with multiple instances of the same domain). [7]

Programs in the HMMER package

The HMMER package consists of a collection of programs for performing functions using profile hidden Markov models. [8] The programs include:

Profile HMM building

Homology searching

Other functions

The package contains numerous other specialised functions.

The HMMER web server

In addition to the software package, the HMMER search function is available in the form of a web server. [9] The service facilitates searches across a range of databases, including sequence databases such as UniProt, SwissProt, and the Protein Data Bank, and HMM databases such as Pfam, TIGRFAMs and SUPERFAMILY. The four search types phmmer, hmmsearch, hmmscan and jackhmmer are supported (see Programs). The search function accepts single sequences as well as sequence alignments or profile HMMs. [10]

The search results are accompanied by a report on the taxonomic breakdown, and the domain organisation of the hits. Search results can then be filtered according to either parameter.

The web service is currently run out of the European Bioinformatics Institute (EBI) in the United Kingdom, while development of the algorithm is still performed by Sean Eddy's team in the United States. [9] Major reasons for relocating the web service were to leverage the computing infrastructure at the EBI, and to cross-link HMMER searches with relevant databases that are also maintained by the EBI.

The HMMER3 release

The latest stable release of HMMER is version 3.0. HMMER3 is complete rewrite of the earlier HMMER2 package, with the aim of improving the speed of profile-HMM searches. Major changes are outlined below:

Improvements in speed

A major aim of the HMMER3 project, started in 2004 was to improve the speed of HMMER searches. While profile HMM-based homology searches were more accurate than BLAST-based approaches, their slower speed limited their applicability. [8] The main performance gain is due to a heuristic filter that finds high-scoring un-gapped matches within database sequences to a query profile. This heuristic results in a computation time comparable to BLAST with little impact on accuracy. Further gains in performance are due to a log-likelihood model that requires no calibration for estimating E-values, and allows the more accurate forward scores to be used for computing the significance of a homologous sequence. [11] [6]

HMMER still lags behind BLAST in speed of DNA-based searches; however, DNA-based searches can be tuned such that an improvement in speed comes at the expense of accuracy. [12]

Improvements in remote homology searching

The major advance in speed was made possible by the development of an approach for calculating the significance of results integrated over a range of possible alignments. [11] In discovering remote homologs, alignments between query and hit proteins are often very uncertain. While most sequence alignment tools calculate match scores using only the best scoring alignment, HMMER3 calculates match scores by integrating across all possible alignments, to account for uncertainty in which alignment is best. HMMER sequence alignments are accompanied by posterior probability annotations, indicating which portions of the alignment have been assigned high confidence and which are more uncertain.

DNA sequence comparison

A major improvement in HMMER3 was the inclusion of DNA/DNA comparison tools. HMMER2 only had functionality to compare protein sequences.

Restriction to local alignments

While HMMER2 could perform local alignment (align a complete model to a subsequence of the target) and global alignment (align a complete model to a complete target sequence), HMMER3 only performs local alignment. This restriction is due to the difficulty in calculating the significance of hits when performing local/global alignments using the new algorithm.

See also

Several implementations of profile HMM methods and related position-specific scoring matrix methods are available. Some are listed below:

Related Research Articles

<span class="mw-page-title-main">Sequence alignment</span> Process in bioinformatics that identifies equivalent sites within molecular sequences

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences such as calculating the distance cost between strings in a natural language, or to display financial data.

In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. It can be performed on the entire genome, transcriptome or proteome of an organism, and can also involve only selected segments or regions, like tandem repeats and transposable elements. Methodologies used include sequence alignment, searches against biological databases, and others.

In bioinformatics and evolutionary biology, a substitution matrix describes the frequency at which a character in a nucleotide sequence or a protein sequence changes to other character states over evolutionary time. The information is often in the form of log odds of finding two specific character states aligned and depends on the assumed number of evolutionary changes or sequence dissimilarity between compared sequences. It is an application of a stochastic matrix. Substitution matrices are usually seen in the context of amino acid or DNA sequence alignments, where they are used to calculate similarity scores between the aligned sequences.

In bioinformatics, BLAST is an algorithm and program for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences. A BLAST search enables a researcher to compare a subject protein or nucleotide sequence with a library or database of sequences, and identify database sequences that resemble the query sequence above a certain threshold. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence.

A Gap penalty is a method of scoring alignments of two or more sequences. When aligning sequences, introducing gaps in the sequences can allow an alignment algorithm to match more terms than a gap-less alignment can. However, minimizing gaps in an alignment is important to create a useful alignment. Too many gaps can cause an alignment to become meaningless. Gap penalties are used to adjust alignment scores based on the number and length of gaps. The five main types of gap penalties are constant, linear, affine, convex, and profile-based.

FASTA is a DNA and protein sequence alignment software package first described by David J. Lipman and William R. Pearson in 1985. Its legacy is the FASTA format which is now ubiquitous in bioinformatics.

<span class="mw-page-title-main">Sequence logo</span>

In bioinformatics, a sequence logo is a graphical representation of the sequence conservation of nucleotides or amino acids . A sequence logo is created from a collection of aligned sequences and depicts the consensus sequence and diversity of the sequences. Sequence logos are frequently used to depict sequence characteristics such as protein-binding sites in DNA or functional units in proteins.

The European Bioinformatics Institute (EMBL-EBI) is an intergovernmental organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Wellcome Genome Campus in Hinxton near Cambridge, and employs over 600 full-time equivalent (FTE) staff. Institute leaders such as Rolf Apweiler, Alex Bateman, Ewan Birney, and Guy Cochrane, an adviser on the National Genomics Data Center Scientific Advisory Board, serve as part of the international research network of the BIG Data Center at the Beijing Institute of Genomics.

<span class="mw-page-title-main">Pfam</span> Database of protein families

Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. Last version of Pfam, 36.0, was released in September 2023 and contains 20,795 families. It is currently provided through InterPro database.

<span class="mw-page-title-main">Multiple sequence alignment</span> Alignment of more than two molecular sequences

Multiple sequence alignment (MSA) is the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. These alignments are used to infer evolutionary relationships via phylogenetic analysis and can highlight homologous features between sequences. Alignments highlight mutation events such as point mutations, insertion mutations and deletion mutations, and alignments are used to assess sequence conservation and infer the presence and activity of protein domains, tertiary structures, secondary structures, and individual amino acids or nucleotides.

InterPro is a database of protein families, protein domains and functional sites in which identifiable features found in known proteins can be applied to new protein sequences in order to functionally characterise them.

<span class="mw-page-title-main">BLOSUM</span> Bioinformatics tool

In bioinformatics, the BLOSUM matrix is a substitution matrix used for sequence alignment of proteins. BLOSUM matrices are used to score alignments between evolutionarily divergent protein sequences. They are based on local alignments. BLOSUM matrices were first introduced in a paper by Steven Henikoff and Jorja Henikoff. They scanned the BLOCKS database for very conserved regions of protein families and then counted the relative frequencies of amino acids and their substitution probabilities. Then, they calculated a log-odds score for each of the 210 possible substitution pairs of the 20 standard amino acids. All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins like the PAM Matrices.

BLAT is a pairwise sequence alignment algorithm that was developed by Jim Kent at the University of California Santa Cruz (UCSC) in the early 2000s to assist in the assembly and annotation of the human genome. It was designed primarily to decrease the time needed to align millions of mouse genomic reads and expressed sequence tags against the human genome sequence. The alignment tools of the time were not capable of performing these operations in a manner that would allow a regular update of the human genome assembly. Compared to pre-existing tools, BLAT was ~500 times faster with performing mRNA/DNA alignments and ~50 times faster with protein/protein alignments.

CS-BLAST (Context-Specific BLAST) is a tool that searches a protein sequence that extends BLAST, using context-specific mutation probabilities. More specifically, CS-BLAST derives context-specific amino-acid similarities on each query sequence from short windows on the query sequences. Using CS-BLAST doubles sensitivity and significantly improves alignment quality without a loss of speed in comparison to BLAST. CSI-BLAST is the context-specific analog of PSI-BLAST, which computes the mutation profile with substitution probabilities and mixes it with the query profile. CSI-BLAST is the context specific analog of PSI-BLAST. Both of these programs are available as web-server and are available for free download.

SUPERFAMILY is a database and search platform of structural and functional annotation for all proteins and genomes. It classifies amino acid sequences into known structural domains, especially into SCOP superfamilies. Domains are functional, structural, and evolutionary units that form proteins. Domains of common Ancestry are grouped into superfamilies. The domains and domain superfamilies are defined and described in SCOP. Superfamilies are groups of proteins which have structural evidence to support a common evolutionary ancestor but may not have detectable sequence homology.

Fast statistical alignment or FSA is a multiple sequence alignment program for aligning many proteins, RNAs, or long genomic DNA sequences. Along with MUSCLE and MAFFT, FSA is one of the few sequence alignment programs which can align datasets of hundreds or thousands of sequences. FSA uses a different optimization criterion which allows it to more reliably identify non-homologous sequences than these other programs, although this increased accuracy comes at the cost of decreased speed.

Phyre and Phyre2 are free web-based services for protein structure prediction. Phyre is among the most popular methods for protein structure prediction having been cited over 1500 times. Like other remote homology recognition techniques, it is able to regularly generate reliable protein models when other widely used methods such as PSI-BLAST cannot. Phyre2 has been designed to ensure a user-friendly interface for users inexpert in protein structure prediction methods. Its development is funded by the Biotechnology and Biological Sciences Research Council.

<span class="mw-page-title-main">Sean Eddy</span> American professor at Harvard University

Sean Roberts Eddy is Professor of Molecular & Cellular Biology and of Applied Mathematics at Harvard University. Previously he was based at the Janelia Research Campus from 2006 to 2015 in Virginia. His research interests are in bioinformatics, computational biology and biological sequence analysis. As of 2016 projects include the use of Hidden Markov models in HMMER, Infernal Pfam and Rfam.

The HH-suite is an open-source software package for sensitive protein sequence searching. It contains programs that can search for similar protein sequences in protein sequence databases. Sequence searches are a standard tool in modern biology with which the function of unknown proteins can be inferred from the functions of proteins with similar sequences. HHsearch and HHblits are two main programs in the package and the entry point to its search function, the latter being a faster iteration. HHpred is an online server for protein structure prediction that uses homology information from HH-suite.

References

  1. "Release 3.4". 15 August 2023. Retrieved 18 September 2023.
  2. Durbin, Richard; Sean R. Eddy; Anders Krogh; Graeme Mitchison (1998). Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press. ISBN   0-521-62971-3.
  3. Krogh A, Brown M, Mian IS, Sjölander K, Haussler D (February 1994). "Hidden Markov models in computational biology. Applications to protein modeling". J. Mol. Biol. 235 (5): 1501–31. doi:10.1006/jmbi.1994.1104. PMID   8107089.
  4. Farrar M (January 2007). "Striped Smith-Waterman speeds database searches six times over other SIMD implementations". Bioinformatics. 23 (2): 156–61. doi: 10.1093/bioinformatics/btl582 . PMID   17110365.
  5. 1 2 3 Eddy, SR (1998). "Profile hidden Markov models". Bioinformatics. 14 (9): 755–63. doi: 10.1093/bioinformatics/14.9.755 . PMID   9918945.
  6. 1 2 Eddy, Sean R.; Pearson, William R. (20 October 2011). "Accelerated Profile HMM Searches". PLOS Computational Biology. 7 (10): e1002195. Bibcode:2011PLSCB...7E2195E. CiteSeerX   10.1.1.290.1476 . doi: 10.1371/journal.pcbi.1002195 . PMC   3197634 . PMID   22039361.
  7. Eddy, Sean. "HMMER2 User's Guide" (PDF).
  8. 1 2 Sean R. Eddy; Travis J. Wheeler. "HMMER User's Guide" (PDF). and the HMMER development team. Retrieved 23 July 2017.
  9. 1 2 Finn, Robert D.; Clements, Jody; Arndt, William; Miller, Benjamin L.; Wheeler, Travis J.; Schreiber, Fabian; Bateman, Alex; Eddy, Sean R. (1 July 2015). "HMMER web server: 2015 update". Nucleic Acids Research. 43 (W1): W30–W38. doi:10.1093/nar/gkv397. PMC   4489315 . PMID   25943547.
  10. Finn, Robert D.; Clements, Jody; Eddy, Sean R. (2011-07-01). "HMMER web server: interactive sequence similarity searching". Nucleic Acids Research. 39 (Web Server issue): W29–W37. doi:10.1093/nar/gkr367. ISSN   0305-1048. PMC   3125773 . PMID   21593126.
  11. 1 2 Eddy SR (2008). Rost, Burkhard (ed.). "A probabilistic model of local sequence alignment that simplifies statistical significance estimation". PLOS Comput Biol. 4 (5): e1000069. Bibcode:2008PLSCB...4E0069E. doi: 10.1371/journal.pcbi.1000069 . PMC   2396288 . PMID   18516236.
  12. Sean R. Eddy; Travis J. Wheeler. "HMMER3.1b2 Release Notes". and the HMMER development team. Retrieved 23 July 2017.