KO Database of Molecular Functions
The KO (KEGG Orthology) database is a database of molecular functions represented in terms of functional orthologs. A functional ortholog is manually defined in the context of KEGG molecular networks, namely, KEGG pathway maps, BRITE hierarchies and KEGG modules, and is given a KO identifier called K number. Most KOs are defined from experimentally characterized genes and proteins in specific organisms, which are then generalized to other organisms based on sequence similarity. The granularity of "function" is context-dependent, and the resulting KO grouping may correspond to a group of highly similar sequences within a limited organism group or it may be a more divergent group.
The term KO system is used for a network-based classification of KOs shown below:
Efforts have been made to associate KO entries with pulication records reporting experimental evidence of functionally characterized sequence data as shown in the SEQUENCE field of the KO entry page. In many cases such data are not available for genes and proteins in the KEGG organisms of completely sequenced genomes. Thus, the addendum (ag) category was introduced in the GENES database enabling functionally characterized individual protein sequences to be included in KEGG. As a byproduct of these efforts, sequence data have also been associated with EC numbers in Enzyme Nomenclature.
The term KO system is used for a network-based classification of KOs shown below:
00001 KEGG Orthology (KO)It consists of six top categories (09100 to 09160) for KEGG pathway maps and one top category (09180) for BRITE hierarchies, as well as one top category (09190) for those KOs that are not yet included in either of them. The category numbers for these top categories and the second-level categories under metabolism (09101 to 09112) are used to define color coding of functions (see KEGG Color Codes).
Efforts have been made to associate KO entries with pulication records reporting experimental evidence of functionally characterized sequence data as shown in the SEQUENCE field of the KO entry page. In many cases such data are not available for genes and proteins in the KEGG organisms of completely sequenced genomes. Thus, the addendum (ag) category was introduced in the GENES database enabling functionally characterized individual protein sequences to be included in KEGG. As a byproduct of these efforts, sequence data have also been associated with EC numbers in Enzyme Nomenclature.
Genome Annotation in KEGG
Genome annotation in KEGG contains two unique aspects, KO assignment and KEGG mapping, as summarized below.
KO assignment
KO assignment
- Molecular functions are stored in the KO (KEGG Orthology) database containing orthologs of experimentally characterized genes/proteins.
- Genome annotation in KEGG is to assign KO identifiers (or K numbers) to individual genes in the genome, rather than giving text description of functions.
- Cellular and organism-level functions are stored in the PATHWAY, BRITE and MODULE databases in terms of the molecular networks, which are all created as networks of K number nodes.
- The KO assignment procedure converts a gene set in the genome to a K number set and leads to automatic reconstruction of KEGG pathways and other networks by the process called KEGG mapping, enabling interpretation of high-level functions.
KO Assignment Tools
The basic tool for the internal annotation procedure of assigning KOs in the GENES database has been the KOALA (KEGG Orthology And Links Annotation) program. It processes GFIT tables generated from the SSDB database of SSEARCH computation results for all pairwise genome comparisons and presents most appropriate KOs to be assigned. The previous version of the KOALA algorithm was based on a weighted sum of SW (Smith-Waterman) scores, but it was too complicated to refine for better assignments. The current version of the newkoala algorithm is much simpler and has introduced a penalty of the sequence length difference for safer predictions. The measure of similarity between two sequences is defined by a modified identity:
In addition, there are two tools for manual annotations based on GFIT tables. One is the KoAnn (KO Annotation) tool for examining the entire set of annotated genes for a given KO, and the other is the Check GN tool for checking the consistency of the entire GENES annotation. The consistency check is performed every night presenting additional candidates and possible misannotations for human intervention.
Since the newkoala algorithm uses the identity score, it can readily be applied not only to SSEARCH, but also to other programs. BlastKOALA is a web server for automatic KO assignment now using the newkoala algorithm for BLAST search against a limited set of GENES data. BlastKOALA is also used internally for initial annotation of new genomes, before SSDB computation and GFIT table generation are completed.
Reference
identity * min(1, overlap*2/(aalen1+aalen2))
where "identity" is the identity score given by SSEARCH, "overlap" is the alignment length, aalen1 and aalen2 are the sequence lengths being compared. This new KOALA program is used to automatically assign and update KOs, currently, four times a week.
In addition, there are two tools for manual annotations based on GFIT tables. One is the KoAnn (KO Annotation) tool for examining the entire set of annotated genes for a given KO, and the other is the Check GN tool for checking the consistency of the entire GENES annotation. The consistency check is performed every night presenting additional candidates and possible misannotations for human intervention.
Since the newkoala algorithm uses the identity score, it can readily be applied not only to SSEARCH, but also to other programs. BlastKOALA is a web server for automatic KO assignment now using the newkoala algorithm for BLAST search against a limited set of GENES data. BlastKOALA is also used internally for initial annotation of new genomes, before SSDB computation and GFIT table generation are completed.
KOALA | BlastKOALA | |
Purpose | Internal GENES annotation | Outside service of genome annotation |
Search program | SSEARCH | BLASTP |
Scoring | newkoala algorithm (using SSEARCH identity scores) |
newkoala algorithm (using BLASTP identity scores) |
Database | Entire GENES database sequences | KEGG Reference genomes and functionally characterized seuences linked from KO references |
Reference
- Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M., and Tanabe, M.; KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 44, D457-D462 (2016). [pubmed]
- Kanehisa, M., Sato, Y., and Morishima, K.; BlastKOALA and GhostKOALA: KEGG tools for functional characterization of genome and metagenome sequences. J. Mol. Biol. 428, 726-731 (2016). [pubmed]
Last updated: January 1, 2025