Abstract
Metagenome studies have retrieved vast amounts of sequence data from a variety of environments leading to new discoveries and insights into the uncultured microbial world. Except for very simple communities, the encountered diversity has made fragment assembly and the subsequent analysis a challenging problem. A taxonomic characterization of metagenomic fragments is required for a deeper understanding of shotgun-sequenced microbial communities, but success has mostly been limited to sequences containing phylogenetic marker genes. Here we present PhyloPythia, a composition-based classifier that combines higher-level generic clades from a set of 340 completed genomes with sample-derived population models. Extensive analyses on synthetic and real metagenome data sets showed that PhyloPythia allows the accurate classification of most sequence fragments across all considered taxonomic ranks, even for unknown organisms. The method requires no more than 100 kb of training sequence for the creation of accurate models of sample-specific populations and can assign fragments ≥1 kb with high specificity.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Venter, J.C. et al. Environmental genome shotgun sequencing of the Sargasso Sea. Science 304, 66–74 (2004).
Tringe, S.G. et al. Comparative metagenomics of microbial communities. Science 308, 554–557 (2005).
Tyson, G.W. et al. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428, 37–43 (2004).
Hugenholtz, P. Exploring prokaryotic diversity in the genomic era. Genome Biol., 3, REVIEWS0003 (2002).
Woese, C.R. & Fox, G.E. Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proc. Natl. Acad. Sci. USA 74, 5088–5090 (1977).
Woese, C.R. Bacterial evolution. Microbiol. Rev. 51, 221–271 (1987).
Graham, D.E., Overbeek, R., Olsen, G.J. & Woese, C.R. An archaeal genomic signature. Proc. Natl. Acad. Sci. USA 97, 3304–3308 (2000).
Wolf, Y.I., Rogozin, I.B., Grishin, N.V., Tatusov, R.L. & Koonin, E.V. Genome trees constructed using five different approaches suggest new major bacterial clades. BMC Evol. Biol. 1, 8 (2001).
Ciccarelli, F.D. et al. Toward automatic reconstruction of a highly resolved tree of life. Science 311, 1283–1287 (2006).
Cole, J.R. et al. The Ribosomal Database Project (RDP-II): sequences and tools for high-throughput rRNA analysis. Nucleic Acids Res. 33, D294–D296 (2005).
Garcìa Martin, H. et al. Metagenomic analysis of two enhanced biological phosphorus removal (EBPR) sludge communities. Nat. Biotechnol. 24, 1263–1269 (2006).
Teeling, H., Meyerdierks, A., Bauer, M., Amann, R. & Glockner, F.O. Application of tetranucleotide frequencies for the assignment of genomic fragments. Environ. Microbiol. 6, 938–947 (2004).
Gans, J., Wolinsky, M. & Dunbar, J. Computational improvements reveal great bacterial diversity and high metal toxicity in soil. Science 309, 1387–1390 (2005).
Karlin, S. & Burge, C. Dinucleotide relative abundance extremes: a genomic signature. Trends Genet. 11, 283–290 (1995).
Karlin, S. & Mrazek, J. Compositional differences within and between eukaryotic genomes. Proc. Natl. Acad. Sci. USA 94, 10227–10232 (1997).
Deschavanne, P.J., Giron, A., Vilain, J., Fagot, G. & Fertil, B. Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Mol. Biol. Evol. 16, 1391–1399 (1999).
Nakashima, H., Ota, M., Nishikawa, K. & Ooi, T. Genes from nine genomes are separated into their organisms in the dinucleotide composition space. DNA Res. 5, 251–259 (1998).
Sandberg, R. et al. Capturing whole-genome characteristics in short sequences using a naive Bayesian classifier. Genome Res. 11, 1404–1409 (2001).
Abe, T. et al. A novel bioinformatic strategy for unveiling hidden genome signatures of eukaryotes: self-organizing map of oligonucleotide frequency. Genome Inform. Ser. Workshop Genome Inform. 13, 12–20 (2002).
Pride, D.T., Meinersmann, R.J., Wassenaar, T.M. & Blaser, M.J. Evolutionary implications of microbial genome tetranucleotide frequency biases. Genome Res. 13, 145–158 (2003).
Chapus, C. et al. Exploration of phylogenetic data using a global sequence analysis method. BMC Evol. Biol. 5, 63 (2005).
Abe, T., Sugawara, H., Kinouchi, M., Kanaya, S. & Ikemura, T. Novel phylogenetic studies of genomic sequence fragments derived from uncultured microbe mixtures in environmental and clinical samples. DNA Res. 12, 281–290 (2005).
Edwards, R.A. et al. Using pyrosequencing to shed light on deep mine microbial ecology. BMC Genomics 7, 57 (2006).
Sharp, P.M., Bailes, E., Grocock, R.J., Peden, J.F. & Sockett, R.E. Variation in the strength of selected codon usage bias among bacteria. Nucleic Acids Res. 33, 1141–1153 (2005).
Lynn, D.J., Singer, G.A. & Hickey, D.A. Synonymous codon usage is subject to selection in thermophilic bacteria. Nucleic Acids Res. 30, 4272–4277 (2002).
Makarova, K.S., Grishin, N.V., Shabalina, S.A., Wolf, Y.I. & Koonin, E.V. A putative RNA-interference-based immune system in prokaryotes: computational analysis of the predicted enzymatic machinery, functional analogies with eukaryotic RNAi, and hypothetical mechanisms of action. Biol Direct 1, 7 (2006).
DeLong, E.F. Microbial community genomics in the ocean. Nat. Rev. Microbiol. 3, 459–469 (2005).
Kalyuzhnaya, M.G. et al. Fluorescence in situ hybridization-flow cytometry-cell sorting-based method for separation and enrichment of type I and type II methanotroph populations. Appl. Environ. Microbiol. 72, 4293–4301 (2006).
Zhang, K. et al. Sequencing genomes from single cells by polymerase cloning. Nat. Biotechnol. 24, 680–686 (2006).
Campbell, A., Mrazek, J. & Karlin, S. Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA. Proc. Natl. Acad. Sci. USA 96, 9184–9189 (1999).
McHardy, A.C. Gene finding and the evaluation of synonymous codon usage features in microbial genomes.. Thesis, Bielefeld Univ., (2004).
Nelson, K.E. et al. Evidence for lateral gene transfer between Archaea and bacteria from genome sequence of Thermotoga maritima. Nature 399, 323–329 (1999).
Tsirigos, A. & Rigoutsos, I. A new computational method for the detection of horizontal gene transfer events. Nucleic Acids Res. 33, 922–933 (2005).
Overbeek, R. et al. The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res. 33, 5691–5702 (2005).
Wheeler, D.L. et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 29, 11–16 (2001).
Acknowledgements
We thank N. Ivanova, V. Kunin and F. Warnecke for help with selection of CAP and Thiothrix-specific training sets and for validation analyses of the metagenomic data-set binning, L. Krause for providing the SEED data, T. Huynh for implementing the web interface, and S. Polonsky for comments and discussion. The work of H.G.M. and P.H. was performed under the auspices of the US Department of Energy's Office of Science, Biological and Environmental Research Program; the University of California, Lawrence Livermore National Laboratory, under contract W-7405-Eng-48; Lawrence Berkeley National Laboratory under contract DE-AC03-76SF00098; and Los Alamos National Laboratory under contract W-7405-ENG-36. PhyloPythia's results were incorporated in the US Department of Energy Joint Genome Institute Integrated Microbial Genomes & Metagenomes (IMG/M) experimental system (http://www.jgi.doe.gov).
Author information
Authors and Affiliations
Contributions
A.C.M. developed and evaluated the method, A.T. contributed codes for pattern discovery and discussion, P.H. and H.G.M. helped with discussions and the evaluation of the results for the EBPR sludges, A.C.M., I.R., H.G.M. and P.H. contributed to the writing of the manuscript, and A.C.M. and I.R. designed and planned the project.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Supplementary information
Supplementary Fig. 1
Assignment accuracy for differently sized genomic fragments and coding sequences from unknown organisms at the level of the class.
Supplementary Fig. 2
Wn parameter search for the sequence composition space with the highest classification accuracy for 15 kb fragments of unknown organisms at different phylogenetic levels.
Supplementary Fig. 3
Evaluation of the relation of genomic fragment length used for model creation and classification accuracy for genomic fragments of unknown organisms and different lengths.
Supplementary Fig. 4
Assignments at the domain level with PhyloPythia for 50 kb genomic fragments from unknown organisms.
Supplementary Fig. 5
Comparison of classification accuracy for 3 kb fragments and 3 kb fragments carrying ribosomal proteins with PhyloPythia.
Supplementary Fig. 6
Clades at different depths of the phylogenetic tree that are sufficiently represented by genomes of the 340 organisms for composition-based modeling.
Supplementary Table 1
Wn parameter search for the sequence composition space with the highest classification accuracy.
Supplementary Table 2
Classification accuracy of the SVM with a gaussian versus a linear kernel.
Supplementary Table 3
Classification accuracy of PhyloPythia for genomic fragments of unknown organisms at different taxonomic ranks.
Supplementary Table 4
Phylogenetic classification accuracy of PhyloPythia for genomic fragments of known organisms at different taxonomic ranks.
Supplementary Table 5
Search for the best parameter settings for the SOM and TETRA-method.
Supplementary Table 6
Comparison of PhyloPythia to the SOM-phylotype associations and tetranucleotide-based binning of the dominant sample populations for the contigs ≥1kb of the Sargasso Sea sample.
Supplementary Table 7
Evaluation of PhyloPythia's classification accuracy for genome fragments of different Prochlorococcus strains.
Rights and permissions
About this article
Cite this article
McHardy, A., Martín, H., Tsirigos, A. et al. Accurate phylogenetic classification of variable-length DNA fragments. Nat Methods 4, 63–72 (2007). https://doi.org/10.1038/nmeth976
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/nmeth976