Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Accurate phylogenetic classification of variable-length DNA fragments

Abstract

Metagenome studies have retrieved vast amounts of sequence data from a variety of environments leading to new discoveries and insights into the uncultured microbial world. Except for very simple communities, the encountered diversity has made fragment assembly and the subsequent analysis a challenging problem. A taxonomic characterization of metagenomic fragments is required for a deeper understanding of shotgun-sequenced microbial communities, but success has mostly been limited to sequences containing phylogenetic marker genes. Here we present PhyloPythia, a composition-based classifier that combines higher-level generic clades from a set of 340 completed genomes with sample-derived population models. Extensive analyses on synthetic and real metagenome data sets showed that PhyloPythia allows the accurate classification of most sequence fragments across all considered taxonomic ranks, even for unknown organisms. The method requires no more than 100 kb of training sequence for the creation of accurate models of sample-specific populations and can assign fragments ≥1 kb with high specificity.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Accuracy of phylogenetic assignments for differently sized genomic fragments with PhyloPythia.
Figure 2: Phylogenetic classification accuracy of PhyloPythia by clade for differently sized genomic fragments from unknown organisms.
Figure 3: Binning accuracy of Thiothrix sp. contigs using PhyloPythia.

Similar content being viewed by others

References

  1. Venter, J.C. et al. Environmental genome shotgun sequencing of the Sargasso Sea. Science 304, 66–74 (2004).

    Article  CAS  PubMed  Google Scholar 

  2. Tringe, S.G. et al. Comparative metagenomics of microbial communities. Science 308, 554–557 (2005).

    Article  CAS  PubMed  Google Scholar 

  3. Tyson, G.W. et al. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428, 37–43 (2004).

    Article  CAS  PubMed  Google Scholar 

  4. Hugenholtz, P. Exploring prokaryotic diversity in the genomic era. Genome Biol., 3, REVIEWS0003 (2002).

  5. Woese, C.R. & Fox, G.E. Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proc. Natl. Acad. Sci. USA 74, 5088–5090 (1977).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Woese, C.R. Bacterial evolution. Microbiol. Rev. 51, 221–271 (1987).

    CAS  PubMed  PubMed Central  Google Scholar 

  7. Graham, D.E., Overbeek, R., Olsen, G.J. & Woese, C.R. An archaeal genomic signature. Proc. Natl. Acad. Sci. USA 97, 3304–3308 (2000).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Wolf, Y.I., Rogozin, I.B., Grishin, N.V., Tatusov, R.L. & Koonin, E.V. Genome trees constructed using five different approaches suggest new major bacterial clades. BMC Evol. Biol. 1, 8 (2001).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Ciccarelli, F.D. et al. Toward automatic reconstruction of a highly resolved tree of life. Science 311, 1283–1287 (2006).

    Article  CAS  PubMed  Google Scholar 

  10. Cole, J.R. et al. The Ribosomal Database Project (RDP-II): sequences and tools for high-throughput rRNA analysis. Nucleic Acids Res. 33, D294–D296 (2005).

    Article  CAS  PubMed  Google Scholar 

  11. Garcìa Martin, H. et al. Metagenomic analysis of two enhanced biological phosphorus removal (EBPR) sludge communities. Nat. Biotechnol. 24, 1263–1269 (2006).

    Article  PubMed  Google Scholar 

  12. Teeling, H., Meyerdierks, A., Bauer, M., Amann, R. & Glockner, F.O. Application of tetranucleotide frequencies for the assignment of genomic fragments. Environ. Microbiol. 6, 938–947 (2004).

    Article  CAS  PubMed  Google Scholar 

  13. Gans, J., Wolinsky, M. & Dunbar, J. Computational improvements reveal great bacterial diversity and high metal toxicity in soil. Science 309, 1387–1390 (2005).

    Article  CAS  PubMed  Google Scholar 

  14. Karlin, S. & Burge, C. Dinucleotide relative abundance extremes: a genomic signature. Trends Genet. 11, 283–290 (1995).

    Article  CAS  PubMed  Google Scholar 

  15. Karlin, S. & Mrazek, J. Compositional differences within and between eukaryotic genomes. Proc. Natl. Acad. Sci. USA 94, 10227–10232 (1997).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Deschavanne, P.J., Giron, A., Vilain, J., Fagot, G. & Fertil, B. Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Mol. Biol. Evol. 16, 1391–1399 (1999).

    Article  CAS  PubMed  Google Scholar 

  17. Nakashima, H., Ota, M., Nishikawa, K. & Ooi, T. Genes from nine genomes are separated into their organisms in the dinucleotide composition space. DNA Res. 5, 251–259 (1998).

    Article  CAS  PubMed  Google Scholar 

  18. Sandberg, R. et al. Capturing whole-genome characteristics in short sequences using a naive Bayesian classifier. Genome Res. 11, 1404–1409 (2001).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Abe, T. et al. A novel bioinformatic strategy for unveiling hidden genome signatures of eukaryotes: self-organizing map of oligonucleotide frequency. Genome Inform. Ser. Workshop Genome Inform. 13, 12–20 (2002).

    CAS  Google Scholar 

  20. Pride, D.T., Meinersmann, R.J., Wassenaar, T.M. & Blaser, M.J. Evolutionary implications of microbial genome tetranucleotide frequency biases. Genome Res. 13, 145–158 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Chapus, C. et al. Exploration of phylogenetic data using a global sequence analysis method. BMC Evol. Biol. 5, 63 (2005).

    Article  PubMed  PubMed Central  Google Scholar 

  22. Abe, T., Sugawara, H., Kinouchi, M., Kanaya, S. & Ikemura, T. Novel phylogenetic studies of genomic sequence fragments derived from uncultured microbe mixtures in environmental and clinical samples. DNA Res. 12, 281–290 (2005).

    Article  CAS  PubMed  Google Scholar 

  23. Edwards, R.A. et al. Using pyrosequencing to shed light on deep mine microbial ecology. BMC Genomics 7, 57 (2006).

    Article  PubMed  PubMed Central  Google Scholar 

  24. Sharp, P.M., Bailes, E., Grocock, R.J., Peden, J.F. & Sockett, R.E. Variation in the strength of selected codon usage bias among bacteria. Nucleic Acids Res. 33, 1141–1153 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Lynn, D.J., Singer, G.A. & Hickey, D.A. Synonymous codon usage is subject to selection in thermophilic bacteria. Nucleic Acids Res. 30, 4272–4277 (2002).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Makarova, K.S., Grishin, N.V., Shabalina, S.A., Wolf, Y.I. & Koonin, E.V. A putative RNA-interference-based immune system in prokaryotes: computational analysis of the predicted enzymatic machinery, functional analogies with eukaryotic RNAi, and hypothetical mechanisms of action. Biol Direct 1, 7 (2006).

    Article  PubMed  PubMed Central  Google Scholar 

  27. DeLong, E.F. Microbial community genomics in the ocean. Nat. Rev. Microbiol. 3, 459–469 (2005).

    Article  CAS  PubMed  Google Scholar 

  28. Kalyuzhnaya, M.G. et al. Fluorescence in situ hybridization-flow cytometry-cell sorting-based method for separation and enrichment of type I and type II methanotroph populations. Appl. Environ. Microbiol. 72, 4293–4301 (2006).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Zhang, K. et al. Sequencing genomes from single cells by polymerase cloning. Nat. Biotechnol. 24, 680–686 (2006).

    Article  CAS  PubMed  Google Scholar 

  30. Campbell, A., Mrazek, J. & Karlin, S. Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA. Proc. Natl. Acad. Sci. USA 96, 9184–9189 (1999).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. McHardy, A.C. Gene finding and the evaluation of synonymous codon usage features in microbial genomes.. Thesis, Bielefeld Univ., (2004).

    Google Scholar 

  32. Nelson, K.E. et al. Evidence for lateral gene transfer between Archaea and bacteria from genome sequence of Thermotoga maritima. Nature 399, 323–329 (1999).

    Article  CAS  PubMed  Google Scholar 

  33. Tsirigos, A. & Rigoutsos, I. A new computational method for the detection of horizontal gene transfer events. Nucleic Acids Res. 33, 922–933 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Overbeek, R. et al. The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res. 33, 5691–5702 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Wheeler, D.L. et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 29, 11–16 (2001).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank N. Ivanova, V. Kunin and F. Warnecke for help with selection of CAP and Thiothrix-specific training sets and for validation analyses of the metagenomic data-set binning, L. Krause for providing the SEED data, T. Huynh for implementing the web interface, and S. Polonsky for comments and discussion. The work of H.G.M. and P.H. was performed under the auspices of the US Department of Energy's Office of Science, Biological and Environmental Research Program; the University of California, Lawrence Livermore National Laboratory, under contract W-7405-Eng-48; Lawrence Berkeley National Laboratory under contract DE-AC03-76SF00098; and Los Alamos National Laboratory under contract W-7405-ENG-36. PhyloPythia's results were incorporated in the US Department of Energy Joint Genome Institute Integrated Microbial Genomes & Metagenomes (IMG/M) experimental system (http://www.jgi.doe.gov).

Author information

Authors and Affiliations

Authors

Contributions

A.C.M. developed and evaluated the method, A.T. contributed codes for pattern discovery and discussion, P.H. and H.G.M. helped with discussions and the evaluation of the results for the EBPR sludges, A.C.M., I.R., H.G.M. and P.H. contributed to the writing of the manuscript, and A.C.M. and I.R. designed and planned the project.

Corresponding author

Correspondence to Isidore Rigoutsos.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Fig. 1

Assignment accuracy for differently sized genomic fragments and coding sequences from unknown organisms at the level of the class.

Supplementary Fig. 2

Wn parameter search for the sequence composition space with the highest classification accuracy for 15 kb fragments of unknown organisms at different phylogenetic levels.

Supplementary Fig. 3

Evaluation of the relation of genomic fragment length used for model creation and classification accuracy for genomic fragments of unknown organisms and different lengths.

Supplementary Fig. 4

Assignments at the domain level with PhyloPythia for 50 kb genomic fragments from unknown organisms.

Supplementary Fig. 5

Comparison of classification accuracy for 3 kb fragments and 3 kb fragments carrying ribosomal proteins with PhyloPythia.

Supplementary Fig. 6

Clades at different depths of the phylogenetic tree that are sufficiently represented by genomes of the 340 organisms for composition-based modeling.

Supplementary Table 1

Wn parameter search for the sequence composition space with the highest classification accuracy.

Supplementary Table 2

Classification accuracy of the SVM with a gaussian versus a linear kernel.

Supplementary Table 3

Classification accuracy of PhyloPythia for genomic fragments of unknown organisms at different taxonomic ranks.

Supplementary Table 4

Phylogenetic classification accuracy of PhyloPythia for genomic fragments of known organisms at different taxonomic ranks.

Supplementary Table 5

Search for the best parameter settings for the SOM and TETRA-method.

Supplementary Table 6

Comparison of PhyloPythia to the SOM-phylotype associations and tetranucleotide-based binning of the dominant sample populations for the contigs ≥1kb of the Sargasso Sea sample.

Supplementary Table 7

Evaluation of PhyloPythia's classification accuracy for genome fragments of different Prochlorococcus strains.

Supplementary Methods

Supplementary Note

Rights and permissions

Reprints and permissions

About this article

Cite this article

McHardy, A., Martín, H., Tsirigos, A. et al. Accurate phylogenetic classification of variable-length DNA fragments. Nat Methods 4, 63–72 (2007). https://doi.org/10.1038/nmeth976

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nmeth976

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing