Abstract
Metagenomics is a rapidly emerging field of research for studying microbial communities. To evaluate methods presently used to process metagenomic sequences, we constructed three simulated data sets of varying complexity by combining sequencing reads randomly selected from 113 isolate genomes. These data sets were designed to model real metagenomes in terms of complexity and phylogenetic composition. We assembled sampled reads using three commonly used genome assemblers (Phrap, Arachne and JAZZ), and predicted genes using two popular gene-finding pipelines (fgenesb and CRITICA/GLIMMER). The phylogenetic origins of the assembled contigs were predicted using one sequence similarity–based (blast hit distribution) and two sequence composition–based (PhyloPythia, oligonucleotide frequencies) binning methods. We explored the effects of the simulated community structure and method combinations on the fidelity of each processing step by comparison to the corresponding isolate genomes. The simulated data sets are available online to facilitate standardized benchmarking of tools for metagenomic analysis.
Please visit methagora to view and post comments on this article
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Tyson, G.W. et al. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428, 37–43 (2004).
Venter, J.C. et al. Environmental genome shotgun sequencing of the Sargasso Sea. Science 304, 66–74 (2004).
Garcia Martin, H. et al. Metagenomic analysis of two enhanced biological phosphorus removal (EBPR) sludge communities. Nat. Biotechnol. 24, 1263–1269 (2006).
Hallam, S.J. et al. Genomic analysis of the uncultivated marine crenarchaeote Cenarchaeum symbiosum . Proc. Natl. Acad. Sci. USA 103, 18296–18301 (2006).
Delcher, A.L., Harmon, D., Kasif, S., White, O. & Salzberg, S.L. Improved microbial gene identification with GLIMMER. Nucleic Acids Res. 27, 4636–4641 (1999).
Lukashin, A.V. & Borodovsky, M. GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res. 26, 1107–1115 (1998).
Deschavanne, P.J., Giron, A., Vilain, J., Fagot, G. & Fertil, B. Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Mol. Biol. Evol. 16, 1391–1399 (1999).
Karlin, S. & Burge, C. Dinucleotide relative abundance extremes: a genomic signature. Trends Genet. 11, 283–290 (1995).
Teeling, H., Waldmann, J., Lombardot, T., Bauer, M. & Glockner, F.O. TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics 5, 163 (2004).
McHardy, A.C., Martin, H.G., Tsirigos, A., Hugenholtz, P. & Rigoutsos, I. Accurate phylogenetic classification of variable-length DNA fragments. Nat. Methods 4, 63–72 (2006).
Hugenholtz, P. Exploring prokaryotic diversity in the genomic era. Genome Biol. 3, 0003 (2002).
Liolios, K., Tavernarakis, N., Hugenholtz, P. & Kyrpides, N.C. The genomes on line database (GOLD) v.2: a monitor of genome projects worldwide. Nucleic Acids Res. 34, D332–D334 (2006).
Markowitz, V.M. et al. The integrated microbial genomes (IMG) system. Nucleic Acids Res. 34, D344–D348 (2006).
Strous, M. et al. Deciphering the evolution and metabolism of an anammox bacterium from a community genome. Nature 440, 790–794 (2006).
Woyke, T. et al. Symbiosis insights through metagenomic analysis of a microbial consortium. Nature 443, 950–955 (2006).
Tringe, S.G. et al. Comparative metagenomics of microbial communities. Science 308, 554–557 (2005).
Jaffe, D.B. et al. Whole-genome sequence assembly for mammalian genomes: Arachne 2. Genome Res. 13, 91–96 (2003).
Aparicio, S. et al. Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes . Science 297, 1301–1310 (2002).
Chain, P. et al. Complete genome sequence of the ammonia-oxidizing bacterium and obligate chemolithoautotroph Nitrosomonas europaea . J. Bacteriol. 185, 2759–2773 (2003).
Altschul, S.F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
DeLong, E.F. et al. Community genomics among stratified microbial assemblages in the ocean's interior. Science 311, 496–503 (2006).
Tringe, S.G. & Rubin, E.M. Metagenomics: DNA sequencing of environmental samples. Nat. Rev. Genet. 6, 805–814 (2005).
Tatusov, R.L. et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4, 41 (2003).
Markowitz, V.M. et al. An experimental metagenome data management and analysis system. Bioinformatics 22, e359–e367 (2006).
Acknowledgements
We thank A. Lykidis and I. Anderson from the Genome Biology Program at DOE-JGI for their feedback and comments on this manuscript. This work was performed under the auspices of the US Department of Energy's Office of Science, Biological and Environmental Research Program, and the University of California, Lawrence Livermore National Laboratory under contract number W-7405-Eng-48, Lawrence Berkeley National Laboratory under contract number DE-AC02-05CH11231 and Los Alamos National Laboratory under contract number W-7405-ENG-36.
Author information
Authors and Affiliations
Contributions
K.M. and N.I. performed the analysis, K.B., H.S. and E.G. performed assemblies with Phrap, JAZZ and Arachne respectively, A.C.M. performed binning with PhyloPythia, A.S. performed gene predictions with fgenesb and developed and performed binning with BLAST distr, F.K. developed and performed binning with kmer, M.L. performed gene prediction with the GLIMMER/CRITICA pipeline, A.L., I.G., P.R. and I.R. supported the project, P.H. and N.C.K. supported the project and contributed conceptually. K.M., P.H. and N.C.K. wrote the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Supplementary information
Supplementary Fig. 1
Enlarged versions of panels in Figure 1b,c. (PDF 1337 kb)
Supplementary Fig. 2
Relative abundance of Alpha and Gamma proteobacteria as derived from binning results for the simLC and simMC data sets. (PDF 118 kb)
Supplementary Table 1
Organisms used for the simulated data sets. (PDF 79 kb)
Supplementary Table 2
Binning summary for contigs larger than 8 Kb and larger than 10 reads. (PDF 54 kb)
Rights and permissions
About this article
Cite this article
Mavromatis, K., Ivanova, N., Barry, K. et al. Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat Methods 4, 495–500 (2007). https://doi.org/10.1038/nmeth1043
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/nmeth1043
This article is cited by
-
MAGICIAN: MAG simulation for investigating criteria for bioinformatic analysis
BMC Genomics (2024)
-
Crop rotation and native microbiome inoculation restore soil capacity to suppress a root disease
Nature Communications (2023)
-
Constructing metagenome-assembled genomes for almost all components in a real bacterial consortium for binning benchmarking
BMC Genomics (2022)
-
Microbial diversity in intensively farmed lake sediment contaminated by heavy metals and identification of microbial taxa bioindicators of environmental quality
Scientific Reports (2022)
-
Challenges in benchmarking metagenomic profilers
Nature Methods (2021)