Coverage (genetics)

Last updated
An overlap of the product of three sequencing runs, with the read sequence coverage at each point indicated. Read, read length and read depth to achieve a read depth of 4.jpg
An overlap of the product of three sequencing runs, with the read sequence coverage at each point indicated.

In genetics, coverage is one of several measures of the depth or completeness of DNA sequencing, and is more specifically expressed in any of the following terms:

Contents

Sequence coverage

Rationale

Even though the sequencing accuracy for each individual nucleotide is very high, the very large number of nucleotides in the genome means that if an individual genome is only sequenced once, there will be a significant number of sequencing errors. Furthermore, many positions in a genome contain rare single-nucleotide polymorphisms (SNPs). Hence to distinguish between sequencing errors and true SNPs, it is necessary to increase the sequencing accuracy even further by sequencing individual genomes a large number of times.

Ultra-deep sequencing

The term "ultra-deep" can sometimes also refer to higher coverage (>100-fold), which allows for detection of sequence variants in mixed populations. [5] [6] [7] In the extreme, error-corrected sequencing approaches such as Maximum-Depth Sequencing can make it so that coverage of a given region approaches the throughput of a sequencing machine, allowing coverages of >10^8. [8]

Transcriptome sequencing

Deep sequencing of transcriptomes, also known as RNA-Seq, provides both the sequence and frequency of RNA molecules that are present at any particular time in a specific cell type, tissue or organ. [9] Counting the number of mRNAs that are encoded by individual genes provides an indicator of protein-coding potential, a major contributor to phenotype. [10] Improving methods for RNA sequencing is an active area of research both in terms of experimental and computational methods. [11]

Calculation

The average coverage for a whole genome can be calculated from the length of the original genome (G), the number of reads (N), and the average read length (L) as . For example, a hypothetical genome with 2,000 base pairs reconstructed from 8 reads with an average length of 500 nucleotides will have 2× redundancy. This parameter also enables one to estimate other quantities, such as the percentage of the genome covered by reads (sometimes also called breadth of coverage). A high coverage in shotgun sequencing is desired because it can overcome errors in base calling and assembly. The subject of DNA sequencing theory addresses the relationships of such quantities. [2]

Physical coverage

Sometimes a distinction is made between sequence coverage and physical coverage. Where sequence coverage is the average number of times a base is read, physical coverage is the average number of times a base is read or spanned by mate paired reads. [2] [12] [4]

Genomic coverage

In terms of genomic coverage and accuracy, whole genome sequencing can broadly be classified into either of the following: [13]

Producing a truly high-quality finished sequence by this definition is very expensive. Thus, most human "whole genome sequencing" results are draft sequences (sometimes above and sometimes below the accuracy defined above). [13]

Related Research Articles

In genetics, shotgun sequencing is a method used for sequencing random DNA strands. It is named by analogy with the rapidly expanding, quasi-random shot grouping of a shotgun.

<span class="mw-page-title-main">Genomics</span> Discipline in genetics

Genomics is an interdisciplinary field of molecular biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, three-dimensional structural configuration. In contrast to genetics, which refers to the study of individual genes and their roles in inheritance, genomics aims at the collective characterization and quantification of all of an organism's genes, their interrelations and influence on the organism. Genes may direct the production of proteins with the assistance of enzymes and messenger molecules. In turn, proteins make up body structures such as organs and tissues as well as control chemical reactions and carry signals between cells. Genomics also involves the sequencing and analysis of genomes through uses of high throughput DNA sequencing and bioinformatics to assemble and analyze the function and structure of entire genomes. Advances in genomics have triggered a revolution in discovery-based research and systems biology to facilitate understanding of even the most complex biological systems such as the brain.

<span class="mw-page-title-main">DNA sequencer</span> A scientific instrument used to automate the DNA sequencing process

A DNA sequencer is a scientific instrument used to automate the DNA sequencing process. Given a sample of DNA, a DNA sequencer is used to determine the order of the four bases: G (guanine), C (cytosine), A (adenine) and T (thymine). This is then reported as a text string, called a read. Some DNA sequencers can be also considered optical instruments as they analyze light signals originating from fluorochromes attached to nucleotides.

In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology might not be able to 'read' whole genomes in one go, but rather reads small pieces of between 20 and 30,000 bases, depending on the technology used. Typically, the short fragments (reads) result from shotgun sequencing genomic DNA, or gene transcript (ESTs).

<span class="mw-page-title-main">DNA sequencing</span> Process of determining the nucleic acid sequence

DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. The advent of rapid DNA sequencing methods has greatly accelerated biological and medical research and discovery.

<span class="mw-page-title-main">ABI Solid Sequencing</span>

SOLiD (Sequencing by Oligonucleotide Ligation and Detection) is a next-generation DNA sequencing technology developed by Life Technologies and has been commercially available since 2006. This next generation technology generates 108 - 109 small sequence reads at one time. It uses 2 base encoding to decode the raw data generated by the sequencing platform into sequence data.

<span class="mw-page-title-main">RNA-Seq</span> Lab technique in cellular biology

RNA-Seq is a technique that uses next-generation sequencing to reveal the presence and quantity of RNA molecules in a biological sample, providing a snapshot of gene expression in the sample, also known as transcriptome.

Cap analysis of gene expression (CAGE) is a gene expression technique used in molecular biology to produce a snapshot of the 5′ end of the messenger RNA population in a biological sample. The small fragments from the very beginnings of mRNAs are extracted, reverse-transcribed to cDNA, PCR amplified and sequenced. CAGE was first published by Hayashizaki, Carninci and co-workers in 2003. CAGE has been extensively used within the FANTOM research projects.

De novo transcriptome assembly is the de novo sequence assembly method of creating a transcriptome without the aid of a reference genome.

<span class="mw-page-title-main">Illumina dye sequencing</span> DNA sequencing method

Illumina dye sequencing is a technique used to determine the series of base pairs in DNA, also known as DNA sequencing. The reversible terminated chemistry concept was invented by Bruno Canard and Simon Sarfati at the Pasteur Institute in Paris. It was developed by Shankar Balasubramanian and David Klenerman of Cambridge University, who subsequently founded Solexa, a company later acquired by Illumina. This sequencing method is based on reversible dye-terminators that enable the identification of single nucleotides as they are washed over DNA strands. It can also be used for whole-genome and region sequencing, transcriptome analysis, metagenomics, small RNA discovery, methylation profiling, and genome-wide protein-nucleic acid interaction analysis.

In DNA sequencing, a read is an inferred sequence of base pairs corresponding to all or part of a single DNA fragment. A typical sequencing experiment involves fragmentation of the genome into millions of molecules, which are size-selected and ligated to adapters. The set of fragments is referred to as a sequencing library, which is sequenced to produce a set of reads.

Single-cell sequencing examines the nucleic acid sequence information from individual cells with optimized next-generation sequencing technologies, providing a higher resolution of cellular differences and a better understanding of the function of an individual cell in the context of its microenvironment. For example, in cancer, sequencing the DNA of individual cells can give information about mutations carried by small populations of cells. In development, sequencing the RNAs expressed by individual cells can give insight into the existence and behavior of different cell types. In microbial systems, a population of the same species can appear genetically clonal. Still, single-cell sequencing of RNA or epigenetic modifications can reveal cell-to-cell variability that may help populations rapidly adapt to survive in changing environments.

G&T-seq is a novel form of single cell sequencing technique allowing one to simultaneously obtain both transcriptomic and genomic data from single cells, allowing for direct comparison of gene expression data to its corresponding genomic data in the same cell...

<span class="mw-page-title-main">Duplex sequencing</span>

Duplex sequencing is a library preparation and analysis method for next-generation sequencing (NGS) platforms that employs random tagging of double-stranded DNA to detect mutations with higher accuracy and lower error rates.

Third-generation sequencing is a class of DNA sequencing methods which produce longer sequence reads, under active development since 2008.

Transcriptomics technologies are the techniques used to study an organism's transcriptome, the sum of all of its RNA transcripts. The information content of an organism is recorded in the DNA of its genome and expressed through transcription. Here, mRNA serves as a transient intermediary molecule in the information network, whilst non-coding RNAs perform additional diverse functions. A transcriptome captures a snapshot in time of the total transcripts present in a cell. Transcriptomics technologies provide a broad account of which cellular processes are active and which are dormant. A major challenge in molecular biology is to understand how a single genome gives rise to a variety of cells. Another is how gene expression is regulated.

<span class="mw-page-title-main">Genome skimming</span> Method of genome sequencing

Genome skimming is a sequencing approach that uses low-pass, shallow sequencing of a genome, to generate fragments of DNA, known as genome skims. These genome skims contain information about the high-copy fraction of the genome. The high-copy fraction of the genome consists of the ribosomal DNA, plastid genome (plastome), mitochondrial genome (mitogenome), and nuclear repeats such as microsatellites and transposable elements. It employs high-throughput, next generation sequencing technology to generate these skims. Although these skims are merely 'the tip of the genomic iceberg', phylogenomic analysis of them can still provide insights on evolutionary history and biodiversity at a lower cost and larger scale than traditional methods. Due to the small amount of DNA required for genome skimming, its methodology can be applied in other fields other than genomics. Tasks like this include determining the traceability of products in the food industry, enforcing international regulations regarding biodiversity and biological resources, and forensics.

<span class="mw-page-title-main">Human Pangenome Reference</span>

The Human Pangenome Reference is a collection of genomes from a diverse cohort of individuals compiled by the Human Pangenome Reference Consortium (HPRC). This first draft pangenome comprises 47 phased, diploid assemblies from a diverse cohort of individuals and was intended to capture the genetic diversity of the human population. The development of this pangenome seeks to address perceived shortcomings in the current human reference genome by offering a more comprehensive and inclusive resource for genomic research and analysis.

3' mRNA-seq is a quantitative, genome-wide transcriptomic technique based on the barcoding of the 3' untranslated region (UTR) of mRNA molecules. Unlike standard bulk RNA-seq, where short sequencing reads are generated along the entire length of mRNA transcripts, only the 3' end of polyadenylated RNAs are sequenced in 3' mRNA-seq. This approach results in a need for fewer reads to quantify the expression of a gene and reduces the sequencing depth required per sample while providing robust and reliable transcriptome-wide read-outs of gene expression levels comparable to full-length RNA-seq methods.

References

  1. "Sequencing Coverage". illumina.com. Illumina education. Retrieved 2020-10-08.
  2. 1 2 3 Sims, David; Sudbery, Ian; Ilott, Nicholas E.; Heger, Andreas; Ponting, Chris P. (2014). "Sequencing depth and coverage: key considerations in genomic analyses". Nature Reviews Genetics. 15 (2): 121–132. doi:10.1038/nrg3642. PMID   24434847. S2CID   13325739.
  3. Mardis, Elaine R. (2008-09-01). "Next-Generation DNA Sequencing Methods". Annual Review of Genomics and Human Genetics. 9 (1): 387–402. doi:10.1146/annurev.genom.9.081307.164359. ISSN   1527-8204. PMID   18576944.
  4. 1 2 Ekblom, Robert; Wolf, Jochen B. W. (2014). "A field guide to whole-genome sequencing, assembly and annotation". Evolutionary Applications. 7 (9): 1026–42. Bibcode:2014EvApp...7.1026E. doi:10.1111/eva.12178. PMC   4231593 . PMID   25553065.
  5. Ajay SS, Parker SC, Abaan HO, Fajardo KV, Margulies EH (September 2011). "Accurate and comprehensive sequencing of personal genomes". Genome Res. 21 (9): 1498–505. doi:10.1101/gr.123638.111. PMC   3166834 . PMID   21771779.
  6. Mirebrahim, Hamid; Close, Timothy J.; Lonardi, Stefano (2015-06-15). "De novo meta-assembly of ultra-deep sequencing data". Bioinformatics. 31 (12): i9–i16. doi:10.1093/bioinformatics/btv226. ISSN   1367-4803. PMC   4765875 . PMID   26072514.
  7. Beerenwinkel, Niko; Zagordi, Osvaldo (2011-11-01). "Ultra-deep sequencing for the analysis of viral populations". Current Opinion in Virology. 1 (5): 413–418. doi:10.1016/j.coviro.2011.07.008. PMID   22440844.
  8. Jee, J.; Rasouly, A.; Shamovsky, I.; Akivis, Y.; Steinman, S.; Mishra, B.; Nudler, E. (2016). "Rates and mechanisms of bacterial mutagenesis from maximum-depth sequencing". Nature. 534 (7609): 693–696. Bibcode:2016Natur.534..693J. doi:10.1038/nature18313. PMC   4940094 . PMID   27338792.
  9. Malone, John H.; Oliver, Brian (2011-01-01). "Microarrays, deep sequencing and the true measure of the transcriptome". BMC Biology. 9: 34. doi: 10.1186/1741-7007-9-34 . ISSN   1741-7007. PMC   3104486 . PMID   21627854.
  10. Hampton M, Melvin RG, Kendall AH, Kirkpatrick BR, Peterson N, Andrews MT (2011). "Deep sequencing the transcriptome reveals seasonal adaptive mechanisms in a hibernating mammal". PLOS ONE. 6 (10): e27021. Bibcode:2011PLoSO...627021H. doi: 10.1371/journal.pone.0027021 . PMC   3203946 . PMID   22046435.
  11. Heyer EE, Ozadam H, Ricci EP, Cenik C, Moore MJ (2015). "An optimized kit-free method for making strand-specific deep sequencing libraries from RNA fragments". Nucleic Acids Res. 43 (1): e2. doi:10.1093/nar/gku1235. PMC   4288154 . PMID   25505164.
  12. Meyerson, M.; Gabriel, S.; Getz, G. (2010). "Advances in understanding cancer genomes through second-generation sequencing". Nature Reviews Genetics. 11 (10): 685–696. doi:10.1038/nrg2841. PMID   20847746. S2CID   2544266.
  13. 1 2 Kris A. Wetterstrand, M.S. "The Cost of Sequencing a Human Genome". National Human Genome Research Institute . Last updated: November 1, 2021