计算机科学 ›› 2015, Vol. 42 ›› Issue (9): 235-239.doi: 10.11896/j.issn.1002-137X.2015.09.045

• 人工智能 • 上一篇    下一篇

CGDNA:基于簇图的基因组序列集成拼接算法

徐魁,陈 科,徐 君,田佳林,刘 浩,王宇凡   

  1. 天津工业大学计算机科学与软件学院 天津300387,天津工业大学计算机科学与软件学院 天津300387,南开大学数学科学学院 天津300071,天津工业大学计算机科学与软件学院 天津300387,天津工业大学计算机科学与软件学院 天津300387,天津工业大学计算机科学与软件学院 天津300387
  • 出版日期:2018-11-14 发布日期:2018-11-14
  • 基金资助:
    本文受国家自然科学基金(11201134),天津市自然科学基金一般项目(12JCYBJC31900),天津市高校中青年骨干创新人才培养计划资助

CGDNA:An Ensemble De Novo Genome Assembly Algorithm Based on Clustering Graph

XU Kui, CHEN Ke, XU Jun, TIAN Jia-lin, LIU Hao and WANG Yu-fan   

  • Online:2018-11-14 Published:2018-11-14

摘要: 基因组测序的目的是获取一个生物体完整的DNA序列信息,而DNA信息是进行遗传学研究和疾病诊断的基础。通常而言,完整的基因组测序分为两个步骤:第一步通过实验手段测定DNA序列片段,第二步通过计算方法把DNA片段拼接为完整的基因组。尽管桑格测序技术成功解析了包括人类在内的多个基因组,但其由于成本过高,目前逐渐被新一代测序技术所取代。新一代测序技术的特点为高通量、高覆盖率、低成本,随之而来的缺点体现为短读长、更多类型的错误。这些特点也给基因拼接算法带来了更大的挑战。鉴于目前的数十种基因拼接算法中并没有一种算法显著优于其它算法,且一些分析表明不同算法的拼接结果具有互补性,提出了CGDNA算法框架,它把不同算法的拼接结果整合到一起,使得整合的结果超越任何单个算法的结果。提出了一种基于簇图的基因组序列集成拼接算法,它通过构建索引、读长映射、重叠群聚簇、构建簇图等步骤将重叠群拼接成更长的序列。实验结果表明,相对于目前最优的算法Velvet、ABySS、SOAPdenovo,CGDNA在N50与最长拼接序列这两项指标上的增长比例高达50%以上,并且达到了较高的覆盖度。当更多的基本算法集成到本算法时,性能可进一步提高。提出的方法大幅提高了基因拼接的长度,为下一步的遗传分析降低了难度,并加快了生物基因组研究的步伐。

关键词: 基因组拼接,集成算法,簇图,索引,读长映射

Abstract: The ultimate goal of genome sequencing is to determine the complete DNA sequence of an organism,which is the basis for genetic research and disease diagnosis.In general,genome sequencing can be divided into two steps:first,generating and determining the DNA fragments experimentally;second,assembling the fragments into full genome through computational method.Although the Sanger technology successfully resolves the human genome,it is replaced by the next generation of sequencing technology due to its high cost.The next generation of sequencing technology has the merits of high throughput,high coverage and low cost and accompanies with short reads and more errors as a byproduct,which brings more challenge to the assembly algorithms.Since it is reported that the assembly results by different algorithms are complementary and none of the assembly algorithms consistently outperforms the remaining algorithms,this study aimed at integrating the assembly results produced by multiple algorithms.In this study,we proposed an algorithm based on clustering graph.Through building index,mapping of reads,clustering of contigs and building of clustering graph,the proposed algorithm outperforms any of the single algorithm.The experimental results demonstrate that by implementing the CGDNA algorithm,two standard metrics(the largest scaffold and scaffold N50) are increased by 50% when compared to the state-of-the-art algorithms,i.e.,Velvet,ABySS,and SOAPdenovo.Moreover,the performance of CGDNA algorithm should be further improved when more base algorithms are added.The proposed algorithm largely improves the quality of assembly result,reduces the difficulty of genetic analysis and accelerates the genome research.

Key words: De novo genome assembly,Ensemble algorithm,Clustering graph,Indexing,Read mapping

[1] Medvedev P,Georgiou K,Myers G,et al.Computability of Mo-dels for Sequence Assmbly[M]∥Algorithms in Bioinformatics.Springer-Verlag Berlin Heidelberg,2007:289-301
[2] Drmanac R,et al.Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays[J].Science,2010,327(5961):78-81
[3] Harris T D,et al.Single-molecule DNA sequencing of a viral genome[J].Science,2008,320(5872):106-109
[4] Margulies M,et al.Genome sequencing inmicrofabricated high-density picolitre reactors[J].Nature,2005,7(7075):376-380
[5] McKernan K J,et al.Sequence and structural variation in a human genome uncovered by short-read,massively parallel ligation sequencing using two-base encoding[J].Genome Res,2009,19(9):1527-1541
[6] Medvedev P,et al.Paired de Bruijn graphs:a novel approach for incorporating mate pair information into genome assemblers[M]∥Research in Computational Molecular Biology.Springer,2011:238-251
[7] Hernandez D,Francois P,Farinelli L,et al.De novo bacterial genome sequencing:millions of very short reads assembled on a desktop computer[J].Genome Res.,2008,8(5):802-809
[8] Miller J R,Koren S,Sutton G.Assembly algorithms for next-generation sequencing data[J].Genomics,2010,5(6):315-327
[9] Warren R L,Sutton G G,Jones S J,et al.Assembling millions of short DNA sequences using SSAKE[J].Bioinformatics,2007,23(4):500-501
[10] Dohm J C,Lottaz C,Borodina T,et al.A fast and highly accurate short-read assembly algorithm for de novo genomic sequencing[J].Genome Res,2007,7(11):1697-1706
[11] Jeck W R,Reinhardt J A,Baltrus D A,et al.Extending assembly of short DNA sequences to handle error[J].Bioinformatics,2007,23(21):2942-2944
[12] http://linux1.softberry.com/berry.phtml?topic=OligoZip
[13] Bresler M,Sheehan S,Chan A H,et al.Telescoper:De novo Assembly of Highly Repetitive Regions[J].Bioinformatics,2012,8(18):311-317
[14] Chaisson M J P,et al.De novo fragment assembly with short mate-paired reads:does the read length matter?[J].Genome Res,2009,19(2):336-346
[15] MacCallum I,et al.ALLPATHS 2:small genomes assembled accurately and with high continuity from short paired reads[J].Genome Biol,2009,10(10):R103
[16] Simpson J T,et al.ABySS:a parallel assembler for short-read sequence data[J].Genome Res,2009,19(6):1117-1123
[17] Zerbino D R,Birney E.Velvet:algorithms for de novo short-read assembly using de Bruijn graphs[J].Genome Res,2008,18(5):821-829
[18] Li R,et al.De novo assembly of human genomes with massively parallel shortread sequencing[J].Genome Res,2010,20(2):265-272
[19] Earl D A,et al.Assemblathon 1:a competitive assessment of de novo short-read assembly methods[J].Genome Res,2011,21(12):2224-2241
[20] Miller J,Koren S,Sutton G.Assembly algorithms for next-gene-ration sequencing data[J].Genomics,2010,95(6):315-27

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!