计算机科学 ›› 2015, Vol. 42 ›› Issue (9): 235-239.doi: 10.11896/j.issn.1002-137X.2015.09.045
徐魁,陈 科,徐 君,田佳林,刘 浩,王宇凡
XU Kui, CHEN Ke, XU Jun, TIAN Jia-lin, LIU Hao and WANG Yu-fan
摘要: 基因组测序的目的是获取一个生物体完整的DNA序列信息,而DNA信息是进行遗传学研究和疾病诊断的基础。通常而言,完整的基因组测序分为两个步骤:第一步通过实验手段测定DNA序列片段,第二步通过计算方法把DNA片段拼接为完整的基因组。尽管桑格测序技术成功解析了包括人类在内的多个基因组,但其由于成本过高,目前逐渐被新一代测序技术所取代。新一代测序技术的特点为高通量、高覆盖率、低成本,随之而来的缺点体现为短读长、更多类型的错误。这些特点也给基因拼接算法带来了更大的挑战。鉴于目前的数十种基因拼接算法中并没有一种算法显著优于其它算法,且一些分析表明不同算法的拼接结果具有互补性,提出了CGDNA算法框架,它把不同算法的拼接结果整合到一起,使得整合的结果超越任何单个算法的结果。提出了一种基于簇图的基因组序列集成拼接算法,它通过构建索引、读长映射、重叠群聚簇、构建簇图等步骤将重叠群拼接成更长的序列。实验结果表明,相对于目前最优的算法Velvet、ABySS、SOAPdenovo,CGDNA在N50与最长拼接序列这两项指标上的增长比例高达50%以上,并且达到了较高的覆盖度。当更多的基本算法集成到本算法时,性能可进一步提高。提出的方法大幅提高了基因拼接的长度,为下一步的遗传分析降低了难度,并加快了生物基因组研究的步伐。
[1] Medvedev P,Georgiou K,Myers G,et al.Computability of Mo-dels for Sequence Assmbly[M]∥Algorithms in Bioinformatics.Springer-Verlag Berlin Heidelberg,2007:289-301 [2] Drmanac R,et al.Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays[J].Science,2010,327(5961):78-81 [3] Harris T D,et al.Single-molecule DNA sequencing of a viral genome[J].Science,2008,320(5872):106-109 [4] Margulies M,et al.Genome sequencing inmicrofabricated high-density picolitre reactors[J].Nature,2005,7(7075):376-380 [5] McKernan K J,et al.Sequence and structural variation in a human genome uncovered by short-read,massively parallel ligation sequencing using two-base encoding[J].Genome Res,2009,19(9):1527-1541 [6] Medvedev P,et al.Paired de Bruijn graphs:a novel approach for incorporating mate pair information into genome assemblers[M]∥Research in Computational Molecular Biology.Springer,2011:238-251 [7] Hernandez D,Francois P,Farinelli L,et al.De novo bacterial genome sequencing:millions of very short reads assembled on a desktop computer[J].Genome Res.,2008,8(5):802-809 [8] Miller J R,Koren S,Sutton G.Assembly algorithms for next-generation sequencing data[J].Genomics,2010,5(6):315-327 [9] Warren R L,Sutton G G,Jones S J,et al.Assembling millions of short DNA sequences using SSAKE[J].Bioinformatics,2007,23(4):500-501 [10] Dohm J C,Lottaz C,Borodina T,et al.A fast and highly accurate short-read assembly algorithm for de novo genomic sequencing[J].Genome Res,2007,7(11):1697-1706 [11] Jeck W R,Reinhardt J A,Baltrus D A,et al.Extending assembly of short DNA sequences to handle error[J].Bioinformatics,2007,23(21):2942-2944 [12] http://linux1.softberry.com/berry.phtml?topic=OligoZip [13] Bresler M,Sheehan S,Chan A H,et al.Telescoper:De novo Assembly of Highly Repetitive Regions[J].Bioinformatics,2012,8(18):311-317 [14] Chaisson M J P,et al.De novo fragment assembly with short mate-paired reads:does the read length matter?[J].Genome Res,2009,19(2):336-346 [15] MacCallum I,et al.ALLPATHS 2:small genomes assembled accurately and with high continuity from short paired reads[J].Genome Biol,2009,10(10):R103 [16] Simpson J T,et al.ABySS:a parallel assembler for short-read sequence data[J].Genome Res,2009,19(6):1117-1123 [17] Zerbino D R,Birney E.Velvet:algorithms for de novo short-read assembly using de Bruijn graphs[J].Genome Res,2008,18(5):821-829 [18] Li R,et al.De novo assembly of human genomes with massively parallel shortread sequencing[J].Genome Res,2010,20(2):265-272 [19] Earl D A,et al.Assemblathon 1:a competitive assessment of de novo short-read assembly methods[J].Genome Res,2011,21(12):2224-2241 [20] Miller J,Koren S,Sutton G.Assembly algorithms for next-gene-ration sequencing data[J].Genomics,2010,95(6):315-27 |
No related articles found! |
|