Hybrid genome assembly - Knowledge (XXG)

168: 248:

high-quality contigs constructed from sequencing reads from second-generation (Illumina and 454) technology. These contigs were supplemented by aligning them to PacBio long reads to achieve linear scaffolds that were gap-filled using PacBio long reads. These scaffolds were then supplemented again, but using PacBio strobe reads (multiple subreads from a single contiguous fragment of DNA ) to achieve a final, high-quality assembly. This approach was used to sequence the genome of a strain of

96:) to conduct genomic analyses involving an organism of interest. The advent of next generation sequencing has presented significant improvements in the speed, accuracy and cost of DNA sequencing and has made the sequencing of entire genomes a feasible process. There are many different sequencing technologies that have been developed by various biotechnology companies, each of which produce different sequencing reads in terms of accuracy and read length. Some of these technologies include 194:

can be difficult when using reads of substantially different lengths. Currently, this challenge is being overcome by using multiple genome assembly programs. An example of this can be seen in Goldberg et al. where the authors paired 454 reads with Sanger reads. The 454 reads were first assemble using the Newbler assembler (which is optimized to use short reads) generating pseudo reads that were then paired with the longer Sanger reads and assembled using the Celera assembler.

273: 327:

difficulties that are encountered during genome assembly will also become a concept of the past as computation efficiency and performance increases. The development of more efficient sequencing algorithms and assembly programs is needed to develop more effective assembly approaches that can tandemly incorporate sequencing reads from multiple technologies.

310:

generate scaffolds using the filtered BLASR data . The advantages of cerulean are that it requires minimal resources and results in assembled scaffolds with high accuracy. These characteristics make it better suited for up-scaling to be used on larger eukaryotic genomes, but the efficiency of cerulean when applied to larger genomes remains to be verified.

17: 305:

Cerulean, unlike other hybrid assembly approaches, doesn’t use the short reads directly, instead it uses an assembly graph that is created in a similar manner to the OLC method or the De Bruijn method. This graph is used to assemble a skeleton graph, which only uses long contigs with the edges of the

301:

The authors of this paper present Cerulean, a hybrid genome assembly program that differs from traditional hybrid assembly approaches. Normally, hybrid assembly involved mapping short high quality reads to long low quality reads, but this still introduces errors in the assembled genomes. This process

224:

This study offers an improvement over the typical programs and algorithms used to assemble uncorrected PacBio reads. ALLPATHS-LG (another program that can assemble PacBio reads) uses the uncorrected PacBio reads to assist in scaffolding and for the closing of gaps in short sequence assemblies. Due to

193:

There are inherent challenges when utilizing sequence reads from various technologies to assemble a sequenced genome; data coming from different sequencers can have different characteristics. An example of this can be seen when using the overlap-layout-consensus (OLC) method of genome assembly, which

183:

that prevents it from being used alone is its relatively low accuracy, which causes inherent errors in the sequenced DNA. Using solely second-generation sequencing technologies for genome assembly can miss or lead to the incomplete assembly of important aspects of the genome. Supplementation of third

50:

is 149 billion base pairs). This assembly is computationally difficult and has some inherent challenges, one of these challenges being that genomes often contain complex tandem repeats of sequences that can be thousands of base pairs in length. These repeats can be long enough that second generation

330:

Many of the current limitations in genomic research revolve around the ability to produce large amounts of high quality sequencing data and to assemble entire genomes of organisms of interest. Developing more effective hybrid genome assembly strategies is taking the next step in advancing sequence

318:

The current challenges in genome assembly are related to the limitation of modern sequencing technologies. Advances in sequencing technology aim to develop systems that are able to produce long sequencing reads with very high fidelity but, at this point, these two things are mutually exclusive. The

263:

One area of the genome where the use of the long PacBio reads was especially helpful was the ribosomal operon. This region is usually greater than 5kb in size and occurs seven time throughout the genome with an average identity ranging from 98.04% to 99.94%. Resolving these regions using only short

259:

This study also used a hybrid approach to error-correction of PacBio sequencing data. This was done by utilizing high-coverage Illumina short reads to correct errors in the low-coverage PacBio reads. BLASR (a long read aligner from PacBio) was used in this process. In areas where the Illumina reads

247:

This study employed two different methods for hybrid genome assembly: a scaffolding approach that supplemented currently available sequenced contigs with PacBio reads, as well as an error correction approach to improve the assembly of bacterial genomes. The first approach in this study started with

220:

technologies). This mapping allows for trimming and correction of the long reads to improve the read accuracy from as low as 80% to over 99.9%. In the best example of this application from this paper, the contig size was quintupled when compared to the assemblies using only second-generation reads.

309:

This method was tested by assembling the genome of an ‘’Escherichia coli’’ strain. First, short reads were assembled using the ABySS assembler. These reads were then mapped to the long reads using BLASR. The results from the ABySS assembly were used to create the assembly graph, which were used to

292:

Comparing the assembly constructed using the hybrid approach to the assembly created using the traditional reference genome approach showed that, with the availability of a reference genome, it is more beneficial to utilize an hybrid de novo assembly strategy as it preserves more genome sequences.

288:

was assembled twice: once using a classical reference genome approach, and once using a hybrid approach. The hybrid approach consisted of three contiguous steps. Firstly, contigs were generated de novo, secondly, the contigs were ordered and concatenated into supercontigs, and, thirdly, the gaps

55:

reads, such as those obtained using the PacBio RS DNA sequencer. These sequences are, on average, 10,000–15,000 base pairs in length and are long enough to span most repeated regions. Using a hybrid approach to this process can increase the fidelity of assembling tandem repeats by being able to

326:

The idea of using multiple sequencing technologies to facilitate genome assembly may become an idea of the past as the quality of long sequencing reads (hundreds or thousands of base pairs) approaches and exceeds the quality of current second generation sequencing reads. The computational

634:

Goldberg, S. M., Johnson, J., Busam, D., Feldblyum, T., Ferriera, S., Friedman, R., ... Venter, J. C. (2006). A Sanger/pyrosequencing hybrid approach for the generation of high-quality draft assemblies of marine microbial genomes. Proc Natl Acad Sci U S A, 103(30), 11240–11245.

158:

in the entire genome assembly process. As such, extensive research is being done to develop new techniques and algorithms to streamline the genome assembly process and make it a more computationally efficient process and to increase the accuracy of the process as a whole.

528:

Ham, J. S., Kwak, W., Chang, O. K., Han, G. S., Jeong, S. G., Seol, K. H., ... Kim, H. (2013). De Novo Assembly and Comparative Analysis of the Enterococcus faecalis Genome (KACC 91532) from a Korean Neonate. Journal of Microbiology and Biotechnology, 23(7), 966–973.

228:

This study also shows that using a lower coverage of corrected long reads is similar to using a higher coverage of shorter reads; 13x PBcR data (corrected using 50x Illumina data) was comparable to an assembly constructed using 100x paired-end Illumina reads. The

306:

graph representing the putative genomic connection between the contigs. The skeleton graph is a simplified version of a typical De Bruijn graph, which means that unambiguous assembly using the skeleton graph is more favourable than traditional methods.

184:

generation reads with short, high-accuracy second generation sequences can overcome these inherent errors and completed crucial details of the genome. This approach has been used to sequence the genomes of some bacterial species including a strain of

150:

approach is used. This is because millions of sequences must be assembled to reconstruct a genome. Within genomes, there are often tandem repeats of DNA segments that can be thousands of base pairs in length, which can cause problems during assembly.

683:

Abrams, J. Y., Copeland, J. R., Tauxe, R. V., Date, K. A., Belay, E. D., Mody, R. K., & Mintz, E. D. (2013). Real-time modelling used for outbreak management during a cholera epidemic, Haiti, 2010–2011. Epidemiology and Infection, 141(6),

197:

Hybrid genome assembly can also be accomplished using the Eulerian path approach. In this approach, the length of the assembled sequences does not matter as once a k-mer spectrum has been constructed, the lengths of the reads are irrelevant.

41:

from fragmented, sequenced DNA resulting from shotgun sequencing. Genome assembly presents one of the most challenging tasks in genome sequencing as most modern DNA sequencing technologies can only produce reads that are, on average, 25–300

264:

second generation reads would be very difficult but the use of long third generation reads makes the process much more efficient. Utilization of the PacBio reads allowed for unambiguous placement of the complex repeated along the scaffold.

20:

Hybrid assembly may be used to resolve ambiguities that exist in genomes previously assembled using second generation sequencing. Short second generation reads have also been used to correct errors that exist in the long third generation

289:

between contigs were closed using an iterative approach. The initial de novo assembly of contigs was achieved in parallel using Velvet, which assembles contigs by manipulating De Bruijn graphs, and Edena, which is an OLC-based assembler

574:

Wang, Y., Yu, Y., Pan, B., Hao, P., Li, Y., Shao, Z., ... Li, X. (2012). Optimizing hybrid assembly of next-generation sequence data from Enterococcus faecium: a microbe with highly divergent genome. BMC Syst Biol, 6 Suppl 3, S21.

495:

Koren, S., Schatz, M. C., Walenz, B. P., Martin, J., Howard, J. T., Ganapathy, G., ... Phillippy, A. M. (2012). Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nature Biotechnology, 30(7), 692–+.

225:

computational limitations, this approach limits assembly to relatively small genomes (maximum of 10Mbp). The PBcR algorithm allows for the assembly of much larger genomes with higher fidelity and using uncorrected PacBio reads.

79:

and assembling them into the correct order such as to reconstruct the original genome. Sequencing involves using automated machines to determine the order of nucleic acids in the DNA of interest (the nucleic acids in DNA are

130:

genome assembly is used when the genome to be assembled is not similar to any other organisms whose genomes have been previously sequenced. This process is carried out by assembling single reads into contiguous sequences

125:

assembly. The scaffolding approach can be useful if the genome of a similar organism has been previously sequenced. This process involves assembling the genome of interest by comparing it to a known genome or scaffold.

215:

assembly program. This algorithm calculates an accurate hybrid consensus sequence by mapping higher accuracy short reads (from second generation sequencing technologies) to individual lower accuracy long reads (from

463:

DiGuistini, S., Liao, N., Platt, D., Robertson, G., Siedel, M., Chan, S., ... Jones, S. J. M. (2009). De novo sequence assembly of a filamentous fungus using Sanger, 454 and Illumina sequence data. Genome Biology,

51:

sequencing reads are not long enough to bridge the repeat, and, as such, determining the location of each repeat in the genome can be difficult. Resolving these tandem repeats can be accomplished by utilizing long

617:

Bashir, A., Klammer, A. A., Robins, W. P., Chin, C. S., Webster, D., Paxinos, E., ... Schadt, E. E. (2012). A hybrid approach for the automated finishing of bacterial genomes. Nature Biotechnology, 30(7), 701–+.

552:

Cerdeira, L. T., Carneiro, A. R., Ramos, R. T. J., de Almeida, S. S., D'Afonseca, V., Schneider, M. P. C., ... Silva, A. (2011). Rapid hybrid de novo assembly of a microbial genome using only short reads:

591:

English, A. C., Richards, S., Han, Y., Wang, M., Vee, V., Qu, J. X., ... Gibbs, R. A. (2012). Mind the Gap: Upgrading Genomes with Pacific Biosciences RS Long-Read Sequencing Technology. PLoS ONE, 7(11).

397:

Koren, S., Harhay, G., Smith, P., Bono, J., Harhay, D., Mcvey, S., ... Phillippy, A. (2013). Reducing assembly complexity of microbial genomes with single-molecule sequencing. Genome Biology.

175:

One hybrid approach to genome assembly involves supplementing short, accurate second-generation sequencing data (i.e. from IonTorrent, Illumina or Roche 454) with long less accurate

284:

This study employs a hybrid genome assembly approach that only uses sequencing reads generated using SOLiD sequencing (a second-generation sequencing technology). The genome of

700:

Deshpande, V., Fung, E., Pham, S., & Bafna, V. (2013). Cerulean: A hybrid assembly using high throughput short and long reads. Algorithms in Bioinformatics, 8126, 349–363.

233:

for the corrected PBcR data was also longer than the Illumina data (4.65MBp compared to 3.32 Mbp for the Illumina reads). A similar trend was seen in the sequencing of the

135:) which are then extended in the 3' and 5' directions by overlapping other sequences. The latter is preferred because it allows for the conservation of more sequences. 512:

Kim, P. G., Cho, H. G., & Park, K. (2008). A scaffold analysis tool using mate-pair information in genome sequencing. Journal of Biomedicine and Biotechnology.

431:

Motahari, A. S., Bresler, G., & Tse, D. N. C. (2013). Information Theory of DNA Shotgun Sequencing. IEEE Transactions on Information Theory, 59(10), 6273–6289.

372:

Pellicer, Jaume, Fay, Michael F., & Leitch, Ilia J. (2010). The largest eukaryotic genome of them all? Botanical Journal of the Linnean Society, 164(1), 10–15.

710: 407: 171:

The workflow of a typical hybrid genome assembly experiment using second- and third-generation sequencing technologies. Figure adapted from Wang et al., 2012

651:

Pevzner, P. A., Tang, H., & Waterman, M. S. (2001). An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci U S A, 98(17), 9748-9753.

211:

The authors of this study developed a correction algorithm called the PacBio corrected Reads (PBcR) algorithm which is implemented as part of the

667:

Ritz, Anna, Bashir, Ali, & Raphael, Benjamin J. (2010). Structural variation analysis with strobe reads. Bioinformatics, 26(10), 1291–1298.

253: 154:

Although next generation sequencing technology is now capable of producing millions of reads, the assembly of these reads can cause a

190:. Algorithms specific for this type of hybrid genome assembly have been developed, such as the PacBio corrected Reads algorithm. 116:

include technologies as the PacBio RS system which can produce long reads (maximum of 23kb) but have a relatively low accuracy.

388:

Alkan, C., Sajjadian, S., & Eichler, E. (2011). Limitations of next-generation genome sequence assembly. Nature Methods, 8.

323:

technology is expanding the limits of genomic research as the cost of generating high quality sequencing data is decreasing.

121: 302:

is also computationally expensive and require a large amount of running time, even for relatively small bacterial genomes.

167: 331:

assembly technology and these strategies are guaranteed to become more effective as more powerful technologies emerge.

109: 320: 217: 180: 176: 113: 52: 411: 46:

in length. This is orders of magnitude smaller than the average size of a genome (the genome of the octoploid plant

75:

The term genome assembly refers to the process of taking a large number of DNA fragments that are generated during

112:. These sequencing technologies produce relatively short reads (50–700 bases) and have a high accuracy (>98%). 280:

that would be used for genome assembly. The nodes represent the sequence of the contigs being used for assembly.

751: 179:

data (i.e. from PacBio RS) to resolve complex repeated DNA segments. The main limitation of single-molecule

119:

Genome assembly is normally done by one of two methods: assembly using a reference genome as a scaffold, or

355:

Pop, M. (2009). Genome assembly reborn: recent computational challenges. Brief Bioinform, 10(4), 354–366.

155: 736: 105: 447:

Mardis, E. R. (2008). Next-generation DNA sequencing methods. Annu Rev Genom Hum Genet, 9, 387–402.

56:

accurately place them along a linear scaffold and make the process more computationally efficient.

473:

Glenn, T. (2011). Field guide to next-generation DNA sequencers. Molecular Ecology Resources, 11.

212: 101: 76: 711:"DNA Sequencing: Latest Developments in Next-Generation Sequencing – Drug Discovery World (DDW)" 260:

could be mapped, a consensus sequence was constructed using overlapping reads in that region.

142:

assembly of DNA sequences is a very computationally challenging process and can fall into the

97: 70: 668: 652: 636: 619: 593: 576: 558: 530: 513: 497: 448: 432: 373: 356: 235: 147: 452: 277: 272: 186: 34: 26: 745: 377: 230: 672: 597: 727:

Hybrid Error Correction and De Novo Assembly of Single-Molecule Sequencing Reads

562: 207:

Hybrid error correction and de novo assembly of single-molecule sequencing reads

580: 731: 726: 43: 436: 408:"New Chemistry Boosts Average Read Length to 10 kb – 15 kb for PacBio® RS II" 239:

JM221 genome: a 25x PBcR assembly had a N50 triple that of 50x 454 assembly.

640: 16: 656: 534: 517: 557:

I19 as a case study. Journal of Microbiological Methods, 86(2), 218–223.

360: 85: 143: 93: 89: 81: 623: 501: 132: 38: 271: 166: 737:

National Center for Biotechnology Information: Genome Assembly

732:

Virtual Poster: Hybrid Genome Assembly of a Nocturnal Lemur

252:that was responsible for a cholera outbreak in 696: 694: 692: 690: 8: 548: 546: 544: 542: 297:Using high throughput short and long reads 613: 611: 609: 607: 605: 491: 489: 487: 485: 483: 481: 479: 243:Automated finishing of bacterial genomes 15: 339: 453:10.1146/annurev.genom.9.081307.164359 351: 349: 347: 345: 343: 7: 37:to achieve the task of assembling a 555:Corynebacterium pseudo tuberculosis 14: 378:10.1111/j.1095-8339.2010.01072.x 1: 673:10.1093/bioinformatics/btq153 598:10.1371/journal.pone.0047768 33:refers to utilizing various 563:10.1016/j.mimet.2011.05.008 321:third-generation sequencing 218:third-generation sequencing 181:third-generation sequencing 177:third-generation sequencing 114:Third-generation sequencing 53:third generation sequencing 768: 581:10.1186/1752-0509-6-S3-S21 68: 146:class of problems if the 65:Classical Genome Assembly 437:10.1109/tit.2013.2270273 641:10.1073/pnas.0604351103 35:sequencing technologies 657:10.1073/pnas.171285098 535:10.4014/jmb.1303.03045 281: 268:Using only short reads 172: 163:Hybrid Genome Assembly 31:hybrid genome assembly 22: 286:C. pseudotuberculosis 275: 170: 19: 202:Practical approaches 518:10.1155/2008/675741 314:Future prospectives 361:10.1093/bib/bbp026 282: 173: 77:shotgun sequencing 23: 148:Hamiltonian-cycle 71:Sequence assembly 759: 715: 714: 707: 701: 698: 685: 681: 675: 665: 659: 649: 643: 632: 626: 624:10.1038/nbt.2288 615: 600: 589: 583: 572: 566: 550: 537: 526: 520: 510: 504: 502:10.1038/nbt.2280 493: 474: 471: 465: 461: 455: 445: 439: 429: 423: 422: 420: 419: 410:. Archived from 404: 398: 395: 389: 386: 380: 370: 364: 353: 276:An example of a 236:Escherichia coli 767: 766: 762: 761: 760: 758: 757: 756: 742: 741: 723: 718: 713:. 4 April 2013. 709: 708: 704: 699: 688: 682: 678: 666: 662: 650: 646: 633: 629: 616: 603: 590: 586: 573: 569: 551: 540: 527: 523: 511: 507: 494: 477: 472: 468: 462: 458: 446: 442: 430: 426: 417: 415: 406: 405: 401: 396: 392: 387: 383: 371: 367: 354: 341: 337: 316: 299: 278:De Bruijn graph 270: 250:Vibrio cholerae 245: 209: 204: 187:Vibrio cholerae 165: 73: 67: 62: 60:Genome Assembly 12: 11: 5: 765: 763: 755: 754: 752:Bioinformatics 744: 743: 740: 739: 734: 729: 722: 721:External links 719: 717: 716: 702: 686: 676: 660: 644: 627: 601: 584: 567: 538: 521: 505: 475: 466: 456: 440: 424: 399: 390: 381: 365: 338: 336: 333: 315: 312: 298: 295: 269: 266: 244: 241: 208: 205: 203: 200: 164: 161: 69:Main article: 66: 63: 61: 58: 48:Paris japonica 27:bioinformatics 13: 10: 9: 6: 4: 3: 2: 764: 753: 750: 749: 747: 738: 735: 733: 730: 728: 725: 724: 720: 712: 706: 703: 697: 695: 693: 691: 687: 680: 677: 674: 670: 664: 661: 658: 654: 648: 645: 642: 638: 631: 628: 625: 621: 614: 612: 610: 608: 606: 602: 599: 595: 588: 585: 582: 578: 571: 568: 564: 560: 556: 549: 547: 545: 543: 539: 536: 532: 525: 522: 519: 515: 509: 506: 503: 499: 492: 490: 488: 486: 484: 482: 480: 476: 470: 467: 460: 457: 454: 450: 444: 441: 438: 434: 428: 425: 414:on 2015-10-10 413: 409: 403: 400: 394: 391: 385: 382: 379: 375: 369: 366: 362: 358: 352: 350: 348: 346: 344: 340: 334: 332: 328: 324: 322: 313: 311: 307: 303: 296: 294: 290: 287: 279: 274: 267: 265: 261: 257: 255: 251: 242: 240: 238: 237: 232: 226: 222: 219: 214: 206: 201: 199: 195: 191: 189: 188: 182: 178: 169: 162: 160: 157: 152: 149: 145: 141: 136: 134: 129: 124: 123: 117: 115: 111: 107: 103: 99: 95: 91: 87: 83: 78: 72: 64: 59: 57: 54: 49: 45: 40: 36: 32: 28: 18: 705: 679: 663: 647: 630: 587: 570: 554: 524: 508: 469: 459: 443: 427: 416:. Retrieved 412:the original 402: 393: 384: 368: 329: 325: 317: 308: 304: 300: 291: 285: 283: 262: 258: 249: 246: 234: 227: 223: 210: 196: 192: 185: 174: 153: 139: 137: 127: 120: 118: 74: 47: 30: 24: 684:1276–1285. 418:2015-08-31 335:References 319:advent of 156:bottleneck 110:IonTorrent 44:base pairs 98:Roche 454 746:Category 102:Illumina 86:cytosine 144:NP-hard 140:de novo 133:contigs 128:De novo 122:de novo 94:thymine 90:guanine 82:adenine 213:Celera 108:, and 39:genome 21:reads. 254:Haiti 106:SOLiD 138:The 92:and 669:doi 653:doi 637:doi 620:doi 594:doi 577:doi 559:doi 531:doi 514:doi 498:doi 464:10. 449:doi 433:doi 374:doi 357:doi 231:N50 25:In 748:: 689:^ 604:^ 541:^ 478:^ 342:^ 256:. 104:, 100:, 88:, 84:, 29:, 671:: 655:: 639:: 622:: 596:: 579:: 565:. 561:: 533:: 516:: 500:: 451:: 435:: 421:. 376:: 363:. 359:: 131:(

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.

Index