Knowledge (XXG)

Sequence assembly

Source 📝

475:. This new sequencing method generated reads much shorter than those of Sanger sequencing: initially about 100 bases, now 400-500 bases. Its much higher throughput and lower cost (compared to Sanger sequencing) pushed the adoption of this technology by genome centers, which in turn pushed development of sequence assemblers that could efficiently handle the read sets. The sheer amount of data coupled with technology-specific error patterns in the reads delayed development of assemblers; at the beginning in 2004 only the 520:) continue to emerge. Despite the higher error rates of these technologies they are important for assembly because their longer read length helps to address the repeat problem. It is impossible to assemble through a perfect repeat that is longer than the maximum read length; however, as reads become longer the chance of a perfect repeat that large becomes small. This gives longer sequencing reads an advantage in assembling repeats even if they have low accuracy (~85%). 178: 33: 493:(previously Solexa) technology has been available and can generate about 100 million reads per run on a single sequencing machine. Compare this to the 35 million reads of the human genome project which needed several years to be produced on hundreds of sequencing machines. Illumina was initially limited to a length of only 36 bases, making it less suitable for de novo assembly (such as 442:) was invented and until shortly after 2000, the technology was improved up to a point where fully automated machines could churn out sequences in a highly parallelised mode 24 hours a day. Large genome centers around the world housed complete farms of these sequencing machines, which in turn led to the necessity of assemblers to be optimised for sequences from whole-genome 497:), but newer iterations of the technology achieve read lengths above 100 bases from both ends of a 3-400bp clone. Announced at the end of 2007, the SHARCGS assembler by Dohm et al. was the first published assembler that was used for an assembly with Solexa reads. It was quickly followed by a number of others. 304:) to 3 billion (e.g., the human genome) base pairs. Subsequent to these efforts, several other groups, mostly at the major genome sequencing centers, built large-scale assemblers, and an open source effort known as AMOS was launched to bring together all the innovations in genome assembly technology under the 710:. For example, sequencing "NAAAAAAAAAAAAN" and "NAAAAAAAAAAAN" which include 12 adenine might be wrongfully called with 11 adenine instead. Sequencing a highly repetitive segment of the target DNA/RNA might result in a call that is one base shorter or one base longer. Read quality is typically measured by 479:
assembler from 454 was available. Released in mid-2007, the hybrid version of the MIRA assembler by Chevreux et al. was the first freely available assembler that could assemble 454 reads as well as mixtures of 454 reads and Sanger reads. Assembling sequences from different sequencing technologies was
422:
The complexity of sequence assembly is driven by two major factors: the number of fragments and their lengths. While more and longer fragments allow better identification of sequence overlaps, they also pose problems as the underlying algorithms show quadratic or even exponential complexity behaviour
730:
Assembly: During this step, reads alignment will be utilized with different criteria to map each read to the possible location. The predicted position of a read is based on either how much of its sequence aligns with other reads or a reference. Different alignment algorithms are used for reads from
315:
Strategy how a sequence assembler would take fragments (shown below the black bar) and match overlaps among them to assembly the final sequence (in black). Potentially problematic repeats are shown above the sequence (in pink above). Without overlapping fragments it may be impossible to assign these
224:
advantages (i.e. call quality). The logic behind it is to group the reads by smaller windows within the reference. Reads in each group will then be reduced in size using the k-mere approach to select the highest quality and most probable contiguous (contig). Contigs will then will be joined together
168:
The problem of sequence assembly can be compared to taking many copies of a book, passing each of them through a shredder with a different cutter, and piecing the text of the book back together just by looking at the shredded pieces. Besides the obvious difficulty of this task, there are some extra
402:
Referring to the comparison drawn to shredded books in the introduction: while for mapping assemblies one would have a very similar book as a template (perhaps with the names of the main characters and a few locations changed), de-novo assemblies present a more daunting challenge in that one would
333:
of a cell and represent only a subset of the whole genome. A number of algorithmical problems differ between genome and EST assembly. For instance, genomes often have large amounts of repetitive sequences, concentrated in the intergenic regions. Transcribed genes contain many fewer repeats, making
373:
In terms of complexity and time requirements, de-novo assemblies are orders of magnitude slower and more memory intensive than mapping assemblies. This is mostly due to the fact that the assembly algorithm needs to compare every read with every other read (an operation that has a naive time
328:
or EST assembly was an early strategy, dating from the mid-1990s to the mid-2000s, to assemble individual genes rather than whole genomes. The problem differs from genome assembly in several ways. The input sequences for EST assembly are fragments of the transcribed
463:
With the Sanger technology, bacterial projects with 20,000 to 200,000 reads could easily be assembled on one computer. Larger projects, like the human genome with approximately 35 million reads, needed large computing farms and distributed computing.
397:
approach, which may also use one of the OLC or DBG approaches. With greedy graph-based algorithms, the contigs, series of reads aligned together, grow by greedy extension, always taking on the read that is found by following the highest-scoring
312: 423:
to both number of fragments and their length. And while shorter sequences are faster to align, they also complicate the layout phase of an assembly as shorter reads are more difficult to use with repeats or near identical repeats.
203:
Reference-guided: grouping of reads by similarity to the most similar region within the reference (step wise mapping). Reads within each group are then shortened down to mimic short reads quality. A typical method to do so is the
169:
practical issues: the original may have many repeated paragraphs, and some shreds may be modified during shredding to have typos. Excerpts from another book may also be added in, and some shreds may be completely unrecognizable.
426:
In the earliest days of DNA sequencing, scientists could only gain a few sequences of short length (some dozen bases) after weeks of work in laboratories. Hence, these sequences could be aligned in a few minutes by hand.
149:
technology might not be able to 'read' whole genomes in one go, but rather reads small pieces of between 20 and 30,000 bases, depending on the technology used. Typically, the short fragments (reads) result from
651:
Different organisms have a distinct region of higher complexity within their genome. Hence, the need of different computational approaches is needed. Some of the commonly used algorithms are:
703:
Pre-assembly: This step is essential to ensure the integrity of downstream analysis such as variant calling or final scaffold sequence. This step consists of two chronological workflows:
391:(DBG) approach, which is most widely applied to the short reads from the Solexa and SOLiD platforms. It relies on K-mer graphs, which performs well with vast quantities of short reads; 1260:
Li Z, Chen Y, Mu D, Yuan J, Shi Y, Zhang H, et al. (January 2012). "Comparison of the two major classes of assembly algorithms: overlap-layout-consensus and de-bruijn-graph".
300:
in 2000 and the human genome just a year later,—scientists developed assemblers like Celera Assembler and Arachne able to handle genomes of 130 million (e.g., the fruit fly
1503:
Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM (October 2015). "BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs".
403:
not know beforehand whether this would become a science book, a novel, a catalogue, or even several books. Also, every shred would be compared with every other shred.
743:. On the other hand, algorithms aligning 3rd generation sequencing reads requires advance approaches to account for the high error rate associated with them. 532:. However, such measures do not assess assembly completeness in terms of gene content. Some tools evaluate the quality of an assembly after the fact. 200:
Mapping/Aligning: assembling reads by aligning reads against a template (AKA reference). The assembled consensus may not be identical to the template.
666:
Greedy Graph Assembly: this approach score each added read to the assembly and selects the highest possible score from the overlapping region.
414:. On the other hand, in a mapping assembly, parts with multiple or no matches are usually left for another assembling technique to look into. 539:, using the fact that many genes are present only as single-copy genes in most genomes. The initial BUSCO sets represented 3023 genes for 768: 50: 1088:
Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, et al. (March 2000). "A whole-genome assembly of Drosophila".
354: 116: 948: 97: 1662: 916:
This is useful post-assembly. It can generate different statistics and perform multiple filtering steps to the alignment file.
535:
For instance, BUSCO (Benchmarking Universal Single-Copy Orthologs) is a measure of gene completeness in a genome, gene set, or
410:
representing neighboring repeats. Such information can be derived from reading a long fragment covering the repeats in full or
69: 938: 670:
Given a set of sequence fragments, the object is to find a longer sequence that contains all the fragments (see figure under
494: 437: 362: 350: 194: 54: 76: 861: 739:, quality, and the sequencing technique used plays a major role in choosing the best alignment algorithm in the case of 706:
Quality check: Depending on the type of sequencing technology, different errors might arise that would lead to a false
1225:
Nagaraj SH, Gasser RB, Ranganathan S (January 2007). "A hitchhiker's guide to expressed sequence tag (EST) analysis".
928: 764: 221: 209: 190: 189:
De-novo: assembling sequencing reads to create full-length (sometimes novel) sequences, without using a template (see
714:
which is an encoded score of each nucleotide quality within a read's sequence. Some sequencing technologies such as
338:), which means that unlike whole-genome shotgun sequencing, the reads are not uniformly sampled across the genome. 83: 43: 880:
This tool is designed to assemble (reference-guided) viral genomes at a greater accuracy using PacBio CCS reads.
334:
assembly somewhat easier. On the other hand, some genes are expressed (transcribed) in very high numbers (e.g.,
431: 158: 65: 242:
programs to piece together vast quantities of fragments generated by automated sequencing instruments called
1657: 619: 296: 1367:
Hu T, Chitnis N, Monos D, Dinh A (November 2021). "Next-generation sequencing technologies: An overview".
1105: 837: 481: 325: 162: 1453:"The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants" 953: 238:
The first sequence assemblers began to appear in the late 1980s and early 1990s as variants of simpler
378:)). Current de-novo genome assemblers may use different types of graph-based algorithms, such as the: 1097: 747: 746:
Post-assembly: This step is focusing on extracting valuable information from the assembled sequence.
501: 342: 305: 217: 1110: 817: 736: 711: 517: 284:) which can, in the worst case, increase the time and space complexity of algorithms quadratically; 1404:"SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing" 718:
do not have a scoring method for their sequenced reads. A common tool used in this step is FastQC.
1298:
Harrington CT, Lin EI, Olson MT, Eshleman JR (September 2013). "Fundamentals of pyrosequencing".
1131: 933: 808:
This is a common tool used to check reads quality from different sequencing technologies such as
732: 505: 490: 443: 411: 361:
was invented, EST sequencing was replaced by this far more efficient technology, described under
239: 151: 138: 385:(OLC) approach, which was typical of the Sanger-data assemblers and relies on an overlap graph; 294:
Faced with the challenge of assembling the first larger eukaryotic genomes—the fruit fly
177: 90: 1634: 1569: 1520: 1482: 1433: 1384: 1315: 1277: 1242: 1180: 1123: 1036: 994: 943: 885: 813: 472: 335: 274: 1587: 1149:
Batzoglou S, Jaffe DB, Stanley K, Butler J, Gnerre S, Mauceli E, et al. (January 2002).
803: 774:
Some of the common tools used in different assembly steps are listed in the following table:
1624: 1559: 1551: 1512: 1472: 1464: 1423: 1415: 1376: 1336: 1307: 1269: 1234: 1170: 1162: 1115: 1028: 986: 875: 731:
different sequencing technologies. Some of the commonly used approaches in the assembly are
529: 1588:"Babraham Bioinformatics - FastQC A Quality Control tool for High Throughput Sequence Data" 977:
Sohn JI, Nam JW (January 2018). "The present and future of de novo whole-genome assembly".
721:
Filtering of reads: Reads that failed to pass the quality check should be removed from the
656: 513: 454: 287: 1101: 225:
to create a scaffold. The final consense is made by closing any gaps in the scaffold.
216:
Referenced-guided assembly is a combination of the other types. This type is applied on
1564: 1539: 1477: 1452: 1428: 1403: 809: 740: 468: 346: 263: 243: 146: 130: 1175: 1150: 1651: 1060: 536: 330: 1629: 1612: 1516: 528:
Most sequence assemblers have some algorithms built in for quality control, such as
1371:. Next Generation Sequencing and its Application to Medical Laboratory Immunology. 1135: 722: 707: 407: 1613:"Comparative analysis of algorithms for next-generation sequencing read alignment" 1119: 831: 699:
In general, there are three steps in assembling sequencing reads into a scaffold:
1380: 509: 32: 17: 1311: 1198: 540: 290:
in the fragments from the sequencing instruments, which can confound assembly.
259: 1040: 1016: 556: 544: 270: 1638: 1573: 1524: 1486: 1437: 1388: 1319: 1281: 1246: 1184: 1127: 998: 1468: 1273: 145:
sequence in order to reconstruct the original sequence. This is needed as
1238: 990: 903: 255: 251: 840:
tool. Mostly known for lightweight run and accurate sequence alignment.
311: 1419: 1032: 476: 358: 247: 1166: 893: 1555: 857: 715: 660: 552: 548: 246:. As the sequenced organisms grew in size and complexity (from small 154: 406:
Handling repeats in de-novo assembly requires the construction of a
769:
List of sequence alignment software § Short-read sequence alignment
341:
EST assembly is made much more complicated by features like (cis-)
851: 655:
Graph Assembly: is based on Graph theory in computer science. The
592: 310: 205: 750:
and population analysis are examples of post-assembly analysis.
559:. This table shows an example for human and fruit fly genomes: 1402:
Dohm JC, Lottaz C, Borodina T, Himmelbauer H (November 2007).
1061:"De novo genome assembly versus mapping to a reference genome" 142: 26: 911: 1451:
Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM (April 2010).
691:
The result might not be an optimal solution to the problem.
1017:"De novo genome assembly: what every biologist should know" 804:
https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
185:
There are three approaches to assembling sequencing data:
876:
https://github.com/salvocamiolo/LoReTTA/releases/tag/v0.1
208:
approach. Reference-guided assembly is most useful using
898:
This is an assembly tool that runs on the command line.
266:
needed increasingly sophisticated strategies to handle:
563:
BUSCO notation assessment results (C,D,F, M in %)
1611:
Ruffalo M, LaFramboise T, Koyutürk M (October 2011).
1337:"MIRA 2.9.8 for 454 and 454 / Sanger hybrid assembly" 1331: 1329: 687:
Repeat step 2 and 3 until only one fragment is left.
280:identical and nearly identical sequences (known as 57:. Unsourced material may be challenged and removed. 1540:"How to apply de Bruijn graphs to genome assembly" 1538:Compeau PE, Pevzner PA, Tesler G (November 2011). 1068:University of Applied Sciences Western Switzerland 1293: 1291: 453:contain sequencing artifacts like sequencing and 1300:Archives of Pathology & Laboratory Medicine 832:https://sourceforge.net/projects/bio-bwa/files/ 678:Сalculate pairwise alignments of all fragments. 1220: 1218: 681:Choose two fragments with the largest overlap. 1498: 1496: 856:This command line tool is designed to handle 8: 1010: 1008: 659:is an example of this approach and utilizes 471:had been brought to commercial viability by 273:of sequencing data which need processing on 1151:"ARACHNE: a whole-genome shotgun assembler" 972: 970: 968: 1362: 1360: 1358: 1356: 1628: 1563: 1476: 1427: 1174: 1109: 516:were released and new technologies (e.g. 117:Learn how and when to remove this message 776: 561: 176: 964: 262:), the assembly programs used in these 1054: 1052: 1050: 767:. For a list of mapping aligners, see 725:file to get the best assembly contigs. 7: 735:graph and overlapping. Read length, 663:to assemble a contiguous from reads. 459:have error rates between 0.5 and 10% 141:and merging fragments from a longer 55:adding citations to reliable sources 894:http://cab.spbu.ru/software/spades/ 25: 1592:www.bioinformatics.babraham.ac.uk 355:post-transcriptional modification 1262:Briefings in Functional Genomics 949:List of sequenced animal genomes 316:segments to any specific region. 31: 864:and reads with 15% error rate. 852:https://github.com/lh3/minimap2 42:needs additional citations for 939:De novo transcriptome assembly 495:de novo transcriptome assembly 363:de novo transcriptome assembly 351:single-nucleotide polymorphism 195:de novo transcriptome assembly 1: 1630:10.1093/bioinformatics/btr477 1517:10.1093/bioinformatics/btv351 1120:10.1126/science.287.5461.2196 500:Later, new technologies like 1381:10.1016/j.humimm.2021.02.012 450:are about 800–900 bases long 1227:Briefings in Bioinformatics 979:Briefings in Bioinformatics 929:De novo sequence assemblers 765:De novo sequence assemblers 369:De-novo vs mapping assembly 191:de novo sequence assemblers 1679: 912:https://samtools.github.io 741:Next Generation Sequencing 672:Types of Sequence Assembly 181:Types of sequence assembly 1312:10.5858/arpa.2012-0463-RA 1015:Baker M (27 March 2012). 446:projects where the reads 357:. Beginning in 2008 when 778:Sequence Assembly Tools 383:Overlap/Layout/Consensus 890:Short & Long reads 828:Short & Long reads 695:Bioinformatics pipeline 684:Merge chosen fragments. 620:Drosophila melanogaster 297:Drosophila melanogaster 1663:DNA sequencing methods 1457:Nucleic Acids Research 418:Technological advances 326:Expressed sequence tag 317: 182: 954:Plant genome assembly 314: 180: 1544:Nature Biotechnology 1203:amos.sourceforge.net 748:Comparative genomics 480:subsequently coined 343:alternative splicing 51:improve this article 1469:10.1093/nar/gkp1137 1274:10.1093/bfgp/elr035 1102:2000Sci...287.2196M 1096:(5461): 2196–2204. 908:Alignment analysis 779: 647:Assembly algorithms 564: 518:Nanopore sequencing 433:dideoxy termination 66:"Sequence assembly" 1420:10.1101/gr.6435207 1239:10.1093/bib/bbl015 1033:10.1038/nmeth.1935 991:10.1093/bib/bbw096 934:Sequence alignment 777: 562: 506:Applied Biosystems 444:shotgun sequencing 395:Greedy graph-based 336:housekeeping genes 318: 275:computing clusters 240:sequence alignment 183: 152:shotgun sequencing 1623:(20): 2790–2796. 1511:(19): 3210–3212. 1414:(11): 1697–1706. 1341:groups.google.com 1167:10.1101/gr.208902 944:Set cover problem 920: 919: 644: 643: 473:454 Life Sciences 439:Sanger sequencing 412:only its two ends 135:sequence assembly 127: 126: 119: 101: 16:(Redirected from 1670: 1643: 1642: 1632: 1608: 1602: 1601: 1599: 1598: 1584: 1578: 1577: 1567: 1556:10.1038/nbt.2023 1535: 1529: 1528: 1500: 1491: 1490: 1480: 1463:(6): 1767–1771. 1448: 1442: 1441: 1431: 1399: 1393: 1392: 1369:Human Immunology 1364: 1351: 1350: 1348: 1347: 1333: 1324: 1323: 1306:(9): 1296–1303. 1295: 1286: 1285: 1257: 1251: 1250: 1222: 1213: 1212: 1210: 1209: 1195: 1189: 1188: 1178: 1146: 1140: 1139: 1113: 1085: 1079: 1078: 1076: 1074: 1065: 1056: 1045: 1044: 1012: 1003: 1002: 974: 780: 763:assemblers, see 565: 467:By 2004 / 2005, 374:complexity of O( 122: 115: 111: 108: 102: 100: 59: 35: 27: 21: 1678: 1677: 1673: 1672: 1671: 1669: 1668: 1667: 1648: 1647: 1646: 1610: 1609: 1605: 1596: 1594: 1586: 1585: 1581: 1550:(11): 987–991. 1537: 1536: 1532: 1502: 1501: 1494: 1450: 1449: 1445: 1408:Genome Research 1401: 1400: 1396: 1375:(11): 801–811. 1366: 1365: 1354: 1345: 1343: 1335: 1334: 1327: 1297: 1296: 1289: 1259: 1258: 1254: 1224: 1223: 1216: 1207: 1205: 1197: 1196: 1192: 1155:Genome Research 1148: 1147: 1143: 1087: 1086: 1082: 1072: 1070: 1063: 1058: 1057: 1048: 1014: 1013: 1006: 976: 975: 966: 962: 925: 862:Oxford Nanopore 759:For a lists of 757: 697: 657:de Bruijn Graph 649: 526: 524:Quality control 489:From 2006, the 483:hybrid assembly 455:cloning vectors 420: 389:de Bruijn Graph 377: 371: 323: 302:D. melanogaster 288:DNA read errors 264:genome projects 236: 231: 175: 159:gene transcript 123: 112: 106: 103: 60: 58: 48: 36: 23: 22: 18:Genome assembly 15: 12: 11: 5: 1676: 1674: 1666: 1665: 1660: 1658:Bioinformatics 1650: 1649: 1645: 1644: 1617:Bioinformatics 1603: 1579: 1530: 1505:Bioinformatics 1492: 1443: 1394: 1352: 1325: 1287: 1252: 1214: 1190: 1161:(1): 177–189. 1141: 1111:10.1.1.79.9822 1080: 1046: 1027:(4): 333–337. 1021:Nature Methods 1004: 963: 961: 958: 957: 956: 951: 946: 941: 936: 931: 924: 921: 918: 917: 914: 909: 906: 900: 899: 896: 891: 888: 882: 881: 878: 873: 870: 866: 865: 854: 849: 846: 842: 841: 834: 829: 826: 822: 821: 806: 801: 798: 794: 793: 790: 789:Tool web page 787: 784: 756: 753: 752: 751: 744: 728: 727: 726: 719: 696: 693: 689: 688: 685: 682: 679: 668: 667: 664: 648: 645: 642: 641: 638: 635: 632: 629: 626: 623: 615: 614: 611: 608: 605: 602: 599: 596: 588: 587: 584: 581: 578: 575: 572: 569: 525: 522: 469:pyrosequencing 461: 460: 457: 451: 419: 416: 400: 399: 392: 386: 375: 370: 367: 347:trans-splicing 322: 319: 292: 291: 285: 278: 244:DNA sequencers 235: 232: 230: 227: 214: 213: 201: 198: 174: 171: 147:DNA sequencing 131:bioinformatics 125: 124: 39: 37: 30: 24: 14: 13: 10: 9: 6: 4: 3: 2: 1675: 1664: 1661: 1659: 1656: 1655: 1653: 1640: 1636: 1631: 1626: 1622: 1618: 1614: 1607: 1604: 1593: 1589: 1583: 1580: 1575: 1571: 1566: 1561: 1557: 1553: 1549: 1545: 1541: 1534: 1531: 1526: 1522: 1518: 1514: 1510: 1506: 1499: 1497: 1493: 1488: 1484: 1479: 1474: 1470: 1466: 1462: 1458: 1454: 1447: 1444: 1439: 1435: 1430: 1425: 1421: 1417: 1413: 1409: 1405: 1398: 1395: 1390: 1386: 1382: 1378: 1374: 1370: 1363: 1361: 1359: 1357: 1353: 1342: 1338: 1332: 1330: 1326: 1321: 1317: 1313: 1309: 1305: 1301: 1294: 1292: 1288: 1283: 1279: 1275: 1271: 1267: 1263: 1256: 1253: 1248: 1244: 1240: 1236: 1232: 1228: 1221: 1219: 1215: 1204: 1200: 1194: 1191: 1186: 1182: 1177: 1172: 1168: 1164: 1160: 1156: 1152: 1145: 1142: 1137: 1133: 1129: 1125: 1121: 1117: 1112: 1107: 1103: 1099: 1095: 1091: 1084: 1081: 1069: 1062: 1055: 1053: 1051: 1047: 1042: 1038: 1034: 1030: 1026: 1022: 1018: 1011: 1009: 1005: 1000: 996: 992: 988: 984: 980: 973: 971: 969: 965: 959: 955: 952: 950: 947: 945: 942: 940: 937: 935: 932: 930: 927: 926: 922: 915: 913: 910: 907: 905: 902: 901: 897: 895: 892: 889: 887: 884: 883: 879: 877: 874: 871: 868: 867: 863: 859: 855: 853: 850: 847: 844: 843: 839: 835: 833: 830: 827: 824: 823: 819: 815: 811: 807: 805: 802: 799: 796: 795: 791: 788: 785: 782: 781: 775: 772: 770: 766: 762: 754: 749: 745: 742: 738: 734: 729: 724: 720: 717: 713: 709: 705: 704: 702: 701: 700: 694: 692: 686: 683: 680: 677: 676: 675: 673: 665: 662: 658: 654: 653: 652: 646: 639: 636: 633: 630: 627: 624: 622: 621: 617: 616: 612: 609: 606: 603: 600: 597: 595: 594: 590: 589: 585: 582: 579: 576: 573: 570: 567: 566: 560: 558: 554: 550: 546: 542: 538: 537:transcriptome 533: 531: 523: 521: 519: 515: 511: 507: 503: 498: 496: 492: 487: 485: 484: 478: 474: 470: 465: 458: 456: 452: 449: 448: 447: 445: 441: 440: 435: 434: 430:In 1975, the 428: 424: 417: 415: 413: 409: 404: 396: 393: 390: 387: 384: 381: 380: 379: 368: 366: 364: 360: 356: 352: 348: 344: 339: 337: 332: 327: 320: 313: 309: 307: 303: 299: 298: 289: 286: 283: 279: 276: 272: 269: 268: 267: 265: 261: 257: 253: 249: 245: 241: 233: 228: 226: 223: 219: 211: 207: 202: 199: 196: 192: 188: 187: 186: 179: 172: 170: 166: 164: 160: 156: 153: 148: 144: 140: 136: 132: 121: 118: 110: 99: 96: 92: 89: 85: 82: 78: 75: 71: 68: –  67: 63: 62:Find sources: 56: 52: 46: 45: 40:This article 38: 34: 29: 28: 19: 1620: 1616: 1606: 1595:. Retrieved 1591: 1582: 1547: 1543: 1533: 1508: 1504: 1460: 1456: 1446: 1411: 1407: 1397: 1372: 1368: 1344:. Retrieved 1340: 1303: 1299: 1268:(1): 25–37. 1265: 1261: 1255: 1230: 1226: 1206:. Retrieved 1202: 1193: 1158: 1154: 1144: 1093: 1089: 1083: 1071:. Retrieved 1067: 1024: 1020: 985:(1): 23–40. 982: 978: 838:command line 773: 760: 758: 698: 690: 671: 669: 650: 618: 593:Homo sapiens 591: 555:and 429 for 534: 527: 499: 488: 482: 466: 462: 438: 436:method (AKA 432: 429: 425: 421: 405: 401: 394: 388: 382: 372: 340: 324: 301: 295: 293: 281: 258:and finally 237: 215: 184: 167: 134: 128: 113: 107:October 2017 104: 94: 87: 80: 73: 61: 49:Please help 44:verification 41: 1233:(1): 6–21. 1199:"AMOS WIKI" 872:Long reads 848:Long reads 551:, 1438 for 543:, 2675 for 541:vertebrates 510:Ion Torrent 308:framework. 306:open source 222:short reads 1652:Categories 1597:2022-05-09 1346:2023-01-02 1208:2023-01-02 960:References 836:This is a 786:Read type 557:eukaryotes 547:, 843 for 545:arthropods 260:eukaryotes 229:Assemblies 218:long reads 210:long-reads 137:refers to 77:newspapers 1106:CiteSeerX 1041:1548-7105 845:MiniMap2 800:Multiple 783:Software 733:de Bruijn 708:base call 549:metazoans 271:terabytes 220:to mimic 1639:21856737 1574:22068540 1525:26059717 1487:20015970 1438:17908823 1389:33745759 1320:23991743 1282:22184334 1247:16772268 1185:11779843 1128:10731133 1059:Wolf B. 999:27742661 923:See also 904:Samtools 869:LoReTTA 810:Illumina 755:Programs 737:coverage 568:Species 491:Illumina 398:overlap. 256:bacteria 252:plasmids 157:DNA, or 139:aligning 1565:5531759 1478:2847217 1429:2045152 1136:6049420 1098:Bibcode 1090:Science 1073:6 April 797:FastQC 761:de-novo 625:13,918 598:20,364 477:Newbler 359:RNA-Seq 282:repeats 248:viruses 155:genomic 91:scholar 1637:  1572:  1562:  1523:  1485:  1475:  1436:  1426:  1387:  1318:  1280:  1245:  1183:  1176:155255 1173:  1134:  1126:  1108:  1039:  997:  886:SPAdes 860:& 858:PacBio 818:PacBio 816:, and 792:Notes 716:PacBio 661:k-mers 640:2,675 613:3,023 571:genes 353:, and 234:Genome 93:  86:  79:  72:  64:  1132:S2CID 1064:(PDF) 723:FASTQ 712:Phred 553:fungi 530:Phred 504:from 502:SOLiD 408:graph 250:over 206:k-mer 173:Types 98:JSTOR 84:books 1635:PMID 1570:PMID 1521:PMID 1483:PMID 1434:PMID 1385:PMID 1316:PMID 1278:PMID 1243:PMID 1181:PMID 1124:PMID 1075:2019 1037:ISSN 995:PMID 825:BWA 637:0.0 634:0.2 631:3.7 610:0.0 607:0.0 604:1.7 514:SMRT 512:and 331:mRNA 163:ESTs 70:news 1625:doi 1560:PMC 1552:doi 1513:doi 1473:PMC 1465:doi 1424:PMC 1416:doi 1377:doi 1308:doi 1304:137 1270:doi 1235:doi 1171:PMC 1163:doi 1116:doi 1094:287 1029:doi 987:doi 814:454 674:): 628:99 601:99 586:n: 583:M: 580:F: 577:D: 574:C: 321:EST 254:to 165:). 143:DNA 129:In 53:by 1654:: 1633:. 1621:27 1619:. 1615:. 1590:. 1568:. 1558:. 1548:29 1546:. 1542:. 1519:. 1509:31 1507:. 1495:^ 1481:. 1471:. 1461:38 1459:. 1455:. 1432:. 1422:. 1412:17 1410:. 1406:. 1383:. 1373:82 1355:^ 1339:. 1328:^ 1314:. 1302:. 1290:^ 1276:. 1266:11 1264:. 1241:. 1229:. 1217:^ 1201:. 1179:. 1169:. 1159:12 1157:. 1153:. 1130:. 1122:. 1114:. 1104:. 1092:. 1066:. 1049:^ 1035:. 1023:. 1019:. 1007:^ 993:. 983:19 981:. 967:^ 820:. 812:, 771:. 508:, 486:. 365:. 349:, 345:, 193:, 133:, 1641:. 1627:: 1600:. 1576:. 1554:: 1527:. 1515:: 1489:. 1467:: 1440:. 1418:: 1391:. 1379:: 1349:. 1322:. 1310:: 1284:. 1272:: 1249:. 1237:: 1231:8 1211:. 1187:. 1165:: 1138:. 1118:: 1100:: 1077:. 1043:. 1031:: 1025:9 1001:. 989:: 376:n 277:; 212:. 197:) 161:( 120:) 114:( 109:) 105:( 95:· 88:· 81:· 74:· 47:. 20:)

Index

Genome assembly

verification
improve this article
adding citations to reliable sources
"Sequence assembly"
news
newspapers
books
scholar
JSTOR
Learn how and when to remove this message
bioinformatics
aligning
DNA
DNA sequencing
shotgun sequencing
genomic
gene transcript
ESTs

de novo sequence assemblers
de novo transcriptome assembly
k-mer
long-reads
long reads
short reads
sequence alignment
DNA sequencers
viruses

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.