Knowledge (XXG)

Compression of genomic sequencing data

Source πŸ“

54: 31:) Genomes Project. The storage and transfer of the tremendous amount of genomic data have become a mainstream problem, motivating the development of high-performance compression tools designed specifically for genomic data. A recent surge of interest in the development of novel algorithms and tools for storing and managing genomic re-sequencing data emphasizes the growing demand for efficient methods for genomic data compression. 140:) can result in higher compression ratio because the consensus reference may contain less bias in its data. Knowledge about the source of the sequence being compressed, however, may be exploited to achieve greater compression gains. The idea of using multiple reference sequences has been proposed. Brandon et al. (2009) alluded to the potential use of ethnic group-specific reference sequence templates, using the compression of 99:
Further reduction can be achieved if all possible positions of substitutions in a pool of genome sequences are known in advance. For instance, if all locations of SNPs in a human population are known, then there is no need to record variant coordinate information (e.g., β€˜123C125T130G’ can be abridged
1364:
Alberti, Claudio; Paridaens, Tom; Voges, Jan; Naro, Daniel; Ahmad, Junaid J.; Ravasi, Massimo; Renzi, Daniele; Zoia, Giorgio; Ochoa, Idoia; Mattavelli, Marco; Delgado, Jaime; Hernaez, Mikel (27 September 2018). "An introduction to MPEG-G, the new ISO standard for genomic information representation".
60:
The principal steps of a workflow for compressing genomic re-sequencing data: (1) processing of the original sequencing data (e.g., reducing the original dataset to only variations relative to a specified reference sequence; (2) Encoding the processed data into binary form; and (3) decoding the data
185:
The compression ratio of currently available genomic data compression tools ranges between 65-fold and 1,200-fold for human genomes. Very close variants or revisions of the same genome can be compressed very efficiently (for example, 18,133 compression ratio was reported for two revisions of the
70:
With the availability of a reference template, only differences (e.g., single nucleotide substitutions and insertions/deletions) need to be recorded, thereby greatly reducing the amount of information to be stored. The notion of relative compression is obvious especially in genome re-sequencing
156:
may not always be optimal because a greater number of variants need to be stored when it is used against data from ethnically distant individuals. Additionally, a reference sequence can be designed based on statistical properties or engineered to improve the compression ratio.
127:
A universal approach to compressing genomic data may not necessarily be optimal, as a particular method may be more suitable for specific purposes and aims. Thus, several design choices that potentially impacts compression performance may be important for consideration.
407:
Compression of FASTA / UCSC2Bit files into random access compressed archives. Toolkit to mount FASTA files, indices and dictionary files virtually. This allows neat file system (api-like )integration without the need to fully decompress archives for random / partial
91:’, β€˜123C125T130G’ can be shortened to β€˜0C2T5G’, where the integers represent intervals between the variants. The cost is the modest arithmetic calculation required to recover the absolute coordinates plus the storage of the correction factor (β€˜123’ in this example). 50:) or many sequences exhibit high levels of similarity (e.g., multiple genome sequences from the same species). Additionally, the statistical and information-theoretic properties of genomic sequences can potentially be exploited for compressing sequencing data. 186:
same A. thaliana genome, which are 99.999% identical). However, such compression is not indicative of the typical compression ratio for different genomes (individuals) of the same organism. The most common encoding scheme amongst these tools is
119:, have been incorporated into genomic data compression tools. Of course, encoding schemes entail accompanying decoding algorithms. Choice of the decoding scheme potentially affects the efficiency of sequence information retrieval. 22:
technologies have led to a dramatic decline of genome sequencing costs and to an astonishingly rapid accumulation of genomic data. These technologies are enabling ambitious genome sequencing endeavours, such as the
136:
Selection of a reference sequence for relative compression can affect compression performance. Choosing a consensus reference sequence over a more specific reference sequence (e.g., the revised
177:, provide a more general entropy encoding scheme when the underlying variant and/or coordinate distribution is not well-defined (this is typically the case in genomic sequence data). 87:
Another useful idea is to store relative genomic coordinates in lieu of absolute coordinates. For example, representing sequence variant bases in the format β€˜
386: 165:
The application of different types of encoding schemes have been explored to encode variant bases and genomic coordinates. Fixed codes, such as the
1123: 111:
schemes are used to convert coordinate integers into binary form to provide additional compression gains. Encoding designs, such as the
258:
A universal compressor for genomic files – compresses FASTQ, SAM/BAM/CRAM, VCF/BCF, FASTA, GFF/GTF/GVF, PHYLIP, BED and 23andMe files
1184: 1106:
Kuruppu, Shanika; Puglisi, Simon J.; Zobel, Justin (2011). "Reference Sequence Construction for Relative Compression of Genomes".
173:, are suitable when the variant or coordinate (represented as integer) distribution is well defined. Variable codes, such as the 71:
projects where the aim is to discover variations in individual genomes. The use of a reference single nucleotide polymorphism (
516: 232:
Lossless compression tool for BAM and FASTQ.gz files; transparent on-the-fly readback through BAM and FASTQ.gz virtual files
72: 1347:"ISO/IEC 23092-2:2019 Information technology β€” Genomic information representation β€” Part 2: Coding of genomic information" 46:), this approach has been criticized to be extravagant because genomic sequences often contain repetitive content (e.g., 1443: 153: 149: 137: 100:
to β€˜CTG’). This approach, however, is rarely appropriate because such information is usually incomplete or unavailable.
585:
Compression with respect to a reference genome. Optionally uses external databases of genomic variations (e.g. dbSNP)
198:
Genomic Sequencing data compression tools compatible with standard genome sequencing files formats (BAM & FASTQ)
327: 53: 19: 1240:
Lan, Divon; Hughes, Daniel S T; Llamas, Bastien (7 July 2023). "Deep FASTQ and BAM co-compression in Genozip 15".
191: 1209:
Lan, Divon; Llamas, Bastien (14 September 2022). "Genozip 14 - advances in compression of BAM and CRAM files".
47: 39:
While standard data compression tools (e.g., zip and rar) are being used to compress sequence data (e.g.,
347:
A tool using a mixture of multiple Markov models for compressing reference and reference-free sequences
433:
Genomic Sequencing data compression tools not compatible with standard genome sequencing files formats
979:
Tembe, W.; Lowey, J.; Suh, E. (2010). "G-SQZ: Compact encoding of genomic sequence and quality data".
1366: 24: 497:
Reference sequence-based tool independent of a reference SNP map or sequence variation information
144:
variant data as an example (see Figure 2). The authors found biased haplotype distribution in the
1253: 1222: 1148: 1129: 531:
Probabilistic copy model-based tool for compressing re-sequencing data using a reference sequence
308: 43: 1170:
Pratas, D., Pinho, A. J., and Ferreira, P. J. S. G. Efficient compression of genomic sequences.
1147:
Grabowski, Szymon; Deorowicz, Sebastian (2011). "Engineering Relative Compression of Genomes".
1419: 1302: 1119: 1088: 1042: 996: 956: 902: 846: 800: 701: 650: 145: 141: 1409: 1399: 1292: 1284: 1245: 1214: 1111: 1078: 1032: 988: 946: 938: 892: 884: 836: 790: 782: 732: 691: 681: 640: 377:
Lossless compression of BAM and FASTQ files into the standard format ISO/IEC 23092 (MPEG-G)
517:
https://web.archive.org/web/20121209070434/http://gmdd.shgmo.org/Computational-Biology/GRS/
1414: 1387: 1297: 1272: 951: 926: 897: 872: 795: 770: 696: 669: 187: 1437: 1288: 1257: 1226: 1133: 1083: 1066: 1037: 1020: 992: 841: 824: 786: 645: 628: 1386:
Hoogstrate, Youri; Jenster, Guido W.; van de Werken, Harmen J. G. (December 2021).
670:"Data Compression Concepts and Algorithms and their Applications to Bioinformatics" 174: 116: 1332: 591:
Human nuclear genome sequence (Watson) and sequences from the 1000 Genomes Project
284:
Lossless compression tool designed for storing and analyzing sequencing read data
1115: 332: 166: 112: 1404: 1249: 1218: 597: 543: 1388:"FASTAFS: file system virtualisation of random access compressed FASTA files" 1319: 1271:
Lan, Divon; Tobler, Ray; Souilmi, Yassine; Llamas, Bastien (25 August 2021).
358: 873:"A novel compression tool for efficient storage of genome resequencing data" 170: 1423: 1306: 1092: 1046: 1000: 960: 906: 850: 804: 705: 654: 1346: 942: 318:
Highly efficient and tunable reference-based compression of sequence data
888: 419: 312: 108: 737: 720: 1110:. Lecture Notes in Computer Science. Vol. 7024. pp. 420–425. 40: 771:"Data structures and compression algorithms for genomic sequence data" 686: 927:"GReEn: A tool for efficient compression of genome resequencing data" 148:
sequences of Africans, Asians, and Eurasians relative to the revised
79:, can be used to further improve the number of variants for storage. 467:
LZ77-style tool for compressing multiple genomes of the same species
1371: 1153: 570: 76: 52: 16:
Methods of compressing data tailored specifically for genomic data
362: 721:"A Survey on Data Compression Methods for Biological Sequences" 629:"Textual data compression in computational biology: A synopsis" 482: 1185:"The Importance of Data Compression in the Field of Genomics" 296: 594:
Entropy coding for approximations of empirical distributions
825:"Robust relative compression of genomes with random access" 392: 1273:"Genozip: a universal extensible genomic data compressor" 719:
Hosseini, Morteza; Pratas, Diogo; Pinho, Armando (2016).
668:
Nalbantog̃Lu, O. U.; Russell, D. J.; Sayood, K. (2010).
243: 383:
Human genome sequences from the 1000 Genomes Project
290:
Human genome sequences from the 1000 Genomes Project
269: 263:
Human genome sequences from the 1000 Genomes Project
238:
Human genome sequences from the 1000 Genomes Project
181:
List of genomic re-sequencing data compression tools
769:Brandon, M. C.; Wallace, D. C.; Baldi, P. (2009). 1065:Pavlichin, D. S.; Weissman, T.; Yona, G. (2013). 925:Pinho, A. J.; Pratas, D.; Garcia, S. P. (2012). 1019:Christley, S.; Lu, Y.; Li, C.; Xie, X. (2009). 627:Giancarlo, R.; Scaturro, D.; Utro, F. (2009). 507:(different revisions of the same genome), and 333:http://www.ebi.ac.uk/ena/software/cram-toolkit 8: 1060: 1058: 1056: 274:Commercial, but free for non-commercial use 1108:String Processing and Information Retrieval 598:https://sourceforge.net/projects/genomezip/ 544:http://bioinformatics.ua.pt/software/green/ 359:http://bioinformatics.ua.pt/software/geco/ 1413: 1403: 1370: 1296: 1166: 1164: 1152: 1082: 1036: 950: 896: 840: 794: 764: 762: 760: 758: 756: 754: 752: 750: 748: 736: 695: 685: 644: 387:Context-adaptive binary arithmetic coding 152:. Their result suggests that the revised 431: 196: 1334:CRAM format specification (version 3.0) 1014: 1012: 1010: 974: 972: 970: 920: 918: 916: 616: 866: 864: 862: 860: 622: 620: 420:https://github.com/yhoogstrate/fastafs 823:Deorowicz, S.; Grabowski, S. (2011). 818: 816: 814: 528:Genome Re-sequencing Encoding (GReEN) 521:free of charge for non-commercial use 473:Nuclear genome sequence of human and 416:Huffman coding as implemented by Zstd 7: 1021:"Human genomes as email attachments" 464:Genome Differential Compressor (GDC) 95:Prior information about the genomes 1067:"The human genome contracts again" 503:Nuclear genome sequence of human, 14: 500:159-fold / 18,133-fold / 82-fold 470:180 to 250-fold / 70 to 100-fold 571:http://www.ics.uci.edu/~dnazip/ 1289:10.1093/bioinformatics/btab102 558:A package of compression tools 363:https://pratas.github.io/geco/ 352:Human nuclear genome sequence 1: 1084:10.1093/bioinformatics/btt362 1038:10.1093/bioinformatics/btn582 993:10.1093/bioinformatics/btq346 842:10.1093/bioinformatics/btr505 787:10.1093/bioinformatics/btp319 646:10.1093/bioinformatics/btp117 564:Human nuclear genome sequence 537:Human nuclear genome sequence 266:Genozip extensible framework 89:Position1Base1Position2Base2… 1116:10.1007/978-3-642-24583-1_41 871:Wang, C.; Zhang, D. (2011). 323:European Nucleotide Archive 154:Cambridge Reference Sequence 150:Cambridge Reference Sequence 138:Cambridge Reference Sequence 104:Encoding genomic coordinates 83:Relative genomic coordinates 1172:Data Compression Conference 483:http://sun.aei.polsl.pl/gdc 1460: 1405:10.1186/s12859-021-04455-3 494:Genome Re-Sequencing (GRS) 297:http://public.tgen.org/sqz 20:High-throughput sequencing 1250:10.1101/2023.07.07.548069 1219:10.1101/2022.09.12.507582 344:Genome Compressor (GeCo) 215:Approach/Encoding Scheme 212:Data Used for Evaluation 192:lossless data compression 475:Saccharomyces cerevisiae 450:Approach/Encoding Scheme 447:Data Used for Evaluation 393:https://www.genomsys.com 281:Genomic Squeeze (G-SQZ) 123:Algorithm design choices 48:microsatellite sequences 1174:, Snowbird, Utah, 2016. 931:Nucleic Acids Research 877:Nucleic Acids Research 62: 56: 505:Arabidopsis thaliana 244:https://petagene.com 190:, which is used for 29:Arabidopsis thaliana 25:1000 Genomes Project 1444:Genomics techniques 943:10.1093/nar/gkr1124 738:10.3390/info7040056 434: 199: 1392:BMC Bioinformatics 889:10.1093/nar/gkr009 432: 355:Arithmetic coding 270:http://genozip.com 209:Compression Ratio 197: 132:Reference sequence 63: 61:back to text form. 44:flat file database 1321:CRAM benchmarking 1283:(16): 2225–2230. 1125:978-3-642-24582-4 1077:(17): 2199–2302. 987:(17): 2192–2194. 835:(21): 2979–2986. 781:(14): 1731–1738. 687:10.3390/e12010034 639:(13): 1575–1586. 608: 607: 540:Arithmetic coding 444:Compression Ratio 430: 429: 146:mitochondrial DNA 142:mitochondrial DNA 1451: 1428: 1427: 1417: 1407: 1383: 1377: 1376: 1374: 1361: 1355: 1354: 1343: 1337: 1330: 1324: 1317: 1311: 1310: 1300: 1268: 1262: 1261: 1237: 1231: 1230: 1206: 1200: 1199: 1197: 1196: 1181: 1175: 1168: 1159: 1158: 1156: 1144: 1138: 1137: 1103: 1097: 1096: 1086: 1062: 1051: 1050: 1040: 1016: 1005: 1004: 976: 965: 964: 954: 922: 911: 910: 900: 868: 855: 854: 844: 820: 809: 808: 798: 766: 743: 742: 740: 716: 710: 709: 699: 689: 665: 659: 658: 648: 624: 435: 374:GenomSys codecs 200: 161:Encoding schemes 35:General concepts 1459: 1458: 1454: 1453: 1452: 1450: 1449: 1448: 1434: 1433: 1432: 1431: 1385: 1384: 1380: 1363: 1362: 1358: 1345: 1344: 1340: 1331: 1327: 1318: 1314: 1270: 1269: 1265: 1239: 1238: 1234: 1208: 1207: 1203: 1194: 1192: 1183: 1182: 1178: 1169: 1162: 1146: 1145: 1141: 1126: 1105: 1104: 1100: 1064: 1063: 1054: 1018: 1017: 1008: 978: 977: 968: 924: 923: 914: 870: 869: 858: 822: 821: 812: 768: 767: 746: 718: 717: 713: 667: 666: 662: 626: 625: 618: 613: 293:Huffman coding 183: 163: 134: 125: 106: 97: 85: 75:) map, such as 68: 37: 17: 12: 11: 5: 1457: 1455: 1447: 1446: 1436: 1435: 1430: 1429: 1378: 1372:10.1101/426353 1356: 1338: 1325: 1312: 1277:Bioinformatics 1263: 1232: 1201: 1176: 1160: 1139: 1124: 1098: 1071:Bioinformatics 1052: 1031:(2): 274–275. 1025:Bioinformatics 1006: 981:Bioinformatics 966: 912: 856: 829:Bioinformatics 810: 775:Bioinformatics 744: 711: 660: 633:Bioinformatics 615: 614: 612: 609: 606: 605: 603: 600: 595: 592: 589: 586: 583: 579: 578: 576: 573: 568: 567:Huffman coding 565: 562: 559: 556: 552: 551: 549: 546: 541: 538: 535: 532: 529: 525: 524: 522: 519: 514: 513:Huffman coding 511: 501: 498: 495: 491: 490: 488: 485: 480: 479:Huffman coding 477: 471: 468: 465: 461: 460: 457: 454: 451: 448: 445: 442: 439: 428: 427: 425: 422: 417: 414: 411: 409: 405: 401: 400: 398: 395: 390: 384: 381: 378: 375: 371: 370: 368: 365: 356: 353: 350: 348: 345: 341: 340: 338: 335: 330: 324: 321: 319: 316: 305: 304: 302: 299: 294: 291: 288: 285: 282: 278: 277: 275: 272: 267: 264: 261: 259: 256: 252: 251: 249: 246: 241: 239: 236: 233: 230: 226: 225: 222: 219: 216: 213: 210: 207: 204: 188:Huffman coding 182: 179: 162: 159: 133: 130: 124: 121: 105: 102: 96: 93: 84: 81: 67: 64: 36: 33: 15: 13: 10: 9: 6: 4: 3: 2: 1456: 1445: 1442: 1441: 1439: 1425: 1421: 1416: 1411: 1406: 1401: 1397: 1393: 1389: 1382: 1379: 1373: 1368: 1360: 1357: 1352: 1348: 1342: 1339: 1336: 1335: 1329: 1326: 1323: 1322: 1316: 1313: 1308: 1304: 1299: 1294: 1290: 1286: 1282: 1278: 1274: 1267: 1264: 1259: 1255: 1251: 1247: 1243: 1236: 1233: 1228: 1224: 1220: 1216: 1212: 1205: 1202: 1190: 1186: 1180: 1177: 1173: 1167: 1165: 1161: 1155: 1150: 1143: 1140: 1135: 1131: 1127: 1121: 1117: 1113: 1109: 1102: 1099: 1094: 1090: 1085: 1080: 1076: 1072: 1068: 1061: 1059: 1057: 1053: 1048: 1044: 1039: 1034: 1030: 1026: 1022: 1015: 1013: 1011: 1007: 1002: 998: 994: 990: 986: 982: 975: 973: 971: 967: 962: 958: 953: 948: 944: 940: 936: 932: 928: 921: 919: 917: 913: 908: 904: 899: 894: 890: 886: 882: 878: 874: 867: 865: 863: 861: 857: 852: 848: 843: 838: 834: 830: 826: 819: 817: 815: 811: 806: 802: 797: 792: 788: 784: 780: 776: 772: 765: 763: 761: 759: 757: 755: 753: 751: 749: 745: 739: 734: 730: 726: 722: 715: 712: 707: 703: 698: 693: 688: 683: 679: 675: 671: 664: 661: 656: 652: 647: 642: 638: 634: 630: 623: 621: 617: 610: 604: 601: 599: 596: 593: 590: 587: 584: 581: 580: 577: 574: 572: 569: 566: 563: 560: 557: 554: 553: 550: 547: 545: 542: 539: 536: 533: 530: 527: 526: 523: 520: 518: 515: 512: 510: 506: 502: 499: 496: 493: 492: 489: 486: 484: 481: 478: 476: 472: 469: 466: 463: 462: 458: 455: 452: 449: 446: 443: 440: 437: 436: 426: 423: 421: 418: 415: 412: 410: 406: 403: 402: 399: 396: 394: 391: 388: 385: 382: 379: 376: 373: 372: 369: 366: 364: 360: 357: 354: 351: 349: 346: 343: 342: 339: 336: 334: 331: 329: 325: 322: 320: 317: 314: 310: 307: 306: 303: 301:-Undeclared- 300: 298: 295: 292: 289: 286: 283: 280: 279: 276: 273: 271: 268: 265: 262: 260: 257: 254: 253: 250: 247: 245: 242: 240: 237: 234: 231: 228: 227: 223: 220: 217: 214: 211: 208: 205: 202: 201: 195: 193: 189: 180: 178: 176: 172: 168: 160: 158: 155: 151: 147: 143: 139: 131: 129: 122: 120: 118: 114: 110: 103: 101: 94: 92: 90: 82: 80: 78: 74: 66:Base variants 65: 59: 55: 51: 49: 45: 42: 34: 32: 30: 26: 21: 1395: 1391: 1381: 1359: 1350: 1341: 1333: 1328: 1320: 1315: 1280: 1276: 1266: 1241: 1235: 1210: 1204: 1193:. Retrieved 1191:. 2019-04-26 1188: 1179: 1171: 1142: 1107: 1101: 1074: 1070: 1028: 1024: 984: 980: 934: 930: 880: 876: 832: 828: 778: 774: 728: 724: 714: 677: 673: 663: 636: 632: 602:-Undeclared- 575:-Undeclared- 548:-Undeclared- 509:Oryza sativa 508: 504: 474: 326:deflate and 221:Use Licence 206:Description 184: 175:Huffman code 164: 135: 126: 117:Huffman code 107: 98: 88: 86: 69: 57: 38: 28: 18: 725:Information 456:Use License 441:Description 413:FASTA files 397:Commercial 380:60% to 90% 337:Apache-2.0 287:65% to 76% 248:Commercial 235:60% to 90% 167:Golomb code 113:Golomb code 1398:(1): 535. 1195:2024-02-22 1189:IEEE Pulse 937:(4): e27. 883:(7): e45. 611:References 588:~1200-fold 459:Reference 229:PetaSuite 224:Reference 27:and 1001 ( 1258:259764998 1227:252357508 1154:1103.2351 731:(4): 56. 680:(1): 34. 582:GenomeZip 561:~750-fold 534:~100-fold 311:(part of 203:Software 171:Rice code 58:Figure 1: 1438:Category 1424:34724897 1307:33585897 1134:16007637 1093:23793748 1047:18996942 1001:20605925 961:22139935 907:21266471 851:21896510 805:19447783 706:20157640 655:19251772 438:Software 424:GPL-v2.0 389:(CABAC) 313:SAMtools 255:Genozip 169:and the 115:and the 109:Encoding 1415:8558547 1367:bioRxiv 1351:iso.org 1298:8388020 1242:bioRxiv 1211:bioRxiv 952:3287168 898:3074166 796:2705231 697:2821113 674:Entropy 408:access. 404:fastafs 41:GenBank 1422:  1412:  1369:  1305:  1295:  1256:  1225:  1132:  1122:  1091:  1045:  999:  959:  949:  905:  895:  849:  803:  793:  704:  694:  653:  555:DNAzip 367:GPLv3 1254:S2CID 1223:S2CID 1149:arXiv 1130:S2CID 487:GPLv2 218:Link 77:dbSNP 1420:PMID 1303:PMID 1120:ISBN 1089:PMID 1043:PMID 997:PMID 957:PMID 903:PMID 847:PMID 801:PMID 702:PMID 651:PMID 453:Link 328:rANS 309:CRAM 1410:PMC 1400:doi 1293:PMC 1285:doi 1246:doi 1215:doi 1112:doi 1079:doi 1033:doi 989:doi 947:PMC 939:doi 893:PMC 885:doi 837:doi 791:PMC 783:doi 733:doi 692:PMC 682:doi 641:doi 361:or 73:SNP 1440:: 1418:. 1408:. 1396:22 1394:. 1390:. 1349:. 1301:. 1291:. 1281:37 1279:. 1275:. 1252:. 1244:. 1221:. 1213:. 1187:. 1163:^ 1128:. 1118:. 1087:. 1075:29 1073:. 1069:. 1055:^ 1041:. 1029:25 1027:. 1023:. 1009:^ 995:. 985:26 983:. 969:^ 955:. 945:. 935:40 933:. 929:. 915:^ 901:. 891:. 881:39 879:. 875:. 859:^ 845:. 833:27 831:. 827:. 813:^ 799:. 789:. 779:25 777:. 773:. 747:^ 727:. 723:. 700:. 690:. 678:12 676:. 672:. 649:. 637:25 635:. 631:. 619:^ 315:) 194:. 1426:. 1402:: 1375:. 1353:. 1309:. 1287:: 1260:. 1248:: 1229:. 1217:: 1198:. 1157:. 1151:: 1136:. 1114:: 1095:. 1081:: 1049:. 1035:: 1003:. 991:: 963:. 941:: 909:. 887:: 853:. 839:: 807:. 785:: 741:. 735:: 729:7 708:. 684:: 657:. 643::

Index

High-throughput sequencing
1000 Genomes Project
GenBank
flat file database
microsatellite sequences

SNP
dbSNP
Encoding
Golomb code
Huffman code
Cambridge Reference Sequence
mitochondrial DNA
mitochondrial DNA
Cambridge Reference Sequence
Cambridge Reference Sequence
Golomb code
Rice code
Huffman code
Huffman coding
lossless data compression
https://petagene.com
http://genozip.com
http://public.tgen.org/sqz
CRAM
SAMtools
rANS
http://www.ebi.ac.uk/ena/software/cram-toolkit
http://bioinformatics.ua.pt/software/geco/
https://pratas.github.io/geco/

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.

↑