54:
31:) Genomes Project. The storage and transfer of the tremendous amount of genomic data have become a mainstream problem, motivating the development of high-performance compression tools designed specifically for genomic data. A recent surge of interest in the development of novel algorithms and tools for storing and managing genomic re-sequencing data emphasizes the growing demand for efficient methods for genomic data compression.
140:) can result in higher compression ratio because the consensus reference may contain less bias in its data. Knowledge about the source of the sequence being compressed, however, may be exploited to achieve greater compression gains. The idea of using multiple reference sequences has been proposed. Brandon et al. (2009) alluded to the potential use of ethnic group-specific reference sequence templates, using the compression of
99:
Further reduction can be achieved if all possible positions of substitutions in a pool of genome sequences are known in advance. For instance, if all locations of SNPs in a human population are known, then there is no need to record variant coordinate information (e.g., β123C125T130Gβ can be abridged
1364:
Alberti, Claudio; Paridaens, Tom; Voges, Jan; Naro, Daniel; Ahmad, Junaid J.; Ravasi, Massimo; Renzi, Daniele; Zoia, Giorgio; Ochoa, Idoia; Mattavelli, Marco; Delgado, Jaime; Hernaez, Mikel (27 September 2018). "An introduction to MPEG-G, the new ISO standard for genomic information representation".
60:
The principal steps of a workflow for compressing genomic re-sequencing data: (1) processing of the original sequencing data (e.g., reducing the original dataset to only variations relative to a specified reference sequence; (2) Encoding the processed data into binary form; and (3) decoding the data
185:
The compression ratio of currently available genomic data compression tools ranges between 65-fold and 1,200-fold for human genomes. Very close variants or revisions of the same genome can be compressed very efficiently (for example, 18,133 compression ratio was reported for two revisions of the
70:
With the availability of a reference template, only differences (e.g., single nucleotide substitutions and insertions/deletions) need to be recorded, thereby greatly reducing the amount of information to be stored. The notion of relative compression is obvious especially in genome re-sequencing
156:
may not always be optimal because a greater number of variants need to be stored when it is used against data from ethnically distant individuals. Additionally, a reference sequence can be designed based on statistical properties or engineered to improve the compression ratio.
127:
A universal approach to compressing genomic data may not necessarily be optimal, as a particular method may be more suitable for specific purposes and aims. Thus, several design choices that potentially impacts compression performance may be important for consideration.
407:
Compression of FASTA / UCSC2Bit files into random access compressed archives. Toolkit to mount FASTA files, indices and dictionary files virtually. This allows neat file system (api-like )integration without the need to fully decompress archives for random / partial
91:β, β123C125T130Gβ can be shortened to β0C2T5Gβ, where the integers represent intervals between the variants. The cost is the modest arithmetic calculation required to recover the absolute coordinates plus the storage of the correction factor (β123β in this example).
50:) or many sequences exhibit high levels of similarity (e.g., multiple genome sequences from the same species). Additionally, the statistical and information-theoretic properties of genomic sequences can potentially be exploited for compressing sequencing data.
186:
same A. thaliana genome, which are 99.999% identical). However, such compression is not indicative of the typical compression ratio for different genomes (individuals) of the same organism. The most common encoding scheme amongst these tools is
119:, have been incorporated into genomic data compression tools. Of course, encoding schemes entail accompanying decoding algorithms. Choice of the decoding scheme potentially affects the efficiency of sequence information retrieval.
22:
technologies have led to a dramatic decline of genome sequencing costs and to an astonishingly rapid accumulation of genomic data. These technologies are enabling ambitious genome sequencing endeavours, such as the
136:
Selection of a reference sequence for relative compression can affect compression performance. Choosing a consensus reference sequence over a more specific reference sequence (e.g., the revised
177:, provide a more general entropy encoding scheme when the underlying variant and/or coordinate distribution is not well-defined (this is typically the case in genomic sequence data).
87:
Another useful idea is to store relative genomic coordinates in lieu of absolute coordinates. For example, representing sequence variant bases in the format β
386:
165:
The application of different types of encoding schemes have been explored to encode variant bases and genomic coordinates. Fixed codes, such as the
1123:
111:
schemes are used to convert coordinate integers into binary form to provide additional compression gains. Encoding designs, such as the
258:
A universal compressor for genomic files β compresses FASTQ, SAM/BAM/CRAM, VCF/BCF, FASTA, GFF/GTF/GVF, PHYLIP, BED and 23andMe files
1184:
1106:
Kuruppu, Shanika; Puglisi, Simon J.; Zobel, Justin (2011). "Reference
Sequence Construction for Relative Compression of Genomes".
173:, are suitable when the variant or coordinate (represented as integer) distribution is well defined. Variable codes, such as the
71:
projects where the aim is to discover variations in individual genomes. The use of a reference single nucleotide polymorphism (
516:
232:
Lossless compression tool for BAM and FASTQ.gz files; transparent on-the-fly readback through BAM and FASTQ.gz virtual files
72:
1347:"ISO/IEC 23092-2:2019 Information technology β Genomic information representation β Part 2: Coding of genomic information"
46:), this approach has been criticized to be extravagant because genomic sequences often contain repetitive content (e.g.,
1443:
153:
149:
137:
100:
to βCTGβ). This approach, however, is rarely appropriate because such information is usually incomplete or unavailable.
585:
Compression with respect to a reference genome. Optionally uses external databases of genomic variations (e.g. dbSNP)
198:
Genomic
Sequencing data compression tools compatible with standard genome sequencing files formats (BAM & FASTQ)
327:
53:
19:
1240:
Lan, Divon; Hughes, Daniel S T; Llamas, Bastien (7 July 2023). "Deep FASTQ and BAM co-compression in
Genozip 15".
191:
1209:
Lan, Divon; Llamas, Bastien (14 September 2022). "Genozip 14 - advances in compression of BAM and CRAM files".
47:
39:
While standard data compression tools (e.g., zip and rar) are being used to compress sequence data (e.g.,
347:
A tool using a mixture of multiple Markov models for compressing reference and reference-free sequences
433:
Genomic
Sequencing data compression tools not compatible with standard genome sequencing files formats
979:
Tembe, W.; Lowey, J.; Suh, E. (2010). "G-SQZ: Compact encoding of genomic sequence and quality data".
1366:
24:
497:
Reference sequence-based tool independent of a reference SNP map or sequence variation information
144:
variant data as an example (see Figure 2). The authors found biased haplotype distribution in the
1253:
1222:
1148:
1129:
531:
Probabilistic copy model-based tool for compressing re-sequencing data using a reference sequence
308:
43:
1170:
Pratas, D., Pinho, A. J., and
Ferreira, P. J. S. G. Efficient compression of genomic sequences.
1147:
Grabowski, Szymon; Deorowicz, Sebastian (2011). "Engineering
Relative Compression of Genomes".
1419:
1302:
1119:
1088:
1042:
996:
956:
902:
846:
800:
701:
650:
145:
141:
1409:
1399:
1292:
1284:
1245:
1214:
1111:
1078:
1032:
988:
946:
938:
892:
884:
836:
790:
782:
732:
691:
681:
640:
377:
Lossless compression of BAM and FASTQ files into the standard format ISO/IEC 23092 (MPEG-G)
517:
https://web.archive.org/web/20121209070434/http://gmdd.shgmo.org/Computational-Biology/GRS/
1414:
1387:
1297:
1272:
951:
926:
897:
872:
795:
770:
696:
669:
187:
1437:
1288:
1257:
1226:
1133:
1083:
1066:
1037:
1020:
992:
841:
824:
786:
645:
628:
1386:
Hoogstrate, Youri; Jenster, Guido W.; van de Werken, Harmen J. G. (December 2021).
670:"Data Compression Concepts and Algorithms and their Applications to Bioinformatics"
174:
116:
1332:
591:
Human nuclear genome sequence (Watson) and sequences from the 1000 Genomes
Project
284:
Lossless compression tool designed for storing and analyzing sequencing read data
1115:
332:
166:
112:
1404:
1249:
1218:
597:
543:
1388:"FASTAFS: file system virtualisation of random access compressed FASTA files"
1319:
1271:
Lan, Divon; Tobler, Ray; Souilmi, Yassine; Llamas, Bastien (25 August 2021).
358:
873:"A novel compression tool for efficient storage of genome resequencing data"
170:
1423:
1306:
1092:
1046:
1000:
960:
906:
850:
804:
705:
654:
1346:
942:
318:
Highly efficient and tunable reference-based compression of sequence data
888:
419:
312:
108:
737:
720:
1110:. Lecture Notes in Computer Science. Vol. 7024. pp. 420β425.
40:
771:"Data structures and compression algorithms for genomic sequence data"
686:
927:"GReEn: A tool for efficient compression of genome resequencing data"
148:
sequences of
Africans, Asians, and Eurasians relative to the revised
79:, can be used to further improve the number of variants for storage.
467:
LZ77-style tool for compressing multiple genomes of the same species
1371:
1153:
570:
76:
52:
16:
Methods of compressing data tailored specifically for genomic data
362:
721:"A Survey on Data Compression Methods for Biological Sequences"
629:"Textual data compression in computational biology: A synopsis"
482:
1185:"The Importance of Data Compression in the Field of Genomics"
296:
594:
Entropy coding for approximations of empirical distributions
825:"Robust relative compression of genomes with random access"
392:
1273:"Genozip: a universal extensible genomic data compressor"
719:
Hosseini, Morteza; Pratas, Diogo; Pinho, Armando (2016).
668:
NalbantogΜLu, O. U.; Russell, D. J.; Sayood, K. (2010).
243:
383:
Human genome sequences from the 1000 Genomes
Project
290:
Human genome sequences from the 1000 Genomes
Project
269:
263:
Human genome sequences from the 1000 Genomes Project
238:
Human genome sequences from the 1000 Genomes Project
181:
List of genomic re-sequencing data compression tools
769:Brandon, M. C.; Wallace, D. C.; Baldi, P. (2009).
1065:Pavlichin, D. S.; Weissman, T.; Yona, G. (2013).
925:Pinho, A. J.; Pratas, D.; Garcia, S. P. (2012).
1019:Christley, S.; Lu, Y.; Li, C.; Xie, X. (2009).
627:Giancarlo, R.; Scaturro, D.; Utro, F. (2009).
507:(different revisions of the same genome), and
333:http://www.ebi.ac.uk/ena/software/cram-toolkit
8:
1060:
1058:
1056:
274:Commercial, but free for non-commercial use
1108:String Processing and Information Retrieval
598:https://sourceforge.net/projects/genomezip/
544:http://bioinformatics.ua.pt/software/green/
359:http://bioinformatics.ua.pt/software/geco/
1413:
1403:
1370:
1296:
1166:
1164:
1152:
1082:
1036:
950:
896:
840:
794:
764:
762:
760:
758:
756:
754:
752:
750:
748:
736:
695:
685:
644:
387:Context-adaptive binary arithmetic coding
152:. Their result suggests that the revised
431:
196:
1334:CRAM format specification (version 3.0)
1014:
1012:
1010:
974:
972:
970:
920:
918:
916:
616:
866:
864:
862:
860:
622:
620:
420:https://github.com/yhoogstrate/fastafs
823:Deorowicz, S.; Grabowski, S. (2011).
818:
816:
814:
528:Genome Re-sequencing Encoding (GReEN)
521:free of charge for non-commercial use
473:Nuclear genome sequence of human and
416:Huffman coding as implemented by Zstd
7:
1021:"Human genomes as email attachments"
464:Genome Differential Compressor (GDC)
95:Prior information about the genomes
1067:"The human genome contracts again"
503:Nuclear genome sequence of human,
14:
500:159-fold / 18,133-fold / 82-fold
470:180 to 250-fold / 70 to 100-fold
571:http://www.ics.uci.edu/~dnazip/
1289:10.1093/bioinformatics/btab102
558:A package of compression tools
363:https://pratas.github.io/geco/
352:Human nuclear genome sequence
1:
1084:10.1093/bioinformatics/btt362
1038:10.1093/bioinformatics/btn582
993:10.1093/bioinformatics/btq346
842:10.1093/bioinformatics/btr505
787:10.1093/bioinformatics/btp319
646:10.1093/bioinformatics/btp117
564:Human nuclear genome sequence
537:Human nuclear genome sequence
266:Genozip extensible framework
89:Position1Base1Position2Base2β¦
1116:10.1007/978-3-642-24583-1_41
871:Wang, C.; Zhang, D. (2011).
323:European Nucleotide Archive
154:Cambridge Reference Sequence
150:Cambridge Reference Sequence
138:Cambridge Reference Sequence
104:Encoding genomic coordinates
83:Relative genomic coordinates
1172:Data Compression Conference
483:http://sun.aei.polsl.pl/gdc
1460:
1405:10.1186/s12859-021-04455-3
494:Genome Re-Sequencing (GRS)
297:http://public.tgen.org/sqz
20:High-throughput sequencing
1250:10.1101/2023.07.07.548069
1219:10.1101/2022.09.12.507582
344:Genome Compressor (GeCo)
215:Approach/Encoding Scheme
212:Data Used for Evaluation
192:lossless data compression
475:Saccharomyces cerevisiae
450:Approach/Encoding Scheme
447:Data Used for Evaluation
393:https://www.genomsys.com
281:Genomic Squeeze (G-SQZ)
123:Algorithm design choices
48:microsatellite sequences
1174:, Snowbird, Utah, 2016.
931:Nucleic Acids Research
877:Nucleic Acids Research
62:
56:
505:Arabidopsis thaliana
244:https://petagene.com
190:, which is used for
29:Arabidopsis thaliana
25:1000 Genomes Project
1444:Genomics techniques
943:10.1093/nar/gkr1124
738:10.3390/info7040056
434:
199:
1392:BMC Bioinformatics
889:10.1093/nar/gkr009
432:
355:Arithmetic coding
270:http://genozip.com
209:Compression Ratio
197:
132:Reference sequence
63:
61:back to text form.
44:flat file database
1321:CRAM benchmarking
1283:(16): 2225β2230.
1125:978-3-642-24582-4
1077:(17): 2199β2302.
987:(17): 2192β2194.
835:(21): 2979β2986.
781:(14): 1731β1738.
687:10.3390/e12010034
639:(13): 1575β1586.
608:
607:
540:Arithmetic coding
444:Compression Ratio
430:
429:
146:mitochondrial DNA
142:mitochondrial DNA
1451:
1428:
1427:
1417:
1407:
1383:
1377:
1376:
1374:
1361:
1355:
1354:
1343:
1337:
1330:
1324:
1317:
1311:
1310:
1300:
1268:
1262:
1261:
1237:
1231:
1230:
1206:
1200:
1199:
1197:
1196:
1181:
1175:
1168:
1159:
1158:
1156:
1144:
1138:
1137:
1103:
1097:
1096:
1086:
1062:
1051:
1050:
1040:
1016:
1005:
1004:
976:
965:
964:
954:
922:
911:
910:
900:
868:
855:
854:
844:
820:
809:
808:
798:
766:
743:
742:
740:
716:
710:
709:
699:
689:
665:
659:
658:
648:
624:
435:
374:GenomSys codecs
200:
161:Encoding schemes
35:General concepts
1459:
1458:
1454:
1453:
1452:
1450:
1449:
1448:
1434:
1433:
1432:
1431:
1385:
1384:
1380:
1363:
1362:
1358:
1345:
1344:
1340:
1331:
1327:
1318:
1314:
1270:
1269:
1265:
1239:
1238:
1234:
1208:
1207:
1203:
1194:
1192:
1183:
1182:
1178:
1169:
1162:
1146:
1145:
1141:
1126:
1105:
1104:
1100:
1064:
1063:
1054:
1018:
1017:
1008:
978:
977:
968:
924:
923:
914:
870:
869:
858:
822:
821:
812:
768:
767:
746:
718:
717:
713:
667:
666:
662:
626:
625:
618:
613:
293:Huffman coding
183:
163:
134:
125:
106:
97:
85:
75:) map, such as
68:
37:
17:
12:
11:
5:
1457:
1455:
1447:
1446:
1436:
1435:
1430:
1429:
1378:
1372:10.1101/426353
1356:
1338:
1325:
1312:
1277:Bioinformatics
1263:
1232:
1201:
1176:
1160:
1139:
1124:
1098:
1071:Bioinformatics
1052:
1031:(2): 274β275.
1025:Bioinformatics
1006:
981:Bioinformatics
966:
912:
856:
829:Bioinformatics
810:
775:Bioinformatics
744:
711:
660:
633:Bioinformatics
615:
614:
612:
609:
606:
605:
603:
600:
595:
592:
589:
586:
583:
579:
578:
576:
573:
568:
567:Huffman coding
565:
562:
559:
556:
552:
551:
549:
546:
541:
538:
535:
532:
529:
525:
524:
522:
519:
514:
513:Huffman coding
511:
501:
498:
495:
491:
490:
488:
485:
480:
479:Huffman coding
477:
471:
468:
465:
461:
460:
457:
454:
451:
448:
445:
442:
439:
428:
427:
425:
422:
417:
414:
411:
409:
405:
401:
400:
398:
395:
390:
384:
381:
378:
375:
371:
370:
368:
365:
356:
353:
350:
348:
345:
341:
340:
338:
335:
330:
324:
321:
319:
316:
305:
304:
302:
299:
294:
291:
288:
285:
282:
278:
277:
275:
272:
267:
264:
261:
259:
256:
252:
251:
249:
246:
241:
239:
236:
233:
230:
226:
225:
222:
219:
216:
213:
210:
207:
204:
188:Huffman coding
182:
179:
162:
159:
133:
130:
124:
121:
105:
102:
96:
93:
84:
81:
67:
64:
36:
33:
15:
13:
10:
9:
6:
4:
3:
2:
1456:
1445:
1442:
1441:
1439:
1425:
1421:
1416:
1411:
1406:
1401:
1397:
1393:
1389:
1382:
1379:
1373:
1368:
1360:
1357:
1352:
1348:
1342:
1339:
1336:
1335:
1329:
1326:
1323:
1322:
1316:
1313:
1308:
1304:
1299:
1294:
1290:
1286:
1282:
1278:
1274:
1267:
1264:
1259:
1255:
1251:
1247:
1243:
1236:
1233:
1228:
1224:
1220:
1216:
1212:
1205:
1202:
1190:
1186:
1180:
1177:
1173:
1167:
1165:
1161:
1155:
1150:
1143:
1140:
1135:
1131:
1127:
1121:
1117:
1113:
1109:
1102:
1099:
1094:
1090:
1085:
1080:
1076:
1072:
1068:
1061:
1059:
1057:
1053:
1048:
1044:
1039:
1034:
1030:
1026:
1022:
1015:
1013:
1011:
1007:
1002:
998:
994:
990:
986:
982:
975:
973:
971:
967:
962:
958:
953:
948:
944:
940:
936:
932:
928:
921:
919:
917:
913:
908:
904:
899:
894:
890:
886:
882:
878:
874:
867:
865:
863:
861:
857:
852:
848:
843:
838:
834:
830:
826:
819:
817:
815:
811:
806:
802:
797:
792:
788:
784:
780:
776:
772:
765:
763:
761:
759:
757:
755:
753:
751:
749:
745:
739:
734:
730:
726:
722:
715:
712:
707:
703:
698:
693:
688:
683:
679:
675:
671:
664:
661:
656:
652:
647:
642:
638:
634:
630:
623:
621:
617:
610:
604:
601:
599:
596:
593:
590:
587:
584:
581:
580:
577:
574:
572:
569:
566:
563:
560:
557:
554:
553:
550:
547:
545:
542:
539:
536:
533:
530:
527:
526:
523:
520:
518:
515:
512:
510:
506:
502:
499:
496:
493:
492:
489:
486:
484:
481:
478:
476:
472:
469:
466:
463:
462:
458:
455:
452:
449:
446:
443:
440:
437:
436:
426:
423:
421:
418:
415:
412:
410:
406:
403:
402:
399:
396:
394:
391:
388:
385:
382:
379:
376:
373:
372:
369:
366:
364:
360:
357:
354:
351:
349:
346:
343:
342:
339:
336:
334:
331:
329:
325:
322:
320:
317:
314:
310:
307:
306:
303:
301:-Undeclared-
300:
298:
295:
292:
289:
286:
283:
280:
279:
276:
273:
271:
268:
265:
262:
260:
257:
254:
253:
250:
247:
245:
242:
240:
237:
234:
231:
228:
227:
223:
220:
217:
214:
211:
208:
205:
202:
201:
195:
193:
189:
180:
178:
176:
172:
168:
160:
158:
155:
151:
147:
143:
139:
131:
129:
122:
120:
118:
114:
110:
103:
101:
94:
92:
90:
82:
80:
78:
74:
66:Base variants
65:
59:
55:
51:
49:
45:
42:
34:
32:
30:
26:
21:
1395:
1391:
1381:
1359:
1350:
1341:
1333:
1328:
1320:
1315:
1280:
1276:
1266:
1241:
1235:
1210:
1204:
1193:. Retrieved
1191:. 2019-04-26
1188:
1179:
1171:
1142:
1107:
1101:
1074:
1070:
1028:
1024:
984:
980:
934:
930:
880:
876:
832:
828:
778:
774:
728:
724:
714:
677:
673:
663:
636:
632:
602:-Undeclared-
575:-Undeclared-
548:-Undeclared-
509:Oryza sativa
508:
504:
474:
326:deflate and
221:Use Licence
206:Description
184:
175:Huffman code
164:
135:
126:
117:Huffman code
107:
98:
88:
86:
69:
57:
38:
28:
18:
725:Information
456:Use License
441:Description
413:FASTA files
397:Commercial
380:60% to 90%
337:Apache-2.0
287:65% to 76%
248:Commercial
235:60% to 90%
167:Golomb code
113:Golomb code
1398:(1): 535.
1195:2024-02-22
1189:IEEE Pulse
937:(4): e27.
883:(7): e45.
611:References
588:~1200-fold
459:Reference
229:PetaSuite
224:Reference
27:and 1001 (
1258:259764998
1227:252357508
1154:1103.2351
731:(4): 56.
680:(1): 34.
582:GenomeZip
561:~750-fold
534:~100-fold
311:(part of
203:Software
171:Rice code
58:Figure 1:
1438:Category
1424:34724897
1307:33585897
1134:16007637
1093:23793748
1047:18996942
1001:20605925
961:22139935
907:21266471
851:21896510
805:19447783
706:20157640
655:19251772
438:Software
424:GPL-v2.0
389:(CABAC)
313:SAMtools
255:Genozip
169:and the
115:and the
109:Encoding
1415:8558547
1367:bioRxiv
1351:iso.org
1298:8388020
1242:bioRxiv
1211:bioRxiv
952:3287168
898:3074166
796:2705231
697:2821113
674:Entropy
408:access.
404:fastafs
41:GenBank
1422:
1412:
1369:
1305:
1295:
1256:
1225:
1132:
1122:
1091:
1045:
999:
959:
949:
905:
895:
849:
803:
793:
704:
694:
653:
555:DNAzip
367:GPLv3
1254:S2CID
1223:S2CID
1149:arXiv
1130:S2CID
487:GPLv2
218:Link
77:dbSNP
1420:PMID
1303:PMID
1120:ISBN
1089:PMID
1043:PMID
997:PMID
957:PMID
903:PMID
847:PMID
801:PMID
702:PMID
651:PMID
453:Link
328:rANS
309:CRAM
1410:PMC
1400:doi
1293:PMC
1285:doi
1246:doi
1215:doi
1112:doi
1079:doi
1033:doi
989:doi
947:PMC
939:doi
893:PMC
885:doi
837:doi
791:PMC
783:doi
733:doi
692:PMC
682:doi
641:doi
361:or
73:SNP
1440::
1418:.
1408:.
1396:22
1394:.
1390:.
1349:.
1301:.
1291:.
1281:37
1279:.
1275:.
1252:.
1244:.
1221:.
1213:.
1187:.
1163:^
1128:.
1118:.
1087:.
1075:29
1073:.
1069:.
1055:^
1041:.
1029:25
1027:.
1023:.
1009:^
995:.
985:26
983:.
969:^
955:.
945:.
935:40
933:.
929:.
915:^
901:.
891:.
881:39
879:.
875:.
859:^
845:.
833:27
831:.
827:.
813:^
799:.
789:.
779:25
777:.
773:.
747:^
727:.
723:.
700:.
690:.
678:12
676:.
672:.
649:.
637:25
635:.
631:.
619:^
315:)
194:.
1426:.
1402::
1375:.
1353:.
1309:.
1287::
1260:.
1248::
1229:.
1217::
1198:.
1157:.
1151::
1136:.
1114::
1095:.
1081::
1049:.
1035::
1003:.
991::
963:.
941::
909:.
887::
853:.
839::
807:.
785::
741:.
735::
729:7
708:.
684::
657:.
643::
Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.