Knowledge (XXG)

Application checkpointing

Source 📝

535:
checkpointing that allows users to select which data needs to be protected, in order to improve efficiency and avoid space, time and energy waste. It offers a direct data interface so that users do not need to deal with files and/or directory names. All metadata is managed by FTI in a transparent fashion for the user. If desired, users can dedicate one process per node to overlap fault tolerance workload and scientific computation, so that post-checkpoint tasks are executed asynchronously.
456:
imperative. Thus the "checkpoint/restart" capability was born, in which after a number of transactions had been processed, a "snapshot" or "checkpoint" of the state of the application could be taken. If the application failed before the next checkpoint, it could be restarted by giving it the checkpoint information and the last place in the transaction file where a transaction had successfully completed. The application could then restart at that point.
278:) and then continue with the execution. In case of failure, when the application restarts, it does not need to start from scratch. Rather, it will read the latest state ("the checkpoint") from the stable storage and execute from that. While there is ongoing debate on whether checkpointing is the dominating I/O workload on distributed computing systems, there is general consensus that checkpointing is one of the major I/O workloads. 77: 544:
made to application code. BLCR focuses on checkpointing parallel applications that communicate through MPI, and on compatibility with the software suite produced by the SciDAC Scalable Systems Software ISIC. Its work is broken down into 4 main areas: Checkpoint/Restart for Linux (CR), Checkpointable MPI Libraries, Resource Management Interface to Checkpoint/Restart and Development of Process Management Interfaces.
1770: 474: 395: 312: 124: 36: 373:
or exit the application and at a later time, restart the application and restore the saved state. This was implemented through a "save" command or menu option in the application. In many cases it became standard practice to ask the user if they had unsaved work when exiting the application if they wanted to save their work before doing so.
543:
The Future Technologies Group at the Lawrence National Laboratories are developing a hybrid kernel/user implementation of checkpoint/restart called BLCR. Their goal is to provide a robust, production quality implementation that checkpoints a wide range of applications, without requiring changes to be
455:
As batch applications began to handle tens to hundreds of thousands of transactions, where each transaction might process one record from one file against several different files, the need for the application to be restartable at some point without the need to rerun the entire job from scratch became
534:
FTI is a library that aims to provide computational scientists with an easy way to perform checkpoint/restart in a scalable fashion. FTI leverages local storage plus multiple replications and erasures techniques to provide several levels of reliability and performance. FTI provides application-level
459:
Checkpointing tends to be expensive, so it was generally not done with every record, but at some reasonable compromise between the cost of a checkpoint vs. the value of the computer time needed to reprocess a batch of records. Thus the number of records processed for each checkpoint might range from
376:
This sort of functionality became extremely important for usability in applications where the particular work could not be completed in one sitting (such as playing a video game expected to take dozens of hours, or writing a book or long document amounting to hundreds or thousands of pages) or where
372:
One of the original and now most common means of application checkpointing was a "save state" feature in interactive applications, in which the user of the application could save the state of all variables and other data to a storage medium at the time they were using it and either continue working,
286:
algorithm. In the uncoordinated checkpointing, each process checkpoints its own state independently. It must be stressed that simply forcing processes to checkpoint their state at fixed time intervals is not sufficient to ensure global consistency. The need for establishing a consistent state (i.e.,
287:
no missing messages or duplicated messages) may force other processes to roll back to their checkpoints, which in turn may cause other processes to roll back to even earlier checkpoints, which in the most extreme case may mean that the only consistent state found is the initial state (the so-called
273:
environment, checkpointing is a technique that helps tolerate failures that otherwise would force long-running application to restart from the beginning. The most basic way to implement checkpointing, is to stop the application, copy all the required data from the memory to reliable storage (e.g.,
821:
Bautista-Gomez, L., Tsuboi, S., Komatitsch, D., Cappello, F., Maruyama, N., & Matsuoka, S. (2011, November). FTI: high performance fault tolerance interface for hybrid systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (p.
586:
Some recent protocols perform collaborative checkpointing by storing fragments of the checkpoint in nearby nodes. This is helpful because it avoids the cost of storing to a parallel file system (which often becomes a bottleneck for large-scale systems) and it uses storage that is closer. This has
552:
DMTCP (Distributed MultiThreaded Checkpointing) is a tool for transparently checkpointing the state of an arbitrary group of programs spread across many machines and connected by sockets. It does not modify the user's program or the operating system. Among the applications supported by DMTCP are
281:
There are two main approaches for checkpointing in the distributed computing systems: coordinated checkpointing and uncoordinated checkpointing. In the coordinated checkpointing approach, processes must ensure that their checkpoints are consistent. This is usually achieved by some kind of
950:
Mirhoseini, A.; Songhori, E.M.; Koushanfar, F., "Idetic: A high-level synthesis approach for enabling long computations on transiently-powered ASICs," Pervasive Computing and Communications (PerCom), 2013 IEEE International Conference on , vol., no., pp.216,224, 18–22 March 2013 URL:
970:
R.E. Ahmed, R.C. Frazier, and P.N. Marinos, " Cache-Aided Rollback Error Recovery (CARER) Algorithms for Shared-Memory Multiprocessor Systems", IEEE 20th International Symposium on Fault-Tolerant Computing (FTCS-20), Newcastle upon Tyne, UK, June 26–28, 1990,
631:
from ambient background sources. Mementos frequently senses the available energy in the system and decides whether to checkpoint the program due to impending power loss versus continuing computation. If checkpointing, data will be stored in a
802:
Bouteiller, B., Lemarinier, P., Krawezik, K., & Capello, F. (2003, December). Coordinated checkpoint versus message log for fault tolerant MPI. In Cluster Computing, 2003. Proceedings. 2003 IEEE International Conference on (pp. 242-250).
380:
The problem with save state is it requires the operator of a program to request the save. For non-interactive programs, including automated or batch processed workloads, the ability to checkpoint such applications also had to be automated.
840:
Ansel, J., Arya, K., & Cooperman, G. (2009, May). DMTCP: Transparent checkpointing for cluster computations and the desktop. In Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on (pp. 1-12).
569:
and shell scripting languages. With the use of TightVNC, it can also checkpoint and restart X Window applications, as long as they do not use extensions (e.g. no OpenGL or video). Among the Linux features supported by DMTCP are open
587:
found use particularly in large-scale supercomputing clusters. The challenge is to ensure that when the checkpoint is needed when recovering from a failure, the nearby nodes with fragments of the checkpoints are available.
622:
Mementos is a software system that transforms general-purpose tasks into interruptible programs for platforms with frequent interruptions such as power outages. It was designed for batteryless embedded devices such as
937:
Benjamin Ransford, Jacob Sorber, and Kevin Fu. 2011. Mementos: system support for long-running computation on RFID-scale devices. ACM SIGPLAN Notices 47, 4 (March 2011), 159-170. DOI=10.1145/2248487.1950386
574:, pipes, sockets, signal handlers, process id and thread id virtualization (ensure old pids and tids continue to work upon restart), ptys, fifos, process group ids, session ids, terminal attributes, and 831:
Hargrove, P. H., & Duell, J. C. (2006, September). Berkeley lab checkpoint/restart (blcr) for linux clusters. In Journal of Physics: Conference Series (Vol. 46, No. 1, p. 494). IOP Publishing.
1096: 745:
Wang, Teng; Snyder, Shane; Lockwood, Glenn; Carns, Philip; Wright, Nicholas; Byna, Suren (Sep 2018). "IOMiner: Large-Scale Analytics Framework for Gaining Knowledge from I/O Logs".
812:
Elnozahy, E. N., Alvisi, L., Wang, Y. M., & Johnson, D. B. (2002). A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys, 34(3), 375-408.
1186: 694:
to a non-volatile memory, the optimum points are required to have minimum number of registers to store. Idetic is deployed and evaluated on energy harvesting
1038: 495: 416: 333: 141: 49: 1795: 460:
25 to 200, depending on cost factors, the relative complexity of the application and the resources needed to successfully restart the application.
667: 1167: 762: 967:
Yibei Ling, Jie Mi, Xiaola Lin: A Variational Calculus Approach to Optimal Checkpoint Placement. IEEE Trans. Computers 50(7): 699-708 (2001)
1434: 1457: 188: 640:, the data is retrieved from non-volatile memory and the program continues from the stored state. Mementos has been implemented on the 1346: 735:
Plank, J. S., Beck, M., Kingsley, G., & Li, K. (1994). Libckpt: Transparent checkpointing under unix. Computer Science Department.
160: 1452: 1429: 521: 442: 359: 225: 207: 63: 1031: 167: 1424: 1239: 991: 1531: 1394: 499: 420: 337: 145: 98: 55: 952: 174: 1755: 1589: 1207: 1127: 558: 981: 89: 1774: 1720: 1180: 1024: 716: 250: 156: 578:/mprotect (including mmap-based shared memory). DMTCP supports the OFED API for InfiniBand on an experimental basis. 484: 405: 377:
the work was being done over a long period of time such as data entry into a document such as rows in a spreadsheet.
322: 261:. This is particularly important for long running applications that are executed in failure-prone computing systems. 1699: 1494: 1379: 1341: 1191: 1081: 503: 488: 424: 409: 341: 326: 134: 1715: 1694: 1639: 1526: 1516: 1489: 283: 1669: 1295: 1234: 1147: 1584: 1730: 1725: 1175: 675: 1469: 1401: 1305: 1197: 1152: 1001: 880: 1259: 1561: 1521: 1474: 1464: 1202: 1122: 1061: 270: 851: 181: 871:
Walters, J. P.; Chaudhary, V. (2009-07-01). "Replication-Based Fault Tolerance for MPI Applications".
1501: 1389: 1384: 1374: 1361: 1157: 671: 566: 275: 254: 885: 1664: 1619: 1445: 1440: 1419: 1285: 683: 633: 1689: 1538: 1511: 1336: 1300: 1290: 1091: 1071: 1066: 1047: 906: 768: 691: 637: 1249: 690:
of the design. Since the checkpointing in hardware level involves sending the data of dependent
1735: 1411: 1369: 1264: 898: 758: 649: 628: 595: 1745: 1544: 1479: 1326: 1142: 1137: 1132: 1101: 890: 750: 645: 1609: 1549: 1484: 1331: 1321: 1254: 1086: 1076: 571: 242: 1244: 953:
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6526735&isnumber=6526701
787:"Comparative I/O workload characterization of two leadership class storage clusters Logs" 76: 1740: 1556: 1213: 1106: 653: 1789: 1629: 1506: 707: 687: 289: 772: 1229: 910: 670:(ASIC) developers to automatically embed checkpoints in their designs. It targets 1750: 473: 394: 311: 123: 754: 786: 712: 1011: 939: 902: 17: 1624: 1599: 246: 894: 598:
and the underlying technology contain a checkpoint and restore mechanism.
1674: 1654: 1579: 695: 624: 554: 1679: 1659: 1634: 1269: 679: 258: 852:"GitHub - DMTCP/DMTCP: DMTCP: Distributed MultiThreaded CheckPointing" 257:'s state, so that applications can restart from that point in case of 1649: 1644: 1016: 856: 641: 986: 747:
2018 IEEE International Conference on Cluster Computing (CLUSTER)
1684: 1614: 1604: 996: 606: 575: 562: 27:
A technique for inserting fault tolerance into computing systems
1020: 1594: 1571: 467: 388: 305: 117: 70: 29: 1006: 924: 94: 873:
IEEE Transactions on Parallel and Distributed Systems
1708: 1570: 1410: 1360: 1314: 1278: 1222: 1166: 1115: 1054: 148:. Unsourced material may be challenged and removed. 997:Distributed MultiThreaded CheckPointing (DMTCP) 666:Idetic is a set of automatic tools which helps 686:approach to locate low overhead points in the 1032: 8: 613:Implementation for embedded and ASIC devices 502:. Unsourced material may be challenged and 423:. Unsourced material may be challenged and 340:. Unsourced material may be challenged and 249:systems. It basically consists of saving a 64:Learn how and when to remove these messages 1039: 1025: 1017: 940:http://doi.acm.org/10.1145/2248487.1950386 884: 636:. When the energy becomes sufficient for 522:Learn how and when to remove this message 443:Learn how and when to remove this message 360:Learn how and when to remove this message 226:Learn how and when to remove this message 208:Learn how and when to remove this message 728: 668:application-specific integrated circuit 992:Berkeley Lab Checkpoint/Restart (BLCR) 674:tools and adds the checkpoints at the 539:Berkeley Lab Checkpoint/Restart (BLCR) 7: 609:is a user space checkpoint library. 500:adding citations to reliable sources 421:adding citations to reliable sources 338:adding citations to reliable sources 265:Checkpointing in distributed systems 146:adding citations to reliable sources 25: 88:to comply with Knowledge (XXG)'s 45:This article has multiple issues. 1769: 1768: 715:, a similar concept provided by 472: 393: 310: 297:Implementations for applications 122: 75: 34: 1796:Fault-tolerant computer systems 1240:Analysis of parallel algorithms 464:Fault Tolerance Interface (FTI) 133:needs additional citations for 53:or discuss these issues on the 627:and smart cards which rely on 1: 1187:Simultaneous and heterogenous 241:is a technique that provides 1775:Category: Parallel computing 717:video game console emulators 582:Collaborative checkpointing 157:"Application checkpointing" 1812: 1082:High-performance computing 755:10.1109/CLUSTER.2018.00062 749:. IEEE. pp. 466–476. 648:. Mementos is named after 1764: 1716:Automatic parallelization 1352:Application checkpointing 284:two-phase commit protocol 101:may contain suggestions. 86:may need to be rewritten 1731:Embarrassingly parallel 1726:Deterministic algorithm 676:register-transfer level 1446:Associative processing 1402:Non-blocking algorithm 1208:Clustered multi-thread 1562:Hardware acceleration 1475:Superscalar processor 1465:Dataflow architecture 1062:Distributed computing 895:10.1109/TPDS.2008.172 567:programming languages 271:distributed computing 1441:Pipelined processing 1390:Explicit parallelism 1385:Implicit parallelism 1375:Dataflow programming 672:high-level synthesis 496:improve this section 417:improve this section 334:improve this section 276:parallel file system 142:improve this article 1665:Parallel Extensions 1470:Pipelined processor 684:dynamic programming 634:non-volatile memory 1539:Massively parallel 1517:distributed shared 1337:Cache invalidation 1301:Instruction window 1092:Manycore processor 1072:Massively parallel 1067:Parallel computing 1048:Parallel computing 385:Checkpoint/Restart 1783: 1782: 1736:Parallel slowdown 1370:Stream processing 1260:Karp–Flatt metric 764:978-1-5386-8319-4 682:code). It uses a 650:Christopher Nolan 629:harvesting energy 532: 531: 524: 453: 452: 445: 370: 369: 362: 236: 235: 228: 218: 217: 210: 192: 116: 115: 90:quality standards 68: 16:(Redirected from 1803: 1772: 1771: 1746:Software lockout 1545:Computer cluster 1480:Vector processor 1435:Array processing 1420:Flynn's taxonomy 1327:Memory coherence 1102:Computer network 1041: 1034: 1027: 1018: 955: 948: 942: 935: 929: 928: 921: 915: 914: 888: 868: 862: 861: 848: 842: 838: 832: 829: 823: 819: 813: 810: 804: 800: 794: 793: 792:. ACM. Nov 2015. 791: 783: 777: 776: 742: 736: 733: 646:microcontrollers 572:file descriptors 527: 520: 516: 513: 507: 476: 468: 448: 441: 437: 434: 428: 397: 389: 365: 358: 354: 351: 345: 314: 306: 231: 224: 213: 206: 202: 199: 193: 191: 150: 126: 118: 111: 108: 102: 79: 71: 60: 38: 37: 30: 21: 1811: 1810: 1806: 1805: 1804: 1802: 1801: 1800: 1786: 1785: 1784: 1779: 1760: 1704: 1610:Coarray Fortran 1566: 1550:Beowulf cluster 1406: 1356: 1347:Synchronization 1332:Cache coherence 1322:Multiprocessing 1310: 1274: 1255:Cost efficiency 1250:Gustafson's law 1218: 1162: 1111: 1087:Multiprocessing 1077:Cloud computing 1050: 1045: 978: 971:pp. 82–88. 964: 962:Further reading 959: 958: 949: 945: 936: 932: 925:"Docker - CRIU" 923: 922: 918: 886:10.1.1.921.6773 879:(7): 997–1010. 870: 869: 865: 850: 849: 845: 839: 835: 830: 826: 820: 816: 811: 807: 801: 797: 789: 785: 784: 780: 765: 744: 743: 739: 734: 730: 725: 704: 664: 620: 615: 604: 593: 584: 550: 541: 528: 517: 511: 508: 493: 477: 466: 449: 438: 432: 429: 414: 398: 387: 366: 355: 349: 346: 331: 315: 304: 299: 267: 243:fault tolerance 232: 221: 220: 219: 214: 203: 197: 194: 151: 149: 139: 127: 112: 106: 103: 93: 80: 39: 35: 28: 23: 22: 15: 12: 11: 5: 1809: 1807: 1799: 1798: 1788: 1787: 1781: 1780: 1778: 1777: 1765: 1762: 1761: 1759: 1758: 1753: 1748: 1743: 1741:Race condition 1738: 1733: 1728: 1723: 1718: 1712: 1710: 1706: 1705: 1703: 1702: 1697: 1692: 1687: 1682: 1677: 1672: 1667: 1662: 1657: 1652: 1647: 1642: 1637: 1632: 1627: 1622: 1617: 1612: 1607: 1602: 1597: 1592: 1587: 1582: 1576: 1574: 1568: 1567: 1565: 1564: 1559: 1554: 1553: 1552: 1542: 1536: 1535: 1534: 1529: 1524: 1519: 1514: 1509: 1499: 1498: 1497: 1492: 1485:Multiprocessor 1482: 1477: 1472: 1467: 1462: 1461: 1460: 1455: 1450: 1449: 1448: 1443: 1438: 1427: 1416: 1414: 1408: 1407: 1405: 1404: 1399: 1398: 1397: 1392: 1387: 1377: 1372: 1366: 1364: 1358: 1357: 1355: 1354: 1349: 1344: 1339: 1334: 1329: 1324: 1318: 1316: 1312: 1311: 1309: 1308: 1303: 1298: 1293: 1288: 1282: 1280: 1276: 1275: 1273: 1272: 1267: 1262: 1257: 1252: 1247: 1242: 1237: 1232: 1226: 1224: 1220: 1219: 1217: 1216: 1214:Hardware scout 1211: 1205: 1200: 1195: 1189: 1184: 1178: 1172: 1170: 1168:Multithreading 1164: 1163: 1161: 1160: 1155: 1150: 1145: 1140: 1135: 1130: 1125: 1119: 1117: 1113: 1112: 1110: 1109: 1107:Systolic array 1104: 1099: 1094: 1089: 1084: 1079: 1074: 1069: 1064: 1058: 1056: 1052: 1051: 1046: 1044: 1043: 1036: 1029: 1021: 1015: 1014: 1009: 1004: 999: 994: 989: 984: 977: 976:External links 974: 973: 972: 968: 963: 960: 957: 956: 943: 930: 916: 863: 843: 833: 824: 814: 805: 795: 778: 763: 737: 727: 726: 724: 721: 720: 719: 710: 703: 700: 663: 660: 619: 616: 614: 611: 603: 600: 592: 589: 583: 580: 549: 546: 540: 537: 530: 529: 480: 478: 471: 465: 462: 451: 450: 401: 399: 392: 386: 383: 368: 367: 318: 316: 309: 303: 300: 298: 295: 266: 263: 234: 233: 216: 215: 130: 128: 121: 114: 113: 83: 81: 74: 69: 43: 42: 40: 33: 26: 24: 14: 13: 10: 9: 6: 4: 3: 2: 1808: 1797: 1794: 1793: 1791: 1776: 1767: 1766: 1763: 1757: 1754: 1752: 1749: 1747: 1744: 1742: 1739: 1737: 1734: 1732: 1729: 1727: 1724: 1722: 1719: 1717: 1714: 1713: 1711: 1707: 1701: 1698: 1696: 1693: 1691: 1688: 1686: 1683: 1681: 1678: 1676: 1673: 1671: 1668: 1666: 1663: 1661: 1658: 1656: 1653: 1651: 1648: 1646: 1643: 1641: 1638: 1636: 1633: 1631: 1630:Global Arrays 1628: 1626: 1623: 1621: 1618: 1616: 1613: 1611: 1608: 1606: 1603: 1601: 1598: 1596: 1593: 1591: 1588: 1586: 1583: 1581: 1578: 1577: 1575: 1573: 1569: 1563: 1560: 1558: 1557:Grid computer 1555: 1551: 1548: 1547: 1546: 1543: 1540: 1537: 1533: 1530: 1528: 1525: 1523: 1520: 1518: 1515: 1513: 1510: 1508: 1505: 1504: 1503: 1500: 1496: 1493: 1491: 1488: 1487: 1486: 1483: 1481: 1478: 1476: 1473: 1471: 1468: 1466: 1463: 1459: 1456: 1454: 1451: 1447: 1444: 1442: 1439: 1436: 1433: 1432: 1431: 1428: 1426: 1423: 1422: 1421: 1418: 1417: 1415: 1413: 1409: 1403: 1400: 1396: 1393: 1391: 1388: 1386: 1383: 1382: 1381: 1378: 1376: 1373: 1371: 1368: 1367: 1365: 1363: 1359: 1353: 1350: 1348: 1345: 1343: 1340: 1338: 1335: 1333: 1330: 1328: 1325: 1323: 1320: 1319: 1317: 1313: 1307: 1304: 1302: 1299: 1297: 1294: 1292: 1289: 1287: 1284: 1283: 1281: 1277: 1271: 1268: 1266: 1263: 1261: 1258: 1256: 1253: 1251: 1248: 1246: 1243: 1241: 1238: 1236: 1233: 1231: 1228: 1227: 1225: 1221: 1215: 1212: 1209: 1206: 1204: 1201: 1199: 1196: 1193: 1190: 1188: 1185: 1182: 1179: 1177: 1174: 1173: 1171: 1169: 1165: 1159: 1156: 1154: 1151: 1149: 1146: 1144: 1141: 1139: 1136: 1134: 1131: 1129: 1126: 1124: 1121: 1120: 1118: 1114: 1108: 1105: 1103: 1100: 1098: 1095: 1093: 1090: 1088: 1085: 1083: 1080: 1078: 1075: 1073: 1070: 1068: 1065: 1063: 1060: 1059: 1057: 1053: 1049: 1042: 1037: 1035: 1030: 1028: 1023: 1022: 1019: 1013: 1010: 1008: 1005: 1003: 1000: 998: 995: 993: 990: 988: 985: 983: 980: 979: 975: 969: 966: 965: 961: 954: 947: 944: 941: 934: 931: 926: 920: 917: 912: 908: 904: 900: 896: 892: 887: 882: 878: 874: 867: 864: 860:. 2019-07-11. 859: 858: 853: 847: 844: 837: 834: 828: 825: 818: 815: 809: 806: 799: 796: 788: 782: 779: 774: 770: 766: 760: 756: 752: 748: 741: 738: 732: 729: 722: 718: 714: 711: 709: 708:Process image 706: 705: 701: 699: 697: 693: 689: 688:state machine 685: 681: 677: 673: 669: 661: 659: 657: 656: 651: 647: 643: 639: 635: 630: 626: 617: 612: 610: 608: 601: 599: 597: 590: 588: 581: 579: 577: 573: 568: 564: 560: 556: 547: 545: 538: 536: 526: 523: 515: 505: 501: 497: 491: 490: 486: 481:This section 479: 475: 470: 469: 463: 461: 457: 447: 444: 436: 426: 422: 418: 412: 411: 407: 402:This section 400: 396: 391: 390: 384: 382: 378: 374: 364: 361: 353: 343: 339: 335: 329: 328: 324: 319:This section 317: 313: 308: 307: 301: 296: 294: 292: 291: 290:domino effect 285: 279: 277: 272: 264: 262: 260: 256: 252: 248: 244: 240: 239:Checkpointing 230: 227: 212: 209: 201: 190: 187: 183: 180: 176: 173: 169: 166: 162: 159: –  158: 154: 153:Find sources: 147: 143: 137: 136: 131:This article 129: 125: 120: 119: 110: 107:February 2012 100: 96: 91: 87: 84:This article 82: 78: 73: 72: 67: 65: 58: 57: 52: 51: 46: 41: 32: 31: 19: 18:Checkpointing 1351: 1315:Coordination 1245:Amdahl's law 1181:Simultaneous 946: 933: 919: 876: 872: 866: 855: 846: 836: 827: 817: 808: 798: 781: 746: 740: 731: 665: 654: 621: 605: 594: 585: 551: 542: 533: 518: 512:January 2024 509: 494:Please help 482: 458: 454: 439: 433:January 2024 430: 415:Please help 403: 379: 375: 371: 356: 350:January 2024 347: 332:Please help 320: 288: 280: 268: 238: 237: 222: 204: 195: 185: 178: 171: 164: 152: 140:Please help 135:verification 132: 104: 95:You can help 85: 61: 54: 48: 47:Please help 44: 1751:Scalability 1512:distributed 1395:Concurrency 1362:Programming 1203:Cooperative 1192:Speculative 1128:Instruction 713:Save states 565:, and many 255:application 1756:Starvation 1495:asymmetric 1230:PRAM model 1198:Preemptive 723:References 644:family of 302:Save State 168:newspapers 50:improve it 1490:symmetric 1235:PEM model 903:1045-9219 881:CiteSeerX 822:32). ACM. 692:registers 625:RFID tags 483:does not 404:does not 321:does not 247:computing 198:July 2022 99:talk page 56:talk page 1790:Category 1721:Deadlock 1709:Problems 1675:pthreads 1655:OpenHMPP 1580:Ateji PX 1541:computer 1412:Hardware 1279:Elements 1265:Slowdown 1176:Temporal 1158:Pipeline 1012:Cryopid2 773:53235850 702:See also 698:device. 696:RFID tag 618:Mementos 555:Open MPI 251:snapshot 1680:RaftLib 1660:OpenACC 1635:GPUOpen 1625:C++ AMP 1600:Charm++ 1342:Barrier 1286:Process 1270:Speedup 1055:General 982:LibCkpt 911:2086958 680:Verilog 655:Memento 504:removed 489:sources 425:removed 410:sources 342:removed 327:sources 269:In the 259:failure 253:of the 182:scholar 1773:  1650:OpenCL 1645:OpenMP 1590:Chapel 1507:shared 1502:Memory 1437:(SIMT) 1380:Models 1291:Thread 1223:Theory 1194:(SpMT) 1148:Memory 1133:Thread 1116:Levels 1002:OpenVZ 909:  901:  883:  857:GitHub 771:  761:  662:Idetic 642:MSP430 638:reboot 596:Docker 591:Docker 559:Python 184:  177:  170:  163:  155:  97:. The 1620:Dryad 1585:Boost 1306:Array 1296:Fiber 1210:(CMT) 1183:(SMT) 1097:GPGPU 907:S2CID 841:IEEE. 803:IEEE. 790:(PDF) 769:S2CID 548:DMTCP 189:JSTOR 175:books 1685:ROCm 1615:CUDA 1605:Cilk 1572:APIs 1532:COMA 1527:NUMA 1458:MIMD 1453:MISD 1430:SIMD 1425:SISD 1153:Loop 1143:Data 1138:Task 1007:CRIU 899:ISSN 759:ISBN 607:CRIU 602:CRIU 576:mmap 563:Perl 487:any 485:cite 408:any 406:cite 325:any 323:cite 245:for 161:news 1700:ZPL 1695:TBB 1690:UPC 1670:PVM 1640:MPI 1595:HPX 1522:UMA 1123:Bit 987:FTI 891:doi 751:doi 652:'s 498:by 419:by 336:by 293:). 144:by 1792:: 905:. 897:. 889:. 877:20 875:. 854:. 767:. 757:. 658:. 561:, 557:, 59:. 1040:e 1033:t 1026:v 927:. 913:. 893:: 775:. 753:: 678:( 525:) 519:( 514:) 510:( 506:. 492:. 446:) 440:( 435:) 431:( 427:. 413:. 363:) 357:( 352:) 348:( 344:. 330:. 229:) 223:( 211:) 205:( 200:) 196:( 186:· 179:· 172:· 165:· 138:. 109:) 105:( 92:. 66:) 62:( 20:)

Index

Checkpointing
improve it
talk page
Learn how and when to remove these messages

quality standards
You can help
talk page

verification
improve this article
adding citations to reliable sources
"Application checkpointing"
news
newspapers
books
scholar
JSTOR
Learn how and when to remove this message
Learn how and when to remove this message
fault tolerance
computing
snapshot
application
failure
distributed computing
parallel file system
two-phase commit protocol
domino effect

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.