Knowledge (XXG)

Cannon's algorithm

Source 📝

98:// PE(i , j) k := (i + j) mod N; a := a; b := b; c := 0; for (l := 0; l < N; l++) { c := c + a * b; concurrently { send a to PE(i, (j + N − 1) mod N); send b to PE((i + N − 1) mod N, j); } with { receive a' from PE(i, (j + 1) mod N); receive b' from PE((i + 1) mod N, j ); } a := a'; b := b'; } 498:
In practice we have much fewer processors than the matrix elements. We can replace the matrix elements with submatrices, so that every processor processes more values. The scalar multiplication and addition become sequential matrix multiplication and addition. The width and height of the submatrices
759: 59:
The Scalable Universal Matrix Multiplication Algorithm (SUMMA) is a more practical algorithm that requires less workspace and overcomes the need for a square 2D grid. It is used by the
535: 543: 395: 235: 192: 145: 1038: 908:
A disadvantage of the algorithm is that there are many connection setups, with small message sizes. It would be better to be able to transmit more data in each message.
867: 903: 795: 828: 488: 458: 425: 255:
for every processor. This way every processor will continue accessing different values of the matrices. The needed data is then always at the neighbour processors.
349: 329: 305: 279: 430:
After the initial distribution of each processor, only the data for the next step has to be stored. These are the intermediate result of the previous sum, a
1159: 1210: 1184: 991: 1179: 1031: 53:
mesh. While Cannon's algorithm works well in homogeneous 2D grids, extending it to heterogeneous 2D grids has been shown to be difficult.
31: 56:
The main advantage of the algorithm is that its storage requirements remain constant and are independent of the number of processors.
1138: 1024: 101:
We need to select k in every iteration for every Processor Element (PE) so that processors don't access the same data for computing
1169: 351:
cyclically upwards. The results of the multiplications are summed up as usual. After n steps, each processor has calculated all
1205: 1096: 1011: 1128: 490:. This means that all three matrices only need to be stored in memory once evenly distributed across the processors. 1082: 1133: 1215: 1047: 937: 150:
Therefore processors in the same row / column must begin summation with different indexes. If for example
1092: 39: 28: 754:{\displaystyle T{\mathcal {(n,p)}}=T_{coll}(n/N,p)+N*T_{seq}(n/N)+2(N-1)(T_{start}+T_{byte}(n/N)^{2})} 248:
In the first step we distribute the input matrices between the processors based on the previous rule.
1087: 502: 1006: 1066: 354: 200: 157: 104: 833: 905:
stands for the time it takes to establish a connection and transmission of byte respectively.
872: 764: 800: 1102: 463: 433: 400: 20: 1143: 35: 978: 1061: 917: 334: 314: 290: 264: 954:
Gupta, H.; Sadayappan, P.: Communication Efficient Matrix-Multiplication on Hypercubes
1199: 1107: 994:, Concurrency: Practice and Experience. Volume 9, Issue 4, pages 255–274, April 1997. 953: 965: 564: 552: 1123: 60: 1016: 797:
is the time of the initial distribution of the matrices in the first step,
941:, Technical report, Ph.D. Thesis, Montana State University, 14 July 1969. 979:
Communication-aware scheduling on heterogeneous master-worker platforms
68: 64: 1174: 1164: 1020: 938:
A cellular computer to implement the Kalman Filter Algorithm
561: 558: 555: 992:
SUMMA: scalable universal matrix multiplication algorithm
966:
4.2 Matrix Multiplication on a Distributed Memory Machine
45:
It is especially suitable for computers laid out in an
875: 836: 803: 767: 546: 505: 466: 436: 403: 357: 337: 317: 293: 267: 203: 160: 107: 1152: 1116: 1075: 1054: 830:is the calculation of the intermediate results and 897: 861: 822: 789: 753: 529: 482: 452: 419: 389: 343: 323: 299: 273: 229: 186: 139: 331:has to be passed cyclically to the left and also 949: 947: 245:satisfies this constraint for the first step. 1032: 8: 1039: 1025: 1017: 990:Robert A. van de Geijn and Jerrell Watts, 95:processing nodes p arranged in a 2D grid. 880: 874: 841: 835: 808: 802: 772: 766: 742: 730: 709: 684: 645: 627: 597: 576: 551: 550: 545: 520: 515: 504: 471: 465: 441: 435: 408: 402: 378: 362: 356: 336: 316: 292: 266: 221: 208: 202: 178: 165: 159: 128: 112: 106: 1170:Basic Linear Algebra Subprograms (BLAS) 928: 251:In the next iterations we choose a new 397:once and its sum is thus the searched 7: 311:for the next step. This means that 32:algorithm for matrix multiplication 16:Algorithm for matrix multiplication 14: 540:The runtime of the algorithm is 1211:Matrix multiplication algorithms 530:{\displaystyle N=n/{\sqrt {p}}} 748: 739: 724: 677: 674: 662: 653: 639: 611: 591: 1: 390:{\displaystyle a_{ik}*b_{kj}} 230:{\displaystyle a_{01}*b_{11}} 187:{\displaystyle a_{00}*b_{00}} 140:{\displaystyle a_{ik}*b_{kj}} 981:, PhD thesis, October 2010. 38:first described in 1969 by 1232: 1083:System of linear equations 87:matrices A and B, we need 1134:Cache-oblivious algorithm 862:{\displaystyle T_{start}} 1185:General purpose software 1048:Numerical linear algebra 898:{\displaystyle T_{byte}} 790:{\displaystyle T_{coll}} 253:k' := (k + 1) mod n 237:first. The selection of 823:{\displaystyle T_{seq}} 239:k := (i + j) mod n 1206:Distributed algorithms 977:Jean-François Pineau, 899: 863: 824: 791: 755: 531: 484: 483:{\displaystyle b_{kj}} 454: 453:{\displaystyle a_{ik}} 421: 420:{\displaystyle c_{ij}} 391: 345: 325: 301: 275: 231: 188: 141: 1180:Specialized libraries 1093:Matrix multiplication 1088:Matrix decompositions 956:, dbpubs.stanford.edu 900: 864: 825: 792: 756: 532: 485: 455: 422: 392: 346: 326: 302: 276: 232: 189: 142: 79:When multiplying two 935:Lynn Elliot Cannon, 873: 834: 801: 765: 544: 503: 464: 434: 401: 355: 335: 315: 291: 265: 201: 158: 105: 34:for two-dimensional 1067:Numerical stability 1007:Lecture at Berkeley 309:PE((i + 1) mod n,j) 285:PE(i,(j + 1) mod n) 194:in the first step, 968:, www.phy.ornl.gov 895: 859: 820: 787: 751: 527: 480: 450: 417: 387: 341: 321: 297: 271: 227: 184: 137: 75:Algorithm overview 40:Lynn Elliot Cannon 25:Cannon's algorithm 1193: 1192: 525: 344:{\displaystyle b} 324:{\displaystyle a} 300:{\displaystyle b} 274:{\displaystyle a} 1223: 1103:Matrix splitting 1041: 1034: 1027: 1018: 995: 988: 982: 975: 969: 963: 957: 951: 942: 933: 904: 902: 901: 896: 894: 893: 868: 866: 865: 860: 858: 857: 829: 827: 826: 821: 819: 818: 796: 794: 793: 788: 786: 785: 760: 758: 757: 752: 747: 746: 734: 723: 722: 701: 700: 649: 638: 637: 601: 590: 589: 568: 567: 536: 534: 533: 528: 526: 521: 519: 489: 487: 486: 481: 479: 478: 459: 457: 456: 451: 449: 448: 426: 424: 423: 418: 416: 415: 396: 394: 393: 388: 386: 385: 370: 369: 350: 348: 347: 342: 330: 328: 327: 322: 306: 304: 303: 298: 280: 278: 277: 272: 236: 234: 233: 228: 226: 225: 213: 212: 193: 191: 190: 185: 183: 182: 170: 169: 146: 144: 143: 138: 136: 135: 120: 119: 21:computer science 1231: 1230: 1226: 1225: 1224: 1222: 1221: 1220: 1216:Mesh networking 1196: 1195: 1194: 1189: 1148: 1144:Multiprocessing 1112: 1108:Sparse problems 1071: 1050: 1045: 1003: 998: 989: 985: 976: 972: 964: 960: 952: 945: 934: 930: 926: 914: 876: 871: 870: 837: 832: 831: 804: 799: 798: 768: 763: 762: 738: 705: 680: 623: 572: 542: 541: 501: 500: 496: 467: 462: 461: 437: 432: 431: 404: 399: 398: 374: 358: 353: 352: 333: 332: 313: 312: 289: 288: 263: 262: 259:needs then the 217: 204: 199: 198: 196:PE(0,1) chooses 174: 161: 156: 155: 124: 108: 103: 102: 99: 77: 17: 12: 11: 5: 1229: 1227: 1219: 1218: 1213: 1208: 1198: 1197: 1191: 1190: 1188: 1187: 1182: 1177: 1172: 1167: 1162: 1156: 1154: 1150: 1149: 1147: 1146: 1141: 1136: 1131: 1126: 1120: 1118: 1114: 1113: 1111: 1110: 1105: 1100: 1090: 1085: 1079: 1077: 1073: 1072: 1070: 1069: 1064: 1062:Floating point 1058: 1056: 1052: 1051: 1046: 1044: 1043: 1036: 1029: 1021: 1015: 1014: 1009: 1002: 1001:External links 999: 997: 996: 983: 970: 958: 943: 927: 925: 922: 921: 920: 918:Systolic array 913: 910: 892: 889: 886: 883: 879: 856: 853: 850: 847: 844: 840: 817: 814: 811: 807: 784: 781: 778: 775: 771: 750: 745: 741: 737: 733: 729: 726: 721: 718: 715: 712: 708: 704: 699: 696: 693: 690: 687: 683: 679: 676: 673: 670: 667: 664: 661: 658: 655: 652: 648: 644: 641: 636: 633: 630: 626: 622: 619: 616: 613: 610: 607: 604: 600: 596: 593: 588: 585: 582: 579: 575: 571: 566: 563: 560: 557: 554: 549: 524: 518: 514: 511: 508: 495: 494:Generalisation 492: 477: 474: 470: 447: 444: 440: 414: 411: 407: 384: 381: 377: 373: 368: 365: 361: 340: 320: 296: 270: 224: 220: 216: 211: 207: 181: 177: 173: 168: 164: 134: 131: 127: 123: 118: 115: 111: 97: 76: 73: 15: 13: 10: 9: 6: 4: 3: 2: 1228: 1217: 1214: 1212: 1209: 1207: 1204: 1203: 1201: 1186: 1183: 1181: 1178: 1176: 1173: 1171: 1168: 1166: 1163: 1161: 1158: 1157: 1155: 1151: 1145: 1142: 1140: 1137: 1135: 1132: 1130: 1127: 1125: 1122: 1121: 1119: 1115: 1109: 1106: 1104: 1101: 1098: 1094: 1091: 1089: 1086: 1084: 1081: 1080: 1078: 1074: 1068: 1065: 1063: 1060: 1059: 1057: 1053: 1049: 1042: 1037: 1035: 1030: 1028: 1023: 1022: 1019: 1013: 1010: 1008: 1005: 1004: 1000: 993: 987: 984: 980: 974: 971: 967: 962: 959: 955: 950: 948: 944: 940: 939: 932: 929: 923: 919: 916: 915: 911: 909: 906: 890: 887: 884: 881: 877: 854: 851: 848: 845: 842: 838: 815: 812: 809: 805: 782: 779: 776: 773: 769: 743: 735: 731: 727: 719: 716: 713: 710: 706: 702: 697: 694: 691: 688: 685: 681: 671: 668: 665: 659: 656: 650: 646: 642: 634: 631: 628: 624: 620: 617: 614: 608: 605: 602: 598: 594: 586: 583: 580: 577: 573: 569: 547: 538: 522: 516: 512: 509: 506: 493: 491: 475: 472: 468: 445: 442: 438: 428: 412: 409: 405: 382: 379: 375: 371: 366: 363: 359: 338: 318: 310: 294: 286: 282: 281: 268: 258: 254: 249: 246: 244: 240: 222: 218: 214: 209: 205: 197: 179: 175: 171: 166: 162: 153: 148: 132: 129: 125: 121: 116: 113: 109: 96: 94: 90: 86: 82: 74: 72: 70: 66: 62: 57: 54: 52: 48: 43: 41: 37: 33: 30: 26: 22: 1055:Key concepts 986: 973: 961: 936: 931: 907: 539: 497: 429: 308: 284: 261: 260: 256: 252: 250: 247: 242: 238: 195: 151: 149: 100: 92: 88: 84: 80: 78: 58: 55: 50: 46: 44: 24: 18: 154:calculates 71:libraries. 29:distributed 1200:Categories 1097:algorithms 924:References 1124:CPU cache 669:− 621:∗ 372:∗ 257:A PE(i,j) 215:∗ 172:∗ 122:∗ 69:Elemental 61:ScaLAPACK 1153:Software 1117:Hardware 1076:Problems 1012:mu.oz.au 912:See also 761:, where 499:will be 287:and the 243:PE(i,j) 152:PE(0,0) 65:PLAPACK 1175:LAPACK 1165:MATLAB 460:and a 67:, and 36:meshes 1160:ATLAS 307:from 283:from 27:is a 1139:SIMD 869:and 241:for 1129:TLB 19:In 1202:: 946:^ 537:. 427:. 223:11 210:01 180:00 167:00 147:. 63:, 49:× 42:. 23:, 1099:) 1095:( 1040:e 1033:t 1026:v 891:e 888:t 885:y 882:b 878:T 855:t 852:r 849:a 846:t 843:s 839:T 816:q 813:e 810:s 806:T 783:l 780:l 777:o 774:c 770:T 749:) 744:2 740:) 736:N 732:/ 728:n 725:( 720:e 717:t 714:y 711:b 707:T 703:+ 698:t 695:r 692:a 689:t 686:s 682:T 678:( 675:) 672:1 666:N 663:( 660:2 657:+ 654:) 651:N 647:/ 643:n 640:( 635:q 632:e 629:s 625:T 618:N 615:+ 612:) 609:p 606:, 603:N 599:/ 595:n 592:( 587:l 584:l 581:o 578:c 574:T 570:= 565:) 562:p 559:, 556:n 553:( 548:T 523:p 517:/ 513:n 510:= 507:N 476:j 473:k 469:b 446:k 443:i 439:a 413:j 410:i 406:c 383:j 380:k 376:b 367:k 364:i 360:a 339:b 319:a 295:b 269:a 219:b 206:a 176:b 163:a 133:j 130:k 126:b 117:k 114:i 110:a 93:n 91:× 89:n 85:n 83:× 81:n 51:N 47:N

Index

computer science
distributed
algorithm for matrix multiplication
meshes
Lynn Elliot Cannon
ScaLAPACK
PLAPACK
Elemental
Systolic array
A cellular computer to implement the Kalman Filter Algorithm


Gupta, H.; Sadayappan, P.: Communication Efficient Matrix-Multiplication on Hypercubes
4.2 Matrix Multiplication on a Distributed Memory Machine
Communication-aware scheduling on heterogeneous master-worker platforms
SUMMA: scalable universal matrix multiplication algorithm
Lecture at Berkeley
mu.oz.au
v
t
e
Numerical linear algebra
Floating point
Numerical stability
System of linear equations
Matrix decompositions
Matrix multiplication
algorithms
Matrix splitting
Sparse problems

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.