98:// PE(i , j) k := (i + j) mod N; a := a; b := b; c := 0; for (l := 0; l < N; l++) { c := c + a * b; concurrently { send a to PE(i, (j + N − 1) mod N); send b to PE((i + N − 1) mod N, j); } with { receive a' from PE(i, (j + 1) mod N); receive b' from PE((i + 1) mod N, j ); } a := a'; b := b'; }
498:
In practice we have much fewer processors than the matrix elements. We can replace the matrix elements with submatrices, so that every processor processes more values. The scalar multiplication and addition become sequential matrix multiplication and addition. The width and height of the submatrices
759:
59:
The
Scalable Universal Matrix Multiplication Algorithm (SUMMA) is a more practical algorithm that requires less workspace and overcomes the need for a square 2D grid. It is used by the
535:
543:
395:
235:
192:
145:
1038:
908:
A disadvantage of the algorithm is that there are many connection setups, with small message sizes. It would be better to be able to transmit more data in each message.
867:
903:
795:
828:
488:
458:
425:
255:
for every processor. This way every processor will continue accessing different values of the matrices. The needed data is then always at the neighbour processors.
349:
329:
305:
279:
430:
After the initial distribution of each processor, only the data for the next step has to be stored. These are the intermediate result of the previous sum, a
1159:
1210:
1184:
991:
1179:
1031:
53:
mesh. While Cannon's algorithm works well in homogeneous 2D grids, extending it to heterogeneous 2D grids has been shown to be difficult.
31:
56:
The main advantage of the algorithm is that its storage requirements remain constant and are independent of the number of processors.
1138:
1024:
101:
We need to select k in every iteration for every
Processor Element (PE) so that processors don't access the same data for computing
1169:
351:
cyclically upwards. The results of the multiplications are summed up as usual. After n steps, each processor has calculated all
1205:
1096:
1011:
1128:
490:. This means that all three matrices only need to be stored in memory once evenly distributed across the processors.
1082:
1133:
1215:
1047:
937:
150:
Therefore processors in the same row / column must begin summation with different indexes. If for example
1092:
39:
28:
754:{\displaystyle T{\mathcal {(n,p)}}=T_{coll}(n/N,p)+N*T_{seq}(n/N)+2(N-1)(T_{start}+T_{byte}(n/N)^{2})}
248:
In the first step we distribute the input matrices between the processors based on the previous rule.
1087:
502:
1006:
1066:
354:
200:
157:
104:
833:
905:
stands for the time it takes to establish a connection and transmission of byte respectively.
872:
764:
800:
1102:
463:
433:
400:
20:
1143:
35:
978:
1061:
917:
334:
314:
290:
264:
954:
Gupta, H.; Sadayappan, P.: Communication
Efficient Matrix-Multiplication on Hypercubes
1199:
1107:
994:, Concurrency: Practice and Experience. Volume 9, Issue 4, pages 255–274, April 1997.
953:
965:
564:
552:
1123:
60:
1016:
797:
is the time of the initial distribution of the matrices in the first step,
941:, Technical report, Ph.D. Thesis, Montana State University, 14 July 1969.
979:
Communication-aware scheduling on heterogeneous master-worker platforms
68:
64:
1174:
1164:
1020:
938:
A cellular computer to implement the Kalman Filter
Algorithm
561:
558:
555:
992:
SUMMA: scalable universal matrix multiplication algorithm
966:
45:
It is especially suitable for computers laid out in an
875:
836:
803:
767:
546:
505:
466:
436:
403:
357:
337:
317:
293:
267:
203:
160:
107:
1152:
1116:
1075:
1054:
830:is the calculation of the intermediate results and
897:
861:
822:
789:
753:
529:
482:
452:
419:
389:
343:
323:
299:
273:
229:
186:
139:
331:has to be passed cyclically to the left and also
949:
947:
245:satisfies this constraint for the first step.
1032:
8:
1039:
1025:
1017:
990:Robert A. van de Geijn and Jerrell Watts,
95:processing nodes p arranged in a 2D grid.
880:
874:
841:
835:
808:
802:
772:
766:
742:
730:
709:
684:
645:
627:
597:
576:
551:
550:
545:
520:
515:
504:
471:
465:
441:
435:
408:
402:
378:
362:
356:
336:
316:
292:
266:
221:
208:
202:
178:
165:
159:
128:
112:
106:
1170:Basic Linear Algebra Subprograms (BLAS)
928:
251:In the next iterations we choose a new
397:once and its sum is thus the searched
7:
311:for the next step. This means that
32:algorithm for matrix multiplication
16:Algorithm for matrix multiplication
14:
540:The runtime of the algorithm is
1211:Matrix multiplication algorithms
530:{\displaystyle N=n/{\sqrt {p}}}
748:
739:
724:
677:
674:
662:
653:
639:
611:
591:
1:
390:{\displaystyle a_{ik}*b_{kj}}
230:{\displaystyle a_{01}*b_{11}}
187:{\displaystyle a_{00}*b_{00}}
140:{\displaystyle a_{ik}*b_{kj}}
981:, PhD thesis, October 2010.
38:first described in 1969 by
1232:
1083:System of linear equations
87:matrices A and B, we need
1134:Cache-oblivious algorithm
862:{\displaystyle T_{start}}
1185:General purpose software
1048:Numerical linear algebra
898:{\displaystyle T_{byte}}
790:{\displaystyle T_{coll}}
253:k' := (k + 1) mod n
237:first. The selection of
823:{\displaystyle T_{seq}}
239:k := (i + j) mod n
1206:Distributed algorithms
977:Jean-François Pineau,
899:
863:
824:
791:
755:
531:
484:
483:{\displaystyle b_{kj}}
454:
453:{\displaystyle a_{ik}}
421:
420:{\displaystyle c_{ij}}
391:
345:
325:
301:
275:
231:
188:
141:
1180:Specialized libraries
1093:Matrix multiplication
1088:Matrix decompositions
956:, dbpubs.stanford.edu
900:
864:
825:
792:
756:
532:
485:
455:
422:
392:
346:
326:
302:
276:
232:
189:
142:
79:When multiplying two
935:Lynn Elliot Cannon,
873:
834:
801:
765:
544:
503:
464:
434:
401:
355:
335:
315:
291:
265:
201:
158:
105:
34:for two-dimensional
1067:Numerical stability
1007:Lecture at Berkeley
309:PE((i + 1) mod n,j)
285:PE(i,(j + 1) mod n)
194:in the first step,
968:, www.phy.ornl.gov
895:
859:
820:
787:
751:
527:
480:
450:
417:
387:
341:
321:
297:
271:
227:
184:
137:
75:Algorithm overview
40:Lynn Elliot Cannon
25:Cannon's algorithm
1193:
1192:
525:
344:{\displaystyle b}
324:{\displaystyle a}
300:{\displaystyle b}
274:{\displaystyle a}
1223:
1103:Matrix splitting
1041:
1034:
1027:
1018:
995:
988:
982:
975:
969:
963:
957:
951:
942:
933:
904:
902:
901:
896:
894:
893:
868:
866:
865:
860:
858:
857:
829:
827:
826:
821:
819:
818:
796:
794:
793:
788:
786:
785:
760:
758:
757:
752:
747:
746:
734:
723:
722:
701:
700:
649:
638:
637:
601:
590:
589:
568:
567:
536:
534:
533:
528:
526:
521:
519:
489:
487:
486:
481:
479:
478:
459:
457:
456:
451:
449:
448:
426:
424:
423:
418:
416:
415:
396:
394:
393:
388:
386:
385:
370:
369:
350:
348:
347:
342:
330:
328:
327:
322:
306:
304:
303:
298:
280:
278:
277:
272:
236:
234:
233:
228:
226:
225:
213:
212:
193:
191:
190:
185:
183:
182:
170:
169:
146:
144:
143:
138:
136:
135:
120:
119:
21:computer science
1231:
1230:
1226:
1225:
1224:
1222:
1221:
1220:
1216:Mesh networking
1196:
1195:
1194:
1189:
1148:
1144:Multiprocessing
1112:
1108:Sparse problems
1071:
1050:
1045:
1003:
998:
989:
985:
976:
972:
964:
960:
952:
945:
934:
930:
926:
914:
876:
871:
870:
837:
832:
831:
804:
799:
798:
768:
763:
762:
738:
705:
680:
623:
572:
542:
541:
501:
500:
496:
467:
462:
461:
437:
432:
431:
404:
399:
398:
374:
358:
353:
352:
333:
332:
313:
312:
289:
288:
263:
262:
259:needs then the
217:
204:
199:
198:
196:PE(0,1) chooses
174:
161:
156:
155:
124:
108:
103:
102:
99:
77:
17:
12:
11:
5:
1229:
1227:
1219:
1218:
1213:
1208:
1198:
1197:
1191:
1190:
1188:
1187:
1182:
1177:
1172:
1167:
1162:
1156:
1154:
1150:
1149:
1147:
1146:
1141:
1136:
1131:
1126:
1120:
1118:
1114:
1113:
1111:
1110:
1105:
1100:
1090:
1085:
1079:
1077:
1073:
1072:
1070:
1069:
1064:
1062:Floating point
1058:
1056:
1052:
1051:
1046:
1044:
1043:
1036:
1029:
1021:
1015:
1014:
1009:
1002:
1001:External links
999:
997:
996:
983:
970:
958:
943:
927:
925:
922:
921:
920:
918:Systolic array
913:
910:
892:
889:
886:
883:
879:
856:
853:
850:
847:
844:
840:
817:
814:
811:
807:
784:
781:
778:
775:
771:
750:
745:
741:
737:
733:
729:
726:
721:
718:
715:
712:
708:
704:
699:
696:
693:
690:
687:
683:
679:
676:
673:
670:
667:
664:
661:
658:
655:
652:
648:
644:
641:
636:
633:
630:
626:
622:
619:
616:
613:
610:
607:
604:
600:
596:
593:
588:
585:
582:
579:
575:
571:
566:
563:
560:
557:
554:
549:
524:
518:
514:
511:
508:
495:
494:Generalisation
492:
477:
474:
470:
447:
444:
440:
414:
411:
407:
384:
381:
377:
373:
368:
365:
361:
340:
320:
296:
270:
224:
220:
216:
211:
207:
181:
177:
173:
168:
164:
134:
131:
127:
123:
118:
115:
111:
97:
76:
73:
15:
13:
10:
9:
6:
4:
3:
2:
1228:
1217:
1214:
1212:
1209:
1207:
1204:
1203:
1201:
1186:
1183:
1181:
1178:
1176:
1173:
1171:
1168:
1166:
1163:
1161:
1158:
1157:
1155:
1151:
1145:
1142:
1140:
1137:
1135:
1132:
1130:
1127:
1125:
1122:
1121:
1119:
1115:
1109:
1106:
1104:
1101:
1098:
1094:
1091:
1089:
1086:
1084:
1081:
1080:
1078:
1074:
1068:
1065:
1063:
1060:
1059:
1057:
1053:
1049:
1042:
1037:
1035:
1030:
1028:
1023:
1022:
1019:
1013:
1010:
1008:
1005:
1004:
1000:
993:
987:
984:
980:
974:
971:
967:
962:
959:
955:
950:
948:
944:
940:
939:
932:
929:
923:
919:
916:
915:
911:
909:
906:
890:
887:
884:
881:
877:
854:
851:
848:
845:
842:
838:
815:
812:
809:
805:
782:
779:
776:
773:
769:
743:
735:
731:
727:
719:
716:
713:
710:
706:
702:
697:
694:
691:
688:
685:
681:
671:
668:
665:
659:
656:
650:
646:
642:
634:
631:
628:
624:
620:
617:
614:
608:
605:
602:
598:
594:
586:
583:
580:
577:
573:
569:
547:
538:
522:
516:
512:
509:
506:
493:
491:
475:
472:
468:
445:
442:
438:
428:
412:
409:
405:
382:
379:
375:
371:
366:
363:
359:
338:
318:
310:
294:
286:
282:
281:
268:
258:
254:
249:
246:
244:
240:
222:
218:
214:
209:
205:
197:
179:
175:
171:
166:
162:
153:
148:
132:
129:
125:
121:
116:
113:
109:
96:
94:
90:
86:
82:
74:
72:
70:
66:
62:
57:
54:
52:
48:
43:
41:
37:
33:
30:
26:
22:
1055:Key concepts
986:
973:
961:
936:
931:
907:
539:
497:
429:
308:
284:
261:
260:
256:
252:
250:
247:
242:
238:
195:
151:
149:
100:
92:
88:
84:
80:
78:
58:
55:
50:
46:
44:
24:
18:
154:calculates
71:libraries.
29:distributed
1200:Categories
1097:algorithms
924:References
1124:CPU cache
669:−
621:∗
372:∗
257:A PE(i,j)
215:∗
172:∗
122:∗
69:Elemental
61:ScaLAPACK
1153:Software
1117:Hardware
1076:Problems
1012:mu.oz.au
912:See also
761:, where
499:will be
287:and the
243:PE(i,j)
152:PE(0,0)
65:PLAPACK
1175:LAPACK
1165:MATLAB
460:and a
67:, and
36:meshes
1160:ATLAS
307:from
283:from
27:is a
1139:SIMD
869:and
241:for
1129:TLB
19:In
1202::
946:^
537:.
427:.
223:11
210:01
180:00
167:00
147:.
63:,
49:×
42:.
23:,
1099:)
1095:(
1040:e
1033:t
1026:v
891:e
888:t
885:y
882:b
878:T
855:t
852:r
849:a
846:t
843:s
839:T
816:q
813:e
810:s
806:T
783:l
780:l
777:o
774:c
770:T
749:)
744:2
740:)
736:N
732:/
728:n
725:(
720:e
717:t
714:y
711:b
707:T
703:+
698:t
695:r
692:a
689:t
686:s
682:T
678:(
675:)
672:1
666:N
663:(
660:2
657:+
654:)
651:N
647:/
643:n
640:(
635:q
632:e
629:s
625:T
618:N
615:+
612:)
609:p
606:,
603:N
599:/
595:n
592:(
587:l
584:l
581:o
578:c
574:T
570:=
565:)
562:p
559:,
556:n
553:(
548:T
523:p
517:/
513:n
510:=
507:N
476:j
473:k
469:b
446:k
443:i
439:a
413:j
410:i
406:c
383:j
380:k
376:b
367:k
364:i
360:a
339:b
319:a
295:b
269:a
219:b
206:a
176:b
163:a
133:j
130:k
126:b
117:k
114:i
110:a
93:n
91:×
89:n
85:n
83:×
81:n
51:N
47:N
Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.