Knowledge (XXG)

GotoBLAS

Source 📝

197:. The kernel used for GEMM is a routine called GEBP, for "General block-times-panel multiply", which was experimentally found to be "inherently superior" over several other kernels that were considered in the design. 270: 488: 140: 189:. It follows a similar decomposition into smaller "kernel" routines that other BLAS implementations use, but where earlier implementations streamed data from the 609: 224: 634: 404: 337: 302: 166:
to 2 TFLOPS. As of 2005, the library was available at no cost for noncommercial use. A later open source version was released under the terms of the
665: 629: 481: 655: 588: 474: 113: 619: 72: 546: 435: 396: 298: 204: 174: 125: 578: 660: 132: 40: 532: 583: 212: 139:
is an actively maintained fork of GotoBLAS, developed at the Lab of Parallel Software and Computational Science,
497: 131:
GotoBLAS remains available, but development ceased with a final version touting optimal performance on Intel's
117: 346: 542: 443: 93: 537: 256: 105: 351: 516: 392: 332: 230: 151: 66: 374: 260: 200:
Several other BLAS routines are, as is customary in BLAS libraries, implemented in terms of GEMM.
364: 186: 282: 552: 413: 356: 79: 593: 456: 208: 511: 388: 328: 265: 121: 29: 261:"Writing the Fastest Code, by Hand, for Fun: A Human Computer Keeps Speeding Up Chips" 649: 557: 431: 294: 159: 167: 84: 147: 128:. As of 2003, it was used in seven of the world's ten fastest supercomputers. 24: 368: 573: 417: 360: 207:
website states that Goto BLAS in no more maintained and suggests the use of
190: 155: 466: 194: 136: 624: 614: 182: 163: 158:
processor and managed to immediately boost the performance of a
109: 470: 178: 335:(2008). "Anatomy of High-Performance Matrix Multiplication". 397:"High-performance implementation of the level-3 BLAS" 177:, called GEMM in BLAS terms, is highly tuned for the 602: 566: 525: 504: 78: 65: 39: 23: 116:with many hand-crafted optimizations for specific 323: 321: 319: 185:processor architectures by means of handcrafted 251: 249: 247: 482: 8: 383: 381: 154:in 2002. It was initially optimized for the 18: 225:Automatically Tuned Linear Algebra Software 489: 475: 467: 71:Linear algebra library; implementation of 17: 405:ACM Transactions on Mathematical Software 350: 338:ACM Transactions on Mathematical Software 146:GotoBLAS was written by Goto during his 620:Basic Linear Algebra Subprograms (BLAS) 243: 452: 441: 135:architecture (contemporary in 2008). 7: 175:matrix-matrix multiplication routine 112:(Basic Linear Algebra Subprograms) 14: 120:types. GotoBLAS was developed by 46:2-1.13 / 5 February 2010 436:Texas Advanced Computing Center 305:from the original on 2020-03-23 299:Texas Advanced Computing Center 273:from the original on 2020-03-23 205:Texas Advanced Computing Center 126:Texas Advanced Computing Center 666:Software using the BSD license 1: 269:. Seattle, Washington, USA. 162:based on that CPU from 1.5 682: 533:System of linear equations 584:Cache-oblivious algorithm 231:Intel Math Kernel Library 61: 35: 656:Numerical linear algebra 635:General purpose software 498:Numerical linear algebra 203:As of January 2022, the 418:10.1145/1377603.1377607 393:van de Geijn, Robert A. 361:10.1145/1356052.1356053 333:van de Geijn, Robert A. 108:implementations of the 451:Cite journal requires 48:; 14 years ago 630:Specialized libraries 543:Matrix multiplication 538:Matrix decompositions 432:"BLAS-LAPACK at TACC" 257:Markoff, John Gregory 193:, GotoBLAS uses the 94:scientific computing 517:Numerical stability 152:Japan Patent Office 20: 661:Numerical software 191:L1 processor cache 25:Original author(s) 643: 642: 345:(3): 12:1–12:25. 90: 89: 673: 553:Matrix splitting 491: 484: 477: 468: 461: 460: 454: 449: 447: 439: 428: 422: 421: 401: 385: 376: 372: 354: 325: 314: 313: 311: 310: 290: 284: 281: 279: 278: 253: 56: 54: 49: 21: 681: 680: 676: 675: 674: 672: 671: 670: 646: 645: 644: 639: 598: 594:Multiprocessing 562: 558:Sparse problems 521: 500: 495: 465: 464: 450: 440: 430: 429: 425: 399: 389:Goto, Kazushige 387: 386: 379: 352:10.1.1.111.3873 329:Goto, Kazushige 327: 326: 317: 308: 306: 293:Milfeld, Kent. 292: 291: 287: 276: 274: 255: 254: 245: 240: 221: 150:leave from the 57: 52: 50: 47: 12: 11: 5: 679: 677: 669: 668: 663: 658: 648: 647: 641: 640: 638: 637: 632: 627: 622: 617: 612: 606: 604: 600: 599: 597: 596: 591: 586: 581: 576: 570: 568: 564: 563: 561: 560: 555: 550: 540: 535: 529: 527: 523: 522: 520: 519: 514: 512:Floating point 508: 506: 502: 501: 496: 494: 493: 486: 479: 471: 463: 462: 453:|journal= 423: 377: 315: 285: 266:New York Times 259:(2005-11-28). 242: 241: 239: 236: 235: 234: 228: 220: 217: 122:Kazushige Goto 88: 87: 82: 76: 75: 69: 63: 62: 59: 58: 45: 43: 37: 36: 33: 32: 30:Kazushige Goto 27: 13: 10: 9: 6: 4: 3: 2: 678: 667: 664: 662: 659: 657: 654: 653: 651: 636: 633: 631: 628: 626: 623: 621: 618: 616: 613: 611: 608: 607: 605: 601: 595: 592: 590: 587: 585: 582: 580: 577: 575: 572: 571: 569: 565: 559: 556: 554: 551: 548: 544: 541: 539: 536: 534: 531: 530: 528: 524: 518: 515: 513: 510: 509: 507: 503: 499: 492: 487: 485: 480: 478: 473: 472: 469: 458: 445: 437: 433: 427: 424: 419: 415: 411: 407: 406: 398: 394: 390: 384: 382: 378: 375: 370: 366: 362: 358: 353: 348: 344: 340: 339: 334: 330: 324: 322: 320: 316: 304: 300: 296: 289: 286: 283: 272: 268: 267: 262: 258: 252: 250: 248: 244: 237: 232: 229: 226: 223: 222: 218: 216: 214: 210: 206: 201: 198: 196: 192: 188: 187:assembly code 184: 180: 176: 171: 169: 165: 161: 160:supercomputer 157: 153: 149: 144: 142: 138: 134: 129: 127: 123: 119: 115: 111: 107: 103: 99: 95: 86: 83: 81: 77: 74: 70: 68: 64: 60: 44: 42: 41:Final release 38: 34: 31: 28: 26: 22: 16: 505:Key concepts 444:cite journal 426: 409: 403: 342: 336: 307:. Retrieved 288: 275:. Retrieved 264: 202: 199: 172: 145: 130: 101: 97: 91: 15: 412:(1): 1–14. 373:(25 pages) 295:"GotoBLAS2" 173:GotoBLAS's 168:BSD license 106:open source 85:BSD License 650:Categories 547:algorithms 309:2013-08-28 277:2010-03-04 238:References 148:sabbatical 53:2010-02-05 574:CPU cache 369:0098-3500 347:CiteSeerX 156:Pentium 4 118:processor 102:GotoBLAS2 603:Software 567:Hardware 526:Problems 395:(2008). 303:Archived 271:Archived 219:See also 195:L2 cache 137:OpenBLAS 98:GotoBLAS 19:GotoBLAS 227:(ATLAS) 133:Nehalem 124:at the 80:License 51: ( 625:LAPACK 615:MATLAB 367:  349:  164:TFLOPS 610:ATLAS 400:(PDF) 233:(MKL) 183:AMD64 141:ISCAS 589:SIMD 457:help 365:ISSN 209:BLIS 181:and 110:BLAS 104:are 100:and 73:BLAS 67:Type 579:TLB 414:doi 357:doi 213:MKL 211:or 179:x86 114:API 92:In 652:: 448:: 446:}} 442:{{ 434:. 410:35 408:. 402:. 391:; 380:^ 363:. 355:. 343:34 341:. 331:; 318:^ 301:. 297:. 263:. 246:^ 215:. 170:. 143:. 96:, 549:) 545:( 490:e 483:t 476:v 459:) 455:( 438:. 420:. 416:: 371:. 359:: 312:. 280:. 55:)

Index

Original author(s)
Kazushige Goto
Final release
Type
BLAS
License
BSD License
scientific computing
open source
BLAS
API
processor
Kazushige Goto
Texas Advanced Computing Center
Nehalem
OpenBLAS
ISCAS
sabbatical
Japan Patent Office
Pentium 4
supercomputer
TFLOPS
BSD license
matrix-matrix multiplication routine
x86
AMD64
assembly code
L1 processor cache
L2 cache
Texas Advanced Computing Center

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.