Knowledge (XXG)

Half-precision floating-point format

Source 📝

379: 2567: 1553:
quantization. Mesh data is usually stored using 32-bit single-precision floats for the vertices, however in some situations it is acceptable to reduce the precision to only 16-bit half-precision, requiring only half the storage at the expense of some precision. Mesh quantization can also be done with
1565:
tend to use half precision: such applications usually do a large amount of calculation, but don't require a high level of precision. Due to hardware typically not supporting 16-bit half-precision floats, neural networks often use the
273:
Several earlier 16-bit floating point formats have existed including that of Hitachi's HD61810 DSP of 1982 (a 4-bit exponent and a 12-bit mantissa), Thomas J. Scott's WIF of 1991 (5 exponent bits, 10 mantissa bits) and the
2158:
Nvidia recently introduced native half precision floating point support (FP16) into their Pascal GPUs. This was mainly motivated by the possibility that this will speed up data intensive and error tolerant applications in
97:
Depending on the computer, half-precision can be over an order of magnitude faster than double precision, e.g. 550 PFLOPS for half-precision vs 37 PFLOPS for double precision on one cloud provider.
2446: 1482:). It is almost identical to the IEEE format, but there is no encoding for infinity or NaNs; instead, an exponent of 31 encodes normalized numbers in the range 65536 to 131008. 1577:
instructions that can handle multiple floating-point numbers within one instruction, half precision can be twice as fast by operating on twice as many numbers simultaneously.
556:
The minimum strictly positive (subnormal) value is 2 ≈ 5.96 × 10. The minimum positive normal value is 2 ≈ 6.10 × 10. The maximum representable value is (2−2) × 2 = 65504.
388:
appear in the memory format but the total precision is 11 bits. In IEEE 754 parlance, there are 10 bits of significand, but there are 11 bits of significand precision (log
2523: 2663: 288:, but without the hard drive and memory cost of single or double precision floating point. The hardware-accelerated programmable shading group led by John Airey at 2419: 453:
Thus, as defined by the offset binary representation, in order to get the true exponent the offset of 15 has to be subtracted from the stored exponent.
2827: 2653: 1703:
instruction set extension, first introduced in 2009 by AMD and fairly broadly adopted by AMD and Intel CPUs by 2012. This was further extended up the
325:, released in late 2002. However, hardware support for accelerated 16-bit floating point was later dropped by Nvidia before being reintroduced in the 151: 255: 3198: 2832: 161: 2245: 2822: 2817: 141: 131: 384:
The format is assumed to have an implicit lead bit with value 1 unless the exponent field is stored with all zeros. Thus, only 10 bits of the
2049: 564:
These examples are given in bit representation of the floating-point value. This includes the sign bit, (biased) exponent, and significand.
2706: 2589: 1744:, VSX and the not-yet-approved SVP64 extension provide hardware support for 16-bit half-precision floats as of PowerISA v3.1B and later. 336:
extension in 2012 allows x86 processors to convert half-precision floats to and from single-precision floats with a machine instruction.
2116: 1573:
If the hardware has instructions to compute half-precision math, it is often faster than single or double precision. If the system has
2615: 1708: 155: 2956: 145: 135: 94:, and the exponent uses 5 bits. This can express values in the range ±65,504, with the minimum value above 1 being 1 + 1/1024. 3073: 2878: 2810: 2772: 1874: 1753: 1607: 1603: 1567: 229: 203: 188: 71:. It is intended for storage of floating-point values in applications where higher precision is not essential, in particular 31: 1656:
also supports half-precision floating point numbers with the half datatype on IEEE 754-2008 half-precision storage format.
1466:
65520 and larger numbers round to infinity. This is for round-to-even, other rounding strategies will change this cut-off.
296:
2000 paper (see section 4.3) and further documented in US patent 7518615. It was popularized by its use in the open-source
3254: 2973: 2903: 2751: 1696: 1670: 1641: 248: 2345: 2295: 3228: 1831: 1660: 2597: 2593: 2577: 1478:
bit) an "alternative half-precision" format, which does away with the special case for an exponent value of 31 (11111
281: 2550: 3249: 2983: 2851: 1585: 3161: 3113: 3025: 3003: 2998: 2926: 2792: 1519: 318: 104: 2091: 3035: 2699: 2395: 2320: 1979:"Patent US7518615 - Display system having floating point rasterization and floating point ... - Google Patents" 1925: 2270: 3188: 3103: 1686: 241: 198: 2931: 2787: 2746: 2741: 2065: 973: 393: 378: 367: 107: 60: 2921: 2896: 2447:"Intel® Advanced Vector Extensions 512 - FP16 Instruction Set for Intel® Xeon® Processor Based Products" 1543: 1857:
Proceedings of the twenty-second SIGCSE technical symposium on Computer science education - SIGCSE '91
2723: 2644: 1787: 408:
representation, with the zero offset being 15; also known as exponent bias in the IEEE 754 standard.
208: 3193: 3171: 3098: 2951: 2943: 2863: 2692: 2221: 1777: 1611: 3176: 3156: 3108: 3083: 2868: 2837: 1880: 167: 54: 1951: 968:, because of the odd number of bits in the significand. The bits beyond the rounding point are 3063: 2993: 2968: 2782: 2777: 2045: 1870: 1782: 1491: 499: 2669: 2658: 2474: 1855:
Scott, Thomas J. (March 1991). "Mathematics and computer science at odds over real numbers".
292:
used the s10e5 data type in 1997 as part of the 'bali' design effort. This is described in a
3208: 3093: 2891: 2629: 2037: 1860: 1558: 1539: 1475: 965: 289: 275: 72: 2196: 3213: 3078: 3030: 2963: 2670:
C source code to convert between IEEE double, single, and half precision can be found here
2142: 1645: 1562: 68: 2679: 1756:: Alternative 16-bit floating-point format with 8 bits of exponent and 7 bits of mantissa 2002: 1898: 3166: 2988: 2978: 2886: 2498: 193: 76: 57: 3243: 3088: 1771: 1531: 442: 405: 285: 83: 1884: 3045: 3020: 2171: 1550: 2371: 2041: 1832:"hitachi :: dataBooks :: HD61810 Digital Signal Processor Users Manual" 3223: 3218: 3068: 3015: 2842: 385: 364: 2031: 3128: 3123: 3040: 3008: 2913: 2856: 1978: 490: 322: 2674: 3203: 3181: 3138: 3133: 2800: 2756: 2715: 1759: 1741: 1615: 307: 183: 38: 2321:"swift-evolution/proposals/0277-float16.md at main · apple/swift-evolution" 2036:. IEEE STD 754-2019 (Revision of IEEE 754-2008). July 2019. pp. 1–84. 1865: 1619: 3118: 1765: 1570:
format, which is the single precision float format truncated to 16 bits.
1523: 540: 358: 352: 314: 293: 114: 1704: 1644:
introduced half-precision floating point numbers in Swift 5.3 with the
1503: 1499: 297: 494: 404:
The half-precision binary floating-point exponent is encoded using an
1726: 1715: 1653: 1535: 1515: 1511: 1495: 303: 64: 1929: 1673:
provides support for half-precision floating point numbers with the
1530:. The advantage over 8-bit or 16-bit integers is that the increased 17: 2596:
external links, and converting useful links where appropriate into
1538:
for images, and avoids gamma correction. The advantage over 32-bit
2680:
Half precision floating point for one of the extended GCC features
2148:. Department of Computer Science, National University of Singapore 326: 213: 87: 1807: 2736: 2648: 2475:"RISC-V Instruction Set Manual, Volume I: RISC-V User-Level ISA" 1700: 1595: 1574: 1527: 1507: 333: 2688: 1729:
provide hardware support for 16-bit half precision floats. The
1554:
8-bit or 16-bit fixed precision depending on the requirements.
392:(2) ≈ 3.311 decimal digits, or 4 digits ± slightly less than 5 321:, released in early 2002, and implemented it in silicon in the 2731: 2560: 1693: 545: 340:
IEEE 754 half-precision binary floating-point format: binary16
2684: 2675:
Java source code for half-precision floating-point conversion
1602:
standard library type. As of January 2024, no .NET language (
1598:
5 introduced half precision floating point numbers with the
284:
was searching for an image format that could handle a wide
2639: 2396:"Integers and Floating-Point Numbers · The Julia Language" 1950:
Mark S. Peercy; Marc Olano; John Airey; P. Jeffrey Ungar.
1768:: IEEE standard for floating-point arithmetic (IEEE 754) 1534:
allows for more detail to be preserved in highlights and
1542:
floating point is that it requires half the storage and
2585: 2580:
may not follow Knowledge (XXG)'s policies or guidelines
2296:"Data Type Summary — Visual Basic language reference" 2143:"Exploiting half precision arithmetic in Nvidia GPUs" 1788:
Power Management Bus § Linear11 Floating Point Format
3147: 3056: 2942: 2912: 2877: 2765: 2722: 2141:Ho, Nhut-Minh; Wong, Weng-Fai (September 1, 2017). 1707:instruction set extension implemented in the Intel 2551:Khronos Vulkan signed 16-bit floating point format 2424:ARM Compiler armclang Reference Guide Version 6.7 2070:RealView Compilation Tools Compiler User Guide 2066:"Half-precision floating-point number support" 2026: 2024: 1588:provides support for half precisions with its 2700: 2420:"Half-precision floating-point number format" 2372:"Tracking Issue for f16 and f128 float types" 2246:"Floating-point numeric types ― C# reference" 1952:"Interactive Multi-Pass Programmable Shading" 1474:ARM processors support (via a floating-point 249: 8: 2096:Khronos Data Format Specification v1.2 rev 1 1667:type for IEEE half-precision 16-bit floats. 2033:IEEE Standard for Floating-Point Arithmetic 1899:"/home/usr/bk/glide/docs2.3.1/GLIDEPGM.DOC" 34:, a different 16-bit floating-point format. 2707: 2693: 2685: 256: 242: 100: 2616:Learn how and when to remove this message 1864: 1546:(at the expense of precision and range). 1494:environments to store pixels, including 983: 566: 466: 1799: 221: 175: 113: 103: 2220:Govindarajan, Prashanth (2020-08-31). 1733:extension is a minimal alternative to 2092:"10.1. 16-bit floating-point numbers" 1920: 1918: 1663:is currently working on adding a new 964:By default, 1/3 rounds down like for 7: 632:smallest positive subnormal number 374:The format is laid out as follows: 67:(two bytes in modern computers) in 2654:OpenGL treatment of half precision 2271:"Literals ― F# language reference" 1859:. Vol. 23. pp. 130–139. 1692:Support for half precision in the 1490:Half precision is used in several 344:The IEEE 754 standard specifies a 82:Almost all modern uses follow the 25: 1774:, Language Independent Arithmetic 1689:have support for half precision. 1549:Half precision can be useful for 2634:Survey of Floating-Point Formats 2565: 1808:"About ABCI - About ABCI | ABCI" 1581:Support by programming languages 972:... which is less than 1/2 of a 848:smallest number larger than one 704:smallest positive normal number 377: 370:: 11 bits (10 explicitly stored) 348:as having the following format: 204:IBM floating-point architecture 1762:: small floating-point formats 1754:bfloat16 floating-point format 1470:ARM alternative half-precision 276:3dfx Voodoo Graphics processor 1: 2773:Arbitrary-precision or bignum 1638:) or a keyword for the type. 776:largest number less than one 27:16-bit computer number format 2222:"Introducing the Half type!" 2042:10.1109/ieeestd.2019.8766229 2007:Cg 3.1 Toolkit Documentation 1622:) has literals (e.g. in C#, 525:(−1) × 2 × 1.significantbits 504:(−1) × 2 × 0.significantbits 2659:Fast Half Float Conversions 464:are interpreted specially. 278:of 1995 (same as Hitachi). 86:standard, where the 16-bit 3271: 1557:Hardware and software for 456:The stored exponents 00000 29: 3114:Strongly typed identifier 1928:. OpenEXR. Archived from 668:largest subnormal number 521: 90:format is referred to as 2645:Half precision constants 2499:"OPF_PowerISA_v3.1B.pdf" 2454:Intel® Builders Programs 1685:Several versions of the 30:Not to be confused with 3189:Parametric polymorphism 2346:"cl_khr_fp16 extension" 2117:"KHR_mesh_quantization" 560:Half precision examples 394:units in the last place 199:Microsoft Binary Format 2664:Analog Devices variant 2505:. OpenPOWER Foundation 2197:"Half Struct (System)" 1486:Uses of half precision 974:unit in the last place 884:largest normal number 290:SGI (Silicon Graphics) 61:computer number format 1866:10.1145/107004.107029 980:Precision limitations 740:nearest value to 1/3 628:) ≈ 0.000000059604645 3255:Floating point types 2586:improve this article 2350:registry.khronos.org 1959:People.csail.mit.edu 1699:is specified in the 700:) ≈ 0.00006103515625 329:mobile GPU in 2015. 3194:Primitive data type 3099:Recursive data type 2952:Algebraic data type 2828:Quadruple precision 2666:(four-bit exponent) 2598:footnote references 2300:learn.microsoft.com 2275:learn.microsoft.com 2250:learn.microsoft.com 2201:learn.microsoft.com 1778:Primitive data type 548:(quiet, signalling) 230:Arbitrary precision 3157:Abstract data type 2838:Extended precision 2797:Reduced precision 2400:docs.julialang.org 2072:. 10 December 2010 958:negative infinity 944:1 11111 0000000000 927:1 10000 0000000000 910:1 00000 0000000000 890:0 11111 0000000000 854:0 11110 1111111111 818:0 01111 0000000001 782:0 01111 0000000000 746:0 01110 1111111111 710:0 01101 0101010101 674:0 00001 0000000000 664:) ≈ 0.000060975552 638:0 00000 1111111111 602:0 00000 0000000001 585:0 00000 0000000000 476:Significand ≠ zero 473:Significand = zero 214:G.711 8-bit floats 168:Extended precision 45:(sometimes called 3250:Binary arithmetic 3237: 3236: 2969:Associative array 2833:Octuple precision 2626: 2625: 2618: 2528:libre-soc.org Git 2524:"ls005.xlen.mdwn" 2090:Garrard, Andrew. 2051:978-1-5044-5924-2 1783:RGBE image format 1492:computer graphics 1464: 1463: 962: 961: 554: 553: 500:subnormal numbers 400:Exponent encoding 266: 265: 16:(Redirected from 3262: 3209:Type constructor 3094:Opaque data type 3026:Record or Struct 2823:Double precision 2818:Single precision 2709: 2702: 2695: 2686: 2621: 2614: 2610: 2607: 2601: 2569: 2568: 2561: 2538: 2537: 2535: 2534: 2520: 2514: 2513: 2511: 2510: 2495: 2489: 2488: 2486: 2485: 2471: 2465: 2464: 2462: 2460: 2451: 2445:Towner, Daniel. 2442: 2436: 2435: 2433: 2431: 2416: 2410: 2409: 2407: 2406: 2392: 2386: 2385: 2383: 2382: 2367: 2361: 2360: 2358: 2356: 2342: 2336: 2335: 2333: 2331: 2317: 2311: 2310: 2308: 2307: 2292: 2286: 2285: 2283: 2282: 2267: 2261: 2260: 2258: 2257: 2242: 2236: 2235: 2233: 2232: 2217: 2211: 2210: 2208: 2207: 2193: 2187: 2186: 2184: 2182: 2168: 2162: 2161: 2155: 2153: 2147: 2138: 2132: 2131: 2129: 2128: 2113: 2107: 2106: 2104: 2103: 2087: 2081: 2080: 2078: 2077: 2062: 2056: 2055: 2028: 2019: 2018: 2016: 2014: 1999: 1993: 1992: 1990: 1989: 1975: 1969: 1968: 1966: 1965: 1956: 1947: 1941: 1940: 1938: 1937: 1922: 1913: 1912: 1910: 1909: 1895: 1889: 1888: 1868: 1852: 1846: 1845: 1843: 1842: 1828: 1822: 1821: 1819: 1818: 1804: 1736: 1732: 1725: 1721: 1687:ARM architecture 1681:Hardware support 1676: 1666: 1648: 1637: 1633: 1629: 1625: 1601: 1591: 1559:machine learning 1540:single-precision 1476:control register 1383: 1381: 1380: 1377: 1374: 1358: 1356: 1355: 1352: 1349: 1333: 1331: 1330: 1327: 1324: 1225: 1223: 1222: 1219: 1216: 1203: 1201: 1200: 1197: 1194: 1186: 1184: 1183: 1180: 1177: 1164: 1162: 1161: 1158: 1155: 1147: 1145: 1144: 1141: 1138: 1125: 1123: 1122: 1119: 1116: 984: 971: 966:double precision 955: 950: 945: 938: 933: 928: 921: 916: 911: 901: 896: 891: 881: 879: 877: 876: 873: 870: 860: 855: 845: 843: 841: 840: 837: 834: 824: 819: 809: 807: 805: 804: 801: 798: 788: 783: 773: 771: 769: 768: 765: 762: 752: 747: 737: 735: 733: 732: 729: 726: 716: 711: 701: 699: 697: 696: 693: 690: 680: 675: 665: 663: 661: 660: 657: 654: 644: 639: 629: 627: 625: 624: 621: 618: 608: 603: 596: 591: 586: 567: 522:normalized value 467: 381: 258: 251: 244: 101: 73:image processing 21: 3270: 3269: 3265: 3264: 3263: 3261: 3260: 3259: 3240: 3239: 3238: 3233: 3214:Type conversion 3149: 3143: 3079:Enumerated type 3052: 2938: 2932:null-terminated 2908: 2873: 2761: 2718: 2713: 2622: 2611: 2605: 2602: 2583: 2574:This article's 2570: 2566: 2559: 2547: 2545:Further reading 2542: 2541: 2532: 2530: 2522: 2521: 2517: 2508: 2506: 2503:OpenPOWER Files 2497: 2496: 2492: 2483: 2481: 2473: 2472: 2468: 2458: 2456: 2449: 2444: 2443: 2439: 2429: 2427: 2426:. ARM Developer 2418: 2417: 2413: 2404: 2402: 2394: 2393: 2389: 2380: 2378: 2370:Cross, Travis. 2369: 2368: 2364: 2354: 2352: 2344: 2343: 2339: 2329: 2327: 2319: 2318: 2314: 2305: 2303: 2294: 2293: 2289: 2280: 2278: 2269: 2268: 2264: 2255: 2253: 2244: 2243: 2239: 2230: 2228: 2219: 2218: 2214: 2205: 2203: 2195: 2194: 2190: 2180: 2178: 2170: 2169: 2165: 2151: 2149: 2145: 2140: 2139: 2135: 2126: 2124: 2123:. Khronos Group 2115: 2114: 2110: 2101: 2099: 2089: 2088: 2084: 2075: 2073: 2064: 2063: 2059: 2052: 2030: 2029: 2022: 2012: 2010: 2001: 2000: 1996: 1987: 1985: 1977: 1976: 1972: 1963: 1961: 1954: 1949: 1948: 1944: 1935: 1933: 1924: 1923: 1916: 1907: 1905: 1897: 1896: 1892: 1877: 1854: 1853: 1849: 1840: 1838: 1830: 1829: 1825: 1816: 1814: 1806: 1805: 1801: 1796: 1750: 1734: 1730: 1723: 1719: 1709:Sapphire Rapids 1697:instruction set 1683: 1674: 1664: 1646: 1635: 1631: 1627: 1623: 1599: 1589: 1583: 1563:neural networks 1488: 1481: 1472: 1378: 1375: 1372: 1371: 1369: 1353: 1350: 1347: 1346: 1344: 1328: 1325: 1322: 1321: 1319: 1220: 1217: 1214: 1213: 1211: 1198: 1195: 1192: 1191: 1189: 1181: 1178: 1175: 1174: 1172: 1159: 1156: 1153: 1152: 1150: 1142: 1139: 1136: 1135: 1133: 1120: 1117: 1114: 1113: 1111: 982: 969: 953: 948: 943: 936: 931: 926: 919: 914: 909: 899: 894: 889: 874: 871: 868: 867: 865: 863: 858: 853: 838: 835: 832: 831: 829: 827: 822: 817: 802: 799: 796: 795: 793: 791: 786: 781: 766: 763: 760: 759: 757: 755: 750: 745: 730: 727: 724: 723: 721: 719: 714: 709: 694: 691: 688: 687: 685: 683: 678: 673: 658: 655: 652: 651: 649: 647: 642: 637: 622: 619: 616: 615: 613: 611: 606: 601: 594: 589: 584: 562: 536: 528: 519: 515: 507: 487: 463: 459: 448: 438: 434: 430: 423: 419: 415: 402: 391: 342: 271: 262: 209:PMBus Linear-11 77:neural networks 69:computer memory 35: 28: 23: 22: 15: 12: 11: 5: 3268: 3266: 3258: 3257: 3252: 3242: 3241: 3235: 3234: 3232: 3231: 3226: 3221: 3216: 3211: 3206: 3201: 3196: 3191: 3186: 3185: 3184: 3174: 3169: 3167:Data structure 3164: 3159: 3153: 3151: 3145: 3144: 3142: 3141: 3136: 3131: 3126: 3121: 3116: 3111: 3106: 3101: 3096: 3091: 3086: 3081: 3076: 3071: 3066: 3060: 3058: 3054: 3053: 3051: 3050: 3049: 3048: 3038: 3033: 3028: 3023: 3018: 3013: 3012: 3011: 3001: 2996: 2991: 2986: 2981: 2976: 2971: 2966: 2961: 2960: 2959: 2948: 2946: 2940: 2939: 2937: 2936: 2935: 2934: 2924: 2918: 2916: 2910: 2909: 2907: 2906: 2901: 2900: 2899: 2894: 2883: 2881: 2875: 2874: 2872: 2871: 2866: 2861: 2860: 2859: 2849: 2848: 2847: 2846: 2845: 2835: 2830: 2825: 2820: 2815: 2814: 2813: 2808: 2806:Half precision 2803: 2793:Floating point 2790: 2785: 2780: 2775: 2769: 2767: 2763: 2762: 2760: 2759: 2754: 2749: 2744: 2739: 2734: 2728: 2726: 2720: 2719: 2714: 2712: 2711: 2704: 2697: 2689: 2683: 2682: 2677: 2672: 2667: 2661: 2656: 2651: 2642: 2637: 2624: 2623: 2578:external links 2573: 2571: 2564: 2558: 2557:External links 2555: 2554: 2553: 2546: 2543: 2540: 2539: 2515: 2490: 2466: 2437: 2411: 2387: 2362: 2337: 2312: 2287: 2262: 2237: 2212: 2188: 2163: 2133: 2108: 2082: 2057: 2050: 2020: 1994: 1970: 1942: 1914: 1890: 1875: 1847: 1823: 1798: 1797: 1795: 1792: 1791: 1790: 1785: 1780: 1775: 1769: 1763: 1757: 1749: 1746: 1682: 1679: 1636:System.Decimal 1582: 1579: 1487: 1484: 1479: 1471: 1468: 1462: 1461: 1458: 1455: 1451: 1450: 1447: 1444: 1440: 1439: 1436: 1433: 1429: 1428: 1425: 1422: 1418: 1417: 1414: 1411: 1407: 1406: 1403: 1400: 1396: 1395: 1392: 1389: 1385: 1384: 1367: 1364: 1360: 1359: 1342: 1339: 1335: 1334: 1317: 1314: 1310: 1309: 1306: 1303: 1299: 1298: 1295: 1292: 1288: 1287: 1284: 1281: 1277: 1276: 1273: 1270: 1266: 1265: 1262: 1259: 1255: 1254: 1251: 1248: 1244: 1243: 1240: 1237: 1233: 1232: 1229: 1226: 1208: 1207: 1204: 1187: 1169: 1168: 1165: 1148: 1130: 1129: 1126: 1109: 1105: 1104: 1101: 1098: 1094: 1093: 1090: 1087: 1083: 1082: 1079: 1076: 1072: 1071: 1068: 1065: 1061: 1060: 1057: 1054: 1050: 1049: 1046: 1043: 1039: 1038: 1035: 1032: 1028: 1027: 1024: 1021: 1017: 1016: 1013: 1010: 1006: 1005: 1002: 999: 995: 994: 991: 988: 981: 978: 960: 959: 956: 951: 946: 940: 939: 934: 929: 923: 922: 917: 912: 906: 905: 902: 897: 892: 886: 885: 882: 861: 856: 850: 849: 846: 844:) ≈ 1.00097656 825: 820: 814: 813: 810: 789: 784: 778: 777: 774: 772:) ≈ 0.99951172 753: 748: 742: 741: 738: 736:) ≈ 0.33325195 717: 712: 706: 705: 702: 681: 676: 670: 669: 666: 645: 640: 634: 633: 630: 609: 604: 598: 597: 592: 587: 581: 580: 577: 574: 571: 561: 558: 552: 551: 549: 543: 537: 534: 530: 529: 526: 523: 520: 517: 513: 509: 508: 505: 502: 497: 488: 485: 481: 480: 477: 474: 471: 461: 457: 451: 450: 446: 440: 436: 432: 428: 425: 421: 417: 413: 401: 398: 389: 372: 371: 362: 356: 341: 338: 300:image format. 270: 267: 264: 263: 261: 260: 253: 246: 238: 235: 234: 233: 232: 224: 223: 219: 218: 217: 216: 211: 206: 201: 196: 194:TensorFloat-32 191: 186: 178: 177: 173: 172: 171: 170: 165: 158: 148: 138: 128: 118: 117: 111: 110: 105:Floating-point 63:that occupies 58:floating-point 43:half precision 26: 24: 14: 13: 10: 9: 6: 4: 3: 2: 3267: 3256: 3253: 3251: 3248: 3247: 3245: 3230: 3227: 3225: 3222: 3220: 3217: 3215: 3212: 3210: 3207: 3205: 3202: 3200: 3197: 3195: 3192: 3190: 3187: 3183: 3180: 3179: 3178: 3175: 3173: 3170: 3168: 3165: 3163: 3160: 3158: 3155: 3154: 3152: 3146: 3140: 3137: 3135: 3132: 3130: 3127: 3125: 3122: 3120: 3117: 3115: 3112: 3110: 3107: 3105: 3102: 3100: 3097: 3095: 3092: 3090: 3089:Function type 3087: 3085: 3082: 3080: 3077: 3075: 3072: 3070: 3067: 3065: 3062: 3061: 3059: 3055: 3047: 3044: 3043: 3042: 3039: 3037: 3034: 3032: 3029: 3027: 3024: 3022: 3019: 3017: 3014: 3010: 3007: 3006: 3005: 3002: 3000: 2997: 2995: 2992: 2990: 2987: 2985: 2982: 2980: 2977: 2975: 2972: 2970: 2967: 2965: 2962: 2958: 2955: 2954: 2953: 2950: 2949: 2947: 2945: 2941: 2933: 2930: 2929: 2928: 2925: 2923: 2920: 2919: 2917: 2915: 2911: 2905: 2902: 2898: 2895: 2893: 2890: 2889: 2888: 2885: 2884: 2882: 2880: 2876: 2870: 2867: 2865: 2862: 2858: 2855: 2854: 2853: 2850: 2844: 2841: 2840: 2839: 2836: 2834: 2831: 2829: 2826: 2824: 2821: 2819: 2816: 2812: 2809: 2807: 2804: 2802: 2799: 2798: 2796: 2795: 2794: 2791: 2789: 2786: 2784: 2781: 2779: 2776: 2774: 2771: 2770: 2768: 2764: 2758: 2755: 2753: 2750: 2748: 2745: 2743: 2740: 2738: 2735: 2733: 2730: 2729: 2727: 2725: 2724:Uninterpreted 2721: 2717: 2710: 2705: 2703: 2698: 2696: 2691: 2690: 2687: 2681: 2678: 2676: 2673: 2671: 2668: 2665: 2662: 2660: 2657: 2655: 2652: 2650: 2646: 2643: 2641: 2638: 2635: 2631: 2628: 2627: 2620: 2617: 2609: 2599: 2595: 2594:inappropriate 2591: 2587: 2581: 2579: 2572: 2563: 2562: 2556: 2552: 2549: 2548: 2544: 2529: 2525: 2519: 2516: 2504: 2500: 2494: 2491: 2480: 2479:Five EmbedDev 2476: 2470: 2467: 2455: 2448: 2441: 2438: 2425: 2421: 2415: 2412: 2401: 2397: 2391: 2388: 2377: 2373: 2366: 2363: 2351: 2347: 2341: 2338: 2326: 2322: 2316: 2313: 2301: 2297: 2291: 2288: 2276: 2272: 2266: 2263: 2251: 2247: 2241: 2238: 2227: 2223: 2216: 2213: 2202: 2198: 2192: 2189: 2177: 2173: 2167: 2164: 2160: 2144: 2137: 2134: 2122: 2118: 2112: 2109: 2097: 2093: 2086: 2083: 2071: 2067: 2061: 2058: 2053: 2047: 2043: 2039: 2035: 2034: 2027: 2025: 2021: 2008: 2004: 1998: 1995: 1984: 1980: 1974: 1971: 1960: 1953: 1946: 1943: 1932:on 2013-05-08 1931: 1927: 1921: 1919: 1915: 1904: 1900: 1894: 1891: 1886: 1882: 1878: 1872: 1867: 1862: 1858: 1851: 1848: 1837: 1833: 1827: 1824: 1813: 1809: 1803: 1800: 1793: 1789: 1786: 1784: 1781: 1779: 1776: 1773: 1772:ISO/IEC 10967 1770: 1767: 1764: 1761: 1758: 1755: 1752: 1751: 1747: 1745: 1743: 1738: 1728: 1717: 1712: 1710: 1706: 1702: 1698: 1695: 1690: 1688: 1680: 1678: 1672: 1668: 1662: 1657: 1655: 1651: 1649: 1643: 1639: 1628:System.Single 1621: 1617: 1613: 1609: 1605: 1597: 1593: 1587: 1580: 1578: 1576: 1571: 1569: 1564: 1560: 1555: 1552: 1547: 1545: 1541: 1537: 1533: 1532:dynamic range 1529: 1525: 1521: 1517: 1513: 1509: 1505: 1501: 1497: 1493: 1485: 1483: 1477: 1469: 1467: 1459: 1456: 1453: 1452: 1448: 1445: 1442: 1441: 1437: 1434: 1431: 1430: 1426: 1423: 1420: 1419: 1415: 1412: 1409: 1408: 1404: 1401: 1398: 1397: 1393: 1390: 1387: 1386: 1368: 1365: 1362: 1361: 1343: 1340: 1337: 1336: 1318: 1315: 1312: 1311: 1307: 1304: 1301: 1300: 1296: 1293: 1290: 1289: 1285: 1282: 1279: 1278: 1274: 1271: 1268: 1267: 1263: 1260: 1257: 1256: 1252: 1249: 1246: 1245: 1241: 1238: 1235: 1234: 1230: 1227: 1210: 1209: 1205: 1188: 1171: 1170: 1166: 1149: 1132: 1131: 1127: 1110: 1107: 1106: 1102: 1099: 1096: 1095: 1091: 1088: 1085: 1084: 1080: 1077: 1074: 1073: 1069: 1066: 1063: 1062: 1058: 1055: 1052: 1051: 1047: 1044: 1041: 1040: 1036: 1033: 1030: 1029: 1025: 1022: 1019: 1018: 1014: 1011: 1008: 1007: 1003: 1000: 997: 996: 992: 989: 986: 985: 979: 977: 975: 967: 957: 952: 947: 942: 941: 935: 930: 925: 924: 918: 913: 908: 907: 903: 898: 893: 888: 887: 883: 862: 857: 852: 851: 847: 826: 821: 816: 815: 811: 790: 785: 780: 779: 775: 754: 749: 744: 743: 739: 718: 713: 708: 707: 703: 682: 677: 672: 671: 667: 646: 641: 636: 635: 631: 610: 605: 600: 599: 593: 588: 583: 582: 578: 575: 572: 569: 568: 565: 559: 557: 550: 547: 544: 542: 538: 532: 531: 524: 511: 510: 503: 501: 498: 496: 492: 489: 483: 482: 478: 475: 472: 469: 468: 465: 454: 444: 443:Exponent bias 441: 426: 411: 410: 409: 407: 406:offset-binary 399: 397: 395: 387: 382: 380: 375: 369: 366: 363: 361:width: 5 bits 360: 357: 354: 351: 350: 349: 347: 339: 337: 335: 330: 328: 324: 320: 316: 313: 309: 305: 301: 299: 295: 291: 287: 286:dynamic range 283: 279: 277: 268: 259: 254: 252: 247: 245: 240: 239: 237: 236: 231: 228: 227: 226: 225: 220: 215: 212: 210: 207: 205: 202: 200: 197: 195: 192: 190: 187: 185: 182: 181: 180: 179: 174: 169: 166: 163: 159: 157: 154:(binary128), 153: 149: 147: 143: 139: 137: 133: 129: 126: 122: 121: 120: 119: 116: 112: 109: 106: 102: 99: 95: 93: 89: 85: 84:IEEE 754-2008 80: 78: 74: 70: 66: 62: 59: 56: 52: 48: 44: 40: 33: 19: 2994:Intersection 2805: 2640:OpenEXR site 2633: 2612: 2603: 2588:by removing 2575: 2531:. Retrieved 2527: 2518: 2507:. Retrieved 2502: 2493: 2482:. Retrieved 2478: 2469: 2457:. Retrieved 2453: 2440: 2428:. Retrieved 2423: 2414: 2403:. Retrieved 2399: 2390: 2379:. Retrieved 2375: 2365: 2353:. Retrieved 2349: 2340: 2328:. Retrieved 2324: 2315: 2304:. Retrieved 2302:. 2021-09-15 2299: 2290: 2279:. Retrieved 2277:. 2022-06-15 2274: 2265: 2254:. Retrieved 2252:. 2022-09-29 2249: 2240: 2229:. Retrieved 2225: 2215: 2204:. Retrieved 2200: 2191: 2179:. Retrieved 2175: 2166: 2157: 2150:. Retrieved 2136: 2125:. Retrieved 2120: 2111: 2100:. Retrieved 2095: 2085: 2074:. Retrieved 2069: 2060: 2032: 2011:. Retrieved 2006: 1997: 1986:. Retrieved 1982: 1973: 1962:. Retrieved 1958: 1945: 1934:. Retrieved 1930:the original 1906:. Retrieved 1902: 1893: 1856: 1850: 1839:. Retrieved 1835: 1826: 1815:. Retrieved 1811: 1802: 1739: 1713: 1705:AVX-512_FP16 1691: 1684: 1669: 1659:As of 2024, 1658: 1652: 1640: 1612:Visual Basic 1594: 1584: 1572: 1556: 1548: 1489: 1473: 1465: 963: 563: 555: 516:, ..., 11110 455: 452: 403: 383: 376: 373: 345: 343: 331: 311: 310:defined the 302: 280: 272: 222:Alternatives 144:(binary64), 134:(binary32), 124: 96: 91: 81: 50: 46: 42: 36: 3224:Type theory 3219:Type system 3069:Bottom type 3016:Option type 2957:generalized 2843:Long double 2788:Fixed point 2176:ziglang.org 1836:Archive.org 1711:processor. 1600:System.Half 386:significand 365:Significand 319:Cg language 164:(binary256) 3244:Categories 3129:Empty type 3124:Type class 3074:Collection 3031:Refinement 3009:metaobject 2857:signedness 2716:Data types 2630:Minifloats 2533:2023-07-02 2509:2023-07-02 2484:2023-07-02 2405:2024-07-11 2381:2024-07-05 2325:github.com 2306:2024-02-01 2281:2024-02-01 2256:2024-02-01 2231:2024-02-01 2206:2024-02-01 2127:2023-07-02 2102:2023-08-05 2076:2015-05-05 1988:2017-07-14 1983:Google.com 1964:2017-07-14 1936:2017-07-14 1908:2017-07-14 1903:Gamers.org 1876:0897913779 1841:2017-07-14 1817:2019-10-06 1794:References 1727:extensions 323:GeForce FX 156:decimal128 127:(binary16) 3204:Subtyping 3199:Interface 3182:metaclass 3134:Unit type 3104:Semaphore 3084:Exception 2989:Inductive 2979:Dependent 2944:Composite 2922:Character 2904:Reference 2801:Minifloat 2757:Bit array 2606:July 2017 2590:excessive 2226:.NET Blog 2181:7 January 2098:. Khronos 2013:17 August 2003:"vs_2_sw" 1926:"OpenEXR" 1760:Minifloat 1742:Power ISA 1634:has type 1626:has type 1544:bandwidth 993:interval 904:infinity 880:) = 65504 864:2 × (1 + 828:2 × (1 + 792:2 × (1 + 756:2 × (1 + 720:2 × (1 + 684:2 × (1 + 648:2 × (0 + 612:2 × (0 + 479:Equation 460:and 11111 368:precision 308:Microsoft 184:Minifloat 160:256-bit: 152:Quadruple 150:128-bit: 146:decimal64 136:decimal32 39:computing 3229:Variable 3119:Top type 2984:Equality 2892:physical 2869:Rational 2864:Interval 2811:bfloat16 2172:"Floats" 2152:July 13, 2009:. Nvidia 1885:16648394 1766:IEEE 754 1748:See also 1568:bfloat16 1524:Direct3D 541:infinity 470:Exponent 359:Exponent 353:Sign bit 346:binary16 327:Tegra X1 315:datatype 294:SIGGRAPH 189:bfloat16 140:64-bit: 130:32-bit: 123:16-bit: 115:IEEE 754 92:binary16 32:bfloat16 3172:Generic 3148:Related 3064:Boolean 3021:Product 2897:virtual 2887:Address 2879:Pointer 2852:Integer 2783:Decimal 2778:Complex 2766:Numeric 2584:Please 2576:use of 1812:abci.ai 1675:Float16 1647:Float16 1616:C++/CLI 1536:shadows 1504:JPEG XR 1500:OpenEXR 1382:⁠ 1370:⁠ 1357:⁠ 1345:⁠ 1332:⁠ 1320:⁠ 1224:⁠ 1212:⁠ 1202:⁠ 1190:⁠ 1185:⁠ 1173:⁠ 1163:⁠ 1151:⁠ 1146:⁠ 1134:⁠ 1124:⁠ 1112:⁠ 878:⁠ 866:⁠ 842:⁠ 830:⁠ 806:⁠ 794:⁠ 770:⁠ 758:⁠ 734:⁠ 722:⁠ 698:⁠ 686:⁠ 662:⁠ 650:⁠ 626:⁠ 614:⁠ 445:= 01111 435:− 01111 431:= 11110 420:− 01111 416:= 00001 355:: 1 bit 317:in the 298:OpenEXR 269:History 162:Octuple 108:formats 65:16 bits 53:) is a 51:float16 3162:Boxing 3150:topics 3109:Stream 3046:tagged 3004:Object 2927:String 2459:13 May 2430:13 May 2376:GitHub 2355:31 May 2330:13 May 2121:GitHub 2048:  1883:  1873:  1731:Zfhmin 1724:Zfhmin 1718:, the 1716:RISC-V 1677:type. 1654:OpenCL 1650:type. 1620:C++/CX 1614:, and 1592:type. 1526:, and 1516:Vulkan 1512:OpenGL 1496:MATLAB 579:Notes 570:Binary 304:Nvidia 142:Double 132:Single 88:base-2 55:binary 3057:Other 3041:Union 2974:Class 2964:Array 2747:Tryte 2647:from 2450:(PDF) 2159:GPUs. 2146:(PDF) 1955:(PDF) 1881:S2CID 1671:Julia 1642:Swift 1454:65520 1446:65520 1443:32768 1435:32768 1432:16384 1424:16384 808:) = 1 576:Value 533:11111 512:00001 484:00000 424:= −14 176:Other 3177:Kind 3139:Void 2999:List 2914:Text 2752:Word 2742:Trit 2737:Byte 2649:D3DX 2632:(in 2461:2022 2432:2022 2357:2024 2332:2024 2183:2024 2154:2020 2046:ISBN 2015:2016 1871:ISBN 1722:and 1701:F16C 1661:Rust 1632:1.0m 1624:1.0f 1618:and 1596:.NET 1575:SIMD 1551:mesh 1528:D3DX 1508:GIMP 1421:8192 1413:8192 1410:4096 1402:4096 1399:2048 1391:2048 1388:1024 1366:1024 970:0101 949:fc00 932:c000 915:8000 895:7c00 875:1024 869:1023 859:7bff 839:1024 823:3c01 812:one 803:1024 787:3c00 767:1024 761:1023 751:3bff 731:1024 715:3555 695:1024 679:0400 659:1024 653:1023 643:03ff 623:1024 607:0001 590:0000 491:zero 449:= 15 439:= 15 334:F16C 332:The 312:half 306:and 125:Half 75:and 47:FP16 18:FP16 3036:Set 2732:Bit 2592:or 2038:doi 1861:doi 1740:On 1735:Zfh 1720:Zfh 1714:On 1694:x86 1665:f16 1630:or 1590:f16 1586:Zig 1561:or 1449:32 1438:16 1363:512 1341:512 1338:256 1316:256 1313:128 1305:128 990:Max 987:Min 725:341 573:Hex 546:NaN 429:max 414:min 396:). 282:ILM 49:or 37:In 3246:: 2526:. 2501:. 2477:. 2452:. 2422:. 2398:. 2374:. 2348:. 2323:. 2298:. 2273:. 2248:. 2224:. 2199:. 2174:. 2156:. 2119:. 2094:. 2068:. 2044:. 2023:^ 2005:. 1981:. 1957:. 1917:^ 1901:. 1879:. 1869:. 1834:. 1810:. 1737:. 1610:, 1608:F# 1606:, 1604:C# 1522:, 1520:Cg 1518:, 1514:, 1510:, 1506:, 1502:, 1498:, 1460:∞ 1427:8 1416:4 1405:2 1394:1 1308:2 1302:64 1297:2 1294:64 1291:32 1286:2 1283:32 1280:16 1275:2 1272:16 1264:2 1253:2 1242:2 1231:2 1206:2 1167:2 1128:2 1103:2 1092:2 1081:2 1070:2 1059:2 1048:2 1037:2 1026:2 1015:2 1004:2 976:. 954:−∞ 937:−2 920:−0 495:−0 493:, 390:10 79:. 41:, 2708:e 2701:t 2694:v 2636:) 2619:) 2613:( 2608:) 2604:( 2600:. 2582:. 2536:. 2512:. 2487:. 2463:. 2434:. 2408:. 2384:. 2359:. 2334:. 2309:. 2284:. 2259:. 2234:. 2209:. 2185:. 2130:. 2105:. 2079:. 2054:. 2040:: 2017:. 1991:. 1967:. 1939:. 1911:. 1887:. 1863:: 1844:. 1820:. 1480:2 1457:∞ 1379:2 1376:/ 1373:1 1354:4 1351:/ 1348:1 1329:8 1326:/ 1323:1 1269:8 1261:8 1258:4 1250:4 1247:2 1239:2 1236:1 1228:1 1221:2 1218:/ 1215:1 1199:2 1196:/ 1193:1 1182:4 1179:/ 1176:1 1160:4 1157:/ 1154:1 1143:8 1140:/ 1137:1 1121:8 1118:/ 1115:1 1108:2 1100:2 1097:2 1089:2 1086:2 1078:2 1075:2 1067:2 1064:2 1056:2 1053:2 1045:2 1042:2 1034:2 1031:2 1023:2 1020:2 1012:2 1009:2 1001:2 998:0 900:∞ 872:/ 836:/ 833:1 800:/ 797:0 764:/ 728:/ 692:/ 689:0 656:/ 620:/ 617:1 595:0 539:± 535:2 527:2 518:2 514:2 506:2 486:2 462:2 458:2 447:2 437:2 433:2 427:E 422:2 418:2 412:E 257:e 250:t 243:v 20:)

Index

FP16
bfloat16
computing
binary
floating-point
computer number format
16 bits
computer memory
image processing
neural networks
IEEE 754-2008
base-2
Floating-point
formats
IEEE 754
Half
Single
decimal32
Double
decimal64
Quadruple
decimal128
Octuple
Extended precision
Minifloat
bfloat16
TensorFloat-32
Microsoft Binary Format
IBM floating-point architecture
PMBus Linear-11

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.