Knowledge

Attention Is All You Need

Source 📝

33: 2285: 2342: 2251: 2261: 302:(Luong et al, 2015) compared the relative performance of global (that of (Bahdanau et al, 2014)) and local (sliding window) attention model architectures for machine translation, and found that a mixed attention architecture had higher quality than global attention, while the use of a local attention architecture reduced translation time. 234:. LSTM became the standard architecture for long sequence modelling until the 2017 publication of Transformers. However, LSTM still used sequential processing, like most other RNNs. Specifically, RNNs operate one token at a time from first to last; they cannot operate in parallel over all tokens in a sequence. 241:
controller (1992) learns to compute a weight matrix for further processing depending on the input. One of its two networks has "fast weights" or "dynamic links" (1981). A slow neural network learns by gradient descent to generate keys and values for computing the weight changes of the fast neural
288:-size output vector, which was then processed by another recurrent network into an output. If the input is long, then the output vector would not be able to contain all relevant information, and the output quality degrades. As evidence, reversing the input sentence improved seq2seq translation. 1666:
Dosovitskiy, Alexey; Beyer, Lucas; Kolesnikov, Alexander; Weissenborn, Dirk; Zhai, Xiaohua; Unterthiner, Thomas; Dehghani, Mostafa; Minderer, Matthias; Heigold, Georg; Gelly, Sylvain; Uszkoreit, Jakob (3 June 2021). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale".
317:. The new model was a seq2seq model where the encoder and the decoder were both 8 layers of bidirectional LSTM. It took nine months to develop, and it achieved a higher level of performance than the statistical approach, which took ten years to develop. In the same year, self-attention 392:
Already in spring 2017, even before the "Attention is all you need" preprint was published, one of the co-authors applied the "decoder-only" variation of the architecture to generate fictitious Knowledge articles. Transformer architecture is now used in many
283:
word of the source text was processed. Although in theory such a vector retains the information about the whole original sentence, in practice the information is poorly preserved, since the input is processed sequentially by one recurrent network into a
384:, by removing its recurrence to process all tokens in parallel, but preserving its dot-product attention mechanism to keep its text processing performance. Its parallelizability was an important factor to its widespread use in large neural networks. 159:
Six of the eight authors were born outside the United States; the other two are children of two green-card-carrying Germans who were temporarily in California and a first-generation American whose family had fled persecution, respectively.
256:
The idea of encoder-decoder sequence transduction had been developed in the early 2010s (see for previous papers). The papers most commonly cited as the originators that produced seq2seq are two concurrently published papers from 2014.
424:(2018), an encoder-only Transformer model. In 2019 October, Google started using BERT to process search queries. In 2020, Google Translate replaced the previous RNN-encoder–RNN-decoder model by a Transformer-encoder–RNN-decoder model. 344:
Seq2seq models with attention (including self-attention) still suffered from the same issue with recurrent networks, which is that they are hard to parallelize, which prevented them to be accelerated on GPUs. In 2016,
216:(1995), a RNN which used various innovations to overcome the vanishing gradient problem, allowing efficient learning of long-sequence modelling. One key innovation was the use of an 673: 129:
should be linearly scaled up from 0 to maximal value for the first part of the training (i.e. 2% of the total number of training steps), and to use dropout, to stabilize training.
1729:
Choromanski, Krzysztof; Likhosherstov, Valerii; Dohan, David; Song, Xingyou; Gane, Andreea; Sarlos, Tamas; Hawkins, Peter; Davis, Jared; Mohiuddin, Afroz (19 November 2022),
115:
Some early examples that the team tried their Transformer architecture on included English-to-German translation, generating Knowledge articles on "The Transformer", and
585: 1516:
Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (11 October 2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding".
1436: 106:
An early design document was titled "Transformers: Iterative Self-Attention and Processing for Various Tasks", and included an illustration of six characters from the
809: 237:
Modern Transformers overcome this problem, but unlike RNNs, they require computation time that is quadratic in the size of the context window. The linearly scaling
1692: 1382: 1254: 1705:
Chen, Lili; Lu, Kevin; Rajeswaran, Aravind; Lee, Kimin; Grover, Aditya; Laskin, Michael; Abbeel, Pieter; Srinivas, Aravind; Mordatch, Igor (24 June 2021),
1458:
Peng, Bo; Alcaide, Eric; Anthony, Quentin; Albalak, Alon; Arcadinho, Samuel; Biderman, Stella; Cao, Huanqi; Cheng, Xin; Chung, Michael (10 December 2023),
1395:
Parikh, Ankur P.; TÀckström, Oscar; Das, Dipanjan; Uszkoreit, Jakob (25 September 2016). "A Decomposable Attention Model for Natural Language Inference".
1142:
Chung, Junyoung; Gulcehre, Caglar; Cho, KyungHyun; Bengio, Yoshua (2014). "Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling".
663: 1770:
Esser, Patrick; Kulal, Sumith; Blattmann, Andreas; Entezari, Rahim; MĂŒller, Jonas; Saini, Harry; Levi, Yam; Lorenz, Dominik; Sauer, Axel (5 March 2024),
1310: 2052: 177: 1288:
Wu, Yonghui; et al. (1 September 2016). "Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation".
2407: 2383: 369:
you need". That hypothesis was against conventional wisdom of the time, and even his father, a well-known computational linguist, was skeptical.
950: 2422: 2326: 1079:
Cho, Kyunghyun; van Merriënboer, Bart; Gulcehre, Caglar; Bahdanau, Dzmitry; Bougares, Fethi; Schwenk, Holger; Bengio, Yoshua (October 2014).
880: 1483: 773: 1798: 57: 1588: 610:
Bahdanau, Dzmitry; Cho, Kyunghyun; Bengio, Yoshua (19 May 2016). "Neural Machine Translation by Jointly Learning to Align and Translate".
2075: 1839: 1754: 706: 2221: 428: 394: 149:, Lukasz Kaiser, and Illia Polosukhin. All eight authors were "equal contributors" to the paper; the listed order was randomized. The 89: 73: 1267:
Luong, Minh-Thang; Pham, Hieu; Manning, Christopher D. (2015). "Effective Approaches to Attention-based Neural Machine Translation".
646: 310: 272:
is another LSTM that converts the vector into a sequence of tokens. Similarly, (Cho et al, 2014) was 130M-parameter model that used
1218: 1163:
Gruber, N.; Jockisch, A. (2020), "Are GRU cells more specific and LSTM cells more sensitive in motive classification of text?",
291:(Bahdanau et al, 2014) introduced an attention mechanism to seq2seq for machine translation to solve the bottleneck problem (of 1621: 314: 990:
Christoph von der Malsburg: The correlation theory of brain function. Internal Report 81-2, MPI Biophysical Chemistry, 1981.
1420: 736: 533: 1003:
Jerome A. Feldman, "Dynamic connections in neural networks," Biological Cybernetics, vol. 46, no. 1, pp. 27-39, Dec. 1982.
1121:
Sutskever, Ilya; Vinyals, Oriol; Le, Quoc Viet (14 December 2014). "Sequence to sequence learning with neural networks".
242:
network which computes answers to queries. This was later shown to be equivalent to the unnormalized linear Transformer.
586:"'Attention is All You Need' creators look beyond Transformers for AI at Nvidia GTC: 'The world needs something better'" 455: 339: 209:
leaves the model's state at the end of a long sentence without precise, extractable information about preceding tokens.
187: 119:. These convinced the team that the Transformer is a general purpose language model, and not just good for translation. 205:(1990). In theory, the information from one token can propagate arbitrarily far down the sequence, but in practice the 2376: 2153: 432: 1537: 361:
with an order of magnitude less parameters than LSTMs. One of its authors, Jakob Uszkoreit, suspected that attention
559: 467: 350: 206: 869:
Parallel Distributed Processing, Volume 1: Explorations in the Microstructure of Cognition: Foundations, Chapter 2
2319: 373: 230: 164:
By 2023, all eight authors had left Google and founded their own AI start-ups (except Ɓukasz Kaiser, who joined
2254: 1946: 198: 295:
output vector), allowing the model to process long-distance dependencies more easily. They called their model
2402: 2349: 2165: 1832: 276:(GRU) instead of LSTM. Later research showed that GRUs are neither better nor worse than LSTMs for seq2seq. 65: 2369: 2085: 1940: 1326: 261: 213: 279:
These early seq2seq models had no attention mechanism, and the state vector is accessible only after the
2417: 2141: 1952: 1753:
Liu, Zhuang; Mao, Hanzi; Wu, Chao-Yuan; Feichtenhofer, Christoph; Darrell, Trevor; Xie, Saining (2022).
486: 421: 946: 32: 828: 2312: 2264: 2006: 440: 273: 96: 84:, but the authors go further in the paper, foreseeing the technique's potential for other tasks like 69: 1563: 2216: 1056: 451: 381: 217: 81: 61: 45: 696: 1825: 1775: 1734: 1710: 1668: 1517: 1463: 1396: 1289: 1268: 1230: 1200: 1143: 1122: 1088: 973: 611: 447: 413: 358: 238: 85: 1081:"Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation" 1905: 1491: 1428: 1318: 1192: 1085:
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
925: 917: 876: 848: 781: 744: 642: 994:
See Reprint in Models of Neural Networks II, chapter 2, pages 95-119. Springer, Berlin, 1994.
632: 2236: 2124: 2112: 1357: 1182: 1172: 1098: 965: 909: 840: 463: 354: 306: 151: 49: 1037: 2412: 2058: 1869: 701: 372:
In 2017, the original (100M-sized) encoder-decoder transformer model was proposed in the "
318: 17: 897: 2353: 2296: 1970: 1187: 563: 521: 409: 138: 1354:
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing
867: 844: 299:, as it "emulates searching through a source sentence during decoding a translation". 260:(Sutskever et al, 2014) was a 380M-parameter model for machine translation using two 2396: 2118: 2018: 1204: 202: 126: 53: 1643: 1613: 977: 446:
Since 2020, Transformers have been applied in modalities beyond text, including the
103:. The name "Transformer" was picked because Uszkoreit liked the sound of that word. 64:
proposed in 2014 by Bahdanau et al. It is considered a foundational paper in modern
2284: 2231: 1911: 1864: 525: 142: 108: 1036:
Katharopoulos, Angelos; Vyas, Apoorv; Pappas, Nikolaos; Fleuret, François (2020).
991: 2341: 2226: 2188: 589: 529: 365:
recurrence is sufficient for language translation, thus the title "attention is
146: 100: 1038:"Transformers are RNNs: Fast autoregressive Transformers with linear attention" 866:
Rumelhart, David E.; McClelland, James L.; Hinton, Geoffrey E. (29 July 1987).
2080: 1978: 1759:. Conference on Computer Vision and Pattern Recognition. pp. 11976–11986. 1356:. Austin, Texas: Association for Computational Linguistics. pp. 551–561. 1087:. Doha, Qatar: Association for Computational Linguistics. pp. 1724–1734. 969: 268:
is an LSTM that takes in a sequence of tokens and turns it into a vector. The
1799:"Transformer: A Novel Neural Network Architecture for Language Understanding" 1495: 1432: 1322: 921: 852: 785: 748: 2171: 2000: 1917: 1848: 1177: 1014: 951:"Learning to control fast-weight memories: an alternative to recurrent nets" 638: 1361: 1196: 929: 52:
authored by eight scientists working at Google. The paper introduced a new
1102: 197:
For many years, sequence modelling and generation was done by using plain
36:
An illustration of main components of the transformer model from the paper
2012: 913: 417: 220:
which used neurons that multiply the outputs of other neurons, so-called
898:"Learning, invariance, and generalization in high-order neural networks" 631:
Shinde, Gitanjali; Wasatkar, Namrata; Mahalle, Parikshit (6 June 2024).
498:
Some architectures, such as RWKV or state space models, avoid the issue.
2046: 1899: 1772:
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
1589:"The inside story of how ChatGPT was built from the people who made it" 634:
Data-Centric Artificial Intelligence for Multidisciplinary Applications
436: 398: 377: 251: 181: 116: 77: 2292: 1923: 1859: 1522: 1349: 668: 459: 165: 1080: 1059:(2021). "Linear Transformers Are Secretly Fast Weight Programmers". 697:"Transformers: the Google scientists who pioneered an AI revolution" 1780: 1739: 1715: 1673: 1468: 1401: 1294: 1273: 2147: 1707:
Decision Transformer: Reinforcement Learning via Sequence Modeling
1235: 1148: 1127: 1093: 1019:
Proceedings of the Annual Meeting of the Cognitive Science Society
616: 454:. The vision transformer, in turn, stimulated new developments in 68:, as the transformer approach has become the main architecture of 31: 1373: 1371: 560:"AI Researcher Who Helped Write Landmark Paper Is Leaving Google" 376:" paper. At the time, the focus of the research was on improving 2159: 1421:"8 Google Employees Invented Modern AI. Here's the Inside Story" 1083:. In Moschitti, Alessandro; Pang, Bo; Daelemans, Walter (eds.). 737:"8 Google Employees Invented Modern AI. Here's the Inside Story" 405: 1821: 224:. Neural networks using multiplicative units were later called 408:(2018) was a bi-directional LSTM that produces contextualized 122:
As of 2024, the paper has been cited more than 100,000 times.
1614:"Improving language understanding with unsupervised learning" 1817: 1348:
Cheng, Jianpeng; Dong, Li; Lapata, Mirella (November 2016).
125:
For their 100M-parameter Transformer model, they suggested
76:. At the time, the focus of the research was on improving 664:"Transformers Revolutionized AI. What Will Replace Them?" 439:, became unexpectedly popular, triggering a boom around 431:
of decoder-only Transformers became state of the art in
2357: 2300: 178:
Transformer (deep learning architecture) § History
1538:"Google: BERT now used on almost every English query" 1350:"Long Short-Term Memory-Networks for Machine Reading" 810:"Meet the $ 4 Billion AI Superstars That Google Lost" 1352:. In Su, Jian; Duh, Kevin; Carreras, Xavier (eds.). 1219:"Sequence to Sequence Learning with Neural Networks" 1217:
Sutskever, Ilya; Vinyals, Oriol; Le, Quoc V (2014).
264:(LSTM). The architecture consists of two parts. The 112:
animated show. The team was named Team Transformer.
2202: 2181: 2134: 2105: 2098: 2068: 2039: 2032: 1993: 1962: 1933: 1892: 1885: 1878: 470:(2024), are based on the Transformer architecture. 1223:Advances in Neural Information Processing Systems 941: 939: 541:Advances in Neural Information Processing Systems 528:; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; 1814:A concurrent blog post on Google Research blog. 992:http://cogprints.org/1380/1/vdM_correlation.pdf 896:Giles, C. Lee; Maxwell, Tom (1 December 1987). 1460:RWKV: Reinventing RNNs for the Transformer Era 827:Feldman, J. A.; Ballard, D. H. (1 July 1982). 353:, which are easy to parallelize, and achieved 95:The paper's title is a reference to the song " 2377: 2320: 1833: 1013:Hinton, Geoffrey E.; Plaut, David C. (1987). 816:. 13 July 2023 – via www.bloomberg.com. 313:, which replaced the previous model based on 145:, Niki Parmar, Jakob Uszkoreit, Llion Jones, 8: 804: 802: 532:; Kaiser, Ɓukasz; Polosukhin, Illia (2017). 1116: 1114: 1112: 1015:"Using Fast Weights to Deblur Old Memories" 829:"Connectionist models and their properties" 412:, improving upon the line of research from 340:Attention (machine learning) § History 201:(RNNs). A well-cited early example was the 2384: 2370: 2327: 2313: 2102: 2036: 1889: 1882: 1840: 1826: 1818: 1484:"Was Linguistic A.I. Created by Accident?" 774:"Was Linguistic A.I. Created by Accident?" 155:article highlights the group's diversity: 1779: 1738: 1714: 1672: 1521: 1467: 1400: 1293: 1272: 1234: 1186: 1176: 1147: 1126: 1092: 615: 1309:Lewis-Kraus, Gideon (14 December 2016). 1691:was invoked but never defined (see the 1381:was invoked but never defined (see the 1253:was invoked but never defined (see the 730: 728: 726: 724: 510: 479: 1414: 1412: 1074: 1072: 1070: 676:from the original on 26 September 2023 489:(2014) further reduced its complexity. 349:applied a self-attention mechanism to 1564:"Recent Advances in Google Translate" 709:from the original on 28 December 2023 435:. In 2022, a chatbot based on GPT-3, 7: 2338: 2336: 2281: 2279: 2260: 1731:Rethinking Attention with Performers 1165:Frontiers in Artificial Intelligence 767: 765: 516: 514: 450:, speech recognition, robotics, and 2076:Quantum Artificial Intelligence Lab 1797:Uszkoreit, Jakob (31 August 2017). 1686: 1376: 1248: 875:. Cambridge, Mass: Bradford Books. 2222:Generative pre-trained transformer 1624:from the original on 18 March 2023 1482:Marche, Stephen (23 August 2024). 1439:from the original on 20 March 2024 772:Marche, Stephen (23 August 2024). 695:Murgia, Madhumita (23 July 2023). 458:. Image and video generators like 25: 584:Goldman, Sharon (20 March 2024). 311:Google Neural Machine Translation 2340: 2283: 2259: 2250: 2249: 1687:Cite error: The named reference 1377:Cite error: The named reference 1249:Cite error: The named reference 1063:. Springer. pp. 9355–9366. 662:Toews, Rob (3 September 2023). 397:that contribute to the ongoing 315:statistical machine translation 2408:Artificial intelligence papers 1379:2017_Attention_Is_All_You_Need 1055:Schlag, Imanol; Irie, Kazuki; 137:The authors of the paper are: 1: 2423:Artificial intelligence stubs 845:10.1016/S0364-0213(82)80001-3 456:convolutional neural networks 427:Starting in 2018, the OpenAI 27:2017 research paper by Google 2356:. You can help Knowledge by 2299:. You can help Knowledge by 558:Love, Julia (10 July 2023). 188:Timeline of machine learning 1044:. PMLR. pp. 5156–5165. 534:"Attention is All you Need" 433:natural language generation 2439: 2335: 2278: 1311:"The Great A.I. Awakening" 1229:. Curran Associates, Inc. 337: 330:, was proposed for LSTMs. 249: 207:vanishing-gradient problem 185: 175: 56:architecture known as the 2245: 2211:Attention Is All You Need 1855: 970:10.1162/neco.1992.4.1.131 547:. Curran Associates, Inc. 374:Attention is all you need 199:recurrent neural networks 88:and what is now known as 42:Attention Is All You Need 18:Attention is all you need 328:intra-sentence attention 90:multimodal Generative AI 2350:artificial intelligence 1756:A ConvNet for the 2020s 1645:finetune-transformer-lm 1178:10.3389/frai.2020.00040 404:In language modelling, 334:Parallelizing attention 212:A key breakthrough was 66:artificial intelligence 2352:-related article is a 2295:-related article is a 2086:Tensor Processing Unit 1648:, OpenAI, 11 June 2018 347:decomposable attention 262:long short-term memory 252:Seq2seq § History 246:Attention with seq2seq 182:Seq2seq § History 162: 37: 1593:MIT Technology Review 487:Gated recurrent units 441:large language models 420:. It was followed by 274:gated recurrent units 231:higher-order networks 157: 70:large language models 44:" is a 2017 landmark 35: 1362:10.18653/v1/D16-1053 914:10.1364/AO.26.004972 351:feedforward networks 323:, originally called 222:multiplicative units 97:All You Need Is Love 72:like those based on 2217:Future of Go Summit 1103:10.3115/v1/D14-1179 1057:Schmidhuber, JĂŒrgen 947:Schmidhuber, JĂŒrgen 382:machine translation 218:attention mechanism 82:machine translation 62:attention mechanism 1963:In popular culture 1542:Search Engine Land 1315:The New York Times 958:Neural Computation 464:Stable Diffusion 3 448:vision transformer 359:textual entailment 325:intra-attention or 172:Historical context 86:question answering 38: 2365: 2364: 2308: 2307: 2273: 2272: 2198: 2197: 2094: 2093: 2028: 2027: 1989: 1988: 1879:Computer programs 1544:. 15 October 2020 908:(23): 4972–4978. 882:978-0-262-68053-0 833:Cognitive Science 395:generative models 226:sigma-pi networks 16:(Redirected from 2430: 2386: 2379: 2372: 2344: 2337: 2329: 2322: 2315: 2287: 2280: 2263: 2262: 2253: 2252: 2237:Google Workspace 2103: 2037: 2033:Machine learning 1890: 1883: 1842: 1835: 1828: 1819: 1813: 1811: 1809: 1785: 1784: 1783: 1767: 1761: 1760: 1750: 1744: 1743: 1742: 1726: 1720: 1719: 1718: 1702: 1696: 1690: 1685: 1679: 1678: 1676: 1663: 1657: 1656: 1655: 1653: 1640: 1634: 1633: 1631: 1629: 1620:. 11 June 2018. 1610: 1604: 1603: 1601: 1599: 1585: 1579: 1578: 1576: 1574: 1560: 1554: 1553: 1551: 1549: 1534: 1528: 1527: 1525: 1513: 1507: 1506: 1504: 1502: 1479: 1473: 1472: 1471: 1455: 1449: 1448: 1446: 1444: 1416: 1407: 1406: 1404: 1392: 1386: 1380: 1375: 1366: 1365: 1345: 1339: 1338: 1336: 1334: 1325:. Archived from 1306: 1300: 1299: 1297: 1285: 1279: 1278: 1276: 1264: 1258: 1252: 1247: 1241: 1240: 1238: 1214: 1208: 1207: 1190: 1180: 1160: 1154: 1153: 1151: 1139: 1133: 1132: 1130: 1118: 1107: 1106: 1096: 1076: 1065: 1064: 1052: 1046: 1045: 1033: 1027: 1026: 1010: 1004: 1001: 995: 988: 982: 981: 955: 943: 934: 933: 893: 887: 886: 874: 863: 857: 856: 824: 818: 817: 806: 797: 796: 794: 792: 769: 760: 759: 757: 755: 732: 719: 718: 716: 714: 692: 686: 685: 683: 681: 659: 653: 652: 628: 622: 621: 619: 607: 601: 600: 598: 596: 581: 575: 574: 572: 570: 555: 549: 548: 538: 518: 499: 496: 490: 484: 309:was revamped to 307:Google Translate 50:machine learning 21: 2438: 2437: 2433: 2432: 2431: 2429: 2428: 2427: 2393: 2392: 2391: 2390: 2334: 2333: 2276: 2274: 2269: 2241: 2194: 2177: 2135:Language models 2130: 2090: 2064: 2040:Neural networks 2024: 1985: 1958: 1929: 1874: 1870:Google DeepMind 1851: 1846: 1807: 1805: 1803:research.google 1796: 1793: 1788: 1769: 1768: 1764: 1752: 1751: 1747: 1728: 1727: 1723: 1704: 1703: 1699: 1688: 1682: 1665: 1664: 1660: 1651: 1649: 1642: 1641: 1637: 1627: 1625: 1612: 1611: 1607: 1597: 1595: 1587: 1586: 1582: 1572: 1570: 1568:research.google 1562: 1561: 1557: 1547: 1545: 1536: 1535: 1531: 1515: 1514: 1510: 1500: 1498: 1481: 1480: 1476: 1457: 1456: 1452: 1442: 1440: 1418: 1417: 1410: 1394: 1393: 1389: 1378: 1369: 1347: 1346: 1342: 1332: 1330: 1308: 1307: 1303: 1287: 1286: 1282: 1266: 1265: 1261: 1250: 1244: 1216: 1215: 1211: 1162: 1161: 1157: 1141: 1140: 1136: 1120: 1119: 1110: 1078: 1077: 1068: 1054: 1053: 1049: 1035: 1034: 1030: 1012: 1011: 1007: 1002: 998: 989: 985: 953: 945: 944: 937: 895: 894: 890: 883: 872: 865: 864: 860: 826: 825: 821: 808: 807: 800: 790: 788: 771: 770: 763: 753: 751: 734: 733: 722: 712: 710: 702:Financial Times 694: 693: 689: 679: 677: 661: 660: 656: 649: 630: 629: 625: 609: 608: 604: 594: 592: 583: 582: 578: 568: 566: 557: 556: 552: 536: 522:Vaswani, Ashish 520: 519: 512: 508: 503: 502: 497: 493: 485: 481: 476: 410:word embeddings 390: 342: 336: 320:avant la lettre 254: 248: 195: 190: 184: 176:Main articles: 174: 135: 80:techniques for 60:, based on the 28: 23: 22: 15: 12: 11: 5: 2436: 2434: 2426: 2425: 2420: 2415: 2410: 2405: 2403:2017 documents 2395: 2394: 2389: 2388: 2381: 2374: 2366: 2363: 2362: 2345: 2332: 2331: 2324: 2317: 2309: 2306: 2305: 2288: 2271: 2270: 2268: 2267: 2257: 2246: 2243: 2242: 2240: 2239: 2234: 2229: 2224: 2219: 2214: 2206: 2204: 2200: 2199: 2196: 2195: 2193: 2192: 2185: 2183: 2179: 2178: 2176: 2175: 2169: 2163: 2157: 2151: 2145: 2138: 2136: 2132: 2131: 2129: 2128: 2122: 2116: 2109: 2107: 2100: 2096: 2095: 2092: 2091: 2089: 2088: 2083: 2078: 2072: 2070: 2066: 2065: 2063: 2062: 2056: 2050: 2043: 2041: 2034: 2030: 2029: 2026: 2025: 2023: 2022: 2016: 2010: 2004: 1997: 1995: 1991: 1990: 1987: 1986: 1984: 1983: 1975: 1966: 1964: 1960: 1959: 1957: 1956: 1950: 1944: 1937: 1935: 1931: 1930: 1928: 1927: 1921: 1915: 1909: 1903: 1896: 1894: 1887: 1880: 1876: 1875: 1873: 1872: 1867: 1862: 1856: 1853: 1852: 1847: 1845: 1844: 1837: 1830: 1822: 1816: 1815: 1792: 1791:External links 1789: 1787: 1786: 1762: 1745: 1721: 1697: 1680: 1658: 1635: 1605: 1580: 1555: 1529: 1508: 1488:The New Yorker 1474: 1450: 1419:Levy, Steven. 1408: 1387: 1367: 1340: 1329:on 24 May 2023 1301: 1280: 1259: 1242: 1209: 1155: 1134: 1108: 1066: 1047: 1028: 1005: 996: 983: 964:(1): 131–139. 935: 902:Applied Optics 888: 881: 858: 839:(3): 205–254. 819: 798: 778:The New Yorker 761: 735:Levy, Steven. 720: 687: 654: 647: 641:. p. 75. 623: 602: 576: 564:Bloomberg News 550: 530:Gomez, Aidan N 509: 507: 504: 501: 500: 491: 478: 477: 475: 472: 389: 386: 338:Main article: 335: 332: 250:Main article: 247: 244: 194: 191: 173: 170: 139:Ashish Vaswani 134: 131: 46:research paper 26: 24: 14: 13: 10: 9: 6: 4: 3: 2: 2435: 2424: 2421: 2419: 2416: 2414: 2411: 2409: 2406: 2404: 2401: 2400: 2398: 2387: 2382: 2380: 2375: 2373: 2368: 2367: 2361: 2359: 2355: 2351: 2346: 2343: 2339: 2330: 2325: 2323: 2318: 2316: 2311: 2310: 2304: 2302: 2298: 2294: 2289: 2286: 2282: 2277: 2266: 2258: 2256: 2248: 2247: 2244: 2238: 2235: 2233: 2230: 2228: 2225: 2223: 2220: 2218: 2215: 2212: 2208: 2207: 2205: 2201: 2190: 2187: 2186: 2184: 2180: 2173: 2170: 2167: 2164: 2161: 2158: 2155: 2152: 2149: 2146: 2143: 2140: 2139: 2137: 2133: 2126: 2123: 2120: 2117: 2114: 2111: 2110: 2108: 2104: 2101: 2099:Generative AI 2097: 2087: 2084: 2082: 2079: 2077: 2074: 2073: 2071: 2067: 2060: 2057: 2054: 2051: 2048: 2045: 2044: 2042: 2038: 2035: 2031: 2020: 2019:AlphaGeometry 2017: 2014: 2011: 2008: 2005: 2002: 1999: 1998: 1996: 1992: 1981: 1980: 1976: 1973: 1972: 1968: 1967: 1965: 1961: 1954: 1951: 1948: 1945: 1942: 1939: 1938: 1936: 1932: 1925: 1922: 1919: 1916: 1913: 1910: 1907: 1904: 1901: 1898: 1897: 1895: 1891: 1888: 1884: 1881: 1877: 1871: 1868: 1866: 1863: 1861: 1858: 1857: 1854: 1850: 1843: 1838: 1836: 1831: 1829: 1824: 1823: 1820: 1804: 1800: 1795: 1794: 1790: 1782: 1777: 1773: 1766: 1763: 1758: 1757: 1749: 1746: 1741: 1736: 1732: 1725: 1722: 1717: 1712: 1708: 1701: 1698: 1694: 1684: 1681: 1675: 1670: 1662: 1659: 1647: 1646: 1639: 1636: 1623: 1619: 1615: 1609: 1606: 1594: 1590: 1584: 1581: 1569: 1565: 1559: 1556: 1543: 1539: 1533: 1530: 1524: 1519: 1512: 1509: 1497: 1493: 1489: 1485: 1478: 1475: 1470: 1465: 1461: 1454: 1451: 1438: 1434: 1430: 1426: 1422: 1415: 1413: 1409: 1403: 1398: 1391: 1388: 1384: 1374: 1372: 1368: 1363: 1359: 1355: 1351: 1344: 1341: 1328: 1324: 1320: 1316: 1312: 1305: 1302: 1296: 1291: 1284: 1281: 1275: 1270: 1263: 1260: 1256: 1246: 1243: 1237: 1232: 1228: 1224: 1220: 1213: 1210: 1206: 1202: 1198: 1194: 1189: 1184: 1179: 1174: 1170: 1166: 1159: 1156: 1150: 1145: 1138: 1135: 1129: 1124: 1117: 1115: 1113: 1109: 1104: 1100: 1095: 1090: 1086: 1082: 1075: 1073: 1071: 1067: 1062: 1058: 1051: 1048: 1043: 1039: 1032: 1029: 1024: 1020: 1016: 1009: 1006: 1000: 997: 993: 987: 984: 979: 975: 971: 967: 963: 959: 952: 948: 942: 940: 936: 931: 927: 923: 919: 915: 911: 907: 903: 899: 892: 889: 884: 878: 871: 870: 862: 859: 854: 850: 846: 842: 838: 834: 830: 823: 820: 815: 811: 805: 803: 799: 787: 783: 779: 775: 768: 766: 762: 750: 746: 742: 738: 731: 729: 727: 725: 721: 708: 704: 703: 698: 691: 688: 675: 671: 670: 665: 658: 655: 650: 648:9781040031131 644: 640: 636: 635: 627: 624: 618: 613: 606: 603: 591: 587: 580: 577: 565: 561: 554: 551: 546: 542: 535: 531: 527: 526:Shazeer, Noam 523: 517: 515: 511: 505: 495: 492: 488: 483: 480: 473: 471: 469: 465: 461: 457: 453: 449: 444: 442: 438: 434: 430: 425: 423: 419: 415: 411: 407: 402: 400: 396: 387: 385: 383: 379: 375: 370: 368: 364: 360: 356: 352: 348: 341: 333: 331: 329: 326: 322: 321: 316: 312: 308: 303: 300: 298: 294: 289: 287: 282: 277: 275: 271: 267: 263: 258: 253: 245: 243: 240: 235: 233: 232: 227: 223: 219: 215: 210: 208: 204: 203:Elman network 200: 192: 189: 183: 179: 171: 169: 167: 161: 156: 154: 153: 148: 144: 140: 132: 130: 128: 127:learning rate 123: 120: 118: 113: 111: 110: 104: 102: 98: 93: 91: 87: 83: 79: 75: 71: 67: 63: 59: 55: 54:deep learning 51: 47: 43: 34: 30: 19: 2418:Google stubs 2358:expanding it 2347: 2301:expanding it 2290: 2275: 2232:Google Pixel 2210: 1977: 1969: 1934:Competitions 1912:AlphaGo Zero 1865:Google Brain 1806:. Retrieved 1802: 1771: 1765: 1755: 1748: 1730: 1724: 1706: 1700: 1683: 1661: 1650:, retrieved 1644: 1638: 1626:. Retrieved 1617: 1608: 1596:. Retrieved 1592: 1583: 1571:. Retrieved 1567: 1558: 1546:. Retrieved 1541: 1532: 1523:1810.04805v2 1511: 1499:. Retrieved 1487: 1477: 1459: 1453: 1441:. Retrieved 1424: 1390: 1353: 1343: 1331:. Retrieved 1327:the original 1314: 1304: 1283: 1262: 1245: 1226: 1222: 1212: 1168: 1164: 1158: 1137: 1084: 1060: 1050: 1041: 1031: 1022: 1018: 1008: 999: 986: 961: 957: 905: 901: 891: 868: 861: 836: 832: 822: 813: 789:. Retrieved 777: 752:. Retrieved 740: 711:. Retrieved 700: 690: 678:. Retrieved 667: 657: 633: 626: 605: 593:. Retrieved 579: 567:. Retrieved 553: 544: 540: 494: 482: 466:(2024), and 445: 426: 414:bag of words 403: 391: 371: 366: 362: 346: 343: 327: 324: 319: 304: 301: 296: 292: 290: 285: 280: 278: 269: 265: 259: 255: 236: 229: 225: 221: 211: 196: 193:Predecessors 163: 158: 150: 143:Noam Shazeer 136: 124: 121: 114: 109:Transformers 107: 105: 94: 41: 39: 29: 2227:Google Labs 2053:Transformer 1548:24 November 590:VentureBeat 388:AI boom era 239:fast weight 147:Aidan Gomez 101:the Beatles 58:transformer 2397:Categories 2154:Chinchilla 2081:TensorFlow 1979:The MANIAC 1781:2403.03206 1740:2009.14794 1716:2106.01345 1689:Gulati2020 1674:2010.11929 1618:openai.com 1469:2305.13048 1402:1606.01933 1295:1609.08144 1274:1508.04025 680:3 December 506:References 452:multimodal 429:GPT series 357:result in 293:fixed-size 186:See also: 2172:VideoPoet 2113:Assistant 2007:AlphaStar 2001:AlphaFold 1947:Lee Sedol 1918:AlphaZero 1849:Google AI 1693:help page 1501:27 August 1496:0028-792X 1433:1059-1028 1383:help page 1323:0362-4331 1255:help page 1251:inventors 1236:1409.3215 1205:220252321 1149:1412.3555 1128:1409.3215 1094:1406.1078 1061:ICML 2021 1042:ICML 2020 922:0003-6935 853:0364-0213 814:Bloomberg 791:24 August 786:0028-792X 749:1059-1028 639:CRC Press 617:1409.0473 305:In 2016, 297:RNNsearch 2255:Category 2203:See also 2106:Chatbots 2013:AlphaDev 1893:Versions 1808:9 August 1628:18 March 1622:Archived 1598:6 August 1443:6 August 1437:Archived 1197:33733157 978:16683347 949:(1992). 930:20523475 754:20 March 713:22 March 707:Archived 674:Archived 462:(2021), 418:word2vec 2265:Commons 2119:Sparrow 2047:WaveNet 1971:AlphaGo 1941:Fan Hui 1900:AlphaGo 1886:AlphaGo 1333:22 June 1188:7861254 595:1 April 569:1 April 437:ChatGPT 399:AI boom 378:seq2seq 363:without 270:decoder 266:encoder 133:Authors 117:parsing 78:Seq2seq 2413:Google 2293:Google 2191:(2024) 2174:(2024) 2168:(2023) 2166:Gemini 2162:(2022) 2156:(2022) 2150:(2021) 2144:(2018) 2127:(2023) 2125:Gemini 2121:(2022) 2115:(2016) 2061:(2022) 2055:(2017) 2049:(2016) 2021:(2024) 2015:(2023) 2009:(2019) 2003:(2018) 1982:(2023) 1974:(2017) 1955:(2017) 1953:Ke Jie 1949:(2016) 1943:(2015) 1926:(2019) 1924:MuZero 1920:(2017) 1914:(2017) 1908:(2016) 1906:Master 1902:(2015) 1860:Google 1494:  1431:  1321:  1203:  1195:  1185:  1171:: 40, 976:  928:  920:  879:  851:  784:  747:  669:Forbes 645:  460:DALL-E 180:, and 166:OpenAI 2348:This 2291:This 2182:Other 2148:LaMDA 2069:Other 1994:Other 1776:arXiv 1735:arXiv 1711:arXiv 1669:arXiv 1652:1 May 1573:8 May 1518:arXiv 1464:arXiv 1425:Wired 1397:arXiv 1290:arXiv 1269:arXiv 1231:arXiv 1201:S2CID 1144:arXiv 1123:arXiv 1089:arXiv 974:S2CID 954:(PDF) 873:(PDF) 741:Wired 612:arXiv 537:(PDF) 474:Notes 286:fixed 152:Wired 99:" by 2354:stub 2297:stub 2189:Vids 2160:PaLM 2142:BERT 2059:Gato 1810:2024 1654:2023 1630:2023 1600:2024 1575:2024 1550:2020 1503:2024 1492:ISSN 1445:2024 1429:ISSN 1335:2023 1319:ISSN 1193:PMID 926:PMID 918:ISSN 877:ISBN 849:ISSN 793:2024 782:ISSN 756:2024 745:ISSN 715:2024 682:2023 643:ISBN 597:2024 571:2024 468:Sora 422:BERT 416:and 406:ELMo 380:for 355:SOTA 281:last 214:LSTM 1358:doi 1183:PMC 1173:doi 1099:doi 966:doi 910:doi 841:doi 367:all 228:or 168:). 74:GPT 48:in 2399:: 1801:. 1774:, 1733:, 1709:, 1695:). 1616:. 1591:. 1566:. 1540:. 1490:. 1486:. 1462:, 1435:. 1427:. 1423:. 1411:^ 1385:). 1370:^ 1317:. 1313:. 1257:). 1227:27 1225:. 1221:. 1199:, 1191:, 1181:, 1167:, 1111:^ 1097:. 1069:^ 1040:. 1021:. 1017:. 972:. 960:. 956:. 938:^ 924:. 916:. 906:26 904:. 900:. 847:. 835:. 831:. 812:. 801:^ 780:. 776:. 764:^ 743:. 739:. 723:^ 705:. 699:. 672:. 666:. 637:. 588:. 562:. 545:30 543:. 539:. 524:; 513:^ 443:. 401:. 141:, 92:. 2385:e 2378:t 2371:v 2360:. 2328:e 2321:t 2314:v 2303:. 2213:" 2209:" 1841:e 1834:t 1827:v 1812:. 1778:: 1737:: 1713:: 1677:. 1671:: 1632:. 1602:. 1577:. 1552:. 1526:. 1520:: 1505:. 1466:: 1447:. 1405:. 1399:: 1364:. 1360:: 1337:. 1298:. 1292:: 1277:. 1271:: 1239:. 1233:: 1175:: 1169:3 1152:. 1146:: 1131:. 1125:: 1105:. 1101:: 1091:: 1025:. 1023:9 980:. 968:: 962:4 932:. 912:: 885:. 855:. 843:: 837:6 795:. 758:. 717:. 684:. 651:. 620:. 614:: 599:. 573:. 40:" 20:)

Index

Attention is all you need

research paper
machine learning
deep learning
transformer
attention mechanism
artificial intelligence
large language models
GPT
Seq2seq
machine translation
question answering
multimodal Generative AI
All You Need Is Love
the Beatles
Transformers
parsing
learning rate
Ashish Vaswani
Noam Shazeer
Aidan Gomez
Wired
OpenAI
Transformer (deep learning architecture) § History
Seq2seq § History
Timeline of machine learning
recurrent neural networks
Elman network
vanishing-gradient problem

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.

↑