Talk:Neural network (machine learning)

1652:(2009). Unfortunately, Nilsson is not a very good source because he writes things such as, "Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams (1985), introduced a new technique, called back propagation," without mentioning the true inventors of backpropagation. He also writes that "the physicist John J. Hopfield" invented the Hopfield network, without citing Amari who published it 10 years earlier. Neither Nilsson nor the even older surveys you mention cite Ivakhnenko who started deep learning in 1965. Isn't that a rather US-centric non-NPOV here? Most of the community learned about the true pioneers from JS' much more meticulous surveys which you critisize. See my previous message. His 2015 survey lists nearly 900 references, his 2022 update over 500, adding stuff that has become important since 2015 (this is not about citations). Could it be that you have a tiny little bit of non-NPOV of your own? Maybe we all have. But then let's find a consensus. You call "unnormalized linear Transformers" a "great rhetorical trick." Why? Unlike older networks you mention, they do have linearized attention and scale linearly. The terminology "linear Transformer" is due to Katharopoulos et al. (2020), but JS had the machinery already in 1991, as was pointed out in 2021 (see reverted edits). You also claim that early NN architectures (McCulloch and Pitts, 1943) did learn. I know the paper, and couldn't find a working learning algorithm in it. Could you? Note that Gauss and Legendre had a working learning algorithm for linear neural nets over 200 years ago, another must-cite. Anyway, I'll try to follow the recommendations on this talk page and go step by step from now on, in line with 1527:, who introduced residual connections or "constant error flow," the "roots of LSTM / Highway Nets / ResNets." Anyway, thanks for toning that down. You deleted important references to JS' 1991 work on self-supervised pre-training, neural network distillation, GANs, and unnormalized linear Transformers; I tried to undo this on 16 Sept 2024. Regardless of the plagiarism disputes, one cannot deny that this work predates GH/YB and colleagues by a long way. In the interest of historical accuracy, I still propose to revert the revert of my 10 edits, and continue from there. In the future, we could strive to explicitly mention details of the priority disputes between these important people, trying to represent all sides in an NPOV way. I bet you could contribute a lot here. What do you think? 1430:. 2. It relies much on Schmidhuber's history, especially "Annotated History of Machine Learning", and Schmidhuber is an unreliable propagandist who bitterly contests priority with everyone else. He aims to show that modern deep learning is mostly originated by his team, or others like Lapa and Fukushima etc, specifically *not* LeCun, Bengio, etc. You can press ctrl+f and type "did not" and find phrases like "This work did not cite the earlier LSTM" "which the authors did not cite" "extremely unfair that Schmidhuber did not get the Turing award"... 1497:(JS,GH,YB,YL) is the very explicit 2023 report which to my knowledge has not been challenged. The most comprehensive surveys of the field are those published by JS in 2015 and 2022, with over 1000 references in total; wouldn't you agree? They really credit the deep learning pioneers, unlike the surveys of GH/YB/YL. I'd say that JS has become a bit like the chief historian of the field, with the handicap that he is part of it (as you wrote: non-NPOV?). Anyway, without his surveys, many practitioners would not even know the following facts: 301: 280: 911: 890: 1795:. I propose to replace the section "Neural network winter" by the section "Deep learning breakthroughs in the 1960s and 1970s" below. Why? The US "neural network winter" (if any) did not affect Ukraine and Japan, where fundamental breakthroughs occurred in the 1960s and 1970s: Ivakhnenko (1965), Amari (1967), Fukushima (1969, 1979). The Kohonen maps (1980s) should be moved to a later section. I should point out that much of the proposed text is based on older resurrected text written by other editors. 1545:"Annotated History of Modern AI and Deep Learning" was cited about 63 times, while "Deep learning in neural networks: An overview" was cited over 22k times. It is clear why if you compare the two. The "Deep learning in neural networks" is a mostly neutral work (if uncommonly citation-heavy), while the "Annotated History" is extremely polemical (even beginning the essay with a giant collage of people's faces and their achievements, recalling to mind the book covers from those 17th century 247: 1874:(1965). They regarded it as a form of polynomial regression, or a generalization of Rosenblatt's perceptron. A 1971 paper described a deep network with eight layers trained by this method, which is based on layer by layer training through regression analysis. Superfluous hidden units are pruned using a separate validation set. Since the activation functions of the nodes are Kolmogorov-Gabor polynomials, these were also the first deep networks with multiplicative units or "gates." 1152: 1336: 1380:(whose efforts I appreciate) tried to compress the text‎. This massive edit has remained unchallenged until now. I also fixed links in some of the old references, added a few new ones (both primary and secondary sources), corrected many little errors, and tried to streamline some of the explanations. IMO these edits restored important parts and further improved the history section of the article, although a lot remains to be done. Now I kindly ask 2115:, Holland, Habit and Duda (1956). The perceptron raised public excitement for research in Artificial Neural Networks, causing the US government to drastically increase funding. This contributed to "the Golden Age of AI" fueled by the optimistic claims made by computer scientists regarding the ability of perceptrons to emulate human intelligence. The first perceptrons did not have adaptive hidden units. However, Joseph (1960) also discussed 1403:

for being the one to pioneer various aspect. So I'm not open to reinstating that whole linked bundle including those. Why not just slow down and put those things back in at a pace where they can be reviewed? And the the ones that are are a reach (transferring or assigning credit for invention) take to talk first. You are most familiar with the details of your edits and are in the best position to know those. Sincerely,

1068: 1050: 646: 390: 1004: 979: 369: 1717:: The original version of the "Early Work" section has a very good and accessible overview of the field, and it wikilinks related subjects in a rather fluid way. I think your version of that section, by going deep into crediting and describing a single primary sources on each topic, just doesn't work. As noted above, doing such a fine-grained step-by-step review of primary works of the history is better for the 1376:. He reverted and wrote, "you are doing massive reassignment of credit for Neural Networks based on your interpretation of their work and primary sources and deleting secondary sourced assignments. Please slow down and take such major reassignments to talk first." So here. Please note that most of my edits are not novel! They resurrect important old references deleted on 7 August 2024 in a major edit when 569: 548: 207: 238: 1814:. Also, the extraordinary claim that CNNs "began with" Neocognitron -- that makes it sound like Neocognitron leveraged the key insight of CNNs which was to reduce the number of weights by using the same weights, effectively, for each pixel, running the kernel(s) across the image. From my limited understand, that is not the case with Neocognitron. The article dedicated to 1698:, you say, "do not reply," but I must: that's not a learning algorithm. Sure, McCulloch and Pitts' Turing-equivalent model (1943) is powerful enough to implement any learning algorithm, but they don't describe one: no goal, no objective function to maximise, no explicit learning algorithm. Otherwise it would be known as the famous McCulloch and Pitts learning algorithm. 484: 466: 1296: 2051:'s work on perceptrons (1958). My third party source is R.D. Joseph (1960) who mentions an even earlier perceptron-like device by Farley and Clark: "Farley and Clark of MIT Lincoln Laboratory actually preceded Rosenblatt in the development of a perceptron-like device." I am also copying additional Farley and Clark references (1954) from 1836:, thanks! I agree, I must delete the phrase "of course" in the draft below. I just did. Regarding the Neocognitron: that's another article that must be corrected, because the Neocognitron CNN did have "massive weight replication," and a third party reference on this is section 5.4 of the 2015 survey. I added this to the draft below. 1580:). Physicists don't cite Newton when they write new papers. They don't even cite Schrödinger. Mathematicians don't cite Gauss-Legendre for least squares. They have a vague feeling that they did something about least squares, and that's enough. It is no serious problem. Historians will do all that detailed credit assignment later. 1078: 1587:, there were several ways to arrive at RNN. One route goes through neuroanatomy. The very first McCulloch and Pitts 1943 paper already had RNN, Hebbian learning, and universality. They had no idea of Ising, nor did they need to, because they got the idea from neuroscientists like Lorente de No. Hopfield cited Amari, btw. 1611:

But I am tired of battling over the historical minutae. Misunderstanding history doesn't hurt the practitioners, because ideas are cheap, and are rediscovered all the time (see: Schmidhuber's long list of grievances), so not citing earlier works is not an issue. This is tiring, and I'm signing out of

1552:

As for the "very explicit 2023 report", it is... not a report. It is the most non-NPOV thing I have seen (beginning the entire report with a damned caricature comic?) and I do not want to read it. He is not the chief historian. He is the chief propagandist. If you want better history of deep learning

1437:

His campaign reached levels of absurdity when he claimed that Amari (1972)'s RNN is "based on the (uncited) Lenz-Ising recurrent architecture". If you can call the Ising model as "The first non-learning recurrent NN architecture", then I can call the heat death of the universe "The first non-evolving

1453:

As a general principle, if I can avoid quoting Schmidhuber, I must, because Schmidhuber is extremely non-NPOV. I had removed almost all citations to his Annotated History except those that genuinely cannot be found anywhere else. For example, I kept all citations to that paper about Amari and Saito,

1595:

We suppose that some axonal terminations cannot at first excite the succeeding neuron; but if at any time the neuron fires, and the axonal terminations are simultaneously excited, they become synapses of the ordinary kind, henceforth capable of exciting the neuron. That is Hebbian learning (6 years

1724:

I don't know the sources on this at all, but I just lend support to editors above for at least this section, on prose, accessibility, and accuracy in a broader conceptual sense, you should not restore your edits wholesale. (I know it's a lot of work, as writing good accessible prose is super hard,

1449:

J. Schmidhuber (AI Blog, 02/20/2020, updated 2021, 2022). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The recent decade's most important developments and industrial applications based on the AI of Schmidhuber's team, with an outlook on the 2020s, also addressing privacy and data

1402:

Recapping my response from our conversation at my talk page: Thanks for your work and your post. The series of rapid fire edits ended up being entangled where they an't be reviewed/ potentially reverted separately. In that bundle were several which IMO pretty creatively shifted/assigned credit

1541:

The most important issue is that any citation to Schmidhuber's blog posts, essays, and "Annotated History" invariably taints a Knowledge (XXG) page with non-NPOV. Before all those details, this is the main problem with citing Schmidhuber. Citing earlier works is fine, but it is *NOT* fine to cite

1522:

had backpropagation (reverse mode of auto-diff) in 1970. G.M. Ostrovski republished this in 1971. Henry J. Kelley already had a precursor in 1960. Tow centuries ago, Gauss and Legendre had the method of least squares which is exactly what's now called a linear neural network (only the name has

1607:

You can find it if you ctrl+f "learn" in the paper. A little later they showed that Hebbian learning in a feedforward network is equivalent to an RNN by unrolling that RNN in time. ("THEOREM VII. Alterable synapses can be replaced by circles.", and Figure 1.i. The dashed line is the learnable

1433:

It is even more revealing if you ctrl+f on "Hinton". More than half of the citations to Hinton are followed by "Very similar to ", "although this type of deep learning dates back to Schmidhuber's work of 1991", "does not mention the pioneering works", "The authors did not cite Schmidhuber's

1572:

trying to figure out if attention really is necessary (for example, "Sparse MLP for image recognition: Is self-attention really necessary?" or MLP-mixers). Does that mean feedforward networks are "attentionless Transformers"? Or can I just put Rosenblatt into the Transformer page's history

667: 1523:

changed). If JS is non-NPOV (as you write), then how non-NPOV are GH/YB/YL who do not cite any of this? You blasted JS' quote, "one of the most important documents in the history of machine learning," which actually refers to the 1991 diploma thesis of his student

1445:, and it came straight from his "Annotated History of Machine Learning". I removed all examples of this phrase in Knowledge (XXG) except in his own page (he is entitled to his own opinions). In fact, the entire paper is scattered with such propagandistic sentences: 2107:. R. D. Joseph (1960) mentions an even earlier perceptron-like device by Farley and Clark: "Farley and Clark of MIT Lincoln Laboratory actually preceded Rosenblatt in the development of a perceptron-like device." However, "they dropped the subject." Farley and 1559:

Mikel Olazaran, A Historical Sociology of Neural Network Research (PhD dissertation, Department of Sociology, University of Edinburgh, 1991); Olazaran, `A Sociological History of the Neural Network Controversy', Advances in Computers, Vol. 37 (1993),

1163: 190: 2119:

with an adaptive hidden layer. Rosenblatt (1962) cited and adopted these ideas, also crediting work by H. D. Block and B. W. Knight. Unfortunately, these early efforts did not lead to a working learning algorithm for hidden units, i.e.,

153: 1590:

Schmidhuber is not reliable by the way. I just checked his "Deep learning in neural networks" and immediately saw an error: "Early NN architectures (McCulloch and Pitts, 1943) did not learn." In fact, it stated right here in the

1458:

H. Saito (1967). Master's thesis, Graduate School of Engineering, Kyushu University, Japan. Implementation of Amari's 1967 stochastic gradient descent method for multilayer perceptrons. (S. Amari, personal communication, 2021.)

1967:, thanks for encouraging me to resume the traditional way of editing. I tried to address the comments of the other users. Now I want to edit the article accordingly, and go step by step from there, as you suggested. 1745:

is sufficiently against the content that you had added, that it should not be reverted back in the same form. Please follow the advice of other editors above, and propose specific text to add back, here in talk.

1513:

had Hopfield networks 10 years before Hopfield, plus a sequence-learning generalization (the "dynamic RNN" as opposed to the "equilibrium RNN" you mentioned), all using the must-cite Ising architecture (1925).

1946:

basis and (just) seek prior consensus on the controversial ones such as assigning / implying credit to individuals. A slower pace with smaller edits makes it reviewable and so is itself a review process.

1454:

because 1. H. Saito is so extremely obscure that if we don't cite Schmidhuber on this, we have no citation for this. 2. I can at least trust that he didn't make up the "personal communication" with Amari.

2073:

I don't have the specialized knowledge to fully evaluate it but overall it looks pretty good to me. Mentions people in the context of early developments without being heavy on claim/credit type wording.

961: 1438:

model of Darwinian evolution". The entire point of RNN is that it is dynamic, and the entire point of the Ising model is that it is about thermal equilibrium at a point where all dynamics has *stopped*.

3069: 1212: 1257: 1252: 1232: 1227: 1202: 1262: 1247: 1222: 1217: 1197: 1192: 691: 831: 1923:(1969), who emphasized that basic perceptrons were incapable of processing the exclusive-or circuit. This insight was irrelevant for the deep networks of Ivakhnenko (1965) and Amari (1967). 1267: 1237: 2969: 1441:

As one example, the phrase "one of the most important documents in the history of machine learning" used to appear several times all across Knowledge (XXG), and is an obvious violation of

1242: 1207: 1187: 44: 1725:

but the hardest part -- finding and understanding the source material -- you've already done and banked, so you should definitely keep up editing on this and the many related articles.)

147: 1671:

but if at any time the neuron fires, and the axonal terminations are simultaneously excited, they become synapses of the ordinary kind, henceforth capable of exciting the neuron

748: 686: 1677:

Again. Gauss or Legendre are not a must cite. I had read hundreds of math and CS papers and never had I needed to know who or what or at what paper least squares was proposed.

1576:

Ising architecture (1925) is NOT a must-cite. It is not even a neural network architecture (though you can really retroactively call an "architecture", but historians call it

619: 1549:). It is very strange that you would combine them in one sentence and say "with over 1000 references in total" as if they have nearly the same order of magnitude in citation. 2029:

on 7 August: JS' 1991 work on self-supervised pre-training, neural network distillation, GANs, and unnormalized linear Transformers, using the improved text of 24 September.

2959: 3024: 3014: 1568:

Calling something "unnormalized linear Transformers" is a great rhetorical trick, and I can call feedforward networks "attentionless Transformers". I am serious. People

609: 524: 194: 530: 3054: 1124: 3029: 1130: 585: 251: 2974: 448: 79: 2954: 1616:, you seem passionate about history. It would be good to try to actually read the primary sources, do not trust Schmidhuber's interpretation, and read some 351: 3039: 951: 3064: 3019: 2999: 2868:

Rochester, N.; J.H. Holland; L.H. Habit; W.L. Duda (1956). "Tests on a cell assembly theory of the action of the brain, using a large digital computer".

793: 438: 1010: 984: 500: 2984: 1481:

alias "pony in a strange land," thanks for your reply! I see where you are coming from. The best reference to the mentioned priority disputes between

341: 2650:

Fukushima, K. (1980). "Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position".

632: 576: 553: 3059: 3044: 2112: 1100: 927: 85: 2964: 1648:, thanks! I am always going to the source when I find something of interest in a survey. You condemn JS and recommend alternative surveys such as 767: 414: 3009: 2989: 739: 317: 168: 2908: 2597: 2346: 2052: 1427: 856: 491: 471: 135: 3049: 3034: 30: 720: 1893:

to classify non-linearily separable pattern classes. Subsequent developments in hardware and hyperparameter tunings have made end-to-end

2994: 2141: 1091: 1055: 918: 895: 2979: 397: 374: 99: 1373: 24: 2111:(1954) also used computational machines to simulate a Hebbian network. Other neural network computational machines were created by 1318: 308: 285: 104: 20: 2613:

Fukushima, K. (1979). "Neural network model for a mechanism of pattern recognition unaffected by shift in position—Neocognitron".

2413: 1577: 1350: 129: 1612:

the debate. A word of advice: If you must use Schmidhuber's history, go directly to the source. Do not use his interpretation. @

74: 3004: 2949: 812: 777: 658: 260: 198: 2142:"How 3 Turing Awardees Republished Key Methods and Ideas Whose Creators They Failed to Credit. Technical Report IDSIA-23-23" 701: 125: 65: 1863: 822: 584:

related articles on Knowledge (XXG). If you would like to participate, please visit the project page, where you can join

1927: 1741:

Speedboys, whatever else may be the case, I don't think that you should "revert the revert... and continue from there."

2047:, my next proposed edit (see draft below based on the reverted edit of 15 September) is about important work predating 1342: 175: 2700:

Rosenblatt, F. (1958). "The Perceptron: A Probabilistic Model For Information Storage And Organization in the Brain".

1905: 1894: 1882: 787: 2522:

Sonoda, Sho; Murata, Noboru (2017). "Neural network with unbounded activation functions is universal approximator".

1889:. In computer experiments conducted by Amari's student Saito, a five layer MLP with two modifiable layers learned 1718: 1388:

who seem to know a lot about the subject: please review the details once more and revert the revert! Best regards,

1157: 206: 185: 109: 1674:

That is a learning algorithm. Ignore it if you must. As i said. I'm tired of fighting over this priority dispute.

1563:

Anderson, James A., and Edward Rosenfeld, eds. Talking nets: An oral history of neural networks. MiT Press, 2000.

1346: 1317:

For more information about external reviews of Knowledge (XXG) articles and about this review in particular, see

849: 1170: 217: 2104: 1890: 1584: 758: 266: 141: 1685: 1635: 1464: 1015: 989: 2709: 2566:

Ramachandran, Prajit; Barret, Zoph; Quoc, V. Le (October 16, 2017). "Searching for Activation Functions".

2495:

Fukushima, K. (1969). "Visual feature extraction by a multilayered network of analog threshold elements".

2116: 2056: 1823: 410: 1792: 1774: 1742: 1653: 55: 2896: 1878: 1099: