A Scrapbook of Neural Network Lores

Lightly curated list of stories, anecdotes, and other various bits from neural network research.
AI
scaling
NN
Author

Yuxi Liu

Published

January 18, 2024

Modified

January 20, 2024

The scaling hypothesis

Marvin Minsky, on how he gave up on neural networks after the 1950s because he could not afford a few million neurons.

I had the naive idea that if one could build a big enough network, with enough memory loops, it might get lucky and acquire the ability to envision things in its head. This became a field of study later. It was called self-organizing random networks. Even today, I still get letters from young students who say, ‘Why are you people trying to program intelligence? Why don’t you try to find a way to build a nervous system that will just spontaneously create it?’ Finally, I decided that either this was a bad idea or it would take thousands or millions of neurons to make it work, and I couldn’t afford to try to build a machine like that. (Bernstein 1981)

Peter Norvig, on how he quickly gave up on neural networks in the 1980s due to lack of compute.

And then it finally worked. And I think the biggest difference was the computing power. Definitely there were advances in data. So we could do image net because Fei-Fei Li and others gathered this large database, and that was really important. There are certainly differences in the algorithm, right? We’ve got a slightly different squashing function. Instead of shaped like this, it’s shaped like this. I mean, I don’t know how big a deal that was, but we learned how to do stochastic gradient dissent a little bit better. We figured that dropout gave you a little bit better robustness.

So there were small things, but I think probably the biggest was the computing power. And I mean, I certainly remember Geoffrey Hinton came to Berkeley when I was a grad student in 1981, I think, when he talked about these neural nets. And we fellow grad students thought that was so cool. So we said, “Let’s go back into the lab and implement it.

And of course, there was absolutely nothing you could download, so we had to build it all from scratch. And we got it to do exclusive or, and then we got it to do something a little bit more complicated. And it was exciting. And then we gave it the first real problem, and it ran overnight, and it didn’t converge, and we let it run one more day, and it still didn’t converge. And then we gave up, and we went back to our sort of knowledge-based systems approach. But if we had the computing power of today, it probably would have converged after five seconds. (Norvig 2021)

“The last bits are deepest”

Why Does Pretraining Work?

Early on in training, a model learns the crudest levels: that some letters like ‘e’ are more frequent than others like ‘z’, that every 5 characters or so there is a space, and so on. … once a model has learned a good English vocabulary and correct formatting/spelling, what’s next? There’s not much juice left in predicting within-words. The next thing is picking up associations among words. … If the word “Jefferson” is the last word, then “Washington” may not be far away, and it should hedge its bets on predicting that ‘W’ is the next character, and then if it shows up, go all-in on “ashington”. … Now training is hard. Even subtler aspects of language must be modeled, such as keeping pronouns consistent. This is hard in part because the model’s errors are becoming rare, and because the relevant pieces of text are increasingly distant and ‘long-range’. … If we compared two models, one of which didn’t understand gender pronouns at all and guessed ‘he’/‘she’ purely at random, and one which understood them perfectly and always guessed ‘she’, the second model would attain a lower average error of barely <0.02 bits per character! …

The implication here is that the final few bits are the most valuable bits, which require the most of what we think of as intelligence. A helpful analogy here might be our actions: for the most part, all humans execute actions equally well. … Where individuals differ is when they start running into the long tail of novel choices, rare choices, choices that take seconds but unfold over a lifetime, choices where we will never get any feedback (like after our death). One only has to make a single bad decision, out of a lifetime of millions of discrete decisions, to wind up in jail or dead. A small absolute average improvement in decision quality, if it is in those decisions, may be far more important than its quantity indicates, and give us some intuition for why those last bits are the hardest/deepest. (Branwen 2020)

Echos of “The last bits are deepest” from a very early paper on using a trigram model to estimate the entropy of English over the Brown corpus (600 million words).

From a loftier perspective, we cannot help but notice that linguistically the trigram concept, which is the workhorse of our language model, seems almost moronic. It captures local tactic constraints by sheer force of numbers, but the more well-protected bastions of semantic, pragmatic, and discourse constraint and even morphological and global syntactic constraint remain unscathed, in fact unnoticed. Surely the extensive work on these topics in recent years can be harnessed to predict English better than we have yet predicted it.

We see this paper as a gauntlet thrown down before the computational linguistics community. The Brown Corpus is a widely available, standard corpus and the subject of much linguistic research. By predicting the corpus character by character, we obviate the need for a common agreement on a vocabulary. Given a model, the computations required to determine the cross-entropy are within reach for even a modest research budget. We hope by proposing this standard task to unleash a fury of competitive energy that will gradually corral the wild and unruly thing that we know the English language to be. (Brown et al. 1992)

Neural networks want to work

Marvin Minsky’s SNARC (1951). Designed to simulate one mouse escaping a maze, it ended up simulating multiple mice due to design bugs – which were never debugged. Though the machine had only 40 neurons, and its parts failed all the time, the whole network continued to work.

It turned out that because of an electronic accident in our design we could put two or three rats in the same maze and follow them all. The rats actually interacted with one another. If one of them found a good path, the others would tend to follow it. We sort of quit science for a while to watch the machine. We were amazed that it could have several activities going on at once in its little nervous system. Because of the random wiring, it had a sort of fail-safe characteristic. If one of the neurons wasn’t working, it wouldn’t make much of a difference—and, with nearly three hundred tubes and the thousands of connections we had soldered, there would usually be something wrong somewhere. In those days, even a radio set with twenty tubes tended to fail a lot. I don’t think we ever debugged our machine completely, but that didn’t matter. By having this crazy random design, it was almost sure to work, no matter how you built it. (Bernstein 1981)

Bernard Widrow once built a MADALINE I (circa 1962) in a rush to present at a technical meeting. Despite that only 1/4 of its circuits were defective, it still worked at reduced capacity.

We discovered the inherent ability of adaptive computers to ignore their own defects while we were rushing through construction of a system called Madaline I for presentation at a technical meeting. The machine was finished late the night before the meeting and the next day we showed some very complex pattern discriminations. Later we discovered that about a fourth of the circuitry was defective. Things were connected backward, there were short circuits, and poor solder joints. We were pretty unhappy until it dawned on us that this system has the ability to adapt around its own internal flaws. The capacity of the system is diminished but it does not fail. (Widrow 1963)

Andrej Karpathy, on how neural network program bugs are very hard to find, because bugged neural networks do not fail, merely degrade.

… perhaps you forgot to flip your labels when you left-right flipped the image during data augmentation. Your net can still (shockingly) work pretty well because your network can internally learn to detect flipped images and then it left-right flips its predictions. Or maybe your autoregressive model accidentally takes the thing it’s trying to predict as an input due to an off-by-one bug. Or you tried to clip your gradients but instead clipped the loss, causing the outlier examples to be ignored during training. Or you initialized your weights from a pretrained checkpoint but didn’t use the original mean. Or you just screwed up the settings for regularization strengths, learning rate, its decay rate, model size, etc. Therefore, your misconfigured neural net will throw exceptions only if you’re lucky; Most of the time it will train but silently work a bit worse. (Karpathy 2019)

Researchers at OpenAI (2018) reported that fixing RL bugs is as important as better algorithms.

Big-picture considerations like susceptibility to the noisy-TV problem are important for the choice of a good exploration algorithm. However, we found that getting seemingly-small details right in our simple algorithm made the difference between an agent that never leaves the first room and an agent that can pass the first level. To add stability to the training, we avoided saturation of the features and brought the intrinsic rewards to a predictable range. We also noticed significant improvements in performance of RND every time we discovered and fixed a bug (our favorite one involved accidentally zeroing an array which resulted in extrinsic returns being treated as non-episodic; we realized this was the case only after being puzzled by the extrinsic value function looking suspiciously periodic). Getting such details right was a significant part of achieving high performance even with algorithms conceptually similar to prior work. This is one reason to prefer simpler algorithms where possible. (Burda and Edwards 2018)

Around 2019, Gwern, Shawn Presser, and others, trained \(512\times 512\) image generation models using the BigGAN architecture. However, they used compare_gan, which had a multiply-by-zero bug. Somehow it still worked, but not well enough compared to the original BigGAN.

Our primary goal was to train & release 512px BigGAN models on not just ImageNet but all the other datasets we had like anime datasets. The compare_gan BigGAN implementation turned out to have a subtle +1 gamma bug which stopped us from reaching results comparable to the model; while we beat our heads against the wall trying to figure out why it was working but not well enough (figuring it out far too late, after we had disbanded) … “Neural nets want to work” – even if they start out being effectively multiplied by zero. (Branwen 2022)

Personal story at the Berkeley CS 285, Deep Reinforcement Learning, 2022 Fall.

For Homework 3, we were asked to implement the soft actor-critic algorithm. We would implement the agent, run the agent on the Half Cheetah environment, and submit the trajectories to Gradescope, where an autograder would check the trajectories and see if the agent achieved a final score above 300. For the Half Cheetah, score means the distance it travels per episode, averaged over several episodes.

I noticed that the algorithm I implemented did learn, but the learning curve looked like a rollercoaster, jumping up and down around the range of 250 – 300. After many fruitless and paranoid programming sessions I managed to pass the autograder by trying enough random seeds and just submitting the best seeds. The professor, Sergey Levine, offered little help, admitting that RL agents are extremely hard to debug.

One day after the assignment deadline, the professor announced that there was a critical one-line bug in the starter code: The correct algorithm should train the model with past game frames in a random order, but the given code always give them in the FIFO order. With the fix, the learning curve would smoothly sigmoid to 350.

The Neural Net Tank Urban Legend

A large list of examples in The Neural Net Tank Urban Legend · Gwern.net. I have a few more.

According to Sejnowski, Takeo Kanade did work on detecting tanks in images. This is unconfirmed. I have looked for “Artificial Intelligence Vision: Progress and Non-Progress”, but it is not available online. I looked for your doctoral dissertation of 1974, but it contains only facial recognition. I also cannot find anything about detecting tanks in his publication list.

In his talk “Artificial Intelligence Vision: Progress and Non-Progress,” Takeo Kanade (from Carnegie Mellon) noted that computer memories back in the 1960s were tiny by today’s standards and could hold only one image at a time. For his doctoral dissertation in 1974, Takeo had shown that, though his program could find a tank in one image, it was too difficult for it to do so in other images where the tank was in a different position and the lighting was different. But, by the time his early students graduated, the programs they designed could recognize tanks under more general conditions because computers were more powerful. Today his students’ programs can recognize tanks in any image. The difference is that today we have access to millions of images that sample a wide range of poses and lighting conditions, and computers are millions of times more powerful. (Sejnowski 2018, 256)

There was not a lot of actual research on tank recognition. (Kanal and Randall 1964) contains some good pictures. The network was a two-layered perceptron network, of type \(\mathbb{R}^{N \times N} \to \{0, 1\}^{32\times 32} \to \{0, 1\}^{24} \to \{0, 1\}\). It works as follows:

  • The grayscale photo is down-scaled and binarized by convolution with a discrete Laplace filter: \(\mathbb{R}^{N \times N} \to \{0, 1\}^{32\times 32}\).
  • The weights for the 24 hidden perceptrons are constructed by linear discriminant analysis: \(\{0, 1\}^{32\times 32} \to \{0, 1\}^{24}\)
  • The output perceptron is learned by the perceptron learning rule: \(\{0, 1\}^{24} \to \{0, 1\}\).
(a) Grayscale photos, some containing tanks, and some not.
(b) A picture of a tank after convolution with a discrete Laplace filter.
(c) The architecture of the network.
Figure 1: Images from (Kanal and Randall 1964).

The second neural network winter

The first neural network winter started around 1965, when the main research centers pivoted away from neural networks: the Stanford Research Institute group turned to symbolic AI; the Bernard Widrow group turned to using single neurons as adaptive filters; the Frank Rosenblatt group died from lack of funds and then the literal death of Rosenblatt in 1971. It rose again around 1985, when backpropagation and improved compute allowed researchers to train neural networks on the order of \(10^4\) parameters and \(4\) layers.

Something strange happened during the 1990 – 2010 period: the neural network research community silently disappeared again for another 20 years. Unlike the previous case, there was no great mythology or drama about this winter, no Perceptron controversy.

I would like to find out why.

Lukas: So I remember Daphne Koller telling me, maybe 2003, that the kind of state-of-the-art handwriting systems were neural nets, but that it was such an ad hoc kind of system that we shouldn’t focus on it. And I wonder if maybe I should have paid more attention to that and tried harder to make neural nets work for the applications I was doing.

Peter: Yeah, me too. And certainly Yann LeCun had success with the digit database, and I think that was over-engineered in that they looked at exactly the features they needed for that set of digitizations of those digits. And in fact, I remember researchers talking about, “Well, what change are we going to do for sample number 347?” Right?

Lukas: Oh, really? Okay.

Peter: There were individual data points that they would perform theories on, so that was definitely over-tuning to the data. And it should have been an indication that was a good approach. It was better than other approaches at the time.

Lukas: I guess so. Although that does sound like damming level of over-fitting the data, I suppose.

Peter: Right. There was only a couple thousand data points. I forget exactly how many. Maybe it was 10,000. Maybe it was even 100,000, but it wasn’t many. (Norvig 2021)

References

Bernstein, Jeremy. 1981. “Marvin Minsky’s Vision of the Future.” The New Yorker, December. https://www.newyorker.com/magazine/1981/12/14/a-i.
Branwen, Gwern. 2020. “The Scaling Hypothesis.” https://www.gwern.net/Scaling-hypothesis.
———. 2022. GANs Didn’t Fail, They Were Abandoned.” https://gwern.net/gan.
Brown, Peter F., Stephen A. Della Pietra, Vincent J. Della Pietra, Jennifer C. Lai, and Robert L. Mercer. 1992. “An Estimate of an Upper Bound for the Entropy of English.” Computational Linguistics 18 (1): 31–40. https://aclanthology.org/J92-1002.pdf.
Burda, Yura, and Harri Edwards. 2018. “Reinforcement Learning with Prediction-Based Rewards.” OpenAI. https://openai.com/research/reinforcement-learning-with-prediction-based-rewards.
Kanal, Laveen N., and Neil C. Randall. 1964. “Recognition System Design by Statistical Analysis.” In Proceedings of the 1964 19th ACM National Conference, 42–501.
Karpathy, Andrej. 2019. “A Recipe for Training Neural Networks.” Andrej Karpathy Blog. https://karpathy.github.io/2019/04/25/recipe/.
Norvig, Peter. 2021. “Singularity Is in the Eye of the Beholder.” https://wandb.ai/wandb_fc/gradient-dissent/reports/Peter-Norvig-Google-s-Director-of-Research-Singularity-is-in-the-eye-of-the-beholder--Vmlldzo2MTYwNjk.
Sejnowski, Terrence J. 2018. The Deep Learning Revolution. Illustrated edition. Cambridge, Massachusetts London, England: The MIT Press.
Widrow, Bernard. 1963. “Adaline: Smarter Than Sweet.” Stanford Today, no. Autumn 1963. https://web.archive.org/web/20230606185311/https://www-isl.stanford.edu/~widrow/papers/j1963adalinesmarter.pdf.