By Joe Cheung

Jun 5, 2022

Basically notes on Gwern’s commentary on recent AI progress.

The deep learning revolution started with AlexNet in 2012. Since then, more and more money is being poured into training larger and larger models:

Most of the increase in compute has been companies willing to pay more rather than algorithmic or hardware progress.

  1. GPT-3 And The Blessings of Scale
  2. Pretraining And The Scaling Hypothesis
  3. Slowly At First, Then All At Once
    1. Google AI’s 540B-parameter PaLM
    2. DeepMind’s 70B-parameter Chinchilla
    3. OpenAI’s 3.5B-parameter DALL·E 2
    4. Google Brain’s ​Imagen
    5. DeepMind’s 1.2B-parameter Gato
  4. Technological Unemployment Is Already Here
  5. Frankfurtian Bullshit In A Post-GPT-3 World
  6. The Singularity Is Nigh
    1. AI timeline
  7. Conclusion
  8. Additional Readings
  9. Endnotes

GPT-3 And The Blessings of Scale

In May 2020, OpenAI announced GPT-3, an unsupervised deep learning transformer-based language model trained on Internet text for the single purpose of predicting the next word in a sentence. It’s the 117x larger 175B-parameter successor to GPT-2, which itself surprised everyone with its ability to learn question answering, reading comprehension, summarisation, and translation, all from the raw text using no task-specific training data. Image GPT, the same exact model trained on pixel sequences instead of text, could generate coherent image completions and samples. GPT-3 astounded everyone — instead of running into diminishing or negative returns, the vast increase in size didn’t merely translate into quantitative improvements in language tasks but also qualitatively distinct improvements that implies meta-learning (attention mechanism as “fast weights” that “learnt to learn”), such as:

GPT-3 is the first AI system that has obvious, immediate, transformative economic value:

GPT-3 demonstrates the blessings of scale: for deep learning, hard problems are easier to solve than easy problems — everything gets better as models get larger (in contrast to the usual outcome in research, where small things are hard and large things impossible). The Bitter Lesson goes:

The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. The ultimate reason for this is Moore’s law, or rather its generalization of continued exponentially falling cost per unit of computation.

The scaling laws of deep learning, highly consistent over more than 6 orders of magnitude, is driven by neural networks (NNs) functioning as ensembles of many sub-networks that average out to an Occam’s razor, which for small data and models, learn superficial or memorised parts of the data, but can be forced into true learning with hard and rich enough problems. As ⁠meta-learners learn amortised Bayesian inference⁠, they build in informative priors when trained over many tasks, and become dramatically more sample-efficient and better at generalisation.

Once the compute is ready, the paradigm will appear:

Each paradigm having a certain amount of compute you can “feed” it before it stops scaling with compute effectively.

As Gwern writes, GPT-3 is terrifying because it is a tiny model compared to what’s possible, trained in the dumbest way possible on a single impoverished modality on tiny data, sampled in a dumb way⁠⁠⁠, its benchmark performance sabotaged by bad prompts and data encoding problems, and yet the first version already manifests crazy runtime meta-learning — and the scaling curves still are not bending!

Per OpenAI’s Kaplan et al. 2020, the leaps we have seen over the past few years are not even halfway there in terms of absolute likelihood loss:

GPT-3 represents ~10^3 — plenty of room for further loss decreases.

If we see such striking gains in halving the validation loss from GPT-2 to GPT-3 but with so far left to go, what is left to emerge as we halve again?

Pretraining And The Scaling Hypothesis

The pretraining thesis goes:

Humans are the cyanobacteria of AI: we emit large amounts of structured data in which logic, causality, object permanence, history are encoded. A model like GPT-3 trains on such data in which “intelligence” is implicit, and learns from the crudest level:

  1. Some letters are more frequent than others (alphanumeric gibberish): 8-5 bit error per character
  2. Words and punctuations exist (still gibberish): 4-3 bit error per character
  3. Words “cluster” (bag of words): <3 bit error per character
  4. Sentences exist (starts making sense): 2 bit error per character (every additional 0.1 bit decrease starts to come more costly1)

  5. Grammar exists e.g. keeping pronouns consistent (multiple sentences make sense): 1-2 bit bit error per character
  6. Subtleties e.g. solving repetition loops (paragraphs make sense) <0.02 bit error per character

Language model performance can be measured by how many bits it takes to convey a character i.e. bits per character (BPC): GPT-2 had a cross-entropy WebText validation loss of ~3.3 BPC; GPT-3 halved that loss to ~1.73 BPC. For a hypothetical GPT-4, if the scaling curve continues for another 3 orders or so of compute (100–1000x) before hitting harder diminishing returns⁠, the cross-entropy loss will drop to ~1.24 BPC.

It still won’t be near the natural language performance of humans who (in ASCII) use a byte to express a full 7 bits of information i.e. Shannon’s 7-gram character entropy (0.7 BPC).

What is in that missing >0.4? Everything! While random babbling sufficed at the start, nothing short of true understanding will suffice for ideal prediction. The last bits are deepest. Analogous to humans: we all perform everyday actions like buttoning our shirts well given enough practice and feedback, but we differ when we run into the long tail of choices that are:

  1. Novel
  2. Rare
  3. Short in execution but unfold over a lifetime
  4. Without any feedback (e.g. after our death)

One only has to make a single bad decision to fall into ruin. A small absolute average improvement in decision quality can be far more important than its quantity indicates, hence why the last bits are the hardest and deepest.

If GPT-3 gained so much meta-learning and world knowledge by dropping its absolute loss 50% when starting from GPT-2’s level, what capabilities would another 30% improvement over GPT-3 gain? If we trained a model which reached that loss of 0.7 i.e. predict text indistinguishable from a human, how could we say that it doesn’t truly understand everything2?

Thus, the scaling hypothesis: the blessings of scale as the secret of artificial general intelligence (AGI) — intelligence is “just” simple NNs applied to diverse experiences at a (currently) unreachable scale; as increasing computational resources permit running such algorithms at the necessary scale, NNs will get ever more intelligent.

Slowly At First, Then All At Once

Even in 2015, the scaling hypothesis seemed highly dubious — you needed something to scale, after all, and it was all too easy to imagine flaws in existing systems would never go away, and progress would sigmoid any month now. Like the genomics revolution where a few far-sighted seers extrapolated that the necessary n for GWASes would increase exponentially and deliver powerful polygenic scores soon (which I wrote about), while sober experts wrung their hands over “missing heritability” and the miraculous complexity of biology to scoff about how such n requirements proved GWAS was a failed paradigm, the future arrived at first slowly and then quickly.

For a while after GPT-3 was published, we were possibly in hardware overhang where large quantities of compute can be diverted to running powerful AI systems as soon as the software is developed (so as one powerful AI system exists, probably a large number of them do). Google Brain was entirely too practical and short-term; DeepMind believes that AGI will require effectively replicating the human brain module by module; OpenAI, lacking anything like DeepMind’s Google cashflow or its enormous headcount is making a startup-like bet that they know an important truth that is a Thielian secret: the Scaling Hypothesis is true.

In August 2021, Stanford’s entire AI department released a 200-page 100-author neural scaling laws manifesto announcing their pivot to position themselves as the number one at academic ML scaling research. Only recently do we see Google AI, Google Brain, and DeepMind treat GPT-3 as scaling’s Sputnik moment:

Google’s 540 billion parameter PaLM is the right-most, up-most dot.

In just the past two months, we saw bigger and bigger models like DeepMind’s 80B-parameter Flamingo (paper), Robotics at Google and Everyday Robots (also Alphabet-owned)’s SayCan (paper), and Robotics at Google’s Socratic Models (paper). Below I highlight the most blockbusting five.

Google AI’s 540B-parameter PaLM

(Blog, paper)

Continues the Kaplan scaling, sees discontinuous improvements from model scale (see comparison with GPT-3). The surprise is perhaps how poor the communication between Google and DeepMind is, as you will see.

DeepMind’s 70B-parameter Chinchilla

(Blog, paper)

Outperforms the much larger 175B-parameter GPT-3 and DeepMind’s own 280B-parameter Gopher, demonstrating that essentially everyone has been training LLMs with a deeply suboptimal use of compute.

Model size is (almost) everything.

DeepMind chose 9 quantities of compute (1018-1021 FLOPs) and trained many different-sized models at each quantity:

The best models are at the minima.

Connecting the minima at each curve gives you a new scaling law: for every increase in compute, data and model sizes should increase by the same amount. DeepMind verified the new training law by training the 70B-parameter Chinchilla using the same compute used for their own 280B-parameter Gopher i.e. Chinchilla trained with 1.4 trillion tokens compared to Gopher’s 300 billion tokens. Indeed, Chinchilla outperforms Gopher by 7.6% on average:

This implies that we shouldn’t see models larger than the 540B-parameter PaLM trained on 780B tokens for a while — it doesn’t make sense until we have 60x as much compute as was used for Gopher/Chinchilla (which is why it is surprising DeepMind let Google piss away millions of dollars in TPU time).

OpenAI’s 3.5B-parameter DALL·E 2

(Blog, paper)

Successor to the 12B-parameter DALL·E (an implementation of GPT-3 trained on text-image pairs from the Internet). Improvements mainly due to algorithmic innovation (not scaling). The examples are stunning:

DALL·E 2 struggles with anime, realistic faces, text in images, multiple subjects arranged in complex ways, and editing. How many of these will be solved by throwing more compute and training data at them? The scaling curves still haven’t bent, and no one has tried diffusion models’ scaling law on better hyperparameterised models like Chinchilla yet.

The text in image problem3 is probably due to an obscure technical detail that also plagued GPT-3 performances in rhyming, alliteration, punning, anagrams or permutations, acrostic poems, and arithmetic: the models do not see characters but ~51k word or sub-word-chunks called byte-pair encodings (BPEs). To breakthrough, the models just need to memorise enough of the encrypted number/word representations using tricks like rewriting numbers to individual digits or BPE-dropout to expose all possible tokenizations, or better yet, character-level representations.

Google Brain’s ​Imagen

(Blog, paper)

Outcompetes DALL·E 2 on text-to-image COCO benchmark despite being smaller than DALL·E 2. The main change appears to be reducing the CLIP reliance in favour of a much larger and more powerful text encoder before doing the image diffusion stuff. They make a point of noting superiority on “compositionality, cardinality, spatial relations, long-form text, rare words, and challenging prompts.” The samples also show text rendering fine inside the images. The results are stunning:

What does this mean for artists? Most AI art criticism comes from a place of not realising that (almost) everyone’s favourite artists are actually curators and not manufacturers. Artists will become more like today’s art directors: people with amazing visual imagination and taste, the ability to see new things, and to help orchestrate it into existence. The net result may be a far more visual culture4. (See also Oxford’s report on AI and the Arts)

In the US, only art created by humans, not AI, can be copyrighted.

When DALL·E 2 can do this to any artwork, what does it mean for copyright?

DeepMind’s 1.2B-parameter Gato

(Blog, paper)

Trained for a myriad of tasks like image captioning, engaging in dialogue, stacking blocks with a real robot arm, and playing Atari games, the tiny (1.2B parameters) Gato performs 450 out of 604 taks at over 50% expert score:

Scaling just works: just train a 1.2B-paramater Transformer on half a thousand different tasks and the scaling curve looks exactly like you’d expect:

Multi-task learning is indeed just another blessing of scale: as DeepMind notes, it used to be that learning multiple Atari games in parallel was really hard; people thought very hard and ran lots of experiments to try to create things like Popart less than 4 years ago where it was a triumph that, due to careful engineering, a single checkpoint could play just the ALE-57 games with mediocre performance.

If one had any doubts, DeepMind is now fully scale-pilled.

We live in a timeline in which the final breakthrough that precipitates AGI could plausibly be literally some one-sentence platitude about general problem-solving — Large Language Models are Zero-Shot Reasoners, Kojima et al. 2022: simply adding “Let’s think step by step” before each answer increases GPT-3 accuracy on MultiArith from 17.7% to state-of-the-art (SOTA) 78.7% and GSM8K from 10.4% to SOTA 40.7%.

Technological Unemployment Is Already Here

Technological unemployment is a very complex issue. Normal people worry about technological unemployment. Economists like Bryan Caplan, Robin Hanson, Tyler Cowen, and Arnold Kling keep telling them to relax, but LLMs since GPT-3 should make them rethink. From AlphaZero, we can see that there is no chance any useful man+machine combination will work together for more than a few years, as humans soon will be only a liability. Humans need not apply.

This time is different: before, the machines began handling brute force tasks and replacing things that offered only brute force and not intelligence like horses or watermills. It’s clear that the machines are now slowly absorbing intelligence — the final province of humans. Machines switched from being complements to being substitutes in some sectors a while ago. Humans need not apply.

During the various panics and busts in the past centuries, there were huge disemployment effects as companies were forced to automate, but the people were able to switch sectors or find new jobs. The trucking industry alone employs 3% of the entire American population, and how many of those employees are skilled operations research PhDs who can easily find employment elsewhere in logistics? Imagine a kid with an IQ of 70, his Ricardian comparative advantage doesn’t guarantee there’s anything worth hiring him for (even laundry has gotten harder). Humans need not apply.

We live in a world where in some cases we would not hire someone at any price. One video of an employee spitting in customer’s food can go viral and do more damage to a chain’s sales than that employee would earn for the chain in a hundred years. One person in an o-ring process can do an incredible amount of damage if they are only slightly subpar; to continue the NASA analogy, one loose bolt can cost $135 million, one young inexperienced technician can cost $200 million. We just have to calculate the expected-value of reducing the number of such incidents by even 0.01%. Humans need not apply.

Frankfurtian Bullshit In A Post-GPT-3 World

As Sarah Constantin writes, humans who are not concentrating are not general intelligences. Unless you make an effort to read carefully, you probably cannot detect any mistakes in GPT-3’s nonfiction upon skimming — OpenAI has achieved the ability to pass the Turing test against humans on autopilot.

As Robin Hanson notes, and I don’t think he’s exaggerating, that a lot of human speech is just babbling — simply linking words and sentences statistically likely to come after the next, not unlike GPT-3. The median student learns a set of low order correlations, but if you ask an exam question probing a deep structure answer, most students give the wrong answer. These low order correlations also seem sufficient to capture most polite conversation talk (e.g. the weather is nice, how is your mother’s illness, and damn that other political party), inspirational TED talks, and when podcast guests pontificate on topics they really don’t understand (e.g. quantum mechanics, consciousness, postmodernism, or the need always for more regulation everywhere).

What unites the GPT-3 and its cousins is an unflagging enthusiasm to render whatever’s been requested, no matter how absurd or overwrought — they are fundamentally Frankfurtian bullshitters:

The liar cares about the truth and attempts to hide it; the bullshitter doesn’t care if what they say is true or false, but cares only whether the listener is persuaded.

As Venkatesh Rao notes, bullshit in this epistemological sense is not a moral failure, but an incomplete cognition mode. It corresponds to the upstream part of what Daniel Dennett called the “multiple drafts” view of consciousness. First you confabulate, then you discriminate — you free-associate to produce output that has a cosmetic coherence, and then close the truth loop somehow in a downstream discrimination step before actual output. Basically, bullshitters output indiscriminately.

All presentations of AI art include the text prompt — the viewer’s pleasure is not in the image, but in the spectacle of the computer’s interpretation. Hence AI art is a genre unto itself, and the bullshit has not found its footing as “mere” art. As Robin Sloan notes:

That’s the paradox of AI art: it leverages access to the spigot of infinity to produce a sense of scarce invention. In an overstuffed audiovisual landscape, it’s the “AI” and not the “art” that provides a reason to look at this and not that, listen to this and not that.

Just as AI art has no artistic merit until OpenAI solves “taste”, effortful thinking is still out of reach until OpenAI fully embodies complete cognition (the generation + discrimination production pipeline with a truth loop) in GPT-n.

Constantin again:

The mental motion of “I didn’t really parse that paragraph, but sure, whatever, I’ll take the author’s word for it” is, in my introspective experience, absolutely identical to “I didn’t really parse that paragraph because it was bot-generated and didn’t make any sense so I couldn’t possibly have parsed it”, except that in the first case, I assume that the error lies with me rather than the text. This is not a safe assumption in a post-GPT2 world. Instead of “default to humility” (assume that when you don’t understand a passage, the passage is true and you’re just missing something) the ideal mental action in a world full of bots is “default to null” (if you don’t understand a passage, assume you’re in the same epistemic state as if you’d never read it at all.)

The Singularity Is Nigh

In The Precipice, the definitive book on existential risks, Toby Ord ranks unaligned artificial intelligence as the greatest risk to humanity’s potential in the next century.

Ord explains the high number for such a speculative risk:

A common approach to estimating the chance of an unprecedented event with earth-shaking consequences is to take a sceptical stance: to start with an extremely small probability and only raise it from there when a large amount of hard evidence is presented. But I disagree. Instead, I think that the right method is to start with a probability that reflects our overall impressions, then adjust this in light of the scientific evidence. When there is a lot of evidence, these approaches converge. But when there isn’t, the starting point can matter.

In the case of artificial intelligence, everyone agrees the evidence and arguments are far from watertight, but the question is where does this leave us? Very roughly, my approach is to start with the overall view of the expert community that there is something like a 1 in 2 chance that AI agents capable of outperforming humans in almost every task will be developed in the coming century. And conditional on that happening, we shouldn’t be shocked if these agents that outperform us across the board were to inherit our future.

Read Gwern’s (fictional) It Looks Like You’re Trying To Take Over The World to imagine a hard takeoff scenario using solely known sorts of NN and ⁠scaling effects. Then read AGI Ruin: A List of Lethalities in which Eliezer Yudkowsky, for the first time publicly, explains what he spent the last several years doing (and he is pessimistic).

Richard Ngo on AGI safety from first principles:

  1. We’ll build AIs which are much more intelligent than humans; that is, much better than humans at using generalisable cognitive skills to understand the world.
  2. Those AGIs will be autonomous agents which pursue long-term, large-scale goals, because goal-directedness is reinforced in many training environments, and because those goals will sometimes generalise to be larger in scope.
  3. Those goals will by default be misaligned with what we want, because our desires are complex and nuanced, and our existing tools for shaping the goals of AIs are inadequate.
  4. The development of autonomous misaligned AGIs would lead to them gaining control of humanity’s future, via their superhuman intelligence, technology and coordination - depending on the speed of AI development, the transparency of AI systems, how constrained they are during deployment, and how well humans can cooperate politically and economically.

What are some concrete problems in AI safety? Fom Amodei et al. 2016, take a robot that cleans up messes in an office using common cleaning tools as an example:

  1. Avoiding negative side effects (e.g. ensure robot doesn’t knock over vase to clean faster without manually specifying everything it shouldn’t disturb)
  2. Avoiding reward hacking (e.g. ensure robot doesn’t disable its vision so it won’t find any mess while rewarding it for a mess-free environment)
  3. Scalable oversight (e.g. ensure robot doesn’t throw away phone but does candy wrapper without having to ask the humans every time)
  4. Safe exploration (e.g. ensure robot doesn’t put a wet mop in an electrical outlet while allowing it to experiment with mopping strategies)
  5. Robustness to distributional shift (e.g. ensure robot learns that its cleaning strategies for an office might be dangerous on a factory workfloor)

The explainable ML community and AI safety is nowadays an eminently empirical field centred around understanding the kinds of models, like transformers, that seem promising, and trying to devise new ways to train them that lead to desired behaviours, for example trying to get language models to output benign completions to a given prompt. There’s more tool building, more concrete tractable problems and less theorising about arbitrarily general intelligent systems.

AI timeline

Is AI Progress Impossible To Predict? AI has improved on a task recently gives us exactly zero predictive power for how much the next model will improve on the same task. Moore’s Law giveth, Platt’s Law taketh away: any AI forecast will forecast strong AI to be 30 years out:

Platt’s Law in blue, OLS regression line in orange; the median forecast is 25 years out.

Nevertheless, the average 50% prediction clusters around 2040-60. Holden Karnofsky predicts more than a 10% chance of PASTA (Process for Automating Scientific and Technological Advancement) transformative AI (see also burden of proof and where the arguments and “experts” stand) : >10% probability by 2036; a 50% by 2060; and a 66% by 2100.

Bio Anchors assumes transformative AI would be 10x the size of the human brain; GPT-3 is only 0.1% as big.

FLOPs alone turn the wheel of history.

Metaculus:

Cotra’s report reminds Scott Alexander of an old joke:

An astronomy professor says that the sun will explode in five billion years, and sees a student visibly freaking out. She asks the student what’s so scary about the sun exploding in five billion years. The student sighs with relief: “Oh, thank God! I thought you’d said five million years!”

You can imagine the opposite joke: A professor says the sun will explode in five minutes, sees a student visibly freaking out, and repeats her claim. The student, visibly relieved: “Oh, thank God! I thought you’d said five seconds.”

In all the AGI timeline predictions, the professor is saying the sun will explode in five minutes instead of five seconds. Compared to the alternative, it’s good news. But if it makes you feel complacent, then the joke’s on you.

Conclusion

As Kaplan notes, all products for creators will have embedded intelligence from LLMs (Copilot in VSCode, DALL·E 2 in Photoshop, GPT-3 in Google Docs); these companies will need to roll their own LLMs or pay a tax to OpenAI/DeepMind/Google.

If you actually believe in AI risk, you should vote with your feet and work on AI safety. If you don’t, you should still pivot your career to work on AI as every successful tech company will use their data moats to build some variant of AGI.

All tech problems are ultimately AGI-complete.

It is up to you to immanentize the eschaton.

Additional Readings

ML Intro

ML Books

ML Courses

ML Resources

ML Research:

Replication crisis in ML research.

Transformers

Interpretability research

Implicit meta-learning

ML Scaling

Neuroscience

AI Safety

List of lists on AI safety

For updates and links on narrow AI, subscribe to my monthly newsletter (archive).

Endnotes

  1. Markov chain & n-gram models start to fall behind; they can memorize increasingly large chunks of the training corpus, but they can’t solve increasingly critical syntactic tasks like balancing parentheses or quotes, much less start to ascend from syntax to semantics. 

  2. As Hofstadter puts it, capabilities must disintegrate — if you successfully reduce “human reasoning”, it must be to un-reasoning atoms, not to little reasoning homunculi; or as Gwern likes to say: AI succeeds not when you anthropomorphize models, but unanthropomorphize humans. 

  3. DALL·E 2 results have been so stunning that some mistake the gibberish to be a secret language; we’re genuinely seeing some kind of AI astrology emerge in real time. 

  4. With 2D images solved by AI, are we following the footsteps of image/sound recording: photography -> silent movie -> movies with sound? 

↑ back to top

comments powered by Disqus