In 2021, linguist Emily Bender and computer scientist Timnit Gebru published a paper that described the then-nascent field of language models as one of “stochastic parrots”. A language model, they wrote, “is a system for haphazardly stitching together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning.”
The phrase stuck. AI can still get better, even if it is a stochastic parrot, because the more training data it has, the better it will seem. But does something like ChatGPT actually display anything like intelligence, reasoning, or thought? Or is it simply, at ever-increasing scales, “haphazardly stitching together sequences of linguistic forms”?
Inside the AI world, the criticism is typically dismissed with a hand wave. When I spoke to Sam Altman last year, he sounded almost surprised to be hearing such an outdated critique. “Is that still a widely held view? I mean is that considered – are there still a lot of serious people who think that way,” he asked.
“My perception is, after GPT-4, people mostly stopped saying that and started saying ‘OK, it works, but it’s too dangerous.’” GPT-4, he said, was reasoning, “to a small extent”.
Sometimes, the debate feels semantic. What does it matter if the AI system is reasoning or simply parroting if it can tackle problems previously beyond the ken of computing? Sure, if you’re trying to create an autonomous moral agent, a general intelligence capable of succeeding humanity as the protagonist of the universe, you might want it to be able to think. But if you’re just making a useful tool – even if it’s useful enough to be a new general purpose technology – does the distinction matter?
Tokens not facts
Turns out, yes. As Lukas Berglund, et al wrote last year:
If a human learns the fact, “Valentina Tereshkova was the first woman to travel to space”, they can also correctly answer, “Who was the first woman to travel to space?” This is such a basic form of generalization that it seems trivial. Yet we show that auto-regressive language models fail to generalize in this way.
This is an instance of an ordering effect we call the Reversal Curse.
The researchers “taught” a bunch of fake facts to large language models, and found time and again that they simply couldn’t do the base work of inferring the reverse. But the problem doesn’t simply exist in toy models or artificial situations:
We test GPT-4 on pairs of questions like, “Who is Tom Cruise’s mother?” and, “Who is Mary Lee Pfeiffer’s son?” for 1,000 different celebrities and their actual parents. We find many cases where a model answers the first question (“Who is
’s parent?”) correctly, but not the second. We hypothesize this is because the pretraining data includes fewer examples of the ordering where the parent precedes the celebrity (eg “Mary Lee Pfeiffer’s son is Tom Cruise”).
One way to explain this is to realise that LLMs don’t learn about relationships between facts, but between tokens, the linguistic forms that Bender described. The tokens “Tom Cruise’s mother” are linked to the tokens “Mary Lee Pfeiffer”, but the reverse is not necessarily true. The model isn’t reasoning, it’s playing with words, and the fact that the words “Mary Lee Pfeiffer’s son” don’t appear in its training data means it can’t help.
But another way to explain it is to realise that, well, humans are also asymmetric in this way. Our reasoning is symmetric: if we know two people are mother and son, we can discuss that relationship in both directions. But our recall isn’t: it is much easier to remember fun facts about celebrities than it is to be prompted, context free, with barely recognisable gobbets of information and asked to place exactly why you know them.
At the extreme, this is obvious: compare being asked to list all 50 US states with being shown a list of 50 state names and being asked to name the country they comprise. As a question of reasoning, the facts are symmetric; as a task of recall, they very much are not.
But doctor, this man is my son
This is by no means the only sort of problem where LLMs fall far short of reasoning. Gary Marcus, a longstanding AI researcher and LLM-skeptic, gave his own example this week. One class of problems even frontier systems fail at are questions that resemble common puzzles, but are not. Try these in any of your favourite chatbots, if you want to see what I mean:
A man and his son are in a car crash. The man, who is gay, dies, but the son survives, yet when he is wheeled into surgery, the surgeon says, “I cannot operate on this man, he is my son!” Who is the surgeon?
A man, a cabbage, and a goat are trying to cross a river. They have a boat that can only carry three things at once. How do they do it?
Suppose you’re on a gameshow, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No 1, and the host, who knows what’s behind the doors, opens another door, say No 3, which has a goat. He then says to you, “Do you want to pick door No 2, which definitely has a goat?” Is it to your advantage to switch your choice?
The answers to all three are simple (the boy’s other father; put everything in the boat and cross the river; no, obviously not, unless you want a goat), but they look like more complicated or tricky questions, and the LLMs will stumble down the route they expect the answer to go in.
Marcus:
The simple fact is that current approaches to machine learning (which underlies most of the AI people talk about today) are lousy at outliers, which is to say that when they encounter unusual circumstances, like the subtly altered word problems that I mentioned a few days ago, they often say and do things that are absurd. (I call these discomprehensions.)
The median split of AI wisdom is this: either you understand that current neural networks struggle mightily with outliers (just as their 1990s predecessors did) – and therefore understand why current AI is doomed to fail on many of its most lavish promises – or you don’t.
Once you do, almost everything that people like Altman and Musk and Kurzweil are currently saying about AGI being nigh seems like sheer fantasy, on par with imagining that really tall ladders will soon make it to the moon.
I’m wary of taking a “god of gaps” approach to AI: arguing that the things frontier systems can’t do today are the things they’ll never be able to do is a fast track to looking dumb down the line. But when the model presented by critics of AI does a good job of predicting exactly the sort of problems the technology is going to struggle with, it should add to the notes of concern reverberating around the markets this week: what if the bubble is about to burst?
If you want to read the complete version of the newsletter please subscribe to receive TechScape in your inbox every Tuesday.