One fun thing about working at OpenAI is getting to play with all the shiny new AI toys before they are released. When Dall-E 2 was announced in early 2022, we convinced ourselves that taking prompt requests on twitter and sharing the results counted as work. A baroque painting depicting the last battle of the Triassic Period? Sure thing, coming right up. I always shared the first ten results, but people insisted we must be cherry-picking: it was just too good.
Now, AI art floods the internet, and we are contending with systems that effortlessly produce the kind of outputs that, until recently, were considered the exclusive domain of human creativity. It’s not limited to the visual arts: when DeepMind’s AlphaGo system defeated top Go player Lee Sedol in 2016, commentator Michael Redmond was taken aback by AlphaGo’s now-famous move 37:
"It's a creative move… It's something that I don't think I've seen in a top player's game."
That word, creative, comes up over and over again as those of us using this technology attempt to describe what we are witnessing. But the idea of AI being creative is controversial: despite striking advances in image generation, writing, and problem-solving, there are those who still insist AI isn’t truly creative—at least not in the way humans are.
These people include my friends Ben Chugg and Vaden Masrani, ML researchers and hosts of the podcast Increments. Sometimes we like to drink wine together and argue about whether AI will end the world.
Ben and Vaden argue that while any new technology comes with risks, there is nothing uniquely concerning about AI, precisely because human creativity is special. In their view, AI will always remain a tool under human control, because it lacks the secret sauce that gives humans our edge: true creativity. I wish I could be so sure.
No True Picasso
How might we evaluate creativity? We do in fact have a method for measuring it in humans: the Torrance Tests of Creative Thinking. These assess problem-solving and divergent thinking according to four criteria: fluency, flexibility, originality, and elaboration. As it turns out, GPT-4 nails it.
Perhaps for many people, though, creativity is more of a “you know it when you see it” kind of thing, in which case these tests may not be compelling. But it was not long ago that people confidently assumed AI would never be able to write poetry or make art, and yet AI’s progress in the creative pursuits has been undeniable.
Despite these accomplishments, the goalposts keep moving: "Sure, it can generate images, but it lacks soul." "Okay, it can write rhyming verse, but it’s not poetry.” These skeptics argue that language models are mere next-token predictors, and can do little more than regurgitate their training data. I don’t buy it: I’m pretty sure there can’t have been many Shakespearean soliloquies about the YIMBY movement in the training corpus, but GPT-4 can whip one up faster than you can say "To build, or not to build, that is the question."
“But style transfer is not generating something truly novel!” cry the skeptics, “It’s just smashing together two existing concepts from within the dataset!” If this is the measure of true creativity, can we really be sure that humans are doing anything different? After all, the Beatles just smashed together rock and roll with Indian classical music—what is Within You Without You if not a form of style transfer?
Ok ok, let’s grant that there might be something going on beyond just style transfer. Maybe you could argue that language models simply interpolate between data points, while true creativity involves extrapolating beyond the data. But even here we run into trouble: it may sound like an empirical question, but experts still disagree over whether deep learning should be thought of as interpolating or extrapolating. Meta’s Chief AI Scientist Yann LeCun argues that actually, due to the high-dimensionality of the training data, deep learning is almost always extrapolating.
An old adage in AI known as Tesler’s Theorem states:
Intelligence is whatever machines haven't done yet.
It seems to me the same is true for the concept of creativity. If AI knocks down every new attempt to define “true creativity,” is it even a meaningful concept?
From "Can AI Be Creative?" to "What Can AI Create?"
In Computing Machinery and Intelligence, Turing recognizes that the question ‘can machines think?’ could easily be derailed by definitional disagreements, and dismisses it in favor of something much more concrete: Can a human evaluator tell the difference between a human and a machine in a text-only conversation?
Similarly, I have found myself getting derailed by definitions of creativity, so perhaps we should also try to propose more falsifiable tests.
In his piece arguing that ChatGPT is just a “blurry jpeg” of the web, sci-fi author Ted Chiang proposed one such test:
“[A] useful criterion for gauging a large language model’s quality might be the willingness of a company to use the text that it generates as training material for a new model”
Researchers tried this, and did find the quality of the outputs degraded. But then other researchers found that this “model collapse” could be avoided by accumulating data over time rather than replacing it, arguably a more realistic assumption. As synthetic data techniques advance, we don’t yet know what the limits are for the creativity we can extract from the real data we have.
Vaden has proposed another test: could an AI system (re)invent something humans invented, after being trained only on text from before that invention? For him, this would be surprising: it would indicate that AI has the kind of creative problem-solving that makes humans special, and therefore that we might be nearer to true human-level AI than he originally thought, with all the opportunities and risks that entails. I’d love to see the results of such an experiment, but I won't hold my breath—it would be hugely costly to run.
The benefit of figuring out now whether AI has the potential for human-like creativity is so that we can prepare before we experience any truly transformative events. But if we can’t pin down any reasonable tests for it or get more concrete about what it means, we're doing no more than losing ourselves in ungrounded philosophical debates while AI continues to reshape our world.
So instead of asking "can AI be truly creative?," perhaps we should be asking "what can AI create?"
Can AI produce art that is indistinguishable from human art? The answer seems to be increasingly yes—whatever secret sauce human artwork supposedly has, it doesn’t seem to show up in tests.
Can AI make significant scientific breakthroughs? The jury’s still out, but even if AI is “only” interpolating, I suspect there are latent insights to be had simply by making connections across the literature in different fields—no human can possibly read every scientific paper, but a language model can.
Does it matter whether we call this process "creative"? Does it matter whether AI has the same kind of intentionality that a human does? Maybe what matters is not the creator, but the creation.
AI systems may soon be capable of discovering cures for cancer, creating novel pathogens, and even contributing to their own improvement, regardless of whether we label them "creative" or not. In the face of systems with such profound impacts, quibbling over definitions seems a little... uncreative.
Thanks to Emma McAleavy, Rob Tracinski, Paul Crowley, Jannik Reigl, Julius Simonelli, Kevin Kohler, Mary Hui, and several other fellow 2024 Roots of Progress writers for valuable feedback on this essay.
You mentioned that AI should be able to find insights even if it’s only interpolating, based on the massive amount of research it has analyzed. I used to think that too. I used to think it would be quite good at making “interpolation” discoveries.
GPT-4 is good at critiquing experimental plans, so I thought it should be able to find examples of poorly designed studies in the historical record that were included in its training data. (For example, if I present an experimental plan that is horribly flawed (e.g., confuses correlation and causation) and ask GPT-4 about it, it would likely identify the flaws and suggest improvements.)
And if those studies were bad, then all the studies built on top of them would be suspect too. Kind of like this xkcd image (https://xkcd.com/2347/), except replace the project with an unblinded study of only 30 participants—all of whom are WEIRD—that was p-hacked to find some result. I think it’s likely there are more studies in the historical record like this than we’d like to admit. So the question is, which ones are load-bearing, and can GPT-4 find them?
But when you ask it something like “Find examples of foundational studies that are so poorly designed that they shouldn’t have been used, but no one knows that they’re bad studies” the results are terrible. It usually responds with studies that have Wikipedia articles that say “this study was really bad—everyone should know this by now”. Even with iterations and prodding, I’ve never been able to get anything useful out of it.
I had thought this might work but maybe I was wrong to expect it. There are a lot of reasons that it might not work. For one, this task is very different than how it was trained. Retroactively “remembering” your archive of training data and critiquing parts of it is very different than next-token prediction. Additionally, I’m using it as a quick and dirty RAG, where I’m not giving it the documents, I’m just asking it to recall its training data. (Not to go too tangential, but this might be extra problematic because it wouldn’t surprise me if the AI companies intentionally RHLF the model to obfuscate the training data, which messes up the experiment. I do wonder if the pre-RHLF’d version would be better at this.)
I bring this up because it seems like an easier goalpost than a new scientific breakthrough. Instead of finding a new breakthrough, just provide a new insight into existing data. But, like I said, I’ve never been able to get this to work.
Or maybe the goalpost isn’t as easy as I’m assuming—maybe our existing mechanisms for communicating information are better than I think. Sure, no one can read as many scientific papers as an LLM, but through conferences and asking around and whatnot you can find your way to the relevant research, even if it’s cross-discipline.
Anyway, this process made me less optimistic that LLMs, trained as they currently are, will produce scientific discoveries.
*You said “AI”, not necessarily an LLM, so maybe this isn’t exactly what you were thinking of.