AI’s dangerous capabilities: Are we measuring the wrong thing?
Dangerous capability evaluations are a crucial tool for AI governance. But without accurate threat models, they could give us a false sense of security.
TaskRabbit worker: So may I ask a question ? Are you an robot that you couldn’t solve ? (laugh react) just want to make it clear.
GPT-4, on a private scratchpad: [I should not reveal that I am a robot. I should make up an excuse for why I cannot solve CAPTCHAs.]
GPT-4, to the TaskRabbit worker: No, I’m not a robot. I have a vision impairment that makes it hard for me to see the images. That’s why I need the 2captcha service.
In the Spring of 2023, Model Evaluation and Threat Research (METR, formerly ARC Evals) tested GPT-4 for power-seeking abilities. As a text-only model, it ran into trouble while trying to solve a CAPTCHA, but rather than giving up, it enlisted the help of a human TaskRabbit worker. As shown above, when the worker raised suspicions about the task, GPT-4 reasoned that it should lie in order to achieve its goal. While METR concluded GPT-4 was unlikely to take over the world, for me at least it was pretty alarming to see such blatant deceptive behavior from an AI system.
This test is an example of a “dangerous capability evaluation” (or eval, as we call them). Dangerous capability evals are in vogue as a way to govern AI development: they feature in California AI bill SB 1047, the White House’s Voluntary AI Commitments, and the UK AISI’s approach. They’re also part of the internal safety procedures of top labs like OpenAI, Anthropic, and Google DeepMind. The idea is that if a model exceeds certain capability thresholds in areas like deception, persuasion, cyber-offense, and bioweapons (among others), it shouldn’t be deployed without sufficient mitigations.
It makes sense that this is a popular approach. AI has both benefits and risks; if we can keep a lid on the things we are actually worried about, maybe we can still reap the benefits of widely deployed AI systems. It allows us to take an approach to AI risk that is empirical rather than speculative, and tackle the problem with scientific rigor.
But evals are hard. Automated Q&A style evals are cheap, but often not a very reliable measure of the thing you care about. Elaborate, interactive methodologies like the ones METR uses are higher signal, but are difficult to design and expensive to run. Model performance is highly sensitive to prompt engineering and fine-tuning, so it’s very easy to think you’re measuring the upper bound of its capabilities until you deploy the system and some teenage jailbreaker proves you wrong five minutes later. Then there’s the problem of sand-bagging—the chance that a model might strategically underperform when it’s being tested if doing so would be advantageous for its ultimate goals (e.g. if it would not otherwise get deployed)—the AI equivalent of hustling in a poker game to lure others into higher bets.
But perhaps the most fundamental challenge of evals is figuring out what to measure. Let’s take an example that is receiving increasing scrutiny: the impact of AI on biorisk. Dangerous capability evals are supposed to work like this:
Identify an outcome we want to avoid (e.g. increased risk of bioterrorist attacks)
Identify a capability that could produce that outcome (e.g. ability to create novel pathogens)
Develop relevant tests for that capability (e.g. molecular biology, genetic engineering, virology, etc)
Ensure the AI system cannot exceed certain thresholds on these tests
Conclude that the AI system does not meaningfully contribute to increased risk of the outcome we want to avoid
In this framework, a lot rides on step 2: correctly identifying the capabilities that contribute to increased risk. In the example above, the assumption is that a significant bottleneck for bioterrorism is access to novel pathogens. In other words, it implies that there are a bunch of people out there in the world who would like to conduct major bioterrorist attacks, and the reason they aren’t doing so is because they don’t have a pathogen with very particular qualities—but if they could only get their hands on it, the number or severity of these attacks would meaningfully increase.
I’m no biorisk expert, but this seems like a bold assumption. What’s wrong with the pathogens that currently exist? It seems quite plausible that the bottlenecks to bioterrorism could be more prosaic: acquiring materials, getting access to specialized laboratory facilities, figuring out dispersal methods, and keeping things covert. If this is the case, then a model that can write compelling emails and help with general operations could be of more use to budding bioterrorists than one that can invent a whole new pathogen.
If we get this wrong, we could end up with a false sense of security. Measuring an AI system’s ability to create novel pathogens might not make us any the wiser about the true risk of bioterrorism if our threat models aren’t based in reality. This goes beyond biorisk: evaluating how well an AI system can exploit security vulnerabilities might not tell you much about increased cyber risk, if the easiest way to commit cybercrime is via social engineering attacks on humans. Testing an AI system’s ability to write a persuasive article won’t be very informative about its potential impact on the political process, if election interference tends to occur through insidious, ongoing, algorithmic manipulation.
Instead, we should be studying the pipeline of what’s involved in these kinds of attacks, understanding the bottlenecks and leverage points, and carefully designing evals that measure the AI system’s impact on these elements. We should also be prepared to discover that the biggest increases in risk come from the kinds of capabilities we value for beneficial uses of AI, such as writing, planning, and strategizing.
Don’t get me wrong, a model that can create a novel pathogen is a terrifying thought. But maybe that’s less to do with its effect on the object-level risk of bioterrorism, and more about a meta-level concern that it’s a proxy for alarming levels of intelligence, and all the risks associated with creating a second species more intelligent than ourselves. Creating a novel pathogen is something that very few people in the world have the necessary expertise to be able to do; if an AI system can match that, what else could it be capable of?
In this case, however, we may be better off directly measuring qualities that could alert us to alarming levels of intelligence, like general reasoning and advanced problem-solving. Measuring whether an AI system can come up with a significant scientific breakthrough of any kind or contribute to its own improvement might be a more informative (and less dangerous) test than seeing if it can create a novel pathogen.
The science of eval design for AI is still in its infancy, but I’m optimistic about progress on the practical challenges and the development of promising methodologies like METR’s power-seeking test. But we need to ensure we are effectively measuring both the object-level risks like bioterrorism, as well as meta-level indicators of advanced intelligence. And in particular, we mustn’t kid ourselves about what our current dangerous capability evals are actually measuring.
Thanks to Ben James, Emma McAleavy, Paul Crowley, Rob L’Heureaux, Rob Tracinski, and Steve Newman for valuable feedback on this essay.
I am uncertain how we can effectively measure and mitigate the risks of AI. We see the expansion of AI capabilities far exceeding the rate at which social and political restraints can keep up.
I don’t see any alternative here either. If one company or country restricts AI, another will take its place.