You mentioned that AI should be able to find insights even if it’s only interpolating, based on the massive amount of research it has analyzed. I used to think that too. I used to think it would be quite good at making “interpolation” discoveries.
GPT-4 is good at critiquing experimental plans, so I thought it should be able to find examples of poorly designed studies in the historical record that were included in its training data. (For example, if I present an experimental plan that is horribly flawed (e.g., confuses correlation and causation) and ask GPT-4 about it, it would likely identify the flaws and suggest improvements.)
And if those studies were bad, then all the studies built on top of them would be suspect too. Kind of like this xkcd image (https://xkcd.com/2347/), except replace the project with an unblinded study of only 30 participants—all of whom are WEIRD—that was p-hacked to find some result. I think it’s likely there are more studies in the historical record like this than we’d like to admit. So the question is, which ones are load-bearing, and can GPT-4 find them?
But when you ask it something like “Find examples of foundational studies that are so poorly designed that they shouldn’t have been used, but no one knows that they’re bad studies” the results are terrible. It usually responds with studies that have Wikipedia articles that say “this study was really bad—everyone should know this by now”. Even with iterations and prodding, I’ve never been able to get anything useful out of it.
I had thought this might work but maybe I was wrong to expect it. There are a lot of reasons that it might not work. For one, this task is very different than how it was trained. Retroactively “remembering” your archive of training data and critiquing parts of it is very different than next-token prediction. Additionally, I’m using it as a quick and dirty RAG, where I’m not giving it the documents, I’m just asking it to recall its training data. (Not to go too tangential, but this might be extra problematic because it wouldn’t surprise me if the AI companies intentionally RHLF the model to obfuscate the training data, which messes up the experiment. I do wonder if the pre-RHLF’d version would be better at this.)
I bring this up because it seems like an easier goalpost than a new scientific breakthrough. Instead of finding a new breakthrough, just provide a new insight into existing data. But, like I said, I’ve never been able to get this to work.
Or maybe the goalpost isn’t as easy as I’m assuming—maybe our existing mechanisms for communicating information are better than I think. Sure, no one can read as many scientific papers as an LLM, but through conferences and asking around and whatnot you can find your way to the relevant research, even if it’s cross-discipline.
Anyway, this process made me less optimistic that LLMs, trained as they currently are, will produce scientific discoveries.
*You said “AI”, not necessarily an LLM, so maybe this isn’t exactly what you were thinking of.
I love this test! I think I would be surprised if current systems did a good job of this, for the reasons you describe (though I don't think this is an inherent limitation of LLMs, I could imagine future versions doing much better). I do think current systems would probably do a much better job of being given a paper and being asked to point out flaws with the design - this seems like a good first thing to try?
I'm also curious what you used for this test - was it just ChatGPT? I wonder if a tool like Elicit which has been optimized for scientific literature would do a better job.
Yes, I think it was with ChatGPT, although I don't remember the specific models. (I've tried it a variety of times with different models). I just tried it with Elicit and it found research about bad papers. But I want the LLM to be the one to figure out it's a bad paper, not just point me to someone who already has.
You mentioned that AI should be able to find insights even if it’s only interpolating, based on the massive amount of research it has analyzed. I used to think that too. I used to think it would be quite good at making “interpolation” discoveries.
GPT-4 is good at critiquing experimental plans, so I thought it should be able to find examples of poorly designed studies in the historical record that were included in its training data. (For example, if I present an experimental plan that is horribly flawed (e.g., confuses correlation and causation) and ask GPT-4 about it, it would likely identify the flaws and suggest improvements.)
And if those studies were bad, then all the studies built on top of them would be suspect too. Kind of like this xkcd image (https://xkcd.com/2347/), except replace the project with an unblinded study of only 30 participants—all of whom are WEIRD—that was p-hacked to find some result. I think it’s likely there are more studies in the historical record like this than we’d like to admit. So the question is, which ones are load-bearing, and can GPT-4 find them?
But when you ask it something like “Find examples of foundational studies that are so poorly designed that they shouldn’t have been used, but no one knows that they’re bad studies” the results are terrible. It usually responds with studies that have Wikipedia articles that say “this study was really bad—everyone should know this by now”. Even with iterations and prodding, I’ve never been able to get anything useful out of it.
I had thought this might work but maybe I was wrong to expect it. There are a lot of reasons that it might not work. For one, this task is very different than how it was trained. Retroactively “remembering” your archive of training data and critiquing parts of it is very different than next-token prediction. Additionally, I’m using it as a quick and dirty RAG, where I’m not giving it the documents, I’m just asking it to recall its training data. (Not to go too tangential, but this might be extra problematic because it wouldn’t surprise me if the AI companies intentionally RHLF the model to obfuscate the training data, which messes up the experiment. I do wonder if the pre-RHLF’d version would be better at this.)
I bring this up because it seems like an easier goalpost than a new scientific breakthrough. Instead of finding a new breakthrough, just provide a new insight into existing data. But, like I said, I’ve never been able to get this to work.
Or maybe the goalpost isn’t as easy as I’m assuming—maybe our existing mechanisms for communicating information are better than I think. Sure, no one can read as many scientific papers as an LLM, but through conferences and asking around and whatnot you can find your way to the relevant research, even if it’s cross-discipline.
Anyway, this process made me less optimistic that LLMs, trained as they currently are, will produce scientific discoveries.
*You said “AI”, not necessarily an LLM, so maybe this isn’t exactly what you were thinking of.
I love this test! I think I would be surprised if current systems did a good job of this, for the reasons you describe (though I don't think this is an inherent limitation of LLMs, I could imagine future versions doing much better). I do think current systems would probably do a much better job of being given a paper and being asked to point out flaws with the design - this seems like a good first thing to try?
I'm also curious what you used for this test - was it just ChatGPT? I wonder if a tool like Elicit which has been optimized for scientific literature would do a better job.
Yes, I think it was with ChatGPT, although I don't remember the specific models. (I've tried it a variety of times with different models). I just tried it with Elicit and it found research about bad papers. But I want the LLM to be the one to figure out it's a bad paper, not just point me to someone who already has.