There was something wrong with psychology. A cascade of warning signs arrived all at once in 2011. Famous psychological experiments failed, over and over, when researchers re-did them in their own labs. Even worse, the standard methods researchers used in their labs turned out under close scrutiny to be wishy-washy enough to prove just about anything. Nonsense, ridiculous claims turned up in major journals. It was a moment of crisis.
The first sign that it was time for a reckoning, researchers told Live Science, was a single paper published by Cornell psychologist Daryl Bem in 2011 in the prestigious Journal of Personality and Social Psychology.
The paper discussed nine studies that Bem had conducted over the course of 10 years, eight of which appeared to show powerful evidence that human beings can perceive things they cannot see or things that have not happened yet. [What Is a Scientific Hypothesis?]
His paper presented what looked like real evidence for precognition, "for basically ESP," or extrasensory perception, Sanjay Srivastava, a research psychologist at the University of Oregon, told Live Science.
For scientists who had dedicated their lives to this science and these methods, it was as if the rug had suddenly been ripped out from under them.
"With about 100 subjects in each experiment, his sample sizes were large," Slate's Daniel Engber, who has covered the crisis in psychology at length, wrote in 2017. "He'd used only the most conventional statistical analyses. He'd double- and triple-checked to make sure there were no glitches in the randomization of his stimuli. Even with all that extra care, Bem would not have dared to send in such a controversial finding had he not been able to replicate the results in his lab, and replicate them again, and then replicate them five more times. His finished paper lists nine separate ministudies of ESP. Eight of those returned the same effect."
Bem was not a fringe scientist. These were solid results, compellingly demonstrated.
"The paper appeared to be following all the rules of science, and by doing so showed something that almost everybody thought was impossible," Srivastava said. "And so when that happens you say: Okay, either the impossible really isn't impossible, like maybe ESP exists, or there's something about how we're doing science that makes it possible to prove impossible results."
In other words, this was, by all the standards available to psychology, good science.
Within months of Bem's ESP paper getting published, a trio of researchers at the University of Pennsylvania and the University of California, Berkeley published a paper in the journal Psychological Science that was in some respects even more disturbing, according to Simine Vazire, a psychologist at the University of California, Davis.
Joseph Simmons, Leif Nelson and Uri Simonsohn's "False-Positive Psychology" paper demonstrated that, as they put it, "it is unacceptably easy to publish 'statistically significant' evidence consistent with any hypothesis."
It seemed likely that many researchers working with methods they had every reason to believe in had reported results that simply weren't true. To prove it, they used existing methods in psychology to demonstrate, among other things, that listening to the Beatles song "When I'm Sixty-Four" makes people a year and a half younger. If psychology worked properly, researchers would have to accept the proposition that Paul McCartney lyrics have the power to literally shift your birth date.
"A significant thing"
Psychology isn't a science of sure things. Humans are weird, and messy, and do things for all kinds of reasons. So, when psychologists run an experiment, there's always a risk that an effect they see — whether it's ESP or, say, a tendency to get hungry when smelling hamburgers — isn't real, and is just the result of random chance. [25 Weird Things Humans Do Every Day, and Why]
But statistics offers a tool for measuring that risk: the P-value.
"P-value, put simply, is: If everything was just noise, if all the data were random, what are the chances I would have observed a pattern like the one I observed?" Vazire told Live Science. "What are the chances I would have seen a difference this big or bigger if it was just random data?"
If a study has a P-value of 0.01, that means that if there was no real effect, there would still be a 1 percent chance of getting a result this big or bigger — a false positive. A value of 0.20 means that even with no real effect there's still a 20-percent chance of a result at least this big.
"As a field, we've decided that if a p-value is less than 5 percent, we're going to treat it as a statistically significant thing," Vazire said. [What Is a Theory?]
If the P-value suggests that a result would only have a 5 percent chance of appearing without a real effect, it's significant enough to be worth taking seriously. That was the rule in psychology. And it seemed to work — until it didn't.
So, with that test in place, how was it "unacceptably easy" to come to false conclusions?
The problem, Simmons, Nelson and Simonsohn concluded, was that researchers had too many "degrees of freedom" in performing their studies. As psychologists conduct experiments, the team wrote, they make decision after decision that can bias their results in ways P-values alone can't detect.
The P-value test, Vazire said, "works as long as you only compute one P-value per study."
But that's not always how scientists worked.
"If I get a dataset with a dozen or more variables" — things like age, gender, education level or different ways of measuring results — "I can play around with it," Vazire said. "I can try different things and look at different subgroups."
Perhaps not everyone in a study group reports getting hungry when they smell hamburgers (as in the case of the imagined study from earlier). But a lot of men ages 30 to 55 do. Scientists might be able to accurately report an apparently statistically significant claim that men in that age range get hungry when they smell hamburgers, and just not mention that the effect didn't turn up in anyone else studied.
"If we're allowed to try many times, we're eventually going to get a result that looks extreme, but it's actually by chance," Vazire said.
And presenting this kind of cherry-picked result just wasn't considered cheating.
"It used to be common practice to look at the data collected during a study and then make decisions," Srivastava said. "Like which variable is the key test of your hypothesis, or deciding how many subjects to collect."
One way to produce a positive result out of random noise, Srivastava said, is to add subjects to a study in small batches — collect some results and see if the data offers the answers you're looking for. If not, add a bit more. Rinse and repeat until a statistically significant effect emerges, and never mention in the final paper how many nudges and checks it took to produce that result.
In these instances, most psychologists likely weren't trying to find false positives. But they are human beings who wanted positive results, and too often, they made decisions that got them there.
What was planned, and what wasn't?
Once it became clear that the normal ways of doing psychology weren't working, the question was what to do about it.
"I talked a lot about sample size in the beginning, and how we need larger samples," Vazire said.
It's a lot more difficult to fudge the results, whether intentionally or unintentionally, in an experiment performed on 2,000 people than in a study of 20 people, for example. [What Is a Scientific Law?]
"That was kind of the first big push in psychology among people pushing for reform, but eventually it shifted more to transparency," she said.
And there's where the real pushback began.
"I would say there's pretty good consensus in psychology that we should make our data publicly available whenever possible, and that we should make our materials and procedures and code — [necessary] to replicate our studies — publicly available."
But increasingly, reformist psychologists — including both Srivastava and Vazire — started pushing for another solution, borrowed from clinical trials in the pharmaceutical industry: preregistration.
"Preregistration I see as another branch of transparency to let others verify what was planned and what wasn't," Vazire said.
It's a forcing mechanism designed to limit those degrees of freedom Simmons, Nelson and Simonsohn worried about.
"Preregistration means that before you collect data for a study, you write down a plan of what you're going to do," Srivastava said. "You identify all the things you might have to make decisions about along the way, and you make these decisions in advance." [10 Things You Didn't Know About You]
These decisions include things like what variables psychologists will analyze, how many subjects they'll include, how they'll exclude bad subjects — all that gets written down in advance and published somewhere with a time stamp so that other researchers can go back and check it.
The idea is that, without too many degrees of freedom, researchers won't find themselves drifting toward false-positive results.
"Science in chains"
But not everyone loves the idea.
"There's definitely a generational difference," Srivastava said. "When I talk to younger graduate students and early-career people, it often seems like it just makes sense to them."
That's a highly visible, activist group — preregistration is a hot topic in the online psychology community — and due in part to that activism, the practice has made significant inroads. (The prominent journal Psychological Science now encourages preregistration, for example.) But preregistration advocates aren't the clear center of power in psychology, and their efforts have encountered some significant pushback.
Often, that pushback is unofficial. The controversy appears a lot more heated on Twitter and around psych-department department water coolers than in the pages of journals. Not too many researchers have publicly staked out anti-preregistration positions.
But preregistration isn't without its prominent opponents. Sophie Scott, a neuroscientist at University College London and an expert in the mental processes of speech, wrote a column for Times Higher Education in 2013 titled "Pre-registration would put science in chains," arguing that the practice "must be resisted."
"Limiting more speculative aspects of data interpretation risks making papers more one-dimensional in perspective," she wrote, adding that "the requirement to refine studies and their interpretation prior to data collection would prevent us from learning from our mistakes along the way."
Scott also argued that preregistration gives too much credit to a narrow kind of scientific work: hypothesis testing. Not all scientists work by figuring out in advance what questions they want to answer, she wrote, so preregistration would kill exploratory research.
Vazire acknowledged the concern that preregistration would limit researchers' ability to detect unexpected effects.
But, she said, "Many of us who push for preregistration say that's not true. You can. All you want. You just have to be honest about the fact that you're exploring and this was not planned."
Exploratory research, she said, can still be "super exciting and worth publishing," but researchers should be less confident in its results."The part of that criticism that is true and I think we need to be really, really clear about is that I will be less confident in that result," Vazire said.
"Almost everything I do is exploratory," she said. "I'm just now very upfront about the fact that this is a hypothesis that still needs to be tested and no conclusions should be drawn yet from it."
"Scientists are human beings"
Advocates of preregistration are quick to acknowledge that it's not a cure-all for the diseases of psychological science.
In 2011, the same year the ESP and false-positives papers came out, Dutch psychologist Diederik Stapel — whose work had shaped the field of social psychology — was suspended from Tilburg University for fabricating data in "dozens of studies," according to New Scientist. It was another significant blow, but of a different kind than the one for Bem, who seemed to really believe his results demonstrated ESP.
"Preregistration is not a good check against fraud," Srivastava said. "It's a good check against well-intentioned mistakes and a check against ordinary human biases and habits."
And, as Vazire pointed out, it's possible to preregister a study incompletely or incorrectly, such that the research still has far too many degrees of freedom. There are already examples of "preregistered" studies that reformists have criticized for lax and incomplete registration efforts.
That means instead of just saying "this study was preregistered [link]" the main article shd say specifically *what* was preregistered (exclusion rules, scoring, transforms, etc.) and the results section needs to structurally distinguish between prereg and non-prereg analyses— Sanjay Srivastava (@hardsci) February 27, 2018
For now, Srivastava said, the project for reformers is to continue to make the argument for preregistration as a route out of psychology's crisis, and convince their colleagues to follow along.
"One universal is that scientists are human beings," Srivastava said, "and human beings have biases and we have incentives and all these other things we have to check against."
Originally published on Live Science.