David Freedman has a very interesting article in The Atlantic that has generated a lot of justified interest in the blogosphere. Freedman describes how hard it is to develop reliable medical knowledge, saying of medical scientific studies that “you have to wonder whether they prove anything at all. Indeed, given the breadth of the potential problems raised at the meeting, can any medical-research studies be trusted?”
The article talks a whole lot about the role that researcher bias, skewed incentives to gain funding, and so forth play in making even the conclusions of large-scale randomized control trials (RCTs) suspect. The hero of the article, celebrated medical meta-researcher Dr. John Ioannidis, is quoted as saying:
“The studies were biased,” he says. “Sometimes they were overtly biased. Sometimes it was difficult to see the bias, but it was there.” Researchers headed into their studies wanting certain results —a nd, lo and behold, they were getting them. We think of the scientific process as being objective, rigorous, and even ruthless in separating out what is true from what we merely wish to be true, but in fact it’s easy to manipulate results, even unintentionally or unconsciously. “At every step in the process, there is room to distort results, a way to make a stronger claim or to select what is going to be concluded,” says Ioannidis. “There is an intellectual conflict of interest that pressures researchers to find whatever it is that is most likely to get them funded.”
This kind of thing happens, as I have often written. But as always, the question is, “Compared to what?” That is, does our imperfect scientific knowledge of medicine allow us to make better decisions than we would make in the absence of this information? After all, medical research is not the only scientific field in which funding pressure creates researcher biases, yet we still seem to be able to build functioning airplanes and mobile phones.
The author answers this question with too broad a brush. What’s so striking to me about the facts asserted in the article is that even if we accept the author’s claims, then certain identifiable categories of medical knowledge seem quite reliable, while others seem worse than useless.
First, consider the differences by research methodology. The article claims that 80 percent (!) of non-randomized studies turn out to be wrong, as compared to “as much as” 10 percent of large randomized trials. Sufficiently-large randomized experiments, as I have often argued, are not some nerdy nice-to-have when evaluating theories in a sufficiently complex environment; they are a requirement. Being wrong 80 percent of the time is literally worse than just flipping a coin. On the other hand, if I had a series of ailments, and had to make a series of decisions about treatments for them, I would happily rely on a method that was right at least 90 percent of time in preference to relying on some combination of my intuition, what my brother-in-law experienced, and what I discovered on Google. Well-executed RCTs really do create useful scientific knowledge.
Second, consider the examples of subsequently refuted findings that are identified in the article. Here is a representative paragraph:
“Of course, medical-science “never minds” are hardly secret. And they sometimes make headlines, as when in recent years large studies or growing consensuses of researchers concluded that mammograms, colonoscopies, and PSA tests are far less useful cancer-detection tools than we had been told; or when widely prescribed antidepressants such as Prozac, Zoloft, and Paxil were revealed to be no more effective than a placebo for most cases of depression; or when we learned that staying out of the sun entirely can actually increase cancer risks; or when we were told that the advice to drink lots of water during intense exercise was potentially fatal; or when, last April, we were informed that taking fish oil, exercising, and doing puzzles doesn’t really help fend off Alzheimer’s disease, as long claimed. Peer-reviewed studies have come to opposite conclusions on whether using cell phones can cause brain cancer, whether sleeping more than eight hours a night is healthful or dangerous, whether taking aspirin every day is more likely to save your life or cut it short, and whether routine angioplasty works better than pills to unclog heart arteries.”
Do you see a common theme here? While not exclusively, these tend very strongly to be long-term, behaviorally-oriented interventions. Consider the classical therapies that were evaluated during the heroic phase of clinical trials in the mid-20th century — things like the streptomycin trials in Britain or polio vaccines in the U.S. These situations can be characterized by acute conditions that cause death or obvious loss of function within a short period, and that are addressed by treatments that apply a chemical to the body. As we shade from this kind of a problem to those characterized by conditions that affect people over many years, often in subjective ways, and are addressed by lifestyle changes or daily dosages of vitamins and so on, we are shading from medicine as classically conceived to something that is analytically much more like social science. This latter end of the spectrum is where the most common and severe problems with reliable determinations of causal effectiveness of interventions arise. This is not necessarily because researchers in these areas are less honest than those in other fields, but because the problem is inherently harder. Among other issues, signal-to-noise is worse, the relevant measurement period becomes years and decades rather than weeks and months, and the causal mechanism often becomes subtly entangled with many lifestyle behaviors. In such situations, RCTs are often impractical, and when they can be done, the integrated complexity of the causal mechanisms means that replications are much more likely to fail because unobserved context differences turn out to be relevant in determining success or failure of the treatment. The problem isn’t always the researchers; sometimes the problem is the problem.