An Observational Study Is Not An Experiment: Cautions In Research Interpretation

September 20, 2007

One of the most important long term issues for the industry is that of health claims for produce. We have a Goldilocks variety of claims: A general claim that consumption of produce is healthy, a more narrow claim that certain categories of produce offer health benefits — which is what “5-a-Day the Color Way” was based on — and then we have hundreds of specific health claims for individual items related to specific maladies.

The research on much of this has never been as strong as we would like, and we have had numerous discussions on the subject, including a piece entitled, More Matters And The Need For Supporting Research. This article included an important letter from Elizabeth Pivonka, President & CEO of the Produce for Better Health Foundation.

Part of the frustration is what Elizabeth calls the “Study of the day” mentality — by which every day “new research” points to “new benefits” or “new hazards” to be found from eating one item or another.

Yet an awful lot of this “research” is not based on an experimental trial at all; it is based on a different kind of study, typically called a “prospective” or “cohort” study, and The New York Times Sunday Magazine ran an intriguing story entitled Do We Really Know What Makes Us Healthy that deals with the problematic nature of this kind of research on which most health claims are based:

At the center of the story is the science of epidemiology itself and, in particular, a kind of study known as a prospective or cohort study, of which the Nurses’ Health Study is among the most renowned. In these studies, the investigators monitor disease rates and lifestyle factors (diet, physical activity, prescription drug use, exposure to pollutants, etc.) in or between large populations (the 122,000 nurses of the Nurses’ study, for example). They then try to infer conclusions — i.e., hypotheses — about what caused the disease variations observed. Because these studies can generate an enormous number of speculations about the causes or prevention of chronic diseases, they provide the fodder for much of the health news that appears in the media — from the potential benefits of fish oil, fruits and vegetables to the supposed dangers of sedentary lives, trans fats and electromagnetic fields. Because these studies often provide the only available evidence outside the laboratory on critical issues of our well-being, they have come to play a significant role in generating public-health recommendations as well.

The piece points to a real problem with relying on these types of studies to generate public health recommendations:

The catch with observational studies like the Nurses’ Health Study, no matter how well designed and how many tens of thousands of subjects they might include, is that they have a fundamental limitation. They can distinguish associations between two events — that women who take H.R.T. (Hormone Replacement Therapy) have less heart disease, for instance, than women who don’t. But they cannot inherently determine causation — the conclusion that one event causes the other; that H.R.T. (Hormone Replacement Therapy) protects against heart disease. As a result, observational studies can only provide what researchers call hypothesis-generating evidence — what a defense attorney would call circumstantial evidence.

The solution? An actual experiment:

Testing these hypotheses in any definitive way requires a randomized-controlled trial — an experiment, not an observational study — and these clinical trials typically provide the flop to the flip-flop rhythm of medical wisdom.

The “flop” is referring to the fact that so often the research “Finding” of today seems to ultimately be reversed upon further study:

Many explanations have been offered to make sense of the here-today-gone-tomorrow nature of medical wisdom — what we are advised with confidence one year is reversed the next — but the simplest one is that it is the natural rhythm of science. An observation leads to a hypothesis. The hypothesis (last year’s advice) is tested, and it fails this year’s test, which is always the most likely outcome in any scientific endeavor. There are, after all, an infinite number of wrong hypotheses for every right one, and so the odds are always against any particular hypothesis being true, no matter how obvious or vitally important it might seem.

What about all the “studies” we read that always point to “lifestyle” as a crucial variable on human health:

Indeed, if you ask the more skeptical epidemiologists in the field what diet and lifestyle factors have been convincingly established as causes of common chronic diseases based on observational studies without clinical trials, you’ll get a very short list: smoking as a cause of lung cancer and cardiovascular disease, sun exposure for skin cancer, sexual activity to spread the papilloma virus that causes cervical cancer and perhaps alcohol for a few different cancers as well.

Richard Peto, professor of medical statistics and epidemiology at Oxford University, phrases the nature of the conflict this way: “Epidemiology is so beautiful and provides such an important perspective on human life and death, but an incredible amount of rubbish is published,” by which he means the results of observational studies that appear daily in the news media and often become the basis of public-health recommendations about what we should or should not do to promote our continued good health.

In January 2001, the British epidemiologists George Davey Smith and Shah Ebrahim, co-editors of The International Journal of Epidemiology, discussed this issue in an editorial titled “Epidemiology — Is It Time to Call It a Day?” They noted that those few times that a randomized trial had been financed to test a hypothesis supported by results from these large observational studies, the hypothesis either failed the test or, at the very least, the test failed to confirm the hypothesis: antioxidants like vitamins E and C and beta carotene did not prevent heart disease, nor did eating copious fiber protect against colon cancer.

It seems that even the best of these “cohort” studies are questionable:

The Nurses’ Health Study is the most influential of these cohort studies, and in the six years since the Davey Smith and Ebrahim editorial, a series of new trials have chipped away at its credibility. The Women’s Health Initiative hormone-therapy trial failed to confirm the proposition that H.R.T.(Hormone Replacement Therapy) prevented heart disease; a W.H.I. (Women’s Health Initiative) diet trial with 49,000 women failed to confirm the notion that fruits and vegetables protected against heart disease; a 40,000-woman trial failed to confirm that a daily regimen of low-dose aspirin prevented colorectal cancer and heart attacks in women under 65. And this June, yet another clinical trial — this one of 1,000 men and women with a high risk of colon cancer — contradicted the inference from the Nurses’s study that folic acid supplements reduced the risk of colon cancer. Rather, if anything, they appear to increase risk.

The implication of this track record seems hard to avoid. “Even the Nurses’ Health Study, one of the biggest and best of these studies, cannot be used to reliably test small-to-moderate risks or benefits,” says Charles Hennekens, a principal investigator with the Nurses’ study from 1976 to 2001. “None of them can.”

The problem is that actual trials are rarely done and still have many limitations, so we are left with observational studies:

Understanding how we got into this situation is the simple part of the story. The randomized-controlled trials needed to ascertain reliable knowledge about long-term risks and benefits of a drug, lifestyle factor or aspect of our diet are inordinately expensive and time consuming. By randomly assigning research subjects into an intervention group (who take a particular pill or eat a particular diet) or a placebo group, these trials “control” for all other possible variables, both known and unknown, that might effect the outcome: the relative health or wealth of the subjects, for instance. This is why randomized trials, particularly those known as placebo-controlled, double-blind trials, are typically considered the gold standard for establishing reliable knowledge about whether a drug, surgical intervention or diet is really safe and effective.

But clinical trials also have limitations beyond their exorbitant costs and the years or decades it takes them to provide meaningful results. They can rarely be used, for instance, to study suspected harmful effects. Randomly subjecting thousands of individuals to secondhand tobacco smoke, pollutants or potentially noxious trans fats presents obvious ethical dilemmas. And even when these trials are done to study the benefits of a particular intervention, it’s rarely clear how the results apply to the public at large or to any specific patient. Clinical trials invariably enroll subjects who are relatively healthy, who are motivated to volunteer and will show up regularly for treatments and checkups. As a result, randomized trials “are very good for showing that a drug does what the pharmaceutical company says it does,” David Atkins, a preventive-medicine specialist at the Agency for Healthcare Research and Quality, says, “but not very good for telling you how big the benefit really is and what are the harms in typical people. Because they don’t enroll typical people.”

These limitations mean that the job of establishing the long-term and relatively rare risks of drug therapies has fallen to observational studies, as has the job of determining the risks and benefits of virtually all factors of diet and lifestyle that might be related to chronic diseases. The former has been a fruitful field of research; many side effects of drugs have been discovered by these observational studies. The latter is the primary point of contention.

But the “observational” studies are much more useful for discovering large risks and benefits:

Smoking and lung cancer is the emblematic success story of chronic-disease epidemiology. But lung cancer was a rare disease before cigarettes became widespread, and the association between smoking and lung cancer was striking: heavy smokers had 2,000 to 3,000 percent the risk of those who had never smoked. This made smoking a “turkey shoot,” says Greenland of U.C.L.A., compared with the associations epidemiologists have struggled with ever since, which fall into the tens of a percent range.

There is a basic conflict between what scientists actually know and the way public health advocates want to use the information:

From a scientific perspective, epidemiologic studies may be incapable of distinguishing a small effect from no effect at all, and so caution dictates that the scientist refrain from making any claims in that situation. From the public-health perspective, a small effect can be a very dangerous or beneficial thing, at least when aggregated over an entire nation, and so caution dictates that action be taken, even if that small effect might not be real. Hence the public-health logic that it’s better to err on the side of prudence even if it means persuading us all to engage in an activity, eat a food or take a pill that does nothing for us and ignoring, for the moment, the possibility that such an action could have unforeseen harmful consequences.

A lot of the article focuses on Hormone Replacement Therapy, or H.R.T., and a close look at the observational studies points out the problem:

In 1987, Diana Petitti, an epidemiologist now at the University of Southern California, reported that she, too, had detected a reduced risk of heart-disease deaths among women taking H.R.T. in the Walnut Creek Study, a population of 16,500 women. When Petitti looked at all the data, however, she “found an even more dramatic reduction in death from homicide, suicide and accidents.” With little reason to believe that estrogen would ward off homicides or accidents, Petitti concluded that something else appeared to be “confounding” the association she had observed. “The same thing causing this obvious spurious association might also be contributing to the lower risk of coronary heart disease,” Petitti says today.

Which makes the hot area of study something called “healthy user bias,” and getting to the bottom of it is crucial:

At its simplest, the problem is that people who faithfully engage in activities that are good for them — taking a drug as prescribed, for instance, or eating what they believe is a healthy diet — are fundamentally different from those who don’t. One thing epidemiologists have established with certainty, for example, is that women who take H.R.T. differ from those who don’t in many ways, virtually all of which associate with lower heart-disease risk: they’re thinner; they have fewer risk factors for heart disease to begin with; they tend to be more educated and wealthier; to exercise more; and to be generally more health conscious.

So the problem is there are so many factors involved that pulling out one factor, such as taking Hormone Replacement Therapy or eating a diet rich in fruits and vegetables and declaring that to be the “cause” of something, is quite questionable:

In one large population studied by Elizabeth Barrett-Connor, an epidemiologist at the University of California, San Diego, having gone to college was associated with a 50 percent lower risk of heart disease. So if women who take H.R.T. tend to be more educated than women who don’t, this confounds the association between hormone therapy and heart disease. It can give the appearance of cause and effect where none exists.

Efforts can be made to adjust for the effect, but it is not easy and the results are not certain:

George Davey Smith, who began his career studying how socioeconomic status associates with health, says one thing this research teaches is that misfortunes “cluster” together. Poverty is a misfortune, and the poor are less educated than the wealthy; they smoke more and weigh more; they’re more likely to have hypertension and other heart-disease risk factors, to eat what’s affordable rather than what the experts tell them is healthful, to have poor medical care and to live in environments with more pollutants, noise and stress. Ideally, epidemiologists will carefully measure the wealth and education of their subjects and then use statistical methods to adjust for the effect of these influences — multiple regression analysis, for instance, as one such method is called — but, as Avorn says, it “doesn’t always work as well as we’d like it to.”

These types of studies depend on getting a lot of things right:

…one conspicuous limitation of all epidemiology is the difficulty of reliably assessing whatever it is the investigators are studying: not only determining whether or not subjects have really taken a medication or consumed the diet that they reported, but whether their subsequent diseases were correctly diagnosed. “The wonder and horror of epidemiology,” Avorn says, “is that it’s not enough to just measure one thing very accurately. To get the right answer, you may have to measure a great many things very accurately.”

So where does all this leave us:

So how should we respond the next time we’re asked to believe that an association implies a cause and effect, that some medication or some facet of our diet or lifestyle is either killing us or making us healthier? We can fall back on several guiding principles, these skeptical epidemiologists say. One is to assume that the first report of an association is incorrect or meaningless, no matter how big that association might be. After all, it’s the first claim in any scientific endeavor that is most likely to be wrong. Only after that report is made public will the authors have the opportunity to be informed by their peers of all the many ways that they might have simply misinterpreted what they saw. The regrettable reality, of course, is that it’s this first report that is most newsworthy. So be skeptical.

If the association appears consistently in study after study, population after population, but is small — in the range of tens of percent — then doubt it. For the individual, such small associations, even if real, will have only minor effects or no effect on overall health or risk of disease. They can have enormous public-health implications, but they’re also small enough to be treated with suspicion until a clinical trial demonstrates their validity.

If the association involves some aspect of human behavior, which is, of course, the case with the great majority of the epidemiology that attracts our attention, then question its validity. If taking a pill, eating a diet or living in proximity to some potentially noxious aspect of the environment is associated with a particular risk of disease, then other factors of socioeconomic status, education, medical care and the whole gamut of healthy-user effects are as well. These will make the association, for all practical purposes, impossible to interpret reliably.

The exception to this rule is unexpected harm, what Avorn calls “bolt from the blue events,” that no one, not the epidemiologists, the subjects or their physicians, could possibly have seen coming — higher rates of vaginal cancer, for example, among the children of women taking the drug DES to prevent miscarriage, or mesothelioma among workers exposed to asbestos. If the subjects are exposing themselves to a particular pill or a vitamin or eating a diet with the goal of promoting health, and, lo and behold, it has no effect or a negative effect — it’s associated with an increased risk of some disorder, rather than a decreased risk — then that’s a bad sign and worthy of our consideration, if not some anxiety. Since healthy-user effects in these cases work toward reducing the association with disease, their failure to do so implies something unexpected is at work.

All of this suggests that the best advice is to keep in mind the law of unintended consequences. The reason clinicians test drugs with randomized trials is to establish whether the hoped-for benefits are real and, if so, whether there are unforeseen side effects that may outweigh the benefits. If the implication of an epidemiologist’s study is that some drug or diet will bring us improved prosperity and health, then wonder about the unforeseen consequences. In these cases, it’s never a bad idea to remain skeptical until somebody spends the time and the money to do a randomized trial and, contrary to much of the history of the endeavor to date, fails to refute it.

It is an intriguing piece ripe with implications for the way we interpret the studies we see published every day of the week. Read the entire piece here.