How Medical Facts Are Developed:
Why Some Are More Potent Than Others

Rodger Pirnie Doyle

Medical scientists view the reality of health and disease in three ways. The most ancient, which was used by Egyptian physicians more than four thousand years ago, is the systematic observation of those who are sick. Through the intensive study of the course of illness in their patients, physicians through the ages have gained knowledge of disease and developed ways of treating it. This simple type of research, the case study, is the foundation of the healing arts and the inspiration for most of the new concepts in medical science.

Beginning in the late nineteenth century, a second type of research, the laboratory experiment, provided a powerful means of testing the concepts that physicians derived from their observations of patients. Through the study of animals, of living tissues, of cells, and of disease-causing agents, laboratory investigators helped to transform medicine from a largely hit-or-miss art into an increasingly effective, science-based discipline.

The third kind of research, like the first, involves people. However, unlike the case study, which focuses on the sick individual, this third type of research deals with groups of people, both the sick and the well. Its most important tool is statistics, and since its early practitioners were concerned with epidemics of smallpox, typhoid, and other infectious diseases, it acquired the name epidemiology.

Epidemiologists study the characteristics of people—where they live, how they earn a living, what they eat, whether or not they smoke, whether they are fat or thin, what illnesses their parents had; in short, all of the countless circumstances and habits that keep them in good health or propel them into illness.

Epidemiology has roots that go back to Hippocrates, but it did not begin to evolve as a scientific discipline until the early nineteenth century. It has become increasingly important in helping to research the causes of diseases, the ways in which they evolve, and the effectiveness of treatment. In the process, it has developed powerful methods for resolving controversies over human health and disease, and indeed, epidemiology is generally the most useful of the three types of research for this purpose.

To understand what epidemiologists do and how their methods help to settle controversies, there is no better example than the remarkable work of Dr. John Snow in Victorian England. Snow was a pioneer in the use of chloroform, and became a celebrity when he administered it to Queen Victoria. But Snow's most extraordinary achievement was to penetrate the mystery of cholera, which was then the most dreaded of all diseases.

Cholera, if unchecked, follows a spectacular course. The victim is suddenly attacked by fever, vomiting, and diarrhea, and within hours the skin turns black, or perhaps blue. With the loss of fluids, the body shrivels at an alarming rate. The hands tremble, the muscles twitch, breathing becomes difficult, the blood thickens, and facial features collapse, until the victim looks like a caricature of his or her former self. The cholera sufferer's body is racked by violent convulsions and may contract into a tight ball, which even after death cannot be unwound. The end sometimes comes within a day of the first symptoms. Ordinarily, half of the victims of cholera die if untreated.

Modern science has a highly effective remedy for cholera—replacement of body fluids with specially prepared solutions—but in early nineteenth-century Europe little was known about how to treat it, and nothing was known about its cause. No one could explain its unpredictability. Healthy people were suddenly stricken, while their next-door neighbors seemed to be immune, a phenomenon that created a frightening sense of insecurity. People by the thousands fled the affected areas, sometimes even abandoning children in their panic.

Cholera originated in India and spread rapidly throughout Europe in a series of epidemics beginning in 1817. It was in 1848, when one of these epidemics reached London, that John Snow became absorbed in finding out how the disease was transmitted. Other physicians argued that the toxic agent was borne by the air. Still others thought that it came from decaying organic matter in the earth. But Snow had the wit to see that the disease spread too slowly to be airborne. He noted that it never spread faster than people could travel and that it never appeared in out-of-the-way places, which suggested that the infectious agent was not in the ground. If the disease did not come from the ground or the air, how then could it be spread?

Snow observed that personal contact with a cholera victim did not always result in infection. When all of the people in a household were afflicted, the physician who attended them often walked away unscathed. Other people who had no personal contact with a sick person would inexplicably become infected.

Snow saw that all of the facts could be explained by one concept: The disease was caused by an agent that entered the mouth, and this agent therefore could be carried only by water, food, or unwashed hands. This would explain, for example, why physicians, who ordinarily did not eat with their patients, rarely got the disease. It would also explain why those who attended the funerals of the victims, where food and drink were customarily served, often did become infected.

In 1854, during a resurgence of cholera in London, Snow saw a unique opportunity to test his hypothesis. Much of London at that time received its water from two private companies. The Southwark and Vauxhall Company supplied water from a highly contaminated part of the Thames, while the Lambeth Company's water came from a less contaminated section of the river. The water lines of the two companies crisscrossed, so that adjacent houses on the same block might be served from different sources. By analyzing the records, Snow found that in the houses served by the Southwark and Vauxhall Company, the cholera incidence was eight times that of the houses supplied by the Lambeth Company.

Snow then performed an experiment which proved that the cholera toxin was borne by water. There was a severe outbreak of the disease in the vicinity of the Broadstreet pump well. Within ten days, more than 500 people in nearby households died, but those who got their water from deep wells and other sources were relatively unaffected. When Snow convinced the authorities that they should remove the handle from the pump well, the number of deaths declined sharply.

In retrospect, Snow's conclusions may seem obvious, but viewed from the perspective of the 1850s, his feat was as difficult as that of a man who finds his way out of a deep, labyrinthine cave without a map or a light. Snow's achievement was remarkable for at least two reasons. First, it came a decade before the germ theory of disease was understood and almost two decades before the cholera bacillus was discovered. Second, Snow's method was brilliant. Indeed, the method he used is a prototype of modern disease research. He took three steps that are still, 130 years later, essential to medical progress: (a) he gathered all of the available data on the disease; (d) he formed a hypothesis that attempted to explain these data in all their variety and permutations; and (c) he tested his hypothesis in the best way he knew.

Like John Snow, modern epidemiologists are concerned with ways in which people interact with their environment and with disease-causing agents. To examine these interactions, epidemiologists focus on groups of people, just as Snow did, rather than on single individuals. There have been changes, of course, since Snow's time. There has been a shift in emphasis away from infectious diseases and toward a variety of noninfectious conditions such as heart disease, cancer, schizophrenia, alcoholism, birth defects, aging, and pregnancy. There has also been an elaboration of techniques. Of course, some of the techniques are far more useful than others for resolving health controversies.

Types of Studies

Epidemiologists, like other scientists, are concerned with the development and testing of theories [1].

Theories are peculiar in at least one way: they are designed to create controversy. A theory is always framed as a declaration—never as a question—and thus is likely to invite attack. Although questions are fundamental in medical research, by itself a question is apt to be sterile, because it cannot be tested unless translated into a theoretical declaration. Theories are the indispensable fuel of medical progress. Without them, medical science would become static, and without the powerful tools of epidemiologic research to develop and test them, progress would be far slower.

There are five principal types of studies that epidemiologists use to explore associations between diseases and their possible causes: ecological, cross-sectional, case-control, cohort, and the clinical trial. Of the five, the clinical trial is the most powerful in testing theories, while ecological studies are the least powerful [2].

Some scientists exclude clinical trials from their definition of epidemiological research, but others do not.

Ecological studies make use of existing statistics on mortality, on the number of people suffering from specific diseases, on various environmental phenomena, such as water contamination and background radiation, and on economic phenomena such as per capita income or fat consumption. Because they can often be carried out at little cost, ecological studies are among the most common type of epidemiological activity.

Ecological studies can take many forms, one of the simplest being the historical comparison, such as a study in which cigarette consumption over the years is examined in relation to the trend of lung-cancer mortality. Another common type of ecological study is based on comparisons between geographical areas. In one such study, researchers compared breast-cancer mortality, country-by-country, with a variety of statistics on dietary consumption. They found that countries in which fat consumption was high almost always had high breast-cancer mortality. However, to conclude that high-fat consumption is a cause of breast cancer would be committing what is known as the ecological fallacy. This is an error in logic in which a relationship between groups is assumed to apply also to individuals within these groups. Breast-cancer mortality may be high not because of excessive fat in the diet but because of some other factor not measured by the statistics. This other factor might be highly correlated with fat consumption, and thus there would also be an association between fat and breast cancer, but this association would be fortuitous.

Ecological studies play an important role in revealing associations between environmental phenomena and disease, but they do not identify which associations are causal. To draw firmer conclusions about causal relationships, it is necessary to examine the behavior and circumstances of individuals.

Cross-Sectional Studies

In concept, the simplest type of study based on individuals is the cross-sectional study. A familiar example is the Gallup poll, in which a cross-section of a thousand or so people are queried in a period of a few days. In epidemiological work, the cross-sectional design is used to examine the relationship between disease and the personal and environmental circumstances of a group of people at a particular moment in time.

Occasionally, cross-sectional studies are useful in suggesting the causes of a disease. An example is a survey done years ago in the town of Berlin, New Hampshire, which was designed to study the relationship of respiratory disease to cigarette smoking. The study found that chronic bronchitis was related to smoking. Few of those who never smoked had bronchitis, while among those who did smoke the prevalence of the disease rose with the number of cigarettes consumed per day.

Many cross-sectional studies provide little insight into causes of a given disease. This is true of a national survey that found that 5% of men had severe drinking problems and that these men tended to be poor. However, the study gave no hint about whether poverty causes problem drinking, whether problem drinking causes poverty, or whether some third factor causes both.

Because cross-sectional surveys are often expensive and often do not provide strong dues to cause-and-effect relationships, epidemiologists commonly turn to a variation of this design known as the case-control study.

Case-Control Studies

If the cross-sectional design is like a wide-angle shot, then the case-control design is like a close-up photograph, because it focuses more closely on people who are ill. In case-control studies, people with a particular disease—the cases—are compared to people who do not have the disease-the controls. This kind of comparison can sometimes be made in cross-sectional studies, but often the number of subjects is too few for making reliable comparisons. The case-control technique allows for careful matching of cases with controls on such vital characteristics as age and sex, a procedure that greatly enhances the analytical power of the study.

By comparing the cases with the controls, it is possible to identify characteristics of the case group that are associated which the disease. In one typical study, approximately 400 patients with breast cancer were compared to 400 women (controls) free of the disease. The investigators found that the diets of women with breast cancer contained more fat than those of women who did not have the disease.

The case-control method has several important limitations. Ordinarily, the cases are patients who have come to a hospital for treatment, and the controls are patients at the same hospital who have unrelated diseases. Case-control studies therefore generally deal with people who have found their way to a hospital or a clinic, but such people are not necessarily typical of those who have the disease being studied. The most serious cases may die before being admitted to the hospital, while the really mild cases may be treated in the offices of private physicians.

In order to improve the validity of their data, some researchers choose as controls healthy people living in the same neighborhoods as the cases. The most desirable method is to bypass the hospital entirely and to choose both cases and controls from the general population, but this is rarely done because of the expense.

The case-control method is also limited because it is usually retrospective; that is, it attempts to reconstruct the past through the patient's memory and whatever records are available. (In contrast, the more powerful cohort technique is usually prospective; that is, all of the observations are made as the study progresses.) The case-control, like the cross-sectional, method therefore does not always provide indisputable evidence that the suspected cause precedes the disease. For example, in a case-control study designed to evaluate the relationship of hypertension to feelings of stress, there is no reliable way to determine which of the two conditions came first.

The dependence upon the patient's memory imposes another important limitation on case-control studies, because memory can be dim or, worse yet, inaccurate. In the study of breast cancer and fat consumption, the finding that women with the disease consume more fat could be misleading because of faulty memory. Breast cancer has a long latency period—possibly twenty years or more—thus, the validity of this clue depends on the successful reconstruction of the diet followed over the course of prior decades. But how many middle-aged women can recall with even approximate accuracy the amount of fat they ate ten or twenty years earlier? When memory is typically good, as it is in recalling such noteworthy events as the age of first menstruation, there is less chance of bias. In some case-control studies, dependence on memory is unnecessary, because well-kept medical records are available.

The case-control study is controversial. Its critics claim that it will not become scientifically acceptable until ways are found to control the effects of bias. Defenders of the case-control study tend to see it as speedy, relatively inexpensive, and a fruitful source of new hypotheses. It is often the only affordable way to study rare diseases, and indeed, for reasons too technical to dwell on here, the accuracy of risk estimates based on case-control studies actually increases with the rarity of the disease.

Cohort Studies

If cross-sectional and case-control studies are like still photographs, cohort studies are like movies. Unlike these other types of studies, they are prospective—that is, they follow groups of people over time, often for many years. They provide the most convincing evidence in support of medical hypotheses, second only to that obtained in clinical trials. Where clinical trials are impractical, judgment may have to depend primarily on cohort-study data.

There are three essential steps in a cohort study. First, the researcher picks a group of people—that is, a cohort—and records their characteristics. Usually, the people chosen are, by design, free of the disease under investigation. Second, the researcher follows the group over time to determine which members contract the disease. Finally, the researcher relates the occurrence or nonoccurrence of the disease to the characteristics of the subjects.

One of the earliest and best-known cohort studies began in 1948 in Framingham, Massachusetts, and continued for more than twenty years. The Framingham study was designed to help identify the characteristics associated with coronary heart disease and other cardiovascular disorders. Over 5,000 people, all of whom were free of the ailment, were questioned extensively on their living habits and given a number of medical tests, including blood-pressure measurements, electrocardiograms, and a variety of blood analyses. The investigators found that certain characteristics—high blood pressure, cigarette smoking, and high serum cholesterol—were more prevalent among those who subsequently developed heart disease than among those who did not. Such characteristics are usually called risk factors, a term that denotes a statistical association but leaves open the question of causality. For example, the Framingham study tells us that people with high serum cholesterol are more likely to develop heart disease than those with low serum cholesterol, but it does not demonstrate conclusively that high serum cholesterol causes heart disease.

Through the use of complex statistical techniques, it is possible in epidemiological studies to determine whether certain characteristics that are associated with a disease are primary or secondary. The Framingham study, for example, found that obesity was associated with heart disease. However, obese people tend to have high blood pressure and high serum cholesterol, and, of course, they may also smoke. When the Framingham researchers took these characteristics into account, they found that obesity, as such, was a secondary characteristic that had only a weak association with the disease. In other words, people who are obese are not greatly at risk of heart disease unless they also have primary risk factors such as high blood pressure.

Cohort studies can provide high-quality information. The findings of the Framingham study and several other cohort studies of coronary heart disease are the basis for the current public-health programs aimed at reducing high blood pressure.

A major problem with this type of study is the large proportion of subjects that are lost to follow-up, particularly in the long-term studies. Cohort studies are expensive and time-consuming; and they are usually not suitable for studying rare diseases, because extremely large numbers of people would have to be included in order to provide a sufficient number of cases. Because of their expense, cohort studies are not suitable for testing highly speculative hypotheses.

There is a variant of the cohort study in which the information is not gathered prospectively. Instead of choosing a group of people and following their fortunes over a period of months or years, researchers sometimes attempt to reconstruct the history of a cohort using available records and personal interviews. Because they go back in time to look at people who may have been exposed to a suspected disease-causing agent, the term applied to this approach is retrospective cohort study. Such studies are often useful in assessing the risk to any group that has been exposed to a common health hazard; for example, workers who have been exposed to lead, patients who have been exposed to a particular type of medical radiation, or people living in the neighborhood of a chemical dump site.

Clinical Trials

Of all the procedures that medical researchers use, the clinical trial is the most decisive in settling controversies over human health. At its best, this method allows researchers to test one variable at a time. When all extraneous variables can be excluded, a cause-and-effect relationship can be established beyond a reasonable doubt.

A famous example of the clinical trial is the 1954 test of the Salk vaccine as a treatment for preventing poliomyelitis. Over 400,000 schoolchildren took part in this test. Half the participants—the experimental group—received the vaccine, and the other half—the control group—were given a placebo vaccination—that is, they were inoculated with a substance containing no active ingredient. The value of this vaccine was gauged by comparing the number of cases of polio that developed in each group.

Like most well-designed trials, the Salk vaccine study incorporated the double-blind principle: The trial subjects did not know whether they were getting real medication or a placebo, and the researchers who examined the subjects for polio did not know who had been given the vaccine and who had been given the placebo. The principle of withholding information from the subjects is vital in most tests of therapy in order to control for any "placebo effect"—the tendency for subjects to feel better simply because they believe have been treated with a beneficial substance. The test could also be invalidated if parents who knew that their children had received the placebo would arrange for them to get the real vaccine. The withholding of information from the researchers is also vital to a well-conducted study, for it eliminates any unconscious tendency to bias the test data, a tendency that is apt to arise in any situation that calls for some degree of subjective judgment in diagnosing disease or analyzing data.

The Salk vaccine study incorporated another important principle of clinical testing: the subjects were assigned at random to either the experimental group or the control group. In such tests, the two groups must be comparable in all respects, and the best way to ensure comparability is by random selection. If the investigator does not follow this method, the results could be open to the suspicion that bias—conscious or unconscious—affected the allocation process.

A variant of the clinical trial is the community trial, which is used when a public-health program cannot be applied selectively to individuals but must be applied to the community as a whole. An example is fluoridation of water to prevent tooth decay, which was tested for its effectiveness in the 1940s. Several communities had fluoride added to their water supplies, while other communities served as controls. By comparing the prevalence of tooth decay among schoolchildren in the test cities against that in control cities, it was found that the children in the fluoridated areas had far fewer cavities. Community trials are necessarily restricted to a limited number of areas, and thus there is the possibility of picking atypical areas that may give misleading results. For this reason, community trials are inherently less powerful tools for testing hypotheses than ordinary clinical trials, in which chance tends to play a far lesser role because the principles of random sampling can be more fully applied.

The fluoridation experiments illustrate that not all trials need be double-blind. It was obviously impossible to avoid disclosing the fact that the water in the test areas had been fluoridated. However, because it is unlikely that development of cavities would be much affected by this knowledge, there was little concern that a placebo effect would bias the findings. (Intuitively, the assumption that there is no placebo effect in fluoridation trials seems correct. The assumption is supported by the fact that in areas with naturally high levels of fluoride in the water, residents had few cavities before it was suspected that fluoride was protective.)

In some tests, it is not possible to use a placebo—for example, when testing a new surgical method or applying a new psychiatric therapy. Although in the 1950s some placebo operations were actually performed by putting a patient under anesthesia and making an incision in the skin, this practice is not considered ethical today. Researchers now ordinarily use as controls patients who are given a different therapy or no therapy. Under these circumstances, the trial is conducted on a single-blind basis, with the researchers who perform the evaluation being unaware of which treatment is given to which patient. In a single-blind study, there is a greater possibility of bias creeping in and blunting the power to detect true differences between the experimental and established therapy.

In any clinical trial designed to test the efficacy of treatment, it is necessary to have a control group. In most cases, the control group is chosen concurrently with the experimental group, thus enhancing comparability. However, a concurrent control is not always used. In testing new cancer drugs, for example, researchers go through a preliminary phase that uses "historical controls"—usually patients who have been treated in the past with other therapies. Such tests are normally done to get a speedy and relatively inexpensive impression of a drug's effectiveness and to decide whether a full-scale controlled double-blind trial is warranted. When the first antibiotics were used to treat life-threatening bacyerial infections, for example, no control group was needed to judge the results. But in most situations the lack of a concurrent control group is unacceptable on scientific grounds, except as a preliminary to more rigorous testing.

When the effect of a treatment is obvious and unequivocal, as it sometimes is in surgical procedures, concurrent controls are not essential. In one experiment, for example, demineralized bone was implanted to see whether it could fill gaps in their bone structure and help in the growth of new bone. The success of the treatment was demonstrated through X rays and biopsies. As it is improbable that the defects could have been remedied without surgery, the test findings are strong evidence for the value of the treatment. The control group, in this example, consisted of the treated individuals prior to their operations.

In certain situations, it is unethical to use a concurrent control group. Cryptococcal meningitis is a fungus-caused disease of the central nervous system. Before an effective treatment was discovered, about three-quarters of the victims died within a year. Many years ago, when researchers were testing the long-term effects of an antibiotic that was the only known cure for the disease, they did not withhold it from any of their test subjects for the purpose of creating a control group. In effect, their control group consisted of all untreated patients who had suffered from the disease up to that time.

Clinical trials are most effective in testing drugs and other relatively simple therapies in which the hoped-for benefic shows up fairly rapidly—within, say, two to three years. When the therapy is a prescribed diet, where adherence is difficult to measure, the task of the researcher is more difficult. In trials that go on for years, many subjects drop out or leave town. Some therapeutic and preventive measures cause a problem by introducing unwanted variables into the test. For example, subjects who adopt a low-cholesterol, low-fat regimen designed to prevent heart disease automatically change other components of their diet, which makes it difficult to determine which dietary factors influence the outcome.

In some situations, studies have proved impractical because the controls do not act as they are supposed to. This was true in a trial in which men at risk of coronary heart disease attempted to reduce their risk by giving up cigarettes, reducing their blood pressure, and reducing serum cholesterol through diet. However, many of the controls took up these preventive measures themselves, and so the comparison of experimental vs. control subjects could not be completed until additional data were gathered.

Clinical trials often pose problems of interpretation. In many cases, they are conducted by enthusiastic supporters of a theory, who might be less apt to insist on rigorous standards than researchers who have no emotional investment in the outcome. Another and more distasteful problem is outright fraud. The late Sir Cyril Burt, one of the most distinguished psychologists of his day, brazenly forged data to make it appear that heredity is almost the sole determinant of intelligence. The best defense against this kind of thing is duplication of studies by several independent groups of researchers.

Getting positive results in a clinical trial is the best way for a new theory to gain acceptance, and if the theory is supported by several well-designed trials, it will get good notices in the journals that sway medical opinion, such as The Lancet, the British Medical Journal, the New England Journal of Medicine, the Journal of the American Medical Association, and the leading specialty publications. Eventually, the new approach, if not seriously challenged, will be presented as accepted treatment in standard textbooks. However, this does not imply that everything in the medical textbooks has been thoroughly tested.

Despite the utility of controlled randomized clinical testing, the practice is not universally accepted. Some people claim that testing what they consider to be a beneficial therapy would mean depriving some of their patients—the controls—of the presumed benefits of treatment. Skeptics would disagree and insist that, because the therapy has not been rigorously tested, the controls would not necessarily be deprived.

Short-Term Human Experiments

The short-term human experiment is similar to the clinical trial, but has less power to resolve controversy. For example, consider the hypothesis that caffeine causes hypertension. The obvious first thing to do when evaluating this hypothesis is to conduct a test with a limited number of people in the least expensive way possible. Researchers from Vanderbilt University did this when they measured the blood pressure of nine people before and after the subjects had been given a beverage containing caffeine, and also after being given a similar-tasting, caffeine-free beverage. None of these subjects regularly drank beverages containing caffeine. This experiment, which was done on a double-blind basis, showed that blood pressure rose substantially during the hour after taking the caffeine and then gradually declined over the next few hours. The same subjects experienced no change in blood pressure that could be attributed to the non-caffeinated beverage. This test suggested that the effect of caffeine on blood pressure might be a worthwhile subject to explore further. However, it was too limited in scope to be a test of the hypothesis that caffeine causes hypertension. It did not, for example, take into account the possibility that those who ingest large amounts of caffeine may in time adjust to the effect of the substance and so have blood-pressure levels that do not differ from those of non-users. A more conclusive test of the hypothesis would be through a large-scale clinical trial—for example, one in which coffee drinkers who gave up the beverage for a period of months or years were compared to a control group who continued to drink coffee.

Natural Experiments

Alert epidemiologists always look for the rare natural experiment—a situation in which the course of a disease can be followed when only one variable is changed, as in a controlled experiment.

John Snow saw that he had a natural experiment when he noted that the Lambeth Company drew its water from a less polluted part of the Thames than the Southwark and Vauxhall Company. In modern times, the largest and most famous (or infamous) natural experiment came about through the bombing of Hiroshima and Nagasaki. By studying survivors in relation to their distance from ground zero, epidemiologists were able to relate the radiation dose to the ensuing development of a variety of diseases.

The mass migration of people also provides an opportunity for natural experiments. For example, a study of Irishmen who migrated to Boston and adopted American habits provided an important clue to the role of diet in coronary heart disease. Identical twins provide the opportunity for natural experiments, such as those that have examined the role of heredity in alcoholism.

Animal Experiments

Among the non-epidemiological evidence that is sometimes used to help resolve health controversies, animal experiments are the most important. Such experiments have sometimes provided dramatic support for medical hypotheses. Howard Florey and Ernst Chain, who won a Nobel Prize for their work in 1945, infected mice with streptococci and then gave some of the infected mice penicillin injections. The next day, they found that all of the untreated mice were dead, while all of those given penicillin survived.

In chronic diseases, animal experiments are apt to have far less value as evidence for hypotheses than in thee case of the Florey-Chain experiment. Chronic diseases develop over a long period of time, usually much longer than the life-span of a laboratory animal, and of course there are important genetic differences between animal species and Homo sapiens than may affect the course of disease.

In Vitro Studies

Sometimes supporting evidence for a hypothesis comes from experiments done in vitro, as in a test tube. The term includes studies of tissues and cells taken from both animals and humans. Such studies are important in developing new hypotheses, but they are not ordinarily considered to be strong supporting evidence for hypotheses relating to human disease, because of the obvious hazard in extrapolating from the laboratory situation to the living organism.

The Hierarchy of Evidence

There is a hierarchy in the type of evidence by which hypotheses relating to human disease are judged. The cardinal rule, as illustrated in Figure 1, is to give greater weight to the more rigorous methods of testing: (a) a report based on a clinical trial is more credible than one based on a cohort study; (b) a cohort study carries more weight than a case-control study; (c) a clear-cut finding from a case-control study, particularly one that does not suffer from an obvious methodological bias, is usually more credible than a finding from an ecological study, a cross-sectional study, or an experiment with animals; and (d) animal experiments carry more weight than in vitro experiments.

Figure 1: Relative Evidentiary Value of Study Type for Hypothesis testing






Value of data (degree of acceptance by the medical community






Clinical trials

Cohort studies

Retrospective cohort studies (if based on reliable historical records, including good information regarding exposure of individuals to the suspected disease-causing agent)

Case-control studies

Cross-sectional studies (some)

Short-term human experiments

Cross-sectional studies (most)

Ecological studies

Animal experiments

In vitro experiments

Anecdotal evidence

Natural experiments are difficult to place in the hierarchy of evidence, because their value varies from one situation to the next. For example, John Snow's natural experiment based on London water company customers was decisively unambiguous. On the other hand, the atomic-bomb studies of Hiroshima and Nagasaki survivors have not been decisive in settling the current controversy over low-level radiation.

The hierarchy of evidence denotes the value of the several types of studies in supporting hypotheses, but does not rank them for their value in inspiring, developing, and modifying hypotheses. For these purposes, studies that are of low value as support for hypotheses, such as animal experiments, may be of much greater value than clinical trials.

The hierarchy of evidence is like a high-rise building: The higher you go, the farther you see. The analogy is useful, but it does not always hold. In some cases—which are probably rare—a hypothesis may get very strong support because the lower-order evidence is highly consistent. For example, by the early 1950s, thirteen out of thirteen case-control studies and several ecological studies supported the hypothesis that smoking causes lung cancer. Although no cohort study had yet been completed, there was a widespread feeling among physicians that the hypothesis was probably correct.

A corollary to recognizing the hierarchy of epidemiological evidence is to give no credence to anecdotal evidence. Notice that in Figure 1, anecdotes are placed at the zero level of credibility. Competent medical researchers may use anecdotal evidence for suggesting new hypotheses, but never as supporting evidence. Anecdotal evidence is the raw material of health cultists. The late J.I. Rodale, founder of Prevention magazine, insisted that prostate problems could be cured by eating pumpkin seeds. His claim was based on anecdotes—letters from his readers claiming therapeutic value for the seeds. The term “anecdotal evidence” connotes secondhand or poorly documented fact and should not be confused with case studies of individual patients that involve careful observation and recording of detail.

In cases where satisfactory clinical trials can be mounted, controversies can be resolved with relative ease, at least at the technical level. Where reliance must be placed on less powerful types of studies, as in the controversy over fat consumption and cancer, researchers look at the array of information available from all types of studies and try to assess it in toto. In doing this kind of assessment, they use certain generally accepted criteria. Lets examine how these criteria were applied to the very important issue of the relationship of smoking to lung cancer.

The Rules of Evidence

The term scientific judgment implies the use of objective criteria. Since World War II, medical researchers have evolved certain more or less formal criteria for use in testing theories about health and disease. These criteria, which are nothing more than common sense elaborated to take into account the peculiarities of medical evidence, deal primarily with the association between the presumed cause and the occurrence of the disease. Seven of the criteria are particularly important in evaluating health-related theories:

The hypothesis that cigarette smoking causes lung cancer, for example, meets all of the criteria. The strength of the association is striking—lung cancer occurs nine to ten times more often among smokers than among nonsmokers. There is a general consistency of association from one type of study to the next. There is a strong dose-response relationship—that is, the more one smokes, the greater the risk of getting the disease. There is a logical temporal relationship, since smoking, in most cases, precedes the appearance of the disease.

The hypothesis is coherent with known facts: cigarette smoke contains a number of carcinogens that are brought into intimate contact with lung tissues. The hypothesis has predictive value. Ideally, the predictive value is tested in a rigorously controlled clinical trial. Although there was no clinical trial in this situation, there was a natural experiment that was almost as conclusive: it was found that, as a group, British doctors who had given up smoking enjoyed a substantial decline in lung-cancer mortality, as implied by the hypothesis. No plausible alternative hypotheses have gained much factual support. It is possible, of course, that an alternative theory will eventually supplant the currently accepted theory, but at present this appears to be a very remote possibility.

Some Key Considerations in Judging the Hypothesis: Cigarette Smoking Causes Lung Cancer


Relevant Research Findings

Strength of association:
How great is the risk for those exposed to the presumed cause of the disease?
Cohort and case-control studies show that lung cancer is many times greater among smokers than among nonsmokers. About 90% of lung cancer occurs among smokers
Consistency of association
Are the findings consistent from study to study and from one type of study to another?
Virtually all cohort and case-control studies show that cigarette smokers are more likely to get lung cancer than nonsmokers. A study of twins and several international statistical analyses are also in agreement with the hypothesis. However, exposure of animals to tobacco smoke, extracts, or condensates have usually not produced lung cancer.
Dose-response relationship
Does a greater amount or longer duration of the presumed cause increase the risk?
Cohort and case-control studies show that the risk of lung cancer increases with the number of cigarettes smoked per day. Heavy smokers have twice the risk of developing lung cancer that average smokers do. The earlier the age at which smoking begins, the higher the risk of the disease. Deep inhalers are at greater risk than light inhalers.
Temporal relationship
Is there evidence that the presumed cause precedes the disease?
Cigarette smoking among men became widespread in the 1920s, preceding the large increase in lung cancer mortality that was evident by the 1930s. Lung cancer usually develops late in life, many years after adoption of cigarette smoking.
Coherence with existing knowledge
Does the cause-and-effect interpretation severely conflict with known facts including biological facts?
Cigarette smoke contains a number of carcinogens that are brought into intimate contact with lung tissues. Autopsy studies show that cellular changes—loss of ciliated cells, thickening, and presence of atypical cells-are much more common in the lining of the trachea and bronchi of cigarette smokers than of nonsmokers. Men took up smoking in large numbers many years before women, and lung-cancer death rates rose first among men. Cigarette consumption and lung cancer are both higher in urban than in rural areas.
Predictive value of the theory
Does exposure to the presumed cause increase the risk of disease? Does its removal decrease the risk?
If cigarette smoking is a cause of lung cancer, mortality from the disease will decline after smoking is given up. This prediction has been confirmed by several cohort studies.
Plausibility of alternative theories
Have alternative theories languished despite a conscientious search for factual support?
A principal alternative theory is that there is a common factor—perhaps genetic—underlying both cigarette smoking and lung cancer. Neither this theory nor any other alternative has gained substantial factual support despite the efforts of those opposed to the cigarette-lung-cancer theory

The criteria are not necessary in situations where the effect follows rapidly on the cause and where there are no circumstances that might confuse the observer. If a man falls off a roof, it is reasonable to conclude that his broken bones were caused by the fall. But the situations dealt with in Part II of this book are, of course, much more complicated. The cause (or the presumed cause) typically occurs many years before the first sign of disease appears, and during those years there may be many occurrences that in them-selves could also cause the disease or modify its course. Clearly, this is true in the case of cigarette smoking and cancer.

Notice that the criteria used in judging the cigarette-lung-cancer thesis are concerned primarily with "association" rather than with "causation." In the study of disease, particularly noninfectious disease, causation is not proven with the same finality as in physics and chemistry unless all variables can be rigorously controlled, as in a well-designed clinical trial. Where an acceptable clinical trial is not possible, causation is inferred by criteria of association and plausibility. Based on these tests, it is reasonable to infer that cigarette smoking is a cause of lung cancer.

The process of evaluating the cigarette-lung-cancer hypothesis is similar to a criminal investigation of murder:

Medical investigators, like criminal investigators, prosper by explaining reality. When a medical theory is successful—when it explains all of the facts in a situation and successfully predicts new developments—it becomes an "accepted" theory. The theory that cigarette smoking causes lung cancer attained this status by the 1960s, when many doctors gave up smoking and advised their patients to do the same. Few of the theories examined in Part II of this book have attained the stature of the cigarette-lung-cancer link.

The criteria for judging medical theories are not rigid tests, but more like guidelines to be applied flexibly as circumstances permit. To be widely accepted, a theory does not have to meet all of the criteria. Furthermore, there are situations in which competent scientists may choose to downplay the criteria because of the exigencies of a particular situation. However, in most situations, scientists who ignore the criteria do so at their peril.

In some instances, the criteria are difficult to apply. This is particularly true of environmental problems where the issue relates to small doses of an agent that, in larger doses, is obviously toxic. In some cases, scientists and policymakers can't agree on what evidence is reasonable for regulatory policy.

Theories that meet all of the criteria are not necessarily immutable. If a specific chemical in cigarette smoke were unequivocally identified as an initiator or promoter of lung cancer, it would then be more correct to say that this chemical, rather than cigarette smoking, is a cause of the disease. Hypotheses that explain a fundamental aspect of nature and that can be generalized to a large number of situations become established as laws, such as Mendel's laws of genetic inheritance. But even laws are not immutable.

I am not suggesting disrespect for laws, particularly for laws of nature. Nor do non-experts have any basis for questioning established theories. But new theories are another matter. They should be greeted with the same skepticism that one brings to, say, a proposal of marriage by a stranger. Even when a theory is advanced by someone particularly eminent, it is wise for the non-expert to be on guard. The informed judgment of a respected scientist will always carry weight, but facts take precedence over reputation.

Eminent scientists can be spectacularly wrong. Linus Pauling, Ph.D., who won the Nobel Prize for his work in chemistry, proposed that high doses of vitamin C could prevent colds. Although publication of his views caused an immediate rise in vitamin C supplementation, well-designed clinical trials demonstrated that he was wrong.

On the other hand, innovative theorists occasionally turn out to be correct. In the 1980s, two Australian physicians postulated that the underlying cause of peptic ulcers is a bacterium able to live within the layer of mucus that normally protects the stomach wall. Although this theory seemed incompatible with well established beliefs, the Australians did the necessary research, eventually proved to be correct, and revolutionized the treatment of peptic ulcers.

Eminent scientists will continue to fall from grace, and occasionally some unlikely notion will make its way from the fringes of science into respectability. When either of these things occur, it is good news because it shows that the rules of the game are in good working order.


  1. Some scientists make a distinction between hypotheses and theories, the former term being applied to unsupported theses and the latter to those that are widely accepted, such as the thesis that cigarette smoking causes lung cancer. The distinction is sometimes useful, but to avoid confusion, I follow common usage and employ the two terms interchangeably.
  2. Epidemiologists frequently disagree on terminology. Cross-sectional studies are often called prevalence studies, while case-control studies are sometimes called case-comparison studies. Cohort studies are also known as incidence, longitudinal, and prospective studies. Clinical trials are routinely referred to as intervention studies.

This article was posted on October 6, 2004