Error in science

What scientists mean by error

This isn't intended as a last word on this subject, but I hope it's helpful. Scientists use the word error in at least three ways:

A mistake.
An incorrect analysis of data.
Uncertainty in raw or processed data.

The first of these is the ordinary use of the word. A scientist might say, "I made an error in deciding to work for the Tobacco Institute", or "It's an error to concentrate all our research resources on energy production: we should be looking into energy conservation as well." Here, error means nothing more or less than a mistake in logic or judgment.

The second category is the one beloved of statisticians and individuals engaged in the formal study of hypothesis testing. An error of this sort is the drawing of a conclusion not warranted by data. There are two kinds of error in this sense: they're commonly known as "false positives" and "false negatives," or "type I errors" and "type II errors". A type I error, or false positive, involves detecting an effect that is actually absent; a type II error, or false negative, involves failing to detect an effect that is present. A null hypothesis is a statement that a particular effect is absent, based on available data, i.e. that there is no relationship between two measured phenomena or association among groups. On that basis, a type I error is rejection of a null hypothesis that is actually correct; a type II error is acceptance of a null hypothesis that is actually wrong. Obviously unless we have some way of discerning absolute truth about a hypothesis, we can't be certain that we are committing an error of these kinds; but in many instances we can recognize that subsequent gathering of data can show us that, at an earlier stage of analysis, we (or someone else) committed a type I or type II error. Prior to 1982, it would have been routine to assert, "bacterial infections are irrelevant to stomach ulcers". In that year, Barry Marshall and Robin Warren discovered that over half of all stomach ulcers are caused by a bacterium called Heliobacter pylori, so that that earlier assertion became recognized as a type II error.

The last sense in which scientists use the word error is the one that is furthest from the conventional use of the word: here it refers to uncertainty in measurement. Every quantitative measure we use in science has some uncertainty associated with it. The error in a measurement is an estimate of that degree of uncertainty.

There are two sources of uncertainty in measurements: random and systematic error. Random error refers to uncertainty associated with the fact that multiple measurements, even of objects that aren't changing in time and where the measurement is made with a correctly calibrated instrument, may not yield the same value. Systematic error refers to uncertainties (recognized or not) that arise from deficiencies in the model on which the measurement is made. These could include incorrect calibration of the instrument we are using to make the measurement, time-dependent changes in the object we are measuring, or unrecognized outside influences on the measurement.

With continuous variables like length and mass, we encounter uncertainty in the operation of the instrument we use to measure that variable. My ruler has gradations of one millimeter, and if my eyes are good I might be able to estimate lengths to better than the distance between those gradations; so I can assert that the envelope in front of me is 177.2mm, based on my use of a wooden 31cm ruler. The "error" in that value is an expression of how uncertain I am of that value. If I am confident that the ruler was calibrated correctly at the factory and hasn't stretched or shrunk since then, then I would probably say that the error in that measurement is about 0.3mm, i.e. that if I used a much more carefully calibrated and precise ruler, I would expect to find that the envelope is really somewhere between 176.9mm and 177.5mm. Alternatively, I would predict that if I used 100 wooden rulers of the kind I've just used, most of the measurements would cluster around 177.3mm. These are all instances of random error in our measurements. If the ruler itself is miscalibrated, or the envelope is being heated to a high temperature so that it undergoes a thermal expansion, then we will encounter systematic error as well. When I was a teenager I would go fly-fishing with my father in the mountains of southern Colorado. The creel that we would stuff the trout we would keep had a ruler printed on its surface, and most of our creel rulers were seriously miscalibrated: an 8-inch fish would measure out at 9.2 inches. I suppose that makes for good fish-stories, but it's an instance of systematic error.

With discrete variables, i.e. objects that we can count, these notions play out in slightly different ways. When I use a pixel-array detector to measure the intensity of a diffraction spot in an X-ray crystallographic experiment, my instrument actually tells me the number of photons that have arrived at my detector during a particular time interval. Suppose that number of photons should be 1000. If I do that experiment 100 times, and my sample isn't moving or otherwise changing during those 100 exposures, then I would expect that all of my experiments will yield roughly 1000 photons. How rough that is depends on the way my instrument works, but in many instances there will be a distribution of measurements grouped around 1000, with most of the measurements falling between 970 and 1030. This particular type of random error obeys what are known as Poisson statistics, where the standard deviation (a measure of the uncertainty) is the square root of the number of counts made during the measurements. The square root of 1000 is about 31.6, so we expect most measurements to fall between 1000-32 and 1000+32. Countable or discrete variables can suffer from systematic error, just as continuous variables can. In my crystallographic case, the incoming X-ray beam may be fluctuating in intensity or position; the sample may be falling apart due to damage from the X-rays themselves; an ant may crawl past the detector during one of my X-ray exposures; and the time interval over which the X-ray shutter is open during one exposure may be different from what it is on another exposure.

In general there isn't much we can do about random error within the confines of a single experiment. In some instances we can reduce the influence of random error by increasing the time over which we make the measurement; that sometimes works in the crystallographic experiment mentioned above. Often we can reduce the significance of random error and get an estimate of how big the random error is by repeating our measurement many times. If I use 100 different wooden rulers to measure how long my envelope is, I can take the mean of all those measurements to be the best estimate of its true length, and the distribution of measurements that I made will give me some idea of the magnitude of the random (and possibly systematic) error in the measurement.

Systematic error is sometimes subject to modeling, whereby we can reduce its influence on the ultimate answers that we want to release to the outside world. In my crystallographic experiments I know that my samples are gradually destroyed by their exposure to X-rays, so that if I measure the same diffraction intensity 40 times over a one-hour period, I will see a falloff in that intensity. But if I know how that falloff depends on time (or X-ray dose), then I can correct for it by multiplying the measurements by a factor that starts at one and becomes larger as time goes on. Obviously I would prefer not to have to rely on this kind of post-processing to correct my measurements: I would prefer to work directly with the numbers that my instrument yields. But we live in the real world, where systematic error does exist, and we are well-advised to try to do what we can to reduce its influence.

How do these kinds of errors influence the way we describe our results? When I measure the envelope's length to be 177.3mm, it would be meaningless for me to say that it's 177.300000mm long, or 177300 micrometers, or 177300000 nanometers long. I don't actually know that all those zeroes are correct: it could be as low as 176.9 or as high as 177.6. Even the .3 may be over-optimistic. So scientists and statisticians have promulgated the notion of significant figures, which are expressions of the number of digits that actually carry meaning. When we assert that the envelope is 177.3mm long, we are claiming that we know that it's not 176 or 178 mm, and that we have some confidence that it's closer to 177.3 than to 177.7 or 176.8. In Star Trek's original series, Mr. Spock routinely used considerably more significant figures than his measurements could possibly yield; apparently that was supposed to make him look smarter, but it tended to have the opposite effect on me.

Engineers, as well as scientists, pay a considerable amount of attention to this third sense of the word "error". In fact, a pithy way of characterizing the difference between scientists and engineers is to say that scientists try to maximize the signal-to-noise ratio, whereas engineers try to minimize the noise-to-signal ratio. A more useful way to view that is to recognize that a lot of engineering is devoted to analyzing the sources of error, particularly systematic error, in systems, and looking for ways to either reduce or model that error. If there are five sources of error in an engineering system, and one of them is responsible for 90% of the uncertainty, then it makes sense to concentrate one's efforts on reducing that source of error rather than the other four. Engineering involves other considerations, including economics, but error analysis is a huge part of any engineering project.