ItP Lecture: Statistics in the Natural Sciences

Introduction to the Professions
Biology, Chemistry, and Physics 100
lecture notes for Thursday-Tuesday, 5-10 October 2006

Statistics in the Natural Sciences

General comments:

Social scientists, as well as natural scientists need to use statistics.
USUALLY natural scientists do a better job than social scientists at handling statistics, because we're more at home in a mathematical environment--but not always!
There are three kinds of untruth: lies, damn lies, and statistics
What are the basic statistical tools you need?
- central measure
- dispersion
- histograms
- trends
- weighting

We will not provide rigorous definitions in this treatment, but we do want to illustrate the ways the first four of these tools are used in scientific measurements. The last of these tools (weighting) will be discussed in the contexts of central measure and dispersion.

Central measure

If we measure something many times, we often want to know some sort of central measure for our observations--a measure that is typical of the collection of measurements. Three common central measures are:

mean (weighted or unweighted)
This is the "average" used in everyday speech. The unweighted mean is appropriate if the experimenter is equally confident of all measurements: then it is simply the sum of the observations divided by the number of observations:
[x]_u = Σ x_i / n = Σ x_i / Σ 1
If instead we believe we can assign weight values to individual observations, so that our confidence in our ith measurement might differ from our confidence in the jth measurement, then we compute a mean that takes these weights into account:
[x]_w = Σ x_i w_i / Σ w_i
Note some simple properties of the weighted mean:
1. If all weight values are equal, then the formula reduces to the previous one:
  If w_i = W for all values of i, then
  [x]_w = Σ x_i w_i / Σ w_i = Σ x_i W / Σ W =
  [W Σ x_i] / [ W Σ 1 ] = Σ x_i / Σ 1 = [x]_u
2. The formula is independent of the scale of the weight factors w_i:
  If v_i = Q * w_i, for Q nonzero, then the mean computed with the altered weights v_i is
  [x]'_w = Σ x_i v_i / Σ v_i = [ Q Σ x_i w_i ] / [ Q Σ w_i ] =
  [ Σ x_i w_i ] / [ Σ w_i ] = [x]_w
median
This is the point in a distribution for which exactly half the measurements were smaller than the value and half were larger. Thus if we make nine observations with values
17, 13, 20, 18, 8, 22, 19,
the median is 18, since 17, 13, and 8 are less than 18, and 20, 22, and 19 are greater.
Note that the mean of this set of measurements is
(17 + 13 + 20 + 18 + 8 + 22 + 19) / 7 = 117 / 7 = 16.71
There are cicrumstances where the median is actually a more useful measure than the mean. It is less sensitive to a small number of outliers. Suppose we made the above measurements as given except that an electronic or recording error occurred in the third measurement, and a much larger value was recorded--say, 73. The median will still be18, but the mean will now be (117 + (73 - 20)) / 7 = 170 / 7 = 24.3. Thus one measurement error had a large effect on the mean, but little or no effect on the median.
Calculating a median is a bit more complex than computing a mean. If the data have been sorted in increasing or decreasing order, it is easy: we just trace down the list to find the midpoint and use the value located there. If the data have not been sorted, we need to find another way of keeping track of the values as we examine them. This sorting notion illustrates a related point: we haven't yet defined a median for an even number of measurements. If we have an even number of observations the median is normally taken to be the mean of the two values closest to the midpoint. Thus the median of the collection of observations
17, 13, 20, 18, 8, 22, 19, 19
(for which we happened to get two equal values as we went along) is 18.5, since 17, 13, 20, and 18 are less than that value, and 19, 19, 20, and 22 are greater.
mode
The mode is the value which occurs most frequently in a distribution. This is unlikely to be useful in measurements that are "naturally" non-integer real values, but for integers, especially relatively small integers, it can be useful. The mode of the measurements
17, 8, 13, 12, 19, 17, 12, 11, 18, 23, 18, 15, 17, 28, 25, 23, 17, 14
is 17. The mode is particularly useful if we have a feeling that there really is a "correct" value for our distribution, and the number that comes up most often will reflect that.

Dispersion

How are the following two distributions different?

A: 0, 5, 8, 9, 13, 15, 20, 27, 30, 33

B: 13, 13, 14, 15, 16, 16, 17, 18, 19, 19

We have sorted the measurements by value, but that does not alter the properties of the distributions. Each distribution has ten observations in it and a mean value of 16. But the values in distribution B are much more closely clustered around the mean value of 16; all lie within three units of the mean, whereas the values in distribution A wander much farther from the mean. We consider B to be "tighter" distribution. How do we quantify this tightness?

By the range of measurements in the distribution. This is the difference between the largest and smallest observations in the distribution, so it is 33 in the first distribution and 6 in the second.
By a quantitative measure known as the variance. If we are calculating a sample mean with the unweighted formula, then the sample variance associated with that mean is
v = Σ (x_i - [x])² / [N - 1]
Note that for one observation the variance is meaningless. This makes sense: we cannot usefully discuss the distribution of values when only one observation exists.
By a related quantitative measure known as the standard deviation. This is nothing more than the square root of the variance:
σ = { Σ (x_i - [x])² / [ N - 1] } ^1/2
As you might expect, there are associated formulas for the standard deviation and variance in cases where the mean is weighted rather than unweighted.

Histograms

Histograms are representations of distributions for which we concentrate on how often any particular value arises. Typically the histogram is representated as a graph with observed values along the horizontal axis and their frequency--the number of times that a particular value arises--along the vertical axis. Thus for the distribution given above under "mode", the histogram looks like

   N
   |
 4 |                   *
 3 |                   *
 2 |         *         * *         *
 1 | *     * * * * *   * * *       *   *     *
   ---------------------------------------------- value
         1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2
     8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8

Usually histograms are plotted for distributions with hundreds or thousands of data points, not just eighteen. But even in this case it becomes possible to see from the histogram which values appear often and which do not. In particular, the mode of the distribution is immediately evident: it's the peak of the histogram. In a distribution for which the values are real numbers, histograms of the individual values are nearly meaningless, because every measured value is likely to be slightly different from every other. Thus the histogram will be composed entirely of N values equal to one or zero--one where an observation appears, zero everywhere else. Therefore under these circumstances histograms are usually built by "binning" values along the observational axis--that is, grouping them so that all values within a small range, say between 18.5 and 19.4999, are counted as being equal to a single value, in this case 19. As long as the "bins" into which values are grouped are of equal width, this kind of binned histogram will provide a useful image of the frequency of values in the distribution.

Trends

Up to now we have been casual about what order we make measurements. In many circumstances this is appropriate: we may not really care whether the large measurements were recorded at the beginning, middle, or end of the sequence of observations. In other circumstances we are distinctly interested in how the observations we make are changing over time. In effect, we are interested in plotting our observations as a function of time or at least of sequence number within our recording period. We are thus interested in the trend displayed by the data.

Trends need not be entirely monotonic in order to be significant. A monotonic distribution is in which every value is larger than the previous one, or in which every value is smaller than the previous one. These trends tend to be easy to spot. A more alert observer is needed to recognize trends in which an overall tendency upward or downward is combined with some fluctuations up and down. If your car's gasoline mileage is slowly getting poorer because your engine is needs a tune-up, you may not find that the mileage on one fill-up is always going to be lower than that found in the previous fill-up. It may actually get better on the October 15th fill-up than the October 9th fill-up if you did a lot of city driving on October 4-9 and a lot of steady, country-road driving on October 9-15. But the overall trend would be downward, so that the average gas mileage during June and July would be 10% better than during September and October. At that point you would know that it's time to see your friendly mechanic.

Trends in scientific measurements are significant when repeated measurements need to be made, and the time-course of the experiment is potentially one of the influences on the measurement. In X-ray crystallography the diffraction spots we measure tend to get weaker over time, because the crystalline order of our sample begins to disappear due to radiation damage. In neutron crystallography decay is essentially non-existent. Thus if you re-measure the same Bragg reflection many times over the course of an X-ray experiment, the trend will be downward; in a neutron experiment, there will be no trend, and random fluctuations will be the only source of variation.