Introduction to the Professions
Biology, Chemistry, and Physics 100
lecture notes for Thursday-Tuesday, 17-19 October 2006

Least-Squares: A Method of Fitting a Function to Data

The previous lecture provided an overview of elementary statistics-- the kind that scientists, social scientists, engineers, and many other people use daily. We focus today on a slightly more advanced topic: fitting a set of data to a curve, or (to put it another way) fitting a curve to a set of data.

The situation we ask you to envision is that during an experiment we measure certain values that can be associated with one another in pairs. Within each pair we designate one kind of measurement as being the independent variable and the other as the dependent variable, typically choosing the independent variable as the one for which we are reasonably confident of the measurement, while the dependent variable is the one that is error-prone. Conventionally we plot the points in two dimensions, with the independent variable along the horizontal (X) axis and the dependent variable along the vertical (Y) axis. The independent variable in many cases is time. In that case we may be watching the time-course of a process, recording the time of each measurement as the independent variable and the process we're interested in as the dependent variable. Thus if we are measuring the temperature of a water bath as we heat it from room temperature to the boiling point of water, recording its temperature from time to time, then the independent variable is time and the dependent variable is the temperature in the bath:

 Temperature
       |
       |                                          *
       |
       |                                      *
       |
       |                               *
       |
       |                              *
       |
       |                       *
       |
       |                  *
       |
       |           *
       |
       |
       |          *
       |
       |    *
       |____________________________________________ time
Our task, then is to find some mathematical function with relatively few adjustable parameters in its definition that nonetheless fits the data we've measured fairly well. We provide a precise definition of "fairly well fit" by saying that we want to make as small as possible the differences between the measured values of the dependent variable and the calculated values of the dependent variable. Making the differences small is typically thought of either in terms of making the absolute value of the difference small:
     min | ycalc - yobs |
or of making the square of the difference small:
     min (ycalc - yobs)2

But we want all the differences to be small, not just the difference for one of the points. So if the measured values of the independent and dependent values in the plot are
(x1,y1), (x2,y2), (x3,y3), (x4,y4), . . . (xn,yn) for points 1, 2, 3, . . . n, then we want to minimize the sum of the absolute values of the differences, or the sum of the squares of the differences:
     min Σ |ycalci - yobsi|      [1]
or
     min Σ (ycalci - yobsi)2      [2]
where the index i runs from 1 to n.

The method we will discuss henceforth involves minimizing the sum of the squares of the differences. It is therefore known as the least squares technique.

So far we haven't said anything about the kind of function ycalc we wish to fit to the points we've observed. The points shown above look like they fit a line pretty well, so we will try to fit our observed temperature (y) values to a line:
     ycalc = a * x + b      [3]
where a and b are the two parameters, the slope and intercept, of the line. Our task is in fact to calculate values of a and b that satisfy eqn. [2]:
     min Σ (a xi + b - yi)2      [4]

How do we minimize a function? The traditional technique is to recognize that any minimum (or maximum) of a function occurs when it is turning around, i.e. if we're at the bottom (or the top) of a curve, we must be momentarily motionless. Thus a minimum of a function occurs when the derivative of the function is zero. If the function depends on more than one variable, the global minimum occurs when the derivative with respect to each of the variables in question is zero. In this case we're not actually talking about the experimental variables (xi,yi) themselves; instead, we're treating the parameters of the line, a and b, as the variables. So to find a and b that give us the minimum [4], we want to take the expression shown in [4] and set its derivatives with respect to a and b equal to zero. If we can derive a formula for a and b from such an analysis, we will have found a way to determine the values of a and b that satisfy [4]. First we will find the derivative of [4] with respect to the parameter a:
     0 = d/da { Σ (a xi + b - yi)2 }      [5]
We can pull the derivative inside the summation to give
     0 = Σ d/da { (a xi + b - yi)2 }      [6]
where we use the general formula d/dx(u2) = 2 u du/dx
to show that
     0 = Σ 2 (a xi + b - yi) * d/da (a xi + b - yi)      [7]
Furthermore, xi, b, and yi do not depend on a at all, so the term that still has a differential in it in [7] is just d/da (a xi + b - yi) = xi
so
     0 = Σ 2 (a xi + b - yi) xi      [8]
The "2" is just being carried along for the ride, so
     0 = Σ (a xi + b - yi) xi      [9]
or 0 = Σ (a xi2 + b xi - yi xi)
which we can rewrite
     Σyixi = Σ(axi2 + bxi)      [10]

Now we do the same thing for the other parameter, namely b:
     0 = d/db { Σ (a xi + b - yi)2 }      [11]
The same kind of algebraic steps as before lead to
     0 = Σ 2 (a xi + b - yi) * d/db (a xi + b - y i)      [12]
this derivative is even simpler than before: d/db (a xi + b - y i) = 1
so after getting rid of the 2, we have
     0 = Σ (a xi + b - yi)      [13]
or
     Σ yi = Σ (a xi + b)      [14]

Now let us look at equations [10] and [14] in a bit more detail. a and b are constants as far as the summations are concerned, so [10] gives
     Σyixi = a Σxi2 + b Σxi      [15]
and [14] gives
     Σyi = a Σxi + b Σ1      [16]

If i = 1, 2, . . . n in these summations then Σ1 = n, so [16] can be written
     Σyi = a Σxi + bn      [16]

Now to save some scribbling we define
     e1 = Σxi, e2 = Σxi2, f0 = Σyi, f1 = Σyixi,      [17]

so [15] and [16] become
     e1b + e2a = f1 and nb + e1a = f0      [18]
This is just a pair of linear equations in two unknowns, and should allow for a solution. One easy way to get the solution is to cast the problem in a matrix form. The matrix formulation will also allow us to generalize to a more complex function than the simple line [3] for fitting. To set up the matrix formulation, we define a matrix, known as the normalmatrix:
      M = (n   e1)
             (e1  e2)      [19]
and two vectors, the righthand side vector and the solution vector:
      v = (f0)               p = ( b )
             (f1)                    ( a )      [20]
We can then write the pair of equations [18] as a matrix equation
      Mp = v    [21]
We want a solution for a and b, i.e. for the vector p. If the 2x2 matrix M is invertible, then the solution of [19] is
      p = M-1v      [22]
In this case the matrix M is easy to invert:
     M-1 = [1/(ne2 - e12)[ ( e2 -e1)  
      ( -e1 n)     [23]

so the solution vector p is
     p = [1/(ne2 - e12] (e2f0 - e1f1)  
      (-e1f0 + nf1)    [24]

so the individual components of p, namely b and a, are
     b = (e2f0 - e1f1) / (ne2 - e12)      [25]
and a = (-e1f0 + nf1) / (ne2 - e12)      [26]

Under what circumstances will this approach fail? The answer is, when the matrix M cannot be inverted. For a fit to a line (the case we're considering here) that happens only if the determinant
     ne2 - e12      [27]
is zero. This happens in the trivial case of no data at all, i.e. n = e1 = e2 = 0, and in the almost equally uninteresting case in which all the xi values are equal, for which if U = xi for all i,
then e1 = nU, e2=nU2, and the determinant [27] is n*nU2 - (nU)2 = 0. In any other case, the determinant will be positive due to a theorem known as the triangle inequality, and the formulae [25],[26] will give us a useful answer.

There may be sets of data for which a line is not the right function to try to fit the data. The following plot looks like it would benefit from a quadratic fit:

 Temperature
       |    *                               *
       |     *                              *
       |    *                               *
       |    *                              *
       |     *                             *
       |     *                            *
       |      *                          *
       |       *                        *
       |        *                       *
       |         *                    *
       |           *                 *
       |            *                *
       |             *              *
       |              *           *
       |                *        *
       |                  *     *
       |                    ***
       |____________________________________________ time
The general equation of a quadratic is
     y = ax2 + bx + c      [28]
so the least-squares problem we wish to study is
     min F(a,b,c) = min Σ [ yi - (axi2 + bxi + c)]2      [29]
for a set of points i = 1, 2, . . . n. As with the line, we want to calculate the derivatives of the expression [29] with respect to the parameters--in this case, a, b, and c--and set those derivatives to zero. We thereby end up with three equations (three derivatives) in three unknowns, and we expect to be able to solve that. Thus for a,
     0 = dF/da = Σ [(yi - (axi2 + bx + c)](xi2) = Σ[yixi2 - (axi4 + bxi3 + cxi2)]      [30]
and for b:
     0 = dF/db = Σ [(yi - (axi2 + bx + c)](xi) = Σ[yixi - (axi3 + bxi2 + cxi)]      [31]
and for c:
     0 = dF/dc = Σ [(yi - (axi2 + bx + c)]      [32]

As with the linear fit we rewrite this so that the terms that have yi in them appear on one side of the equation and the terms that do not are on the other side:
     Σyixi2 = Σ(axi4 + bxi3 + cxi2) = aΣxi4 + bΣxi3 + cΣxi2      [33]
     Σyixi = Σ(axi3 + bxi2 + cxi) = aΣxi3 + bΣxi2 + cΣxi      [34]
     Σyi = Σ(axi2 + bxi + c) = aΣxi2 + bΣxi + cΣ1      [35]

For convenience we define
e1 = Σxi, e2 = Σxi2, e3 = Σxi3, e4 = Σxi4, f0 = Σyi, f1 = Σyixi, f2 = Σyixi2.
Thus (recalling that n = Σ1), [33] through [35] become
     f0 = nc + e1b + e2a      [36]
     f1 = e1c + e2b + e3a      [37]
     f2 = e2c + e3b + e4a      [38]
As with the line, we form a matrix M, a righthand-side vector v, and a solution vector p. This time M is 3x3, and the two vectors are 3x1:
                    ( n    e1    e2 )                     ( f0 )                ( c )
           M = ( e1   e2    e3 )         , v =    ( f1 )       , p =    ( b )
                    ( e2   e3    e4 )                     ( f2)                ( a )     [39]
and, as before, the solution to the least-squares problem Mp = v is p = M-1v.
We write down the determinant of M:
     D = det M = ne2e4 + 2e1e2e3 - e23 - ne32 - e12e4     [40]
and then
                         ( e2e4 - e32      e2e3 - e1e4      e1e3 - e22 )
     M-1 = D-1 ( e2e3 - e1e4      ne4 - e22       e1e2 -ne3 )
                         ( e1e3 - e22       e1e2 -ne3      ne2 - e12 )     [41]
so
     c = D-1 [ (e2e4 - e32) f0 + (e2e3 - e1e4) f1 + (e1e3 - e22) f2 ]     [42]
     b = D-1 [ (e2e3 - e1e4) f0 + (ne4 - e22) f1 + (e1e2 -ne3) f2 ]     [43]
     a = D-1 [ (e1e3 - e22) f0 + (e1e2 - ne3) f1 + (ne2 - e12) f2 ]     [44]
which gives us the desired parameters. As with the line the only conditions under which the solution is nonsensical is when the determinant D is zero, and similar "triangle inequality" considerations make that happen rather seldom.

Clearly we can extend to higher-order polynomials, e.g. a cubic fit will require four parameters and the inversion of a 4x4 matrix M. We are not required to make the function a polynomial. If we use a function like
     y = a cos(ωx + δ)
then separating the equations so that we can get clear formulas for M, and v may be tricky, but the same basic concepts arise. Thus when Gauss discovered these least-squares techniques in the early nineteenth century, he paved the way for many years of productive work with fitting equations, including complicated ones, to experimental data.