ItP Lecture: Least-Squares

Introduction to the Professions
Biology, Chemistry, and Physics 100
lecture notes for Thursday-Tuesday, 17-19 October 2006

Least-Squares: A Method of Fitting a Function to Data

The previous lecture provided an overview of elementary statistics-- the kind that scientists, social scientists, engineers, and many other people use daily. We focus today on a slightly more advanced topic: fitting a set of data to a curve, or (to put it another way) fitting a curve to a set of data.

The situation we ask you to envision is that during an experiment we measure certain values that can be associated with one another in pairs. Within each pair we designate one kind of measurement as being the independent variable and the other as the dependent variable, typically choosing the independent variable as the one for which we are reasonably confident of the measurement, while the dependent variable is the one that is error-prone. Conventionally we plot the points in two dimensions, with the independent variable along the horizontal (X) axis and the dependent variable along the vertical (Y) axis. The independent variable in many cases is time. In that case we may be watching the time-course of a process, recording the time of each measurement as the independent variable and the process we're interested in as the dependent variable. Thus if we are measuring the temperature of a water bath as we heat it from room temperature to the boiling point of water, recording its temperature from time to time, then the independent variable is time and the dependent variable is the temperature in the bath:

 Temperature
       |
       |                                          *
       |
       |                                      *
       |
       |                               *
       |
       |                              *
       |
       |                       *
       |
       |                  *
       |
       |           *
       |
       |
       |          *
       |
       |    *
       |____________________________________________ time

Our task, then is to find some mathematical function with relatively few adjustable parameters in its definition that nonetheless fits the data we've measured fairly well. We provide a precise definition of "fairly well fit" by saying that we want to make as small as possible the differences between the measured values of the dependent variable and the calculated values of the dependent variable. Making the differences small is typically thought of either in terms of making the absolute value of the difference small:
min | y^calc - y^obs |
or of making the square of the difference small:
min (y^calc - y^obs)²

But we want all the differences to be small, not just the difference for one of the points. So if the measured values of the independent and dependent values in the plot are
(x₁,y₁), (x₂,y₂), (x₃,y₃), (x₄,y₄), . . . (x_n,y_n) for points 1, 2, 3, . . . n, then we want to minimize the sum of the absolute values of the differences, or the sum of the squares of the differences:
min Σ |y^calc_i - y^obs_i| [1]
or
min Σ (y^calc_i - y^obs_i)² [2]
where the index i runs from 1 to n.

The method we will discuss henceforth involves minimizing the sum of the squares of the differences. It is therefore known as the least squares technique.

So far we haven't said anything about the kind of function y^calc we wish to fit to the points we've observed. The points shown above look like they fit a line pretty well, so we will try to fit our observed temperature (y) values to a line:
y^calc = a * x + b [3]
where a and b are the two parameters, the slope and intercept, of the line. Our task is in fact to calculate values of a and b that satisfy eqn. [2]:
min Σ (a x_i + b - y_i)² [4]

How do we minimize a function? The traditional technique is to recognize that any minimum (or maximum) of a function occurs when it is turning around, i.e. if we're at the bottom (or the top) of a curve, we must be momentarily motionless. Thus a minimum of a function occurs when the derivative of the function is zero. If the function depends on more than one variable, the global minimum occurs when the derivative with respect to each of the variables in question is zero. In this case we're not actually talking about the experimental variables (x_i,y_i) themselves; instead, we're treating the parameters of the line, a and b, as the variables. So to find a and b that give us the minimum [4], we want to take the expression shown in [4] and set its derivatives with respect to a and b equal to zero. If we can derive a formula for a and b from such an analysis, we will have found a way to determine the values of a and b that satisfy [4]. First we will find the derivative of [4] with respect to the parameter a:
     0 = d/da { Σ (a x_i + b - y_i)² }      [5]
We can pull the derivative inside the summation to give
     0 = Σ d/da { (a x_i + b - y_i)² }      [6]
where we use the general formula d/dx(u²) = 2 u du/dx
to show that
     0 = Σ 2 (a x_i + b - y_i) * d/da (a x_i + b - y_i)      [7]
Furthermore, x_i, b, and y_i do not depend on a at all, so the term that still has a differential in it in [7] is just d/da (a x_i + b - y_i) = x_i
so
     0 = Σ 2 (a x_i + b - y_i) x_i      [8]
The "2" is just being carried along for the ride, so
     0 = Σ (a x_i + b - y_i) x_i      [9]
or 0 = Σ (a x_i² + b x_i - y_i x_i)
which we can rewrite
     Σy_ix_i = Σ(ax_i² + bx_i)      [10]

Now we do the same thing for the other parameter, namely b:
     0 = d/db { Σ (a x_i + b - y_i)² }      [11]
The same kind of algebraic steps as before lead to
     0 = Σ 2 (a x_i + b - y_i) * d/db (a x_i + b - y _i)      [12]
this derivative is even simpler than before: d/db (a x_i + b - y _i) = 1
so after getting rid of the 2, we have
     0 = Σ (a x_i + b - y_i)      [13]
or
     Σ y_i = Σ (a x_i + b)      [14]

Now let us look at equations [10] and [14] in a bit more detail. a and b are constants as far as the summations are concerned, so [10] gives
Σy_ix_i = a Σx_i² + b Σx_i [15]
and [14] gives
Σy_i = a Σx_i + b Σ1 [16]

If i = 1, 2, . . . n in these summations then Σ1 = n, so [16] can be written
Σy_i = a Σx_i + bn [16]

Now to save some scribbling we define
e₁ = Σx_i, e₂ = Σx_i², f₀ = Σy_i, f₁ = Σy_ix_i, [17]

so [15] and [16] become
     e₁b + e₂a = f₁ and nb + e₁a = f₀      [18]
This is just a pair of linear equations in two unknowns, and should allow for a solution. One easy way to get the solution is to cast the problem in a matrix form. The matrix formulation will also allow us to generalize to a more complex function than the simple line [3] for fitting. To set up the matrix formulation, we define a matrix, known as the normalmatrix:
      M = (n   e₁)
             (e₁  e₂)      [19]
and two vectors, the righthand side vector and the solution vector:
      v = (f₀)               p = ( b )
             (f₁)                    ( a )      [20]
We can then write the pair of equations [18] as a matrix equation
      Mp = v    [21]
We want a solution for a and b, i.e. for the vector p. If the 2x2 matrix M is invertible, then the solution of [19] is
      p = M^-1v      [22]
In this case the matrix M is easy to invert:

     M^-1 = [1/(ne₂ - e₁²)[ ( e₂ -e₁)

( -e₁ n)     [23]

so the solution vector p is

     p = [1/(ne₂ - e₁²] (e₂f₀ - e₁f₁)

(-e₁f₀ + nf₁)    [24]

so the individual components of p, namely b and a, are
     b = (e₂f₀ - e₁f₁) / (ne₂ - e₁²)      [25]
and a = (-e₁f₀ + nf₁) / (ne₂ - e₁²)      [26]

Under what circumstances will this approach fail? The answer is, when the matrix M cannot be inverted. For a fit to a line (the case we're considering here) that happens only if the determinant
ne₂ - e₁² [27]
is zero. This happens in the trivial case of no data at all, i.e. n = e₁ = e₂ = 0, and in the almost equally uninteresting case in which all the x_i values are equal, for which if U = x_i for all i,
then e₁ = nU, e₂=nU², and the determinant [27] is n*nU² - (nU)² = 0. In any other case, the determinant will be positive due to a theorem known as the triangle inequality, and the formulae [25],[26] will give us a useful answer.

There may be sets of data for which a line is not the right function to try to fit the data. The following plot looks like it would benefit from a quadratic fit:

 Temperature
       |    *                               *
       |     *                              *
       |    *                               *
       |    *                              *
       |     *                             *
       |     *                            *
       |      *                          *
       |       *                        *
       |        *                       *
       |         *                    *
       |           *                 *
       |            *                *
       |             *              *
       |              *           *
       |                *        *
       |                  *     *
       |                    ***
       |____________________________________________ time

The general equation of a quadratic is
     y = ax² + bx + c      [28]
so the least-squares problem we wish to study is
     min F(a,b,c) = min Σ [ y_i - (ax_i² + bx_i + c)]²      [29]
for a set of points i = 1, 2, . . . n. As with the line, we want to calculate the derivatives of the expression [29] with respect to the parameters--in this case, a, b, and c--and set those derivatives to zero. We thereby end up with three equations (three derivatives) in three unknowns, and we expect to be able to solve that. Thus for a,
     0 = dF/da = Σ [(y_i - (ax_i² + bx + c)](x_i²) = Σ[y_ix_i² - (ax_i⁴ + bx_i³ + cx_i²)]      [30]
and for b:
     0 = dF/db = Σ [(y_i - (ax_i² + bx + c)](x_i) = Σ[y_ix_i - (ax_i³ + bx_i² + cx_i)]      [31]
and for c:
     0 = dF/dc = Σ [(y_i - (ax_i² + bx + c)]      [32]

As with the linear fit we rewrite this so that the terms that have y_i in them appear on one side of the equation and the terms that do not are on the other side:
     Σy_ix_i² = Σ(ax_i⁴ + bx_i³ + cx_i²) = aΣx_i⁴ + bΣx_i³ + cΣx_i²      [33]
     Σy_ix_i = Σ(ax_i³ + bx_i² + cx_i) = aΣx_i³ + bΣx_i² + cΣx_i      [34]
     Σy_i = Σ(ax_i² + bx_i + c) = aΣx_i² + bΣx_i + cΣ1      [35]

For convenience we define
e₁ = Σx_i, e₂ = Σx_i², e₃ = Σx_i³, e₄ = Σx_i⁴, f₀ = Σy_i, f₁ = Σy_ix_i, f₂ = Σy_ix_i².
Thus (recalling that n = Σ1), [33] through [35] become
     f₀ = nc + e₁b + e₂a      [36]
     f₁ = e₁c + e₂b + e₃a      [37]
     f₂ = e₂c + e₃b + e₄a      [38]
As with the line, we form a matrix M, a righthand-side vector v, and a solution vector p. This time M is 3x3, and the two vectors are 3x1:
                    ( n    e₁    e₂ )                     ( f₀ )                ( c )
           M = ( e₁   e₂    e₃ )         , v =    ( f₁ )      , p =    ( b )
                    ( e₂   e₃    e₄ )                     ( f₂)                ( a )     [39]
and, as before, the solution to the least-squares problem Mp = v is p = M^-1v.
We write down the determinant of M:
     D = det M = ne₂e₄ + 2e₁e₂e₃ - e₂³ - ne₃² - e₁²e₄     [40]
and then
                         ( e₂e₄ - e₃²      e₂e₃ - e₁e₄      e₁e₃ - e₂² )
     M^-1 = D^-1 ( e₂e₃ - e₁e₄      ne₄ - e₂²      e₁e₂ -ne₃ )
                         ( e₁e₃ - e₂²      e₁e₂ -ne₃      ne₂ - e₁² )     [41]
so
     c = D^-1 [ (e₂e₄ - e₃²) f₀ + (e₂e₃ - e₁e₄) f₁ + (e₁e₃ - e₂²) f₂ ]     [42]
     b = D^-1 [ (e₂e₃ - e₁e₄) f₀ + (ne₄ - e₂²) f₁ + (e₁e₂ -ne₃) f₂ ]     [43]
     a = D^-1 [ (e₁e₃ - e₂²) f₀ + (e₁e₂ - ne₃) f₁ + (ne₂ - e₁²) f₂ ]     [44]
which gives us the desired parameters. As with the line the only conditions under which the solution is nonsensical is when the determinant D is zero, and similar "triangle inequality" considerations make that happen rather seldom.

Clearly we can extend to higher-order polynomials, e.g. a cubic fit will require four parameters and the inversion of a 4x4 matrix M. We are not required to make the function a polynomial. If we use a function like
y = a cos(ωx + δ)
then separating the equations so that we can get clear formulas for M, and v may be tricky, but the same basic concepts arise. Thus when Gauss discovered these least-squares techniques in the early nineteenth century, he paved the way for many years of productive work with fitting equations, including complicated ones, to experimental data.

	p =	[1/(ne₂ - e₁²]	(e₂f₀	- e₁f₁)
			(-e₁f₀	+ nf₁)	[24]