1. Introduction. We wish to consider fitting a set of experimental
data ([t.sub.i], [y.sub.i]) for i = 1,2, ..., m with a model of the form
y = [n.summation over (j=1)] [c.sub.j][[empty set].sub.j] ([alpha],
t),
where [alpha] = [[[[alpha].sub.1] [[alpha].sub.2] ...
[[alpha].sub.d]].sup.T] is a set of parameters that enter non-linearly
into the model, and the coefficients c = [[[c.sub.1] [c.sub.2] ...
[c.sub.n]].sup.T] obviously enter linearly. This problem can be more
usefully viewed by constructing a model matrix A whose i, j element is
given by A(i, j) = [[empty set].sub.j] ([alpha], [t.sub.i]). It is clear
that A is a function of both [alpha] and the [t.sub.i] although we
suppress that in the notation for the sake of clarity. As the [t.sub.i]
are fixed, our goal is to find values for [alpha] and c that minimize
the length of the residual r = y - Ac where y = [[[y.sub.1] [y.sub.2]
... [y.sub.m]].sup.T].
It is clear that for a fixed value of a the linear parameters c can
be found by solving a linear least squares problem. Linear least squares
problems are well understood and a variety of excellent methods are
known for their solution [1, 6, 7].
Henceforth, we shall assume that A is a full rank [R.sup.m x n]
matrix with m > n and let P be the projection matrix that projects
vectors onto the range of A. In particular, we let P = [AA.sup.[dagger]]
where [A.sup.[dagger]] = [([A.sup.T] A).sup.-1] [A.sup.T] is the
Moore-Penrose generalized inverse (1) of A which can be used to solve
the linear least squares problem yielding c = [A.sup.[dagger]]y.
We note that P is both symmetric ([P.sup.T] = P) and idempotent
([P.sup.2] = P). The orthogonal projector [P.sup.[perpendicular to]]
that projects onto the null space of [A.sup.T] is given by
[P.sup.[perpendicular to]] = I - P.
It is clear at this point that the residual, which is in fact a
function of [alpha], is given by r = [P.sup.[perpendicular to]]y and we
have effectively removed the linear parameters and now need to solve a
non-linear least squares problem to find [alpha]. One particularly
useful technique for solving this problem is developed in [4] and
involves applying the Gauss-Newton method to the variable projection
functional, in essence the norm of [P.sup.[perpendicular to]]y. The
convergence of this approach and of a variant first proposed in [8] are
elegantly analyzed in [11]. This variable projection approach has found
wide application in a variety of fields; see the excellent overview of
the history and applications in [5]. In the sequel we shall develop a
full-Newton approach to solving this problem which follows by extending
the ideas in [4].
Although the algorithm we derive is quite general, the specific
practical problem that motivated this development is rapid non-linear
curve fitting to smaller data sets. This is a problem of serious
interest in systems that use contour encoding for data compression where
one must fit curves to many thousands of small data sets. One such
system is developed in [10] for compression of hand-written documents
and it uses the Gauss-Newton algorithm developed in [2] for fitting
B-spline curves to pen strokes to achieve the needed efficiencies for a
practical compression algorithm. Another related application is
compressing track information for moving objects, etc. Although each
individual problem may be small, the huge number of problems that need
to be solved makes it worthwhile to apply more sophisticated techniques
to their solution.
In light of this motivation we demonstrate the usefulness of the
full-Newton approach first by briefly applying it to the fitting Bezier
curves that was described in [2]. We then carefully develop a specific
algorithm for discrete rational approximation in the least-squares sense
which we demonstrate on a few simple examples.
2. The Newton step. The key to deriving a full Newton algorithm for
solving a nonlinear least squares problem is to build a quadratic model
for the squared norm of the residual at the current point ac and
minimize that at each step; see, for instance, [3]. The Newton step is
given by
(2.1) [[alpha].sub.N] = - [H.sup.-1][J.sup.T] r,
where J = [nabla]r is the Jacobian and
(2.2) H = [J.sup.T] J + S,
with
(2.3) S = [m.summation over (i=1)] [r.sub.i] ([alpha])
[[nabla].sup.2] [r.sub.i]([alpha]).
In order to construct J and S we will need the first and second
partial derivatives of [P.sup.[perpendicular to]] with respect to the
non-linear parameters [[alpha].sub.i]. Henceforth, we shall use the
subscript i on a matrix to denote differentiation with respect to the
variable [[alpha].sub.i].
To compute the first partial we follow the derivation presented in
[4]. We begin by noting that the projection matrix satisfies the
equation PA = A. Differentiating both sides of this equation with
respect to [[alpha].sub.i] yields [P.sub.i]A + P [A.sub.i] = [A.sub.i].
Subtracting P [A.sub.i] from both sides and regrouping yields [P.sub.i]A
= (I - P) [A.sub.i]. Which implies that [P.sub.i]A =
[P.sup.[perpendicular to]] [A.sub.i]. Multiplying both sides on the
right by [A.sup.[dagger]] and using [AA.sup.[dagger]] = P yields
[P.sub.i]P = [P.sup.[perpendicular to]] [A.sub.i][A.sup.[dagger]]. Now
transposing and exploiting symmetry on the left gives [PP.sub.i] =
[([P.sup.[perpendicular to]] [A.sub.i][A.sup.[dagger]]).sup.T]. And
finally, since P is idempotent ([P.sup.2] = P) differentiation yields
the widely known result from [4],
(2.4) = [P.sub.i] = [P.sub.i]P + [PP.sub.i] = [P.sup.[perpendicular
to]] [A.sub.i][A.sup.[dagger]] + [([P.sup.[perpendicular to]]
[A.sub.i][A.sup.[dagger]]).sup.T].
Now we need to extend this derivation in order to compute the
second partial derivative. We begin by considering (2.4) and using the
product rule to take its partial derivative with respect to
[[alpha].sub.j]. Therefore,
(2.5) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
Now we note that
[([P.sup.[perpendicular to]]).sub.j] = [([I - P).sub.j] = -
[P.sub.j],
which we may use to simplify (2.5) giving
(2.6) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
Next, we need to compute the partial derivative of [A.sup.[dagger]]
with respect to [[alpha].sub.j]. We begin by differentiating both sides
of P = [AA.sup.[dagger]] with respect to [[alpha].sub.j], which yields
[P.sub.j] = [A.sub.j] [A.sup.[dagger]] + A
[([A.sup.[dagger]]).sub.j].
Rearranging gives
A [([A.sup.[dagger]]).sub.j] = [P.sub.j] - [A.sub.j]
[A.sup.[dagger]].
Multiplying on the left by [A.sup.[dagger]] and invoking the fact
that [A.sup.[dagger]] A = I yields (2)
[([A.sup.[dagger]]).sub.j] = [A.sup.[dagger]][P.sub.j] -
[A.sup.[dagger]] [A.sub.j] [A.sup.[dagger]].
Substituting for [P.sub.j] from (2.4) yields
[([A.sup.[dagger]]).sub.j] = [A.sup.[dagger]]
([P.sup.[perpendicular to]] [A.sub.j] [A.sup.[dagger]] +
[([P.sup.[perpendicular to]).sup.T]) - [A.sup.[dagger]] [A.sub.j]
[A.sup.[dagger]].
And finally, invoking [A.sup.[dagger]] [P.sup.[perpendicular to]] =
0 yields
[([A.sup.[dagger]]).sub.j] = [A.sup.[dagger]]
[([P.sup.[perpendicular to]] [A.sub.j] [A.sup.[dagger]]).sup.T] -
[A.sup.[dagger]] [A.sub.j] [A.sup.[dagger]].
We can now substitute this expression into (2.6) giving
(2.7) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
In the interest of brevity, we define
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
Substituting this definition into (2.7) yields a compact expression
for the second partial derivative of the projector. In particular,
(2.8) [P.sub.i,j] = [D.sub.i,j] + [D.sup.T.sub.i,j].
Now that we have expressions for the first and second partial
derivatives we may proceed to construct the Newton step. To solve (2.1)
we need to construct the J and S matrices. We begin by defining two
useful quantities. First, the current residual is given by
(2.9) r = [P.sup.[perpendicular to]] y,
and second, the current solution is given by
(2.10) c = [A.sup.[dagger]] y.
Now, since J = [nabla]r it follows that the jth column of J is
given by - [P.sub.j]y which, invoking (2.4), is
(2.11) - [P.sup.[perpendicular to]] [A.sub.j]c -
[(A.sup.[dagger]]).sup.T] [A.sup.T.sub.j] r.
To construct S we note that following (2.3) the i, j element of S
is given by
S (i, j) = - [r.sup.T] [P.sub.ij] y.
Making use of (2.8), we find that
(2.12) S(i, j) = - [r.sup.T] ([D.sub.i,j] + [D.sup.T.sub.i,j]) y.
We will evaluate this in two parts. Consider first the quantity
(2.13) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
Invoking [P.sup.[perpendicular to]]r = r and (2.9) and (2.10), we
reduce (2.13) to
(2.14) - [r.sup.T] [D.sub.jj] y = [r.sup.]T [P.sub.j] [A.sub.i]c -
[r.sup.T] [A.sub.ij]c - [r.sup.T] [A.sub.i][A.sup.[dagger]] [(A.sub.j]
[A.sup.[dagger]]).sup.T] r + [r.sup.T] [A.sub.i][A.sup.[dagger]]
[A.sub.j]c.
Substituting in for [P.sub.j] from (2.4) yields
(2.15) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
Now we note that [A.sup.[dagger]] r = 0 and this reduces to
(2.16) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
Next, we consider the second quantity. Omitting the details, it can
be reduced to
(2.17) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
Substituting (2.16) and (2.17) into (2.12) and combining terms
yields
(2.18) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
It is important to exploit symmetries and use care when
constructing the S matrix in order to minimize the operation count and
enhance accuracy.
3. Implementation details. Implementation details are important.
First and foremost is the careful construction of both the J and S
matrices. A simple observation can greatly simplify this process; in
particular, terms of the form [A.sub.j]c and [A.sup.T.sub.j]r are
ubiquitous in (2.11) and (2.18). We begin by constructing two m x d
auxiliary matrices. We shall call the first matrix U. Its jth column is
given by
(3.1) [A.sub.j]c.
We shall call the second matrix V. Its jth column is given by
(3.2) [A.sup.T.sub.j]r.
Using these definitions in (2.11), we see that
J = - ([P.sup.[perpendicular to]] U + [([A.sup.[dagger]]).sup.T] V)
.
At this point we will introduce two intermediate quantities to
simplify the notation going forward. In particular, we define
(3.3) K = [P.sup.[perpendicular to]] U
(3.4) L = [([A.sup.[dagger]]).sup.T] V.
With these definitions, we find that
(3.5) J = - (K + L),
or, equivalently,
(3.6) J = - (U - P(U - L)).
It is also true that
(3.7) [J.sup.T] J = [K.sup.T] K + [L.sup.T] L,
since the columns of L are clearly elements of the range of A, and
those of K are in the null space of [A.sup.T].
Now we examine (2.18). We begin by considering the first term and
noting that it can be written in terms of the jth column of V, which we
shall denote by [v.sub.j] and the ith column of U, which we shall denote
by [u.sub.i],
[r.sup.T] [A.sub.j] [A.sup.[dagger]] [A.sub.i]c = [v.sup.T.sub.j]
[A.sup.[dagger]] [u.sub.i],
which is the i, j element of the matrix [U.sup.T] L. By the same
argument, the quantity in the second term of (2.18) is the i, j element
of the matrix [L.sup.T] U. Similarly, the quantity in the third term of
(2.18) is twice the i, j element of the matrix [L.sup.T] L.
Hence, if we let S be a matrix whose i, j element is given by
(3.8) [??](i, j) = [r.sup.T] [A.sub.ij] c,
then it follows that (2.18) leads to
(3.9) S = [U.sup.T] L + [L.sup.T] U - 2[L.sup.T] L - [??].
Using (3.7) and (3.9) in (2.2) gives
(3.10) H = [U.sup.T] L + [L.sup.T] U + [K.sup.T] K - [L.sup.T] L -
[??],
or, noting that PL = L, more compactly,
(3.11) H = [U.sup.T] U - [(U - L).sup.T] P (U - L) - [??].
Note that the most significant difference, in terms of
computational load, between the Full Newton and the Gauss-Newton
approaches is the construction of the [??] matrix. This requires working
with O([d.sup.2]) partial derivatives (both first and second order
partials) as opposed to just O(d) first partials for the Gauss-Newton
approach. This can be significant if all of the second partial
derivatives are full rank. However, in a number of applications many of
the mixed partial derivatives vanish, are sparse, or have low rank, and
there can be a significant savings by exploiting such structure. In
Section 4, we present an example from [2] involving the fitting of a
Bezier curve to ordered two-dimensional data where all of the mixed
partial derivatives are identically zero. In this specific example we
see only a 50% increase in the work per step by using a Full-Newton
approach and reap all the benefits of greatly accelerated convergence.
In order to maximize efficiencies for any specific model one must take
full advantage of any special structure of the partial derivatives.
The other critical step is to deal with [P.sup.[perpendicular to]
and [A.sup.[dagger]] in a computationally responsible manner. We begin
by computing the full QR factorization of the model matrix A. In
particular,
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
where [[Q.sub.1] [Q.sub.2]] is an m x m orthogonal matrix, R is n x
n, upper-triangular, and invertible, and n is an n x n permutation. It
is easily verified that
K = [Q.sub.2][Q.sup.T.sub.2] U,
L = [Q.sub.1] [R.sup.-T] [PI]V,
and these forms should be used in any implementation. The Newton
step can then be calculated using either (3.5) or (3.6) to construct J,
and either (3.10) or (3.11) to construct H. It should be noted that one
clear advantage to using (3.6) in conjunction with (3.11) is that one
need not compute a full QR factorization in that case, a skinny QR
factorization will suffice. Of course, the speed advantage of, for
example, a modified Gram-Schmidt approach needs to be weighed against
the numerical risk if the model matrix A is poorly conditioned. We note,
anecdotally, that this has not been an issue in any of the examples we
have considered.
4. A brief example. As a brief example we consider the problem of
fitting a Bezier curve to ordered two-dimensional data. This problem is
carefully developed in [2] and we refer the reader there for the
details. The basic problem can be stated as follows. Given an ordered
set of data points [{([w.sub.j], [v.sub.i])}.sup.m.sub.i=1] and a
non-negative integer n < m, find nodes
[{[[alpha].sub.i]}.sup.m.sub.i=1] and control points [{([x.sub.j],
[y.sub.j])}.sup.n.sub.i=1] that minimize
(4.1) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
where [B.sup.n.sub.j] ([alpha]) is the j'th Bernstein
polynomial of degree n, that is,
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
It is clear that the control points enter the problem linearly and
the nodes are the non-linear parameters.
The critical observation is that all of the mixed partial
derivatives taken with respect to the nodes are identically zero. By
exploiting this fact and the other structure inherent to the problem we
find that we can compute the full-Newton step using only 50% more time
than computing the Gauss-Newton step. This in turn indicates that a 33%
reduction in iterations will be the break even point for switching to a
full-Newton code.
We tested this by using an existing Gauss-Newton code developed in
[2] and modifying it to use the full-Newton step; we refer the reader to
[2] for the details of the underlying algorithm. We then applied it to a
data set containing 23 points taken from an experiment on a reacting
chemical system which may have multiple steady states; see [9]. The
experiment samples the steady state oxidation rate R achieved by a
catalytic system for an input concentration of carbon monoxide
[C.sub.CO]. The data is in log-log format and we show the results of
fitting with a sixth-degree Bezier curve. Note that in (4.1) this
corresponds to m = 23 and n = 6. Even though the squared residual at the
solution is very small (0.00217) the full-Newton algorithm converges in
just 32 iterations as compared to 174 iterations for the Gauss-Newton
code. This greatly enhanced convergence more than makes up for the
increase in per iteration cost.
5. A full-Newton algorithm for discrete least squares rational
approximation. As a
specific detailed example we now consider the important problem of
fitting a rational function of the form
y = [c.sub.1] + [c.sub.2]t + ... + [c.sub.n][t.sup.n-1]/1 +
[[alpha].sub.1]t + [[alpha].sub.2][t.sup.2] + ... +
[[alpha].sub.k][t.sup.k]
to data in the least-squares sense. This is clearly a separable
non-linear least squares problem with the coefficients of the numerator
entering the problem linearly and those of the denominator non-linearly.
The model matrix A for this problem can be written in the suggestive
form A = [D.sup.-1] N where the numerator matrix N is an m x n
Vandermonde matrix,
(5.1) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
and the denominator matrix [D.sup.-1] is an m x m diagonal matrix,
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII],
where q(t) = 1 + [[alpha].sub.1]t + [[alpha].sub.2][t.sup.2] + ...
+ [[alpha].sub.k][t.sup.k].
We define T to be the m x m diagonal matrix,
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
It is easily verified that
(5.2) [A.sub.j] = - [D.sup.-2][T.sup.j]N
= - [T.sup.j] [D.sup.-1] A
and
(5.3) [A.sub.i,j] = 2[D.sup.-3][T.sup.i+_j]N
= 2[T.sup.i+j] [D.sup.-2] A.
Combining (5.2) with (3.1), we see that the jth column of U is
given by
(5.4) - [T.sup.j] [D.sup.-1] Ac,
and also, combining (5.2) with (3.2), we see that the jth columnofV
is given by
- [A.sup.T] [D.sup.-1] [T.sup.j] r.
It will be convenient to define an intermediate quantity [??] whose
jth column is given by
(5.5) - [D.sup.-1] [T.sup.j]r.
We note then that
(5.6) V = [A.sup.T] [??].
Using (5.3) in (3.8) yields
= [??](i, j) = 2[r.sup.T] [T.sup.i+j] [D.sup.-2] Ac
= 2[r.sup.T] [T.sup.i] [D.sup.-1] [T.sup.j] [D.sup.-1] Ac.
It is not difficult to see then that
(5.7) [??] = 2[[??].sup.T] U.
Close inspection of (5.4) and (5.5) reveals that the columns of U
and [??] are just diagonally scaled columns of a Vandermonde matrix. We
introduce the matrix M defined by
(5.8) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
It follows that
(5.9) U = - [PHI] [D.sup.-1] M,
where [PHI] = diag (Ac), and also that
(5.10) [??] = -[phi] [D.sup.-1] M,
where [phi] = diag(r). Moreover, we note that combining (5.6) with
(3.4) yields
(5.11) L = [([A.sup.[dagger]).sup.T] [A.sup.T] [??]
= P[??],
where P is the current projection, i.e., P =
[Q.sub.1][Q.sup.T.sub.1]. Using (5.11) in (3.6) and invoking the
idempotence ofP gives
J = - (U - P(U - [??])).
Next, we invoke (5.9) and (5.10) to factor [D.sup.-1]M from the
right to get a formula for the Jacobian,
(5.12) J = ([PHI] - P([PHI] - [phi])) [D.sup.-1]M.
Next, using (5.7) and (5.11) in conjunctionwith the idempotence ofP
in (3.11) gives
H = [U.sup.T] U - [(U - [??]).sup.T] P (U - [??]) - 2[[??].sup.T]
U.
We then invoke (5.9) and (5.10) to factor [D.sup.-1]M from the
right and its transpose from the left to get
(5.13) H = [M.sup.T] [D.sup.-1] ([[PHI].sup.2] - ([PHI] -
[phi])P([PHI] - [phi]) - 2[PHI][phi]) [D.sup.-1] M, or, equivalently,
H = [M.sup.T][D.sup.-1] (([PHI] - [phi])[P.sup.[perpendicular
to])([PHI] - [phi]) - [[phi].sup.2]) [D.sup.-1] M.
5.1. The algorithm. With the formulas necessary for computing the
Newton step for rational approximation we developed a basic MATLAB code
implementing the full-Newton approach. The first algorithm uses (5.12)
for J and (5.13) for H. The user specifies the degree of the numerator n
- 1 and the denominator k and may supply a starting guess for the
coefficients of the denominator, [[alpha].sub.i] for i = 1, ..., q. If
no starting guess is given the code generates one by solving the widely
known linear least squares extension of the Newton-Pade approximant. In
particular, we find the least squares solution to
[m.summation over (i=1)] [([c.sub.1] + [c.sub.2][t.sub.i] + ... +
[c.sub.n][t.sup.n-1.sub.i] - [y.sub.i] [([[alpha].sub.1] t +
[[alpha].sub.2][t.sup.2.sub.i] + ... + [[alpha].sub.k][t.sup.k.sub.i]) -
[y.sub.i].sup.2].
This is implemented by solving
(5.14) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII],
where Y is diagonal matrix with [y.sub.i,i] = [y.sub.i], and N and
M are defined in (5.1) and (5.8), respectively.
Once we have a current guess for a each successive iteration begins
by computing J and H using (5.12) and (5.13). We then evaluate the
gradient using J and the current residual, and then solve for the Newton
step using the Cholesky factorization ofH. One weakness of full-Newton
methods is that H may not be positive definite in a region of mixed
curvature and hence the Newton direction may not be a descent direction.
Although there are a number of methods for dealing with this we have
implemented a very simple one that we have found to be very reliable. In
particular, if the Cholesky factorization fails we regularize H by
arbitrarily shifting its spectrum to the right by a bit more than the
absolute value of its leftmost eigenvalue (we used [absolute value of
1.2 x [[lambda].sub.min]]). Once that is done we compute the Cholesky
factorization and solve for the Newton step.
Once a descent direction has been established it is essential to
use some form of step size control in order to get reasonable
convergence. We have chosen to use a backtracking line search and have
found this to work surprisingly well. We omit the implementation details
which can be found in [3].
Finally, as a stopping criterion we terminated the algorithm when
the relative change in the squared norm of the residual was less than
10-12.
For purposes of comparison we also created a Gauss-Newton code,
identical to the full-Newton code except that all calculations not
needed for computing the Gauss-Newton direction are removed. In
particular, after constructing the Jacobian with (5.12) we set H =
[J.sup.T] J and proceed to solve for the Gauss-Newton step using the
Cholesky factorization. (3) No regularization is needed in this case so
that code is also removed.
5.2. Experimental results. We begin by considering two examples
using measured data, both of these are taken from the NIST Statistical
Reference Datasets for Nonlinear Regression. (4) The first involves
measured data from an experiment on electron mobility. This data is due
to R. Thurber and is considered to be of higher difficulty. There are 37
data points to be fit to a model of the form
y = [c.sub.0] + [c.sub.1]t + [c.sub.2][t.sup.2] +
[c.sub.3][t.sup.3]/1 + [[alpha].sub.1]t + [[alpha].sub.2][t.sup2] +
[[alpha].sub.1][t.sup.3].
We consider two scenarios in order to compare the approaches. In
the first case we use the starting guess suggested in the NIST testbed
of [[alpha].sub.0] = [[1 .4 .05].sup.T]. In this case we converge to the
known solution of [alpha] = [[.9663 .3980 .0497].sup.T] in 6 iterations
with the full Newton code as compared to 20 iterations with the
Gauss-Newton code. In this case the full Newton code resorts to
regularization for the first two steps. In the second case we provide no
starting guess and hence the codes generate one using (5.14). In this
case the Full-Newton code converges in 7 iterations as compared to 30
for Gauss-Newton. In this case the full Newton code resorts to
regularization only for the first step. It is worth noting that the
squared norm of the residual at the solution is roughly 5642.7. Since it
is well known that the full-Newton approach generally outperforms
Gauss-Newton when the residuals are large this outcome is not
surprising, although its magnitude is noteworthy.
To get a baseline notion of relative speed both codes were tested
using MATLAB 7.0.4 on a Pentium(R) 4 with 504MB of RAM. Both scenarios
were run and the average time per run over 10,000 runs was calculated.
In the first scenario both codes run with the starting guess above, in
the second scenario both codes were run with no starting guess. The
results are summarized in Table 5.1, where we show the number of steps
to converge, the average time (measured in milliseconds), and the time
ratio.
The second example involves measured data from a NIST study
involving scanning electron microscope line width standards. This data
set is due to R. Kirby and is considered to be of average difficulty.
There are 151 data points to be fit to a model of the form
y = [c.sub.0] + [c.sub.1]t + [c.sub.2][t.sup.2]/1 +
[[alpha].sub.1]t + [[alpha].sub.2][t.sup.2]
We consider the same two scenarios in order to compare the
approaches. In the first case we use the starting guess suggested in the
NIST testbed of [[alpha].sub.0] = [[-.0015 .00002].sup.T]. The results
of this experiment are summarized in Table 5.2. It is worth noting that
the squared norm of the residual at the solution is roughly 3.905 and we
see the full-Newton approach comparing less favorably to Gauss-Newton
when the residual is small as we might generally expect. It is also
noteworthy that for this data set the full-Newton code did not resort to
regularization in either case.
As an additional test we consider fitting a rational of the form
y = [c.sub.0] + [c.sub.1]t + [c.sub.2][t.sup.2]/1 +
[[alpha].sub.1]t + [[alpha].sub.2][t.sup.2]
to the function y = [square root of 1 = [x.sup.2]] on the interval
[-1,1] and also to the function y = cos x on the interval [-[pi], [pi]].
To give some notion of scaling we do this for three different cases:
using 11, 101, and 501 evenly spaced points over the interval.
Regularization was not required in any of these examples. The results
are summarized in Tables 5.3 and 5.4.
Finally, we consider a function with more complex behavior. In
particular, we will try to fit y = [e.sup.-x cos 4x] on the interval [0,
[pi]]. We begin by trying to fit the function with a rational with both
numerator and denominator being degree 4 and using 20 equally spaced
points. In this case the full-Newton algorithm converges in just 12
iterations with a final squared residual of 6.6916 x [10.sup.-1] and it
resorts to regularization for each of the first four steps. The
Gauss-Newton algorithm converges in 13 iterations but to a very
unsatisfying approximation with a squared residual of 6.9470 (there are
two poles inside the interval). This is not unusual in our experience;
anecdotally, we have observed that the full-Newton algorithm is rather
less likely to stall out far from the optimal solution.
We ran this test a second time using 100 evenly spaced points and
using a higher order rational (numerator and denominator both degree 6).
In this case both algorithms converge to a nice solution with a squared
residual of 2.3965 x [10.sup.-1] but the full-Newton code requires only
20 iterations in contrast to the 25 required by the Gauss-Newton code,
moreover it runs in just 71% of the time required by the latter. It is
very interesting to note that in this experiment the full-Newton code
resorts to regularization for each of the first 10 steps. However, even
though it is regularizing fully half of the time it still noticeably
outperforms the Gauss-Newton code.
6. Conclusion. We have derived a full-Newton approach for separable
non-linear least squares problems. The derivation results in a
surprisingly compact formula for computing the Newton step. Experiments
show that the method can substantially improve the convergence rate at
the expense of additional per iteration costs. It is seen that for
problems where the second partial derivatives of the model matrix have
special structure the additional costs of using a full-Newton approach
may be minimal and hence the improved convergence rate can lead to
substantially faster solutions. This was briefly demonstrated using an
example from parametric curve fitting where all of the mixed partials
are identically zero (and structured).
We then applied our derivation of the Newton step to the problem of
discrete least squares rational approximation. This very important
problem has a structure that leads to a surprisingly compact form for
the Newton step. We showed with several examples that the fullNewton
approach can significantly outperform the Gauss-Newton approach.
REFERENCES
[1] A. Bjorck, Numerical Methods for Least Squares Problems, SIAM,
Philadelphia, PA, 1996.
[2] C. F. Borges and T. A. Pastva, Total least squares fitting of
Bezier and B-spline curves to ordered data, Comput. Aided Geom. Design,
19 (2002), pp. 275-289.
[3] J. E. Dennis and R. B. Schnabel, Numerical Methods for
Unconstrained Optimization and Nonlinear Equations, Prentice-Hall,
Englewood Cliffs, NJ, 1983.
[4] G. H. Golub and V. Pereyra, The differentiation of
pseudo-inverses and nonlinear least-squares problems whose variables
separate, SIAM J. Numer. Anal., 10 (1973), pp. 413-432.
[5] --, Separable nonlinear least squares: the variable projection
method and its applications, Inverse Problems, 19 (2003), pp. R1-R26.
[6] G. H. Golub and C. F. Van Loan, Matrix Computations, Third ed.,
The Johns Hopkins University Press, Baltimore, MD, 1996.
[7] R. J. Hanson and C. L. Lawson, Extensions and applications of
the Householder algorithm for solving linear least squares problems,
Math. Comp., 23 (1969), pp. 787-812.
[8] L. Kaufman, A variable projection method for solving separable
nonlinear least squares problems, BIT, 15 (1975), pp. 49-57.
[9] S. P. Marin and P. W. Smith, Parametric approximation of data
using ODR splines, Comput. Aided Geom. Design, 11 (1994), pp. 247-267.
[10] C. Nelson, Contour encoded compression and transmission,
Master's thesis, Department of Computer Science, Brigham Young
University, Provo, UT, 2006.
[11] A. Ruhe and P.-A. Wedin, Algorithms for separable nonlinear
least squares problems, SIAM Rev., 22 (1980), pp. 318-337.
* Received August 30, 2008. Accepted for publication December 15,
2008. Published online on April 14, 2009. Recommended by M. Benzi.
(1) This is a formal derivation and it is worth noting that one
should not form generalized inverses as a method for computing solutions
to linear least-squares problems.
(2) The assumption that A be full-rank is important here.
CARLOS F. BORGES ([dagger])
(3) We note that solving the normal equations is not a
computationally responsible approach. However, for the purpose of the
experiments that follow we found that carefully solving for the
Gauss-Newton step did not change the convergence or final results but
did add additional overhead which biased the time comparisons rather
unfairly to the benefit of the Full-Newton algorithm. Therefore we
produced all of the timings using Cholesky in order that the algorithms
would be competing on a level playing field.
(4) This data is available online at www.itl.nist.gov.
([dagger]) Department of Applied Mathematics, Naval Postgraduate
School, Monterey, CA 93943 (borges@nps.edu).
Table 5.1
Experimental Results for the Thurber data set
Full-Newton Gauss-Newton Ratio
Steps Time Steps Time
Case I 6 1.5454 20 2.9902 52%
Case II 7 1.5007 30 4.5454 33%
Table 5.2
Experimental Results for the Kirby data set
Full-Newton Gauss-Newton Ratio
Steps Time Steps Time
Case I 5 2.4767 7 3.8659 64%
Case II 4 2.2382 7 4.4928 50%
Table 5.3
Experimental Results for the y = [square root of 1 - [chi square]]
data set
# Points Full-Newton Gauss-Newton
Steps Time Steps Time
11 4 0.6268 5 0.8305
101 4 1.3843 8 2.6605
501 4 98.211 7 168.78
# Points Ratio Residual
11 75% 8.91 x [10.sup.-4]
101 52% 3.68 x [10.sup.-2]
501 58% 8.50 x [10.sup.-2]
Table 5.4
Experimental Results for the y = cos x data set
# Points Full-Newton Gauss-Newton
Steps Time Steps Time
11 4 0.6356 7 0.8549
101 4 1.3955 7 2.3812
501 4 86.802 7 167.88
# Points Ratio Residual
11 74% 2.42 x [10.sup.-2]
101 59% 1.30 x [10.sup.-1]
501 52% 5.94 x [10.sup.-1]