Abstract:

We consider a class of non-linear least squares problems that are
widely used in fitting experimental data. A defining characteristic of
the models we will consider is that the solution parameters may be
separated into two classes, those that enter the problem linearly and
those that enter non-linearly. Problems of this type are known as
separable non-linear least squares (SNLLS) problems and are often solved
using a Gauss-Newton algorithm that was developed in Golub and Pereyra
[SIAM J. Numer. Anal., 10 (1973), pp. 413-432] and has been very widely
applied. We develop a full-Newton algorithm for solving this problem.
Exploiting the structure of the general problem leads to a surprisingly
compact algorithm which exhibits all of the excellent characteristics of
the full-Newton approach (e.g. rapid convergence on problems with large
residuals). Moreover, for certain problems of quite general interest,
the per iteration cost for the full-Newton algorithm compares quite
favorably with that of the Gauss-Newton algorithm. We explore one such
problem, that of discrete least-squares fitting of rational functions.

Key words. separable nonlinear least squares, rational approximation

AMS subject classifications. 65F20, 65D10, 41A20

Key words. separable nonlinear least squares, rational approximation

AMS subject classifications. 65F20, 65D10, 41A20

Article Type:

Report

Subject:

Least squares
(Research)

Approximation theory (Research)

Approximation theory (Research)

Author:

Borges, Carlos F.

Pub Date:

01/01/2009

Publication:

Name: Electronic Transactions on Numerical Analysis Publisher: Institute of Computational Mathematics Audience: Academic Format: Magazine/Journal Subject: Computers; Mathematics Copyright: COPYRIGHT 2009 Institute of Computational Mathematics ISSN: 1068-9613

Issue:

Date: Jan, 2009 Source Volume: 35

Topic:

Event Code: 310 Science & research

Geographic:

Geographic Scope: United States Geographic Code: 1USA United States

Accession Number:

230063919

Full Text:

1. Introduction. We wish to consider fitting a set of experimental
data ([t.sub.i], [y.sub.i]) for i = 1,2, ..., m with a model of the form

y = [n.summation over (j=1)] [c.sub.j][[empty set].sub.j] ([alpha], t),

where [alpha] = [[[[alpha].sub.1] [[alpha].sub.2] ... [[alpha].sub.d]].sup.T] is a set of parameters that enter non-linearly into the model, and the coefficients c = [[[c.sub.1] [c.sub.2] ... [c.sub.n]].sup.T] obviously enter linearly. This problem can be more usefully viewed by constructing a model matrix A whose i, j element is given by A(i, j) = [[empty set].sub.j] ([alpha], [t.sub.i]). It is clear that A is a function of both [alpha] and the [t.sub.i] although we suppress that in the notation for the sake of clarity. As the [t.sub.i] are fixed, our goal is to find values for [alpha] and c that minimize the length of the residual r = y - Ac where y = [[[y.sub.1] [y.sub.2] ... [y.sub.m]].sup.T].

It is clear that for a fixed value of a the linear parameters c can be found by solving a linear least squares problem. Linear least squares problems are well understood and a variety of excellent methods are known for their solution [1, 6, 7].

Henceforth, we shall assume that A is a full rank [R.sup.m x n] matrix with m > n and let P be the projection matrix that projects vectors onto the range of A. In particular, we let P = [AA.sup.[dagger]] where [A.sup.[dagger]] = [([A.sup.T] A).sup.-1] [A.sup.T] is the Moore-Penrose generalized inverse (1) of A which can be used to solve the linear least squares problem yielding c = [A.sup.[dagger]]y.

We note that P is both symmetric ([P.sup.T] = P) and idempotent
([P.sup.2] = P). The orthogonal projector [P.sup.[perpendicular to]]
that projects onto the null space of [A.sup.T] is given by
[P.sup.[perpendicular to]] = I - P.

It is clear at this point that the residual, which is in fact a function of [alpha], is given by r = [P.sup.[perpendicular to]]y and we have effectively removed the linear parameters and now need to solve a non-linear least squares problem to find [alpha]. One particularly useful technique for solving this problem is developed in [4] and involves applying the Gauss-Newton method to the variable projection functional, in essence the norm of [P.sup.[perpendicular to]]y. The convergence of this approach and of a variant first proposed in [8] are elegantly analyzed in [11]. This variable projection approach has found wide application in a variety of fields; see the excellent overview of the history and applications in [5]. In the sequel we shall develop a full-Newton approach to solving this problem which follows by extending the ideas in [4].

Although the algorithm we derive is quite general, the specific practical problem that motivated this development is rapid non-linear curve fitting to smaller data sets. This is a problem of serious interest in systems that use contour encoding for data compression where one must fit curves to many thousands of small data sets. One such system is developed in [10] for compression of hand-written documents and it uses the Gauss-Newton algorithm developed in [2] for fitting B-spline curves to pen strokes to achieve the needed efficiencies for a practical compression algorithm. Another related application is compressing track information for moving objects, etc. Although each individual problem may be small, the huge number of problems that need to be solved makes it worthwhile to apply more sophisticated techniques to their solution.

In light of this motivation we demonstrate the usefulness of the full-Newton approach first by briefly applying it to the fitting Bezier curves that was described in [2]. We then carefully develop a specific algorithm for discrete rational approximation in the least-squares sense which we demonstrate on a few simple examples.

2. The Newton step. The key to deriving a full Newton algorithm for solving a nonlinear least squares problem is to build a quadratic model for the squared norm of the residual at the current point ac and minimize that at each step; see, for instance, [3]. The Newton step is given by

(2.1) [[alpha].sub.N] = - [H.sup.-1][J.sup.T] r,

where J = [nabla]r is the Jacobian and

(2.2) H = [J.sup.T] J + S,

with

(2.3) S = [m.summation over (i=1)] [r.sub.i] ([alpha]) [[nabla].sup.2] [r.sub.i]([alpha]).

In order to construct J and S we will need the first and second partial derivatives of [P.sup.[perpendicular to]] with respect to the non-linear parameters [[alpha].sub.i]. Henceforth, we shall use the subscript i on a matrix to denote differentiation with respect to the variable [[alpha].sub.i].

To compute the first partial we follow the derivation presented in [4]. We begin by noting that the projection matrix satisfies the equation PA = A. Differentiating both sides of this equation with respect to [[alpha].sub.i] yields [P.sub.i]A + P [A.sub.i] = [A.sub.i]. Subtracting P [A.sub.i] from both sides and regrouping yields [P.sub.i]A = (I - P) [A.sub.i]. Which implies that [P.sub.i]A = [P.sup.[perpendicular to]] [A.sub.i]. Multiplying both sides on the right by [A.sup.[dagger]] and using [AA.sup.[dagger]] = P yields [P.sub.i]P = [P.sup.[perpendicular to]] [A.sub.i][A.sup.[dagger]]. Now transposing and exploiting symmetry on the left gives [PP.sub.i] = [([P.sup.[perpendicular to]] [A.sub.i][A.sup.[dagger]]).sup.T]. And finally, since P is idempotent ([P.sup.2] = P) differentiation yields the widely known result from [4],

(2.4) = [P.sub.i] = [P.sub.i]P + [PP.sub.i] = [P.sup.[perpendicular to]] [A.sub.i][A.sup.[dagger]] + [([P.sup.[perpendicular to]] [A.sub.i][A.sup.[dagger]]).sup.T].

Now we need to extend this derivation in order to compute the second partial derivative. We begin by considering (2.4) and using the product rule to take its partial derivative with respect to [[alpha].sub.j]. Therefore,

(2.5) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Now we note that

[([P.sup.[perpendicular to]]).sub.j] = [([I - P).sub.j] = - [P.sub.j],

which we may use to simplify (2.5) giving

(2.6) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

Next, we need to compute the partial derivative of [A.sup.[dagger]] with respect to [[alpha].sub.j]. We begin by differentiating both sides of P = [AA.sup.[dagger]] with respect to [[alpha].sub.j], which yields

[P.sub.j] = [A.sub.j] [A.sup.[dagger]] + A [([A.sup.[dagger]]).sub.j].

Rearranging gives

A [([A.sup.[dagger]]).sub.j] = [P.sub.j] - [A.sub.j] [A.sup.[dagger]].

Multiplying on the left by [A.sup.[dagger]] and invoking the fact that [A.sup.[dagger]] A = I yields (2)

[([A.sup.[dagger]]).sub.j] = [A.sup.[dagger]][P.sub.j] - [A.sup.[dagger]] [A.sub.j] [A.sup.[dagger]].

Substituting for [P.sub.j] from (2.4) yields

[([A.sup.[dagger]]).sub.j] = [A.sup.[dagger]] ([P.sup.[perpendicular to]] [A.sub.j] [A.sup.[dagger]] + [([P.sup.[perpendicular to]).sup.T]) - [A.sup.[dagger]] [A.sub.j] [A.sup.[dagger]].

And finally, invoking [A.sup.[dagger]] [P.sup.[perpendicular to]] = 0 yields

[([A.sup.[dagger]]).sub.j] = [A.sup.[dagger]] [([P.sup.[perpendicular to]] [A.sub.j] [A.sup.[dagger]]).sup.T] - [A.sup.[dagger]] [A.sub.j] [A.sup.[dagger]].

We can now substitute this expression into (2.6) giving

(2.7) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

In the interest of brevity, we define

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Substituting this definition into (2.7) yields a compact expression for the second partial derivative of the projector. In particular,

(2.8) [P.sub.i,j] = [D.sub.i,j] + [D.sup.T.sub.i,j].

Now that we have expressions for the first and second partial derivatives we may proceed to construct the Newton step. To solve (2.1) we need to construct the J and S matrices. We begin by defining two useful quantities. First, the current residual is given by

(2.9) r = [P.sup.[perpendicular to]] y,

and second, the current solution is given by

(2.10) c = [A.sup.[dagger]] y.

Now, since J = [nabla]r it follows that the jth column of J is given by - [P.sub.j]y which, invoking (2.4), is

(2.11) - [P.sup.[perpendicular to]] [A.sub.j]c - [(A.sup.[dagger]]).sup.T] [A.sup.T.sub.j] r.

To construct S we note that following (2.3) the i, j element of S is given by

S (i, j) = - [r.sup.T] [P.sub.ij] y.

Making use of (2.8), we find that

(2.12) S(i, j) = - [r.sup.T] ([D.sub.i,j] + [D.sup.T.sub.i,j]) y.

We will evaluate this in two parts. Consider first the quantity

(2.13) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

Invoking [P.sup.[perpendicular to]]r = r and (2.9) and (2.10), we reduce (2.13) to

(2.14) - [r.sup.T] [D.sub.jj] y = [r.sup.]T [P.sub.j] [A.sub.i]c - [r.sup.T] [A.sub.ij]c - [r.sup.T] [A.sub.i][A.sup.[dagger]] [(A.sub.j] [A.sup.[dagger]]).sup.T] r + [r.sup.T] [A.sub.i][A.sup.[dagger]] [A.sub.j]c.

Substituting in for [P.sub.j] from (2.4) yields

(2.15) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

Now we note that [A.sup.[dagger]] r = 0 and this reduces to

(2.16) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

Next, we consider the second quantity. Omitting the details, it can be reduced to

(2.17) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

Substituting (2.16) and (2.17) into (2.12) and combining terms yields

(2.18) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

It is important to exploit symmetries and use care when constructing the S matrix in order to minimize the operation count and enhance accuracy.

3. Implementation details. Implementation details are important. First and foremost is the careful construction of both the J and S matrices. A simple observation can greatly simplify this process; in particular, terms of the form [A.sub.j]c and [A.sup.T.sub.j]r are ubiquitous in (2.11) and (2.18). We begin by constructing two m x d auxiliary matrices. We shall call the first matrix U. Its jth column is given by

(3.1) [A.sub.j]c.

We shall call the second matrix V. Its jth column is given by

(3.2) [A.sup.T.sub.j]r.

Using these definitions in (2.11), we see that

J = - ([P.sup.[perpendicular to]] U + [([A.sup.[dagger]]).sup.T] V) .

At this point we will introduce two intermediate quantities to simplify the notation going forward. In particular, we define

(3.3) K = [P.sup.[perpendicular to]] U

(3.4) L = [([A.sup.[dagger]]).sup.T] V.

With these definitions, we find that

(3.5) J = - (K + L),

or, equivalently,

(3.6) J = - (U - P(U - L)).

It is also true that

(3.7) [J.sup.T] J = [K.sup.T] K + [L.sup.T] L,

since the columns of L are clearly elements of the range of A, and those of K are in the null space of [A.sup.T].

Now we examine (2.18). We begin by considering the first term and noting that it can be written in terms of the jth column of V, which we shall denote by [v.sub.j] and the ith column of U, which we shall denote by [u.sub.i],

[r.sup.T] [A.sub.j] [A.sup.[dagger]] [A.sub.i]c = [v.sup.T.sub.j] [A.sup.[dagger]] [u.sub.i],

which is the i, j element of the matrix [U.sup.T] L. By the same argument, the quantity in the second term of (2.18) is the i, j element of the matrix [L.sup.T] U. Similarly, the quantity in the third term of (2.18) is twice the i, j element of the matrix [L.sup.T] L.

Hence, if we let S be a matrix whose i, j element is given by

(3.8) [??](i, j) = [r.sup.T] [A.sub.ij] c,

then it follows that (2.18) leads to

(3.9) S = [U.sup.T] L + [L.sup.T] U - 2[L.sup.T] L - [??].

Using (3.7) and (3.9) in (2.2) gives

(3.10) H = [U.sup.T] L + [L.sup.T] U + [K.sup.T] K - [L.sup.T] L - [??],

or, noting that PL = L, more compactly,

(3.11) H = [U.sup.T] U - [(U - L).sup.T] P (U - L) - [??].

Note that the most significant difference, in terms of computational load, between the Full Newton and the Gauss-Newton approaches is the construction of the [??] matrix. This requires working with O([d.sup.2]) partial derivatives (both first and second order partials) as opposed to just O(d) first partials for the Gauss-Newton approach. This can be significant if all of the second partial derivatives are full rank. However, in a number of applications many of the mixed partial derivatives vanish, are sparse, or have low rank, and there can be a significant savings by exploiting such structure. In Section 4, we present an example from [2] involving the fitting of a Bezier curve to ordered two-dimensional data where all of the mixed partial derivatives are identically zero. In this specific example we see only a 50% increase in the work per step by using a Full-Newton approach and reap all the benefits of greatly accelerated convergence. In order to maximize efficiencies for any specific model one must take full advantage of any special structure of the partial derivatives.

The other critical step is to deal with [P.sup.[perpendicular to] and [A.sup.[dagger]] in a computationally responsible manner. We begin by computing the full QR factorization of the model matrix A. In particular,

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

where [[Q.sub.1] [Q.sub.2]] is an m x m orthogonal matrix, R is n x n, upper-triangular, and invertible, and n is an n x n permutation. It is easily verified that

K = [Q.sub.2][Q.sup.T.sub.2] U,

L = [Q.sub.1] [R.sup.-T] [PI]V,

and these forms should be used in any implementation. The Newton step can then be calculated using either (3.5) or (3.6) to construct J, and either (3.10) or (3.11) to construct H. It should be noted that one clear advantage to using (3.6) in conjunction with (3.11) is that one need not compute a full QR factorization in that case, a skinny QR factorization will suffice. Of course, the speed advantage of, for example, a modified Gram-Schmidt approach needs to be weighed against the numerical risk if the model matrix A is poorly conditioned. We note, anecdotally, that this has not been an issue in any of the examples we have considered.

4. A brief example. As a brief example we consider the problem of fitting a Bezier curve to ordered two-dimensional data. This problem is carefully developed in [2] and we refer the reader there for the details. The basic problem can be stated as follows. Given an ordered set of data points [{([w.sub.j], [v.sub.i])}.sup.m.sub.i=1] and a non-negative integer n < m, find nodes [{[[alpha].sub.i]}.sup.m.sub.i=1] and control points [{([x.sub.j], [y.sub.j])}.sup.n.sub.i=1] that minimize

(4.1) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

where [B.sup.n.sub.j] ([alpha]) is the j'th Bernstein polynomial of degree n, that is,

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

It is clear that the control points enter the problem linearly and the nodes are the non-linear parameters.

The critical observation is that all of the mixed partial derivatives taken with respect to the nodes are identically zero. By exploiting this fact and the other structure inherent to the problem we find that we can compute the full-Newton step using only 50% more time than computing the Gauss-Newton step. This in turn indicates that a 33% reduction in iterations will be the break even point for switching to a full-Newton code.

We tested this by using an existing Gauss-Newton code developed in [2] and modifying it to use the full-Newton step; we refer the reader to [2] for the details of the underlying algorithm. We then applied it to a data set containing 23 points taken from an experiment on a reacting chemical system which may have multiple steady states; see [9]. The experiment samples the steady state oxidation rate R achieved by a catalytic system for an input concentration of carbon monoxide [C.sub.CO]. The data is in log-log format and we show the results of fitting with a sixth-degree Bezier curve. Note that in (4.1) this corresponds to m = 23 and n = 6. Even though the squared residual at the solution is very small (0.00217) the full-Newton algorithm converges in just 32 iterations as compared to 174 iterations for the Gauss-Newton code. This greatly enhanced convergence more than makes up for the increase in per iteration cost.

5. A full-Newton algorithm for discrete least squares rational approximation. As a

specific detailed example we now consider the important problem of fitting a rational function of the form

y = [c.sub.1] + [c.sub.2]t + ... + [c.sub.n][t.sup.n-1]/1 + [[alpha].sub.1]t + [[alpha].sub.2][t.sup.2] + ... + [[alpha].sub.k][t.sup.k]

to data in the least-squares sense. This is clearly a separable non-linear least squares problem with the coefficients of the numerator entering the problem linearly and those of the denominator non-linearly. The model matrix A for this problem can be written in the suggestive form A = [D.sup.-1] N where the numerator matrix N is an m x n Vandermonde matrix,

(5.1) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

and the denominator matrix [D.sup.-1] is an m x m diagonal matrix,

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII],

where q(t) = 1 + [[alpha].sub.1]t + [[alpha].sub.2][t.sup.2] + ... + [[alpha].sub.k][t.sup.k].

We define T to be the m x m diagonal matrix,

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

It is easily verified that

(5.2) [A.sub.j] = - [D.sup.-2][T.sup.j]N

= - [T.sup.j] [D.sup.-1] A

and

(5.3) [A.sub.i,j] = 2[D.sup.-3][T.sup.i+_j]N

= 2[T.sup.i+j] [D.sup.-2] A.

Combining (5.2) with (3.1), we see that the jth column of U is given by

(5.4) - [T.sup.j] [D.sup.-1] Ac,

and also, combining (5.2) with (3.2), we see that the jth columnofV is given by

- [A.sup.T] [D.sup.-1] [T.sup.j] r.

It will be convenient to define an intermediate quantity [??] whose jth column is given by

(5.5) - [D.sup.-1] [T.sup.j]r.

We note then that

(5.6) V = [A.sup.T] [??].

Using (5.3) in (3.8) yields

= [??](i, j) = 2[r.sup.T] [T.sup.i+j] [D.sup.-2] Ac

= 2[r.sup.T] [T.sup.i] [D.sup.-1] [T.sup.j] [D.sup.-1] Ac.

It is not difficult to see then that

(5.7) [??] = 2[[??].sup.T] U.

Close inspection of (5.4) and (5.5) reveals that the columns of U and [??] are just diagonally scaled columns of a Vandermonde matrix. We introduce the matrix M defined by

(5.8) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

It follows that

(5.9) U = - [PHI] [D.sup.-1] M,

where [PHI] = diag (Ac), and also that

(5.10) [??] = -[phi] [D.sup.-1] M,

where [phi] = diag(r). Moreover, we note that combining (5.6) with (3.4) yields

(5.11) L = [([A.sup.[dagger]).sup.T] [A.sup.T] [??]

= P[??],

where P is the current projection, i.e., P = [Q.sub.1][Q.sup.T.sub.1]. Using (5.11) in (3.6) and invoking the idempotence ofP gives

J = - (U - P(U - [??])).

Next, we invoke (5.9) and (5.10) to factor [D.sup.-1]M from the right to get a formula for the Jacobian,

(5.12) J = ([PHI] - P([PHI] - [phi])) [D.sup.-1]M.

Next, using (5.7) and (5.11) in conjunctionwith the idempotence ofP in (3.11) gives

H = [U.sup.T] U - [(U - [??]).sup.T] P (U - [??]) - 2[[??].sup.T] U.

We then invoke (5.9) and (5.10) to factor [D.sup.-1]M from the right and its transpose from the left to get

(5.13) H = [M.sup.T] [D.sup.-1] ([[PHI].sup.2] - ([PHI] - [phi])P([PHI] - [phi]) - 2[PHI][phi]) [D.sup.-1] M, or, equivalently,

H = [M.sup.T][D.sup.-1] (([PHI] - [phi])[P.sup.[perpendicular to])([PHI] - [phi]) - [[phi].sup.2]) [D.sup.-1] M.

5.1. The algorithm. With the formulas necessary for computing the Newton step for rational approximation we developed a basic MATLAB code implementing the full-Newton approach. The first algorithm uses (5.12) for J and (5.13) for H. The user specifies the degree of the numerator n - 1 and the denominator k and may supply a starting guess for the coefficients of the denominator, [[alpha].sub.i] for i = 1, ..., q. If no starting guess is given the code generates one by solving the widely known linear least squares extension of the Newton-Pade approximant. In particular, we find the least squares solution to

[m.summation over (i=1)] [([c.sub.1] + [c.sub.2][t.sub.i] + ... + [c.sub.n][t.sup.n-1.sub.i] - [y.sub.i] [([[alpha].sub.1] t + [[alpha].sub.2][t.sup.2.sub.i] + ... + [[alpha].sub.k][t.sup.k.sub.i]) - [y.sub.i].sup.2].

This is implemented by solving

(5.14) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII],

where Y is diagonal matrix with [y.sub.i,i] = [y.sub.i], and N and M are defined in (5.1) and (5.8), respectively.

Once we have a current guess for a each successive iteration begins by computing J and H using (5.12) and (5.13). We then evaluate the gradient using J and the current residual, and then solve for the Newton step using the Cholesky factorization ofH. One weakness of full-Newton methods is that H may not be positive definite in a region of mixed curvature and hence the Newton direction may not be a descent direction. Although there are a number of methods for dealing with this we have implemented a very simple one that we have found to be very reliable. In particular, if the Cholesky factorization fails we regularize H by arbitrarily shifting its spectrum to the right by a bit more than the absolute value of its leftmost eigenvalue (we used [absolute value of 1.2 x [[lambda].sub.min]]). Once that is done we compute the Cholesky factorization and solve for the Newton step.

Once a descent direction has been established it is essential to use some form of step size control in order to get reasonable convergence. We have chosen to use a backtracking line search and have found this to work surprisingly well. We omit the implementation details which can be found in [3].

Finally, as a stopping criterion we terminated the algorithm when the relative change in the squared norm of the residual was less than 10-12.

For purposes of comparison we also created a Gauss-Newton code, identical to the full-Newton code except that all calculations not needed for computing the Gauss-Newton direction are removed. In particular, after constructing the Jacobian with (5.12) we set H = [J.sup.T] J and proceed to solve for the Gauss-Newton step using the Cholesky factorization. (3) No regularization is needed in this case so that code is also removed.

5.2. Experimental results. We begin by considering two examples using measured data, both of these are taken from the NIST Statistical Reference Datasets for Nonlinear Regression. (4) The first involves measured data from an experiment on electron mobility. This data is due to R. Thurber and is considered to be of higher difficulty. There are 37 data points to be fit to a model of the form

y = [c.sub.0] + [c.sub.1]t + [c.sub.2][t.sup.2] + [c.sub.3][t.sup.3]/1 + [[alpha].sub.1]t + [[alpha].sub.2][t.sup2] + [[alpha].sub.1][t.sup.3].

We consider two scenarios in order to compare the approaches. In the first case we use the starting guess suggested in the NIST testbed of [[alpha].sub.0] = [[1 .4 .05].sup.T]. In this case we converge to the known solution of [alpha] = [[.9663 .3980 .0497].sup.T] in 6 iterations with the full Newton code as compared to 20 iterations with the Gauss-Newton code. In this case the full Newton code resorts to regularization for the first two steps. In the second case we provide no starting guess and hence the codes generate one using (5.14). In this case the Full-Newton code converges in 7 iterations as compared to 30 for Gauss-Newton. In this case the full Newton code resorts to regularization only for the first step. It is worth noting that the squared norm of the residual at the solution is roughly 5642.7. Since it is well known that the full-Newton approach generally outperforms Gauss-Newton when the residuals are large this outcome is not surprising, although its magnitude is noteworthy.

To get a baseline notion of relative speed both codes were tested using MATLAB 7.0.4 on a Pentium(R) 4 with 504MB of RAM. Both scenarios were run and the average time per run over 10,000 runs was calculated. In the first scenario both codes run with the starting guess above, in the second scenario both codes were run with no starting guess. The results are summarized in Table 5.1, where we show the number of steps to converge, the average time (measured in milliseconds), and the time ratio.

The second example involves measured data from a NIST study involving scanning electron microscope line width standards. This data set is due to R. Kirby and is considered to be of average difficulty. There are 151 data points to be fit to a model of the form

y = [c.sub.0] + [c.sub.1]t + [c.sub.2][t.sup.2]/1 + [[alpha].sub.1]t + [[alpha].sub.2][t.sup.2]

We consider the same two scenarios in order to compare the approaches. In the first case we use the starting guess suggested in the NIST testbed of [[alpha].sub.0] = [[-.0015 .00002].sup.T]. The results of this experiment are summarized in Table 5.2. It is worth noting that the squared norm of the residual at the solution is roughly 3.905 and we see the full-Newton approach comparing less favorably to Gauss-Newton when the residual is small as we might generally expect. It is also noteworthy that for this data set the full-Newton code did not resort to regularization in either case.

As an additional test we consider fitting a rational of the form

y = [c.sub.0] + [c.sub.1]t + [c.sub.2][t.sup.2]/1 + [[alpha].sub.1]t + [[alpha].sub.2][t.sup.2]

to the function y = [square root of 1 = [x.sup.2]] on the interval [-1,1] and also to the function y = cos x on the interval [-[pi], [pi]]. To give some notion of scaling we do this for three different cases: using 11, 101, and 501 evenly spaced points over the interval. Regularization was not required in any of these examples. The results are summarized in Tables 5.3 and 5.4.

Finally, we consider a function with more complex behavior. In particular, we will try to fit y = [e.sup.-x cos 4x] on the interval [0, [pi]]. We begin by trying to fit the function with a rational with both numerator and denominator being degree 4 and using 20 equally spaced points. In this case the full-Newton algorithm converges in just 12 iterations with a final squared residual of 6.6916 x [10.sup.-1] and it resorts to regularization for each of the first four steps. The Gauss-Newton algorithm converges in 13 iterations but to a very unsatisfying approximation with a squared residual of 6.9470 (there are two poles inside the interval). This is not unusual in our experience; anecdotally, we have observed that the full-Newton algorithm is rather less likely to stall out far from the optimal solution.

We ran this test a second time using 100 evenly spaced points and using a higher order rational (numerator and denominator both degree 6). In this case both algorithms converge to a nice solution with a squared residual of 2.3965 x [10.sup.-1] but the full-Newton code requires only 20 iterations in contrast to the 25 required by the Gauss-Newton code, moreover it runs in just 71% of the time required by the latter. It is very interesting to note that in this experiment the full-Newton code resorts to regularization for each of the first 10 steps. However, even though it is regularizing fully half of the time it still noticeably outperforms the Gauss-Newton code.

6. Conclusion. We have derived a full-Newton approach for separable non-linear least squares problems. The derivation results in a surprisingly compact formula for computing the Newton step. Experiments show that the method can substantially improve the convergence rate at the expense of additional per iteration costs. It is seen that for problems where the second partial derivatives of the model matrix have special structure the additional costs of using a full-Newton approach may be minimal and hence the improved convergence rate can lead to substantially faster solutions. This was briefly demonstrated using an example from parametric curve fitting where all of the mixed partials are identically zero (and structured).

We then applied our derivation of the Newton step to the problem of discrete least squares rational approximation. This very important problem has a structure that leads to a surprisingly compact form for the Newton step. We showed with several examples that the fullNewton approach can significantly outperform the Gauss-Newton approach.

REFERENCES

[1] A. Bjorck, Numerical Methods for Least Squares Problems, SIAM, Philadelphia, PA, 1996.

[2] C. F. Borges and T. A. Pastva, Total least squares fitting of Bezier and B-spline curves to ordered data, Comput. Aided Geom. Design, 19 (2002), pp. 275-289.

[3] J. E. Dennis and R. B. Schnabel, Numerical Methods for Unconstrained Optimization and Nonlinear Equations, Prentice-Hall, Englewood Cliffs, NJ, 1983.

[4] G. H. Golub and V. Pereyra, The differentiation of pseudo-inverses and nonlinear least-squares problems whose variables separate, SIAM J. Numer. Anal., 10 (1973), pp. 413-432.

[5] --, Separable nonlinear least squares: the variable projection method and its applications, Inverse Problems, 19 (2003), pp. R1-R26.

[6] G. H. Golub and C. F. Van Loan, Matrix Computations, Third ed., The Johns Hopkins University Press, Baltimore, MD, 1996.

[7] R. J. Hanson and C. L. Lawson, Extensions and applications of the Householder algorithm for solving linear least squares problems, Math. Comp., 23 (1969), pp. 787-812.

[8] L. Kaufman, A variable projection method for solving separable nonlinear least squares problems, BIT, 15 (1975), pp. 49-57.

[9] S. P. Marin and P. W. Smith, Parametric approximation of data using ODR splines, Comput. Aided Geom. Design, 11 (1994), pp. 247-267.

[10] C. Nelson, Contour encoded compression and transmission, Master's thesis, Department of Computer Science, Brigham Young University, Provo, UT, 2006.

[11] A. Ruhe and P.-A. Wedin, Algorithms for separable nonlinear least squares problems, SIAM Rev., 22 (1980), pp. 318-337.

* Received August 30, 2008. Accepted for publication December 15, 2008. Published online on April 14, 2009. Recommended by M. Benzi.

(1) This is a formal derivation and it is worth noting that one should not form generalized inverses as a method for computing solutions to linear least-squares problems.

(2) The assumption that A be full-rank is important here.

CARLOS F. BORGES ([dagger])

(3) We note that solving the normal equations is not a computationally responsible approach. However, for the purpose of the experiments that follow we found that carefully solving for the Gauss-Newton step did not change the convergence or final results but did add additional overhead which biased the time comparisons rather unfairly to the benefit of the Full-Newton algorithm. Therefore we produced all of the timings using Cholesky in order that the algorithms would be competing on a level playing field.

(4) This data is available online at www.itl.nist.gov.

([dagger]) Department of Applied Mathematics, Naval Postgraduate School, Monterey, CA 93943 (borges@nps.edu).

y = [n.summation over (j=1)] [c.sub.j][[empty set].sub.j] ([alpha], t),

where [alpha] = [[[[alpha].sub.1] [[alpha].sub.2] ... [[alpha].sub.d]].sup.T] is a set of parameters that enter non-linearly into the model, and the coefficients c = [[[c.sub.1] [c.sub.2] ... [c.sub.n]].sup.T] obviously enter linearly. This problem can be more usefully viewed by constructing a model matrix A whose i, j element is given by A(i, j) = [[empty set].sub.j] ([alpha], [t.sub.i]). It is clear that A is a function of both [alpha] and the [t.sub.i] although we suppress that in the notation for the sake of clarity. As the [t.sub.i] are fixed, our goal is to find values for [alpha] and c that minimize the length of the residual r = y - Ac where y = [[[y.sub.1] [y.sub.2] ... [y.sub.m]].sup.T].

It is clear that for a fixed value of a the linear parameters c can be found by solving a linear least squares problem. Linear least squares problems are well understood and a variety of excellent methods are known for their solution [1, 6, 7].

Henceforth, we shall assume that A is a full rank [R.sup.m x n] matrix with m > n and let P be the projection matrix that projects vectors onto the range of A. In particular, we let P = [AA.sup.[dagger]] where [A.sup.[dagger]] = [([A.sup.T] A).sup.-1] [A.sup.T] is the Moore-Penrose generalized inverse (1) of A which can be used to solve the linear least squares problem yielding c = [A.sup.[dagger]]y.

It is clear at this point that the residual, which is in fact a function of [alpha], is given by r = [P.sup.[perpendicular to]]y and we have effectively removed the linear parameters and now need to solve a non-linear least squares problem to find [alpha]. One particularly useful technique for solving this problem is developed in [4] and involves applying the Gauss-Newton method to the variable projection functional, in essence the norm of [P.sup.[perpendicular to]]y. The convergence of this approach and of a variant first proposed in [8] are elegantly analyzed in [11]. This variable projection approach has found wide application in a variety of fields; see the excellent overview of the history and applications in [5]. In the sequel we shall develop a full-Newton approach to solving this problem which follows by extending the ideas in [4].

Although the algorithm we derive is quite general, the specific practical problem that motivated this development is rapid non-linear curve fitting to smaller data sets. This is a problem of serious interest in systems that use contour encoding for data compression where one must fit curves to many thousands of small data sets. One such system is developed in [10] for compression of hand-written documents and it uses the Gauss-Newton algorithm developed in [2] for fitting B-spline curves to pen strokes to achieve the needed efficiencies for a practical compression algorithm. Another related application is compressing track information for moving objects, etc. Although each individual problem may be small, the huge number of problems that need to be solved makes it worthwhile to apply more sophisticated techniques to their solution.

In light of this motivation we demonstrate the usefulness of the full-Newton approach first by briefly applying it to the fitting Bezier curves that was described in [2]. We then carefully develop a specific algorithm for discrete rational approximation in the least-squares sense which we demonstrate on a few simple examples.

2. The Newton step. The key to deriving a full Newton algorithm for solving a nonlinear least squares problem is to build a quadratic model for the squared norm of the residual at the current point ac and minimize that at each step; see, for instance, [3]. The Newton step is given by

(2.1) [[alpha].sub.N] = - [H.sup.-1][J.sup.T] r,

where J = [nabla]r is the Jacobian and

(2.2) H = [J.sup.T] J + S,

with

(2.3) S = [m.summation over (i=1)] [r.sub.i] ([alpha]) [[nabla].sup.2] [r.sub.i]([alpha]).

In order to construct J and S we will need the first and second partial derivatives of [P.sup.[perpendicular to]] with respect to the non-linear parameters [[alpha].sub.i]. Henceforth, we shall use the subscript i on a matrix to denote differentiation with respect to the variable [[alpha].sub.i].

To compute the first partial we follow the derivation presented in [4]. We begin by noting that the projection matrix satisfies the equation PA = A. Differentiating both sides of this equation with respect to [[alpha].sub.i] yields [P.sub.i]A + P [A.sub.i] = [A.sub.i]. Subtracting P [A.sub.i] from both sides and regrouping yields [P.sub.i]A = (I - P) [A.sub.i]. Which implies that [P.sub.i]A = [P.sup.[perpendicular to]] [A.sub.i]. Multiplying both sides on the right by [A.sup.[dagger]] and using [AA.sup.[dagger]] = P yields [P.sub.i]P = [P.sup.[perpendicular to]] [A.sub.i][A.sup.[dagger]]. Now transposing and exploiting symmetry on the left gives [PP.sub.i] = [([P.sup.[perpendicular to]] [A.sub.i][A.sup.[dagger]]).sup.T]. And finally, since P is idempotent ([P.sup.2] = P) differentiation yields the widely known result from [4],

(2.4) = [P.sub.i] = [P.sub.i]P + [PP.sub.i] = [P.sup.[perpendicular to]] [A.sub.i][A.sup.[dagger]] + [([P.sup.[perpendicular to]] [A.sub.i][A.sup.[dagger]]).sup.T].

Now we need to extend this derivation in order to compute the second partial derivative. We begin by considering (2.4) and using the product rule to take its partial derivative with respect to [[alpha].sub.j]. Therefore,

(2.5) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Now we note that

[([P.sup.[perpendicular to]]).sub.j] = [([I - P).sub.j] = - [P.sub.j],

which we may use to simplify (2.5) giving

(2.6) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

Next, we need to compute the partial derivative of [A.sup.[dagger]] with respect to [[alpha].sub.j]. We begin by differentiating both sides of P = [AA.sup.[dagger]] with respect to [[alpha].sub.j], which yields

[P.sub.j] = [A.sub.j] [A.sup.[dagger]] + A [([A.sup.[dagger]]).sub.j].

Rearranging gives

A [([A.sup.[dagger]]).sub.j] = [P.sub.j] - [A.sub.j] [A.sup.[dagger]].

Multiplying on the left by [A.sup.[dagger]] and invoking the fact that [A.sup.[dagger]] A = I yields (2)

[([A.sup.[dagger]]).sub.j] = [A.sup.[dagger]][P.sub.j] - [A.sup.[dagger]] [A.sub.j] [A.sup.[dagger]].

Substituting for [P.sub.j] from (2.4) yields

[([A.sup.[dagger]]).sub.j] = [A.sup.[dagger]] ([P.sup.[perpendicular to]] [A.sub.j] [A.sup.[dagger]] + [([P.sup.[perpendicular to]).sup.T]) - [A.sup.[dagger]] [A.sub.j] [A.sup.[dagger]].

And finally, invoking [A.sup.[dagger]] [P.sup.[perpendicular to]] = 0 yields

[([A.sup.[dagger]]).sub.j] = [A.sup.[dagger]] [([P.sup.[perpendicular to]] [A.sub.j] [A.sup.[dagger]]).sup.T] - [A.sup.[dagger]] [A.sub.j] [A.sup.[dagger]].

We can now substitute this expression into (2.6) giving

(2.7) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

In the interest of brevity, we define

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Substituting this definition into (2.7) yields a compact expression for the second partial derivative of the projector. In particular,

(2.8) [P.sub.i,j] = [D.sub.i,j] + [D.sup.T.sub.i,j].

Now that we have expressions for the first and second partial derivatives we may proceed to construct the Newton step. To solve (2.1) we need to construct the J and S matrices. We begin by defining two useful quantities. First, the current residual is given by

(2.9) r = [P.sup.[perpendicular to]] y,

and second, the current solution is given by

(2.10) c = [A.sup.[dagger]] y.

Now, since J = [nabla]r it follows that the jth column of J is given by - [P.sub.j]y which, invoking (2.4), is

(2.11) - [P.sup.[perpendicular to]] [A.sub.j]c - [(A.sup.[dagger]]).sup.T] [A.sup.T.sub.j] r.

To construct S we note that following (2.3) the i, j element of S is given by

S (i, j) = - [r.sup.T] [P.sub.ij] y.

Making use of (2.8), we find that

(2.12) S(i, j) = - [r.sup.T] ([D.sub.i,j] + [D.sup.T.sub.i,j]) y.

We will evaluate this in two parts. Consider first the quantity

(2.13) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

Invoking [P.sup.[perpendicular to]]r = r and (2.9) and (2.10), we reduce (2.13) to

(2.14) - [r.sup.T] [D.sub.jj] y = [r.sup.]T [P.sub.j] [A.sub.i]c - [r.sup.T] [A.sub.ij]c - [r.sup.T] [A.sub.i][A.sup.[dagger]] [(A.sub.j] [A.sup.[dagger]]).sup.T] r + [r.sup.T] [A.sub.i][A.sup.[dagger]] [A.sub.j]c.

Substituting in for [P.sub.j] from (2.4) yields

(2.15) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

Now we note that [A.sup.[dagger]] r = 0 and this reduces to

(2.16) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

Next, we consider the second quantity. Omitting the details, it can be reduced to

(2.17) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

Substituting (2.16) and (2.17) into (2.12) and combining terms yields

(2.18) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

It is important to exploit symmetries and use care when constructing the S matrix in order to minimize the operation count and enhance accuracy.

3. Implementation details. Implementation details are important. First and foremost is the careful construction of both the J and S matrices. A simple observation can greatly simplify this process; in particular, terms of the form [A.sub.j]c and [A.sup.T.sub.j]r are ubiquitous in (2.11) and (2.18). We begin by constructing two m x d auxiliary matrices. We shall call the first matrix U. Its jth column is given by

(3.1) [A.sub.j]c.

We shall call the second matrix V. Its jth column is given by

(3.2) [A.sup.T.sub.j]r.

Using these definitions in (2.11), we see that

J = - ([P.sup.[perpendicular to]] U + [([A.sup.[dagger]]).sup.T] V) .

At this point we will introduce two intermediate quantities to simplify the notation going forward. In particular, we define

(3.3) K = [P.sup.[perpendicular to]] U

(3.4) L = [([A.sup.[dagger]]).sup.T] V.

With these definitions, we find that

(3.5) J = - (K + L),

or, equivalently,

(3.6) J = - (U - P(U - L)).

It is also true that

(3.7) [J.sup.T] J = [K.sup.T] K + [L.sup.T] L,

since the columns of L are clearly elements of the range of A, and those of K are in the null space of [A.sup.T].

Now we examine (2.18). We begin by considering the first term and noting that it can be written in terms of the jth column of V, which we shall denote by [v.sub.j] and the ith column of U, which we shall denote by [u.sub.i],

[r.sup.T] [A.sub.j] [A.sup.[dagger]] [A.sub.i]c = [v.sup.T.sub.j] [A.sup.[dagger]] [u.sub.i],

which is the i, j element of the matrix [U.sup.T] L. By the same argument, the quantity in the second term of (2.18) is the i, j element of the matrix [L.sup.T] U. Similarly, the quantity in the third term of (2.18) is twice the i, j element of the matrix [L.sup.T] L.

Hence, if we let S be a matrix whose i, j element is given by

(3.8) [??](i, j) = [r.sup.T] [A.sub.ij] c,

then it follows that (2.18) leads to

(3.9) S = [U.sup.T] L + [L.sup.T] U - 2[L.sup.T] L - [??].

Using (3.7) and (3.9) in (2.2) gives

(3.10) H = [U.sup.T] L + [L.sup.T] U + [K.sup.T] K - [L.sup.T] L - [??],

or, noting that PL = L, more compactly,

(3.11) H = [U.sup.T] U - [(U - L).sup.T] P (U - L) - [??].

Note that the most significant difference, in terms of computational load, between the Full Newton and the Gauss-Newton approaches is the construction of the [??] matrix. This requires working with O([d.sup.2]) partial derivatives (both first and second order partials) as opposed to just O(d) first partials for the Gauss-Newton approach. This can be significant if all of the second partial derivatives are full rank. However, in a number of applications many of the mixed partial derivatives vanish, are sparse, or have low rank, and there can be a significant savings by exploiting such structure. In Section 4, we present an example from [2] involving the fitting of a Bezier curve to ordered two-dimensional data where all of the mixed partial derivatives are identically zero. In this specific example we see only a 50% increase in the work per step by using a Full-Newton approach and reap all the benefits of greatly accelerated convergence. In order to maximize efficiencies for any specific model one must take full advantage of any special structure of the partial derivatives.

The other critical step is to deal with [P.sup.[perpendicular to] and [A.sup.[dagger]] in a computationally responsible manner. We begin by computing the full QR factorization of the model matrix A. In particular,

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

where [[Q.sub.1] [Q.sub.2]] is an m x m orthogonal matrix, R is n x n, upper-triangular, and invertible, and n is an n x n permutation. It is easily verified that

K = [Q.sub.2][Q.sup.T.sub.2] U,

L = [Q.sub.1] [R.sup.-T] [PI]V,

and these forms should be used in any implementation. The Newton step can then be calculated using either (3.5) or (3.6) to construct J, and either (3.10) or (3.11) to construct H. It should be noted that one clear advantage to using (3.6) in conjunction with (3.11) is that one need not compute a full QR factorization in that case, a skinny QR factorization will suffice. Of course, the speed advantage of, for example, a modified Gram-Schmidt approach needs to be weighed against the numerical risk if the model matrix A is poorly conditioned. We note, anecdotally, that this has not been an issue in any of the examples we have considered.

4. A brief example. As a brief example we consider the problem of fitting a Bezier curve to ordered two-dimensional data. This problem is carefully developed in [2] and we refer the reader there for the details. The basic problem can be stated as follows. Given an ordered set of data points [{([w.sub.j], [v.sub.i])}.sup.m.sub.i=1] and a non-negative integer n < m, find nodes [{[[alpha].sub.i]}.sup.m.sub.i=1] and control points [{([x.sub.j], [y.sub.j])}.sup.n.sub.i=1] that minimize

(4.1) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

where [B.sup.n.sub.j] ([alpha]) is the j'th Bernstein polynomial of degree n, that is,

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

It is clear that the control points enter the problem linearly and the nodes are the non-linear parameters.

The critical observation is that all of the mixed partial derivatives taken with respect to the nodes are identically zero. By exploiting this fact and the other structure inherent to the problem we find that we can compute the full-Newton step using only 50% more time than computing the Gauss-Newton step. This in turn indicates that a 33% reduction in iterations will be the break even point for switching to a full-Newton code.

We tested this by using an existing Gauss-Newton code developed in [2] and modifying it to use the full-Newton step; we refer the reader to [2] for the details of the underlying algorithm. We then applied it to a data set containing 23 points taken from an experiment on a reacting chemical system which may have multiple steady states; see [9]. The experiment samples the steady state oxidation rate R achieved by a catalytic system for an input concentration of carbon monoxide [C.sub.CO]. The data is in log-log format and we show the results of fitting with a sixth-degree Bezier curve. Note that in (4.1) this corresponds to m = 23 and n = 6. Even though the squared residual at the solution is very small (0.00217) the full-Newton algorithm converges in just 32 iterations as compared to 174 iterations for the Gauss-Newton code. This greatly enhanced convergence more than makes up for the increase in per iteration cost.

5. A full-Newton algorithm for discrete least squares rational approximation. As a

specific detailed example we now consider the important problem of fitting a rational function of the form

y = [c.sub.1] + [c.sub.2]t + ... + [c.sub.n][t.sup.n-1]/1 + [[alpha].sub.1]t + [[alpha].sub.2][t.sup.2] + ... + [[alpha].sub.k][t.sup.k]

to data in the least-squares sense. This is clearly a separable non-linear least squares problem with the coefficients of the numerator entering the problem linearly and those of the denominator non-linearly. The model matrix A for this problem can be written in the suggestive form A = [D.sup.-1] N where the numerator matrix N is an m x n Vandermonde matrix,

(5.1) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

and the denominator matrix [D.sup.-1] is an m x m diagonal matrix,

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII],

where q(t) = 1 + [[alpha].sub.1]t + [[alpha].sub.2][t.sup.2] + ... + [[alpha].sub.k][t.sup.k].

We define T to be the m x m diagonal matrix,

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

It is easily verified that

(5.2) [A.sub.j] = - [D.sup.-2][T.sup.j]N

= - [T.sup.j] [D.sup.-1] A

and

(5.3) [A.sub.i,j] = 2[D.sup.-3][T.sup.i+_j]N

= 2[T.sup.i+j] [D.sup.-2] A.

Combining (5.2) with (3.1), we see that the jth column of U is given by

(5.4) - [T.sup.j] [D.sup.-1] Ac,

and also, combining (5.2) with (3.2), we see that the jth columnofV is given by

- [A.sup.T] [D.sup.-1] [T.sup.j] r.

It will be convenient to define an intermediate quantity [??] whose jth column is given by

(5.5) - [D.sup.-1] [T.sup.j]r.

We note then that

(5.6) V = [A.sup.T] [??].

Using (5.3) in (3.8) yields

= [??](i, j) = 2[r.sup.T] [T.sup.i+j] [D.sup.-2] Ac

= 2[r.sup.T] [T.sup.i] [D.sup.-1] [T.sup.j] [D.sup.-1] Ac.

It is not difficult to see then that

(5.7) [??] = 2[[??].sup.T] U.

Close inspection of (5.4) and (5.5) reveals that the columns of U and [??] are just diagonally scaled columns of a Vandermonde matrix. We introduce the matrix M defined by

(5.8) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

It follows that

(5.9) U = - [PHI] [D.sup.-1] M,

where [PHI] = diag (Ac), and also that

(5.10) [??] = -[phi] [D.sup.-1] M,

where [phi] = diag(r). Moreover, we note that combining (5.6) with (3.4) yields

(5.11) L = [([A.sup.[dagger]).sup.T] [A.sup.T] [??]

= P[??],

where P is the current projection, i.e., P = [Q.sub.1][Q.sup.T.sub.1]. Using (5.11) in (3.6) and invoking the idempotence ofP gives

J = - (U - P(U - [??])).

Next, we invoke (5.9) and (5.10) to factor [D.sup.-1]M from the right to get a formula for the Jacobian,

(5.12) J = ([PHI] - P([PHI] - [phi])) [D.sup.-1]M.

Next, using (5.7) and (5.11) in conjunctionwith the idempotence ofP in (3.11) gives

H = [U.sup.T] U - [(U - [??]).sup.T] P (U - [??]) - 2[[??].sup.T] U.

We then invoke (5.9) and (5.10) to factor [D.sup.-1]M from the right and its transpose from the left to get

(5.13) H = [M.sup.T] [D.sup.-1] ([[PHI].sup.2] - ([PHI] - [phi])P([PHI] - [phi]) - 2[PHI][phi]) [D.sup.-1] M, or, equivalently,

H = [M.sup.T][D.sup.-1] (([PHI] - [phi])[P.sup.[perpendicular to])([PHI] - [phi]) - [[phi].sup.2]) [D.sup.-1] M.

5.1. The algorithm. With the formulas necessary for computing the Newton step for rational approximation we developed a basic MATLAB code implementing the full-Newton approach. The first algorithm uses (5.12) for J and (5.13) for H. The user specifies the degree of the numerator n - 1 and the denominator k and may supply a starting guess for the coefficients of the denominator, [[alpha].sub.i] for i = 1, ..., q. If no starting guess is given the code generates one by solving the widely known linear least squares extension of the Newton-Pade approximant. In particular, we find the least squares solution to

[m.summation over (i=1)] [([c.sub.1] + [c.sub.2][t.sub.i] + ... + [c.sub.n][t.sup.n-1.sub.i] - [y.sub.i] [([[alpha].sub.1] t + [[alpha].sub.2][t.sup.2.sub.i] + ... + [[alpha].sub.k][t.sup.k.sub.i]) - [y.sub.i].sup.2].

This is implemented by solving

(5.14) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII],

where Y is diagonal matrix with [y.sub.i,i] = [y.sub.i], and N and M are defined in (5.1) and (5.8), respectively.

Once we have a current guess for a each successive iteration begins by computing J and H using (5.12) and (5.13). We then evaluate the gradient using J and the current residual, and then solve for the Newton step using the Cholesky factorization ofH. One weakness of full-Newton methods is that H may not be positive definite in a region of mixed curvature and hence the Newton direction may not be a descent direction. Although there are a number of methods for dealing with this we have implemented a very simple one that we have found to be very reliable. In particular, if the Cholesky factorization fails we regularize H by arbitrarily shifting its spectrum to the right by a bit more than the absolute value of its leftmost eigenvalue (we used [absolute value of 1.2 x [[lambda].sub.min]]). Once that is done we compute the Cholesky factorization and solve for the Newton step.

Once a descent direction has been established it is essential to use some form of step size control in order to get reasonable convergence. We have chosen to use a backtracking line search and have found this to work surprisingly well. We omit the implementation details which can be found in [3].

Finally, as a stopping criterion we terminated the algorithm when the relative change in the squared norm of the residual was less than 10-12.

For purposes of comparison we also created a Gauss-Newton code, identical to the full-Newton code except that all calculations not needed for computing the Gauss-Newton direction are removed. In particular, after constructing the Jacobian with (5.12) we set H = [J.sup.T] J and proceed to solve for the Gauss-Newton step using the Cholesky factorization. (3) No regularization is needed in this case so that code is also removed.

5.2. Experimental results. We begin by considering two examples using measured data, both of these are taken from the NIST Statistical Reference Datasets for Nonlinear Regression. (4) The first involves measured data from an experiment on electron mobility. This data is due to R. Thurber and is considered to be of higher difficulty. There are 37 data points to be fit to a model of the form

y = [c.sub.0] + [c.sub.1]t + [c.sub.2][t.sup.2] + [c.sub.3][t.sup.3]/1 + [[alpha].sub.1]t + [[alpha].sub.2][t.sup2] + [[alpha].sub.1][t.sup.3].

We consider two scenarios in order to compare the approaches. In the first case we use the starting guess suggested in the NIST testbed of [[alpha].sub.0] = [[1 .4 .05].sup.T]. In this case we converge to the known solution of [alpha] = [[.9663 .3980 .0497].sup.T] in 6 iterations with the full Newton code as compared to 20 iterations with the Gauss-Newton code. In this case the full Newton code resorts to regularization for the first two steps. In the second case we provide no starting guess and hence the codes generate one using (5.14). In this case the Full-Newton code converges in 7 iterations as compared to 30 for Gauss-Newton. In this case the full Newton code resorts to regularization only for the first step. It is worth noting that the squared norm of the residual at the solution is roughly 5642.7. Since it is well known that the full-Newton approach generally outperforms Gauss-Newton when the residuals are large this outcome is not surprising, although its magnitude is noteworthy.

To get a baseline notion of relative speed both codes were tested using MATLAB 7.0.4 on a Pentium(R) 4 with 504MB of RAM. Both scenarios were run and the average time per run over 10,000 runs was calculated. In the first scenario both codes run with the starting guess above, in the second scenario both codes were run with no starting guess. The results are summarized in Table 5.1, where we show the number of steps to converge, the average time (measured in milliseconds), and the time ratio.

The second example involves measured data from a NIST study involving scanning electron microscope line width standards. This data set is due to R. Kirby and is considered to be of average difficulty. There are 151 data points to be fit to a model of the form

y = [c.sub.0] + [c.sub.1]t + [c.sub.2][t.sup.2]/1 + [[alpha].sub.1]t + [[alpha].sub.2][t.sup.2]

We consider the same two scenarios in order to compare the approaches. In the first case we use the starting guess suggested in the NIST testbed of [[alpha].sub.0] = [[-.0015 .00002].sup.T]. The results of this experiment are summarized in Table 5.2. It is worth noting that the squared norm of the residual at the solution is roughly 3.905 and we see the full-Newton approach comparing less favorably to Gauss-Newton when the residual is small as we might generally expect. It is also noteworthy that for this data set the full-Newton code did not resort to regularization in either case.

As an additional test we consider fitting a rational of the form

y = [c.sub.0] + [c.sub.1]t + [c.sub.2][t.sup.2]/1 + [[alpha].sub.1]t + [[alpha].sub.2][t.sup.2]

to the function y = [square root of 1 = [x.sup.2]] on the interval [-1,1] and also to the function y = cos x on the interval [-[pi], [pi]]. To give some notion of scaling we do this for three different cases: using 11, 101, and 501 evenly spaced points over the interval. Regularization was not required in any of these examples. The results are summarized in Tables 5.3 and 5.4.

Finally, we consider a function with more complex behavior. In particular, we will try to fit y = [e.sup.-x cos 4x] on the interval [0, [pi]]. We begin by trying to fit the function with a rational with both numerator and denominator being degree 4 and using 20 equally spaced points. In this case the full-Newton algorithm converges in just 12 iterations with a final squared residual of 6.6916 x [10.sup.-1] and it resorts to regularization for each of the first four steps. The Gauss-Newton algorithm converges in 13 iterations but to a very unsatisfying approximation with a squared residual of 6.9470 (there are two poles inside the interval). This is not unusual in our experience; anecdotally, we have observed that the full-Newton algorithm is rather less likely to stall out far from the optimal solution.

We ran this test a second time using 100 evenly spaced points and using a higher order rational (numerator and denominator both degree 6). In this case both algorithms converge to a nice solution with a squared residual of 2.3965 x [10.sup.-1] but the full-Newton code requires only 20 iterations in contrast to the 25 required by the Gauss-Newton code, moreover it runs in just 71% of the time required by the latter. It is very interesting to note that in this experiment the full-Newton code resorts to regularization for each of the first 10 steps. However, even though it is regularizing fully half of the time it still noticeably outperforms the Gauss-Newton code.

6. Conclusion. We have derived a full-Newton approach for separable non-linear least squares problems. The derivation results in a surprisingly compact formula for computing the Newton step. Experiments show that the method can substantially improve the convergence rate at the expense of additional per iteration costs. It is seen that for problems where the second partial derivatives of the model matrix have special structure the additional costs of using a full-Newton approach may be minimal and hence the improved convergence rate can lead to substantially faster solutions. This was briefly demonstrated using an example from parametric curve fitting where all of the mixed partials are identically zero (and structured).

We then applied our derivation of the Newton step to the problem of discrete least squares rational approximation. This very important problem has a structure that leads to a surprisingly compact form for the Newton step. We showed with several examples that the fullNewton approach can significantly outperform the Gauss-Newton approach.

REFERENCES

[1] A. Bjorck, Numerical Methods for Least Squares Problems, SIAM, Philadelphia, PA, 1996.

[2] C. F. Borges and T. A. Pastva, Total least squares fitting of Bezier and B-spline curves to ordered data, Comput. Aided Geom. Design, 19 (2002), pp. 275-289.

[3] J. E. Dennis and R. B. Schnabel, Numerical Methods for Unconstrained Optimization and Nonlinear Equations, Prentice-Hall, Englewood Cliffs, NJ, 1983.

[4] G. H. Golub and V. Pereyra, The differentiation of pseudo-inverses and nonlinear least-squares problems whose variables separate, SIAM J. Numer. Anal., 10 (1973), pp. 413-432.

[5] --, Separable nonlinear least squares: the variable projection method and its applications, Inverse Problems, 19 (2003), pp. R1-R26.

[6] G. H. Golub and C. F. Van Loan, Matrix Computations, Third ed., The Johns Hopkins University Press, Baltimore, MD, 1996.

[7] R. J. Hanson and C. L. Lawson, Extensions and applications of the Householder algorithm for solving linear least squares problems, Math. Comp., 23 (1969), pp. 787-812.

[8] L. Kaufman, A variable projection method for solving separable nonlinear least squares problems, BIT, 15 (1975), pp. 49-57.

[9] S. P. Marin and P. W. Smith, Parametric approximation of data using ODR splines, Comput. Aided Geom. Design, 11 (1994), pp. 247-267.

[10] C. Nelson, Contour encoded compression and transmission, Master's thesis, Department of Computer Science, Brigham Young University, Provo, UT, 2006.

[11] A. Ruhe and P.-A. Wedin, Algorithms for separable nonlinear least squares problems, SIAM Rev., 22 (1980), pp. 318-337.

* Received August 30, 2008. Accepted for publication December 15, 2008. Published online on April 14, 2009. Recommended by M. Benzi.

(1) This is a formal derivation and it is worth noting that one should not form generalized inverses as a method for computing solutions to linear least-squares problems.

(2) The assumption that A be full-rank is important here.

CARLOS F. BORGES ([dagger])

(3) We note that solving the normal equations is not a computationally responsible approach. However, for the purpose of the experiments that follow we found that carefully solving for the Gauss-Newton step did not change the convergence or final results but did add additional overhead which biased the time comparisons rather unfairly to the benefit of the Full-Newton algorithm. Therefore we produced all of the timings using Cholesky in order that the algorithms would be competing on a level playing field.

(4) This data is available online at www.itl.nist.gov.

([dagger]) Department of Applied Mathematics, Naval Postgraduate School, Monterey, CA 93943 (borges@nps.edu).

Table 5.1 Experimental Results for the Thurber data set Full-Newton Gauss-Newton Ratio Steps Time Steps Time Case I 6 1.5454 20 2.9902 52% Case II 7 1.5007 30 4.5454 33% Table 5.2 Experimental Results for the Kirby data set Full-Newton Gauss-Newton Ratio Steps Time Steps Time Case I 5 2.4767 7 3.8659 64% Case II 4 2.2382 7 4.4928 50% Table 5.3 Experimental Results for the y = [square root of 1 - [chi square]] data set # Points Full-Newton Gauss-Newton Steps Time Steps Time 11 4 0.6268 5 0.8305 101 4 1.3843 8 2.6605 501 4 98.211 7 168.78 # Points Ratio Residual 11 75% 8.91 x [10.sup.-4] 101 52% 3.68 x [10.sup.-2] 501 58% 8.50 x [10.sup.-2] Table 5.4 Experimental Results for the y = cos x data set # Points Full-Newton Gauss-Newton Steps Time Steps Time 11 4 0.6356 7 0.8549 101 4 1.3955 7 2.3812 501 4 86.802 7 167.88 # Points Ratio Residual 11 74% 2.42 x [10.sup.-2] 101 59% 1.30 x [10.sup.-1] 501 52% 5.94 x [10.sup.-1]

Gale Copyright:

Copyright 2009 Gale, Cengage Learning. All rights
reserved.