AMS subject classifications. 65F15, 15A18, 65F50.
1. Introduction. The computation of singular subspaces associated
with the k largest or smallest singular values of a large, sparse (or
structured) matrix A is commonplace. Example applications are the
solution of discrete ill-posed problems [17], or the construction of
low-rank matrix approximations in areas such as signal processing [13]
and information retrieval [5]. This paper focuses on Lanczos
bidiagonalization, a method that can be competitive in this context
because it exploits matrix sparsity.
The problem of computing the singular value decomposition (SVD) of
a matrix A can be formulated as an equivalent eigenvalue problem, using
for instance the cross product matrix [A.sup.*] A. The Lanczos
bidiagonalization algorithm can be deduced from Lanczos
tridiagonalization applied to these equivalent eigenproblems. Therefore,
it inherits the good properties as well as the implementation
difficulties present in Lanczos-based eigensolvers. It is possible to
stop after a few Lanczos steps, in which case we obtain Rayleigh-Ritz
approximations of the singular triplets. On the other hand, loss of
orthogonality among Lanczos vectors has to be dealt with, either by full
reorthogonalization or by a cheaper alternative, such as partial
reorthogonalization [25, 26]. Block variants of the method have been
proposed; see, e.g., [16]. Also, in the case of slow convergence,
restarting techniques become very important in order to keep the cost of
reorthogonalization bounded. All these techniques are intended for
numerical robustness as well as computational efficiency. Furthermore,
if these properties are to be maintained in the context of parallel
computing, then additional tuning of the algorithm may be required.
Therefore, it becomes apparent that implementing an industrial-strength
SVD solver based on Lanczos bidiagonalization requires a careful
combination of a number of different techniques.
In this paper, we present a thick restart Lanczos bidiagonalization
procedure implemented in SLEPc, the Scalable Library for Eigenvalue
Problem Computations [19, 20].
The proposed Lanczos bidiagonalization algorithm is based on full
reorthogonalization via iterated Classical Gram-Schmidt, and its main
goal is to reduce the number of synchronization points in the parallel
implementation, while maintaining numerical robustness and fast
convergence. Some of the techniques presented here were also applied to
the Arnoldi eigensolver in a previous work by the authors [18]. The
implemented bidiagonalization algorithm is used as a basis for a
thick-restarted SVD solver similar to that proposed by Baglama and
Reichel [1].
The text is organized as follows. First, Sections 2-4 provide a
general description of the Lanczos bidiagonalization method, discuss how
to deal with loss of orthogonality among Lanczos vectors, and review the
thick-restarted strategy for singular value solvers. Then, Section 5
gives some details about the SLEPc implementation. Finally, Sections 6
and 7 show some numerical and performance results obtained with this
implementation.
2. Lanczos bidiagonalization. The singular value decomposition of
an m x n complex matrix A can be written as
A = U[summation][V.sup.*]. (2.1)
where U = [[u.sub.i], ..., [u.sub.m]] is an m x m unitary matrix
([U.sup.*]U = I), V = [[v.sub.i], ..., [v.sub.n]] is an n x n unitary
matrix ([V.sup.*] V = I), and [summation] is an m x n diagonal matrix
with nonnegative real diagonal entries [[summation].sub.ii] =
[[sigma].sub.i], for i = 1, ..., min{m, n}. If A is real, U and V are
real and orthogonal. The vectors [u.sub.i] are called the left singular
vectors, the [v.sub.i] are the right singular vectors, and the
[[sigma].sub.i] are the singular values. In this work, we will assume
without loss of generality that m [greater than or equal to] n. The
singular values are labeled in descending order, [[sigma].sub.1]
[greater than or equal to] [[sigma].sub.2] [greater than or equal to]
... [greater than or equal to] [[sigma].sub.n].
The problem of computing the singular triplets ([[sigma].sub.i],
[u.sub.i], [v.sub.i]) of A can be formulated as an eigenvalue problem
involving a Hermitian matrix related to A, either
1. the cross product matrix, [A.sup.*] A, or
2. the cyclic matrix, [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN
ASCII].
The singular values are the nonnegative square roots of the
eigenvalues of the cross product matrix. This approach may imply a
severe loss of accuracy in the smallest singular values. The cyclic
matrix approach is an alternative procedure that avoids this problem, at
the expense of significantly increasing the cost of the computation.
Note that we could also consider the alternative cross product matrix
A[A.sup.*] , but that approach is unfeasible under the assumption that m
[greater than or equal to] n.
Computing the cross product matrix explicitly is not recommended,
especially in the case of A sparse. Bidiagonalization was proposed by
Golub and Kahan [15] as a way of tridiagonalizing the cross product
matrix without forming it explicitly. Consider the decomposition
A = PB[Q.sup.*], (2.2)
where P and Q are unitary matrices, and B is an m x n upper
bidiagonal matrix. Then the tridiagonal matrix [B.sup.*] B is unitarily
similar to [A.sup.*] A. Additionally, specific methods exist (e.g.,
[11]) that compute the singular values of B without forming B* B.
Therefore, after computing the SVD of B,
B = X[summation][Y.sup.*], (2.3)
it only remains to combine (2.3) and (2.2) to get the solution of
the original problem (2.1) with U = PX and V = QY.
Bidiagonalization can be accomplished by means of Householder
transformations or alternatively via Lanczos recurrences. The latter
approach is more appropriate for sparse matrix computations and was
already proposed in [15], hence it is sometimes referred to as
GolubKahan-Lanczos bidiagonalization.
The Lanczos bidiagonalization technique can be derived from several
equivalent perspectives. Consider a compact version of(2.2),
A = [P.sub.n] [B.sub.n] [Q.sup.*.sub.n]:
where the zero rows of the bidiagonal matrix have been removed and,
therefore, [P.sub.n] is now an m x n matrix with orthonormal columns,
[Q.sub.n] is a unitary matrix of order n that is equal to Q of (2.2),
and [B.sub.n] is a square matrix of order n that can be written as
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (2.4)
The coefficients of this matrix are real and given by
[[alpha].sup.j] = [p.sup.*.sub.j] [Aq.sub.j] and [[beta].sub.j] =
[p.sup.*.sub.j] [A.sub.qj+1], where [p.sub.j] and [q.sub.j] are the
columns of [P.sub.n] and [Q.sub.n], respectively. It is possible to
derive a double recurrence to compute these coefficients together with
the vectors [p.sub.j] and [q.sub.j], since after choosing [q.sub.1] as
an arbitrary unit vector, the other columns of [P.sub.n] and [Q.sub.n]
are determined uniquely (apart from a sign change, and assuming A has
full rank and [B.sub.n] is unreduced).
Pre-multiplying (2.4) by [P.sub.n], we have the relation [AQ.sub.n]
= [P.sub.n] [B.sub.n]. Also, if we transpose both sides of (2.4) and
pre-multiply by [Q.sub.n], we obtain [A.sup.*] [P.sub.n] =
[Q.sub.n][B.sup.*.sub.n]. Equating the first k < n columns of both
relations results in
A[Q.sub.k] = [P.sub.k] [B.sub.k], (2.5)
[A.sup.*] [P.sub.k] = [Q.sub.k] [B.sup.*.sub.k] + [[beta].sub.k]
[q.sub.k] + 1 [e.sup.*.sub.k], (2.6)
where [B.sub.k] denotes the k x k leading principal submatrix of
[B.sub.n]. Analogous expressions can be written in vector form by
equating the jth column only,
[Aq.sub.j] = [[beta].sub.j-1] [p.sub.j-1] + [[alpha].sub.j]
[p.sub.j], > (2.7)
[A.sup.*] [p.sub.j] = [[alpha].sub.j] [q.sub.j] + [[beta].sub.j]
[q.sub.j+1].
These expressions directly yield the double recursion
[[alpha].sub.j] [p.sub.j] = [Aq.sub.j] - [[beta].sub.j-1]
[p.sub.j-1] > (2.8)
[[beta].sub.j] [q.sub.j+1] = [A.sup.*] [p.sub.j] - [[alpha].sub.j]
[q.sub.j], (2.9)
with [[alpha].sub.j] = [[parallel][A.sub.qj] - [[beta].sub.j-1]
[p.sub.j-1][parallel].sub.2] and [[beta].sub.j] = [[parallel][A.sup.*]
[p.sub.j] - [[alpha].sub.j] [[q.sub.j][parallel].sub.2], since the
columns of [P.sub.n] and [Q.sub.n] are normalized. The bidiagonalization
algorithm is built from (2.8) and (2.9); see Algorithm 1.
Equations (2.5) and (2.6) can be combined by pre-multiplying the
first one by [A.sup.*], resulting in
[A.sup.*] A[Q.sub.k] = [Q.sub.k] [B.sup.*.sub.k] [B.sub.k] +
[[alpha].sub.k] [[beta].sub.k] [q.sub.k+1] [e.sup.*.sub.k]. (2.10)
The matrix [B.sup.*.sub.k] [B.sub.k] is symmetric positive definite
and tridiagonal. The conclusion is that Algorithm 1 computes the same
information as the Lanczos tridiagonalization algorithm applied to the
Hermitian matrix [A.sup.*] A. In particular, the right Lanczos vectors
[q.sub.j] computed by Algorithm 1 constitute an orthonormal basis of the
Krylov subspace
[K.sub.k] ([A.sup.*] A, [q.sub.1]) = span{[q.sub.1],
[A.sup.*]A[q.sub.1], ... , [([A.sup.*]A).sup.k-1] [q.sub.1]}.
Another way of combining (2.5) and (2.6) is by pre-multiplying the
second one by A, giving in this case the equation
A[A.sup.*] [P.sub.k] = [P.sub.k][B.sub.k] [B.sup.*.sub.k]] +
[[beta].sub.k]A[q.sub.k+1] [e.sup.*.sub.k].
In contrast to (2.10), this equation does not represent a Lanczos
decomposition, because the vector Aqk+1 is not orthogonal to Pk, in
general. However, using (2.7) we get
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII],
where the matrix [B.sub.k] [B.sup.*.sub.k] + [[beta].sup.2.sub.k]
[e.sub.k] [e.sup.*.sub.k] is also symmetric positive definite and
tridiagonal. Thus, a similar conclusion can be drawn for matrix A
[A.sup.*] , and the left Lanczos vectors [p.sub.j] span the Krylov
subspace [K.sub.k] (A [A.sup.*], [p.sub.1]).
There is an alternative way of deriving Algorithm 1, which further
displays the intimate relation between Lanczos bidiagonalization and the
usual three-term Lanczos tridiagonalization. The idea is to apply the
standard Lanczos algorithm to the cyclic matrix, H(A), with the special
initial vector
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
It can be shown that the generated Lanczos vectors are then
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], (2.11)
and that the projected matrix after 2k Lanczos steps is
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
That is, two steps ofthis procedure compute the same informationas
one step of Algorithm 1. Moreover, it is easy to show that there is an
equivalence transformation with the odd-even permutation (also called
perfect shuffle) that maps [T.sub.2k] into the cyclic matrix
H([B.sub.k]). Note that, in a computer implementation, this procedure
would require about twice as much storage as Algorithm 1, unless the
zero components in the Lanczos vectors (2.11) are not stored explicitly.
Due to these equivalences, all the properties and implementation
considerations of Lanczos tridiagonalization (see [3], [24], [27] or
[29]) carry over to Algorithm 1. In particular, error bounds for Ritz
approximations can be computed very easily. After k Lanczos steps, the
Ritz values [[??].sub.i] (approximate singular values of A) are computed
as the singular values of [B.sub.k], and the Ritz vectors are
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
where [x.sub.i] and [y.sub.i] are the left and right singular
vectors of [B.sub.k]. With these definitions, and equations (2.5)-(2.6),
it is easy to show that
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
The residual norm associated to the Ritz singular triplet
([MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]), defined as
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII],
can be cheaply computed as
[[parallel][r.sub.i][paralle].sub.2] = [[beta].sub.k] [absolute
value of [e.sup.*.sub.k] [x.sub.i]] (2.12)
3. Dealing with loss of orthogonality. As in the case of the
standard Lanczos tridiagonalization algorithm, Algorithm 1 diverts from
the expected behaviour when run in finite precision arithmetic. In
particular, after a sufficient number of steps the Lanczos vectors start
to lose their mutual orthogonality, and this happens together with the
appearance of repeated and spurious Ritz values in the set of singular
values of [B.sub.j].
The simplest cure for this loss of orthogonality is full
orthogonalization. In Lanczos bidiagonalization, two sets of Lanczos
vectors are computed, so full orthogonalization amounts to
orthogonalizing vector [p.sub.j] explicitly with respect to all the
previously computed left Lanczos vectors, and orthogonalizing vector
[q.sub.j+1] explicitly with respect to all the previously computed right
Lanczos vectors. Algorithm 2 shows this variant with a modified
Gram-Schmidt (MGS) orthogonalization procedure. Note that in the
computation of pj it is no longer necessary to subtract the term
[[beta].sub.j-1] [p.sub.j-1], since this is already done in the
orthogonalization step; a similar remark holds for the computation of
[q.sub.j+1].
This solution was already proposed in the seminal paper by Golub
and Kahan [15], and used in some of the first implementations, such as
the block version in [16]. The main advantage of full orthogonalization
is its robustness, since orthogonality is maintained to full machine
precision (provided that reorthogonalization is employed, see Section 5
for details). Its main drawback is the high computational cost, which
grows as the iteration proceeds.
An alternative to full orthogonalization is to simply ignore loss
of orthogonality and perform only local orthogonalization at every
Lanczos step. This technique has to carry out a post-process of matrix
[T.sub.2k] in order to determine the correct multiplicity of the
computed singular values as well as to discard the spurious ones; see
[8] for further details.
Semiorthogonal techniques try to find a compromise between full and
local orthogonalization. One such technique is partial
reorthogonalization [30], which uses a cheap recurrence to estimate the
level of orthogonality, and applies some corrective measures when it
drops below a certain threshold. This technique has been adapted by
Larsen [25] to the particular case of Lanczos bidiagonalization. In this
case, two recurrences are necessary, one for monitoring loss of
orthogonality among right Lanczos vectors, and the other one for left
Lanczos vectors.
However, these alternatives to full orthogonalization are not very
meaningful in the context of restarted variants, discussed in Section 4.
First, the basis size is limited so the cost of full orthogonalization
does not grow indefinitely. Second, currently there is no reliable
theory background on how to adapt semiorthogonal techniques to the case
in which a restart is performed. In [7] an attempt is done to extend
semi-orthogonalization also to locked eigenvectors in the context of an
explicitly restarted eigensolver. In the case of thick restart, the
technique employed in this paper, numerical experiments carried out by
the authors show that orthogonality with respect to restart vectors must
be enforced explicitly in each iteration, negating the advantage of
semiorthogonal techniques.
One-sided variant. There is a variation of Algorithm 2 that
maintains the effectiveness of full reorthogonalization, but with a
considerably reduced cost. This technique was proposed by Simon and Zha
[31]. The idea comes from the observation that, in the Lanczos
bidiagonalization procedure without reorthogonalization, the level of
orthogonality of left and right Lanczos vectors go hand in hand. If we
quantify the level of orthogonality of the Lanczos vectors [MATHEMATICAL
EXPRESSION NOT REPRODUCIBLE IN ASCII], computed in finite precision
arithmetic, as
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII],
then it can be observed that at a given Lanczos step [MATHEMATICAL
EXPRESSION NOT REPRODUCIBLE IN ASCII] differ in no more than an order of
magnitude, except maybe when Bj becomes very ill-conditioned. This
observation led Simon and Zha to propose what they called the one-sided
version, shown in Algorithm 3.
Note that the only difference of Algorithm 3 with respect to
Algorithm 2 is that [p.sub.j] is no longer orthogonalized explicitly.
Still, numerical experiments carried out by Simon and Zha show that the
computed [[??].sub.j] vectors maintain a similar level of orthogonality
as [[??].sub.j].
When the singular values of interest are the smallest ones, then
[B.sub.j] may become ill-conditioned. A robust implementation should
track this event and revert to the two-sided variant when a certain
threshold is exceeded.
4. Restarted bidiagonalization. Restarting is a key aspect in the
efficient implementation of projection-based eigensolvers, such as those
based on Arnoldi or Jacobi-Davidson. This topic has motivated an intense
research activity in recent years. These developments are also
applicable to Lanczos, especially if full reorthogonalization is
employed. In this section, we adapt the discussion to the context of
Lanczos bidiagonalization.
The number of iterations required in the Lanczos bidiagonalization
algorithm (i.e., the value of k) can be quite high if many singular
triplets are requested, and also depends on the distribution of the
singular values, as convergence is slow in the presence of clustered
singular values. Increasing k too much may not be acceptable, since this
implies a growth in storage requirements and, sometimes more important,
a growth of computational cost per iteration in the case of full
orthogonalization. To avoid this problem, restarted variants limit the
maximum number of Lanczos steps to a fixed value k, and when this value
is reached the computation is re-initiated. This can be done in
different ways.
Explicit restart consists of rerunning the algorithm with a
"better" initial vector. In Algorithms 1-3, the initial vector
is q1, so the easiest strategy is to replace [q.sub.1] with the right
Ritz vector associated to the approximate dominant singular value. A
block equivalent of this technique was employed in [16]. In the case
that many singular triplets are to be computed, it is not evident how to
build the new [q.sub.1]. One possibility is to compute [q.sub.1] as a
linear combination of a subset of the computed Ritz vectors, possibly
applying a polynomial filter to remove components in unwanted
directions.
Implicit restart is a much better alternative that eliminates the
need to explicitly compute a new start vector q1 . It consists of
combining the Lanczos bidiagonalization process with the implicitly
shifted QR algorithm. The k-step Lanczos relations described in
(2.5)-(2.6) are transformed and truncated to order l < k, and then
extended again to order k. The procedure allows the small-size equations
to retain the relevant spectral information of the full-size relations.
A detailed description of this technique can be found in [6], [21], [23]
and [26].
An equivalent yet easier to implement alternative to implicit
restart is the so-called thick restart, originally proposed in the
context of Lanczos tridiagonalization [32]. We next de scribe how this
method can be adapted to Lanczos bidiagonalization, as proposed in [1].
The main idea of thick-restarted Lanczos bidiagonalization is to
reduce the full k-step Lanczos bidiagonalization, (2.5)-(2.6), to the
following one
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], (4.1)
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], (4.2)
where the value of l < k could be, for instance, the number of
wanted singular values. The key point here is to build the decomposition
of (4.1)-(4.2) in such a way that it keeps the relevant spectral
information contained in the full decomposition. This is achieved
directly by setting the first l columns of [[??].sub.l+1] to be the
wanted approximate right singular vectors, and analogously in
[[??].sub.l+1] the corresponding approximate left singular vectors. It
can be shown [1] that it is possible to easily build a decomposition
that satisfies these requirements, as described in the following.
We start by defining [[??].sub.l+1] as
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], (4.3)
that is, the Ritz vectors [[??].sub.i] = [Q.sub.k] [y.sub.i]
together with the last Lanczos vector generated by Algorithm 1. Note
that this matrix has orthonormal columns because [Q.sup.*.sub.k]
[q.sub.k+1] = 0 by construction. Similarly, define [[??].sub.l+1] as
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], (4.4)
with [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] a
unit-norm vector computed as [[??].sub.l+1] = f/[parallel]f[parallel]2,
where f is the vector resulting from orthogonalizing [A.sub.qk+1] with
respect to the first l left Ritz vectors, [[??].sub.i],
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
It can be shown that the orthogonalization coefficients can be
computed as [[??].sub.i] = [[beta].sub.k] [e.sup.*.sub.k] [x.sub.i];
note that these values are similar to the residual bounds in (2.12), but
here the sign is relevant. The new projected matrix is
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII],
where [[??].sub.l+1] = [[parallel]f[parallel].sub.2], so that (4.1)
holds. To complete the form of a Lanczos bidiagonalization, it only
remains to define [[??].sub.l+1] and [[??].sub.l+2] in (4.2), which turn
out to be [[??].sub.l+1] = [[parallel]g[parallel].sub.2] and
[[??].sub.l+2] = g/[[parallel]g[parallel].sub.2], where g = [A.sup.*]
[[??].sub.l+1] - [[??].sub.l+1][q.sub.k+1].
It is shown in [1] that the Lanczos bidiagonalization relation is
maintained if Algorithm 1 is run for j = l + 2, ..., k, starting from
the values of [[??].sub.l+1] and [[??].sub.l+2] indicated above, thus
obtaining a new full-size decomposition. In this case, the projected
matrix is no longer
bidiagonal, but it takes the form
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII],
where the values without tilde are computed in the usual way with
Algorithm 1.
When carried out in an iterative fashion, the above procedure
results in Algorithm 4. The step 4 can be executed by a variation of
Algorithm 3, as illustrated in Algorithm 5. Starting from the new
initial vector, [q.sub.l+1], this algorithm first computes the
corresponding left initial vector [p.sub.l+1], and then proceeds in the
standard way.
5. Parallel implementation in SLEPc. This work is framed in the
context of the SLEPc project. SLEPc, the Scalable Library for Eigenvalue
Problem Computations [19,20], is a parallel software library intended
for the solution of large, sparse eigenvalue problems. A collection of
eigensolvers is provided, that can be used for different types of
eigenproblems, either standard or generalized ones, both Hermitian and
non-Hermitian, with either real or complex arithmetic. SLEPc also
provides built-in support for spectral transformations such as
shift-and-invert. In addition, a mechanism for computing partial
singular value decompositions is also available, either via the
associated eigenproblems (cross product or cyclic matrix) or with
Lanczos bidiagonalization as described in this paper.
SLEPc is an extension of PETSc, the Portable, Extensible Toolkit
for Scientific Computation [4], and therefore reuses part of its
software infrastructure, particularly the matrix and vector data
structures. PETSc uses a message-passing programming paradigm with
standard data distribution of vectors and matrices, i.e., by blocks of
rows. With this distribution of data across processors, parallelization
of the Lanczos bidiagonalization process (Algorithm 5) amounts to
carrying out the following three stages in parallel:
1. Basis expansion. In the case of Lanczos bidiagonalization, the
method builds two bases, one associated to A and another to [A.sup.*],
so this stage consists of a direct and transpose sparse matrix-vector
product. In many applications, the matrix-vector product operation can
be parallelized quite efficiently. This is the case in mesh-based
computations, in which, if the mesh is correctly partitioned in compact
subdomains, data needs to be exchanged only between processes owning
neighbouring subdomains.
2. Orthogonalization. This stage consists of a series of vector
operations such as inner products, additions, and multiplications by a
scalar.
3. Normalization. From the parallelization viewpoint, the
computation of the norm is equivalent to a vector inner product.
The matrix vector products at the core of the Lanczos process are
implemented with PETSc's MatMult and MatMultTranspose operations.
These operations are optimized for parallel sparse storage and even
allow for matrix-free, user-defined matrix-vector product operations;
see [4] for additional details. PETSc matrices are stored in compressed
sparse row (CSR) format so the MatMult operation achieves good cache
memory efficiency. This operation is implemented as a loop that
traverses the non-zero matrix elements and stores the resulting vector
elements in order. However, the MatMultTranspose operation writes the
resulting vector elements in a non-consecutive order, forcing frequent
cache block copies to main memory. This difference in performance is
evident especially in the case of rectangular matrices. In order to
achieve good sequential efficiency, SLEPc stores the matrix [A.sup.*]
explicitly. This detail is transparent to the user and it can be
deactivated to reduce memory usage. We will center our discussion on the
orthogonalization and normalization stages, and assume that the
implementation of the basis expansion is scalable. The test cases used
for the performance analysis presented in Section 7 have been chosen so
that basis expansion has a negligible impact on scalability. Also, in
those matrices the decision on the explicit storage of [A.sup.*] makes
little difference.
Vector addition and scaling operations can be parallelized
trivially, with no associated communication. Therefore, the parallel
efficiency of the orthogonalization and normalization steps only depends
on how the required inner products are performed. The parallelization of
a vector inner product requires a global reduction operation (an
all-reduce addition), which on distributed-memory platforms has a
significant cost, growing with the number of processes. Moreover, this
operation represents a global synchronization point in the algorithm,
thus hampering scalability if done too often. Consequently, global
reduction operations should be eliminated whenever possible, for
instance by grouping together several inner products in a single
reduction. The multiple-vector product operation accomplishes all the
individual inner products with just one synchronization point and
roughly the same communication cost as just one inner product. A
detailed analysis of these techniques applied to the Arnoldi method for
eigenproblems has been published by the authors [18].
As a consequence, classical Gram-Schmidt (CGS) orthogonalization
scales better than the MGS procedure used in Algorithm 5, because CGS
computes all the required inner products with the unmodified vector
[q.sub.j+1], that is, c = [Q.sup.*.sub.j] [q.sub.j+1], and this
operation can be carried out with a single communication. However, CGS
is known to be numerical unstable when implemented in finite precision
arithmetic. This problem can be solved by iterating the CGS procedure,
until the resulting vector is sufficiently orthogonal to working
accuracy. We will refer to this technique as selective
reorthogonalization. A simple criterion can be used to avoid unnecessary
reorthogonalization steps [9]. This criterion involves the computation
of the vector norms before and after the orthogonalization. The parallel
overhead associated to the first vector norm can be eliminated by
joining its communication with the inner products corresponding to the
computation of the orthogonalization coefficients c. The explicit
computation of the second norm can be avoided with a simple technique
used in [14]. The basic idea is to estimate this norm starting from the
original norm (before the orthogonalization), by simply applying the
Pythagorean theorem as described later in this section. Usually, at most
one reorthogonalization is necessary in practice, although provision has
to be made for a second one to handle specially ill-conditioned cases.
Another parallel optimization that can be applied to Algorithm 5 is
to postpone the normalization of [p.sub.j] until after the
orthogonalization of [q.sub.j+1]. This allows for the union of two
global communications, thus eliminating one synchronization point. As a
side effect, the basis expansion has to be done with an unnormalized
vector, but this is not a problem provided that all the computed
quantities are corrected as soon as the norm is available. This
technique was originally proposed in [22] in the context of an Arnoldi
eigensolver without reorthogonalization.
Applying all the above mentioned optimizations to Algorithm 5
results in Algorithm 6. In this algorithm, lines 7 and 14 are the CGS
orthogonalization step, and lines 17 and 19 are the CGS
reorthogonalization step. The selective reorthogonalization criterion is
checked in line 16, typically with a value of [eta] = 1/ [square root of
2] as suggested in [9, 28]. In this expression, [rho] represents the
norm of [q.sub.j+1] before the orthogonalization. This value is computed
explicitly, as discussed later in this section. In contrast, the norm
after orthogonalization, [[beta].sub.j], is estimated in lines 15 and
20. These estimations are based on the following relation due to lines
14 and 19,
[q.sub.j+1] = [q'.sub.j+1] - [Q.sub.j]c, (5.1)
where [q.sub.j+1] and [q'.sub.j+1] denote the vector before
and after reorthogonalization, respectively, and [Q.sub.j]c is the
vector resulting from projecting [q.sub.j+1] onto span {[q.sub.1],
[q.sub.2], ..., [q.sub.j]}. In exact arithmetic, [q'.sub.j+1] is
orthogonal to this subspace and it is possible to apply the Pythagorean
theorem to the right-angled triangle formed by these three vectors
[parallel][[q.sub.j+1][parallel].sup.2.sub.2] =
[parallel][[q'.sub.j+1][parallel].sup.2.sub.2] +
[parallel][Q.sub.j]c[[parallel].sup.2.sub.2]. (5.2)
Since the columns of [Q.sub.j] are orthonormal, the wanted norm can
be computed as
[[parallel] [[q'.sub.j+1][parallel].sub.2] = [square root of
[parallel][q.sub.j+1][[parallel].sup.2.sub.2] -
[[summation].sup.j.sub.i=1] [c.sup.2.sub.i] (5.3)
In deriving (5.3), we have assumed that [q'.sub.j+1] is
orthogonal to span{[q.sub.1], [q.sub.2], ..., [q.sub.j]} and that the
columns of [Q.sub.j] are orthonormal. These assumptions are not
necessarily satisfied infinite precision arithmetic, and for this reason
we consider it as an estimation of the norm. Usually, these estimates
are very accurate because the [c.sub.i] coefficients are very small
compared to [parallel][q.sub.j+1][[parallel].sup.2.sub.2], that is,
[q.sub.j+1] and [q'.sub.j+1] have roughly the same norm.
The first estimate (line 15) may be inaccurate if [q.sub.j+1] is
not fully orthogonal after the first orthogonalization. This does not
represent a problem for the normalization stage, because in that case
the criterion would force a reorthogonalization step and then a new norm
estimation would be computed. Although the reorthogonalization criterion
may seem less trustworthy, due to the use of estimates, the numerical
experiments described in Section 6 reveal that this algorithm is as
robust as Algorithm 5. In very exceptional cases, [parallel] [q.sub.j+1]
[[parallel].sup.2.sub.2] could be as small as
[[summation].sup.j.sub.i=1] [c.sup.2.sub.i], so that it is safer to
discard the estimate and to compute the norm explicitly. This
implementation detail is omitted from Algorithm 6 in order to maintain
its readability.
The second major enhancement incorporated into Algorithm 6 is that
the computation of [[alpha].sub.j] and the normalization of [p.sub.j]
are delayed until after the computation of [q.sub.j+1] = [A.sup.*]
[p.sub.j] (these operations appear at the beginning of the loop in
Algorithm 5). Thus, [q.sub.j+1] must be corrected with this factor
[[alpha].sub.j] in line 11,
[q'.sub.j+1] = [A.sup.*] [p'.sub.j] = [A.sup.*]
[[alpha].sup.-1.sub.j] [p.sub.j] = [[alpha].sup.-1.sub.j] [A.sup.*]
[p.sub.j] = [[alpha].sup.-1.sub.j] [q.sub.j+1], (5.4)
where [p'.sub.j] denotes the vector [p.sub.j] after the
normalization, and [q'.sub.j+1] denotes the vector [q.sub.j+1]
after the correction. In lines 12 and 13, c and [rho] are also corrected
as
c' = [Q.sup.*.sub.j] [q'.sub.j+1] = [Q.sup.*.sub.j]
[[alpha].sup.-1.sub.j] [q.sub.j+1] = [[alpha].sup.-1.sub.j]
[Q.sup.*.sub.j] [q.sub.j+1] = [[alpha].sup.-1.sub.j]c, (5.5)
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (5.6)
[FIGURE 6.1 OMITTED]
In Algorithm 6, the communications associated with operations in
lines 7-9 can be joined together in one multiple reduction message. In
the same way, the operations in lines 17 and 18 can also be joined in
one message. The rest of the operations, with the exception of the two
matrix-vector products in lines 6 and 24, can be executed in parallel
trivially (without communication). Therefore, the parallel
implementation of Algorithm 6 needs only one (or two if
reorthogonalization is needed) global synchronizations per iteration.
This is a huge improvement over Algorithm 5, that has j + 2
synchronizations per iteration.
6. Numerical results. Algorithms 5 and 6 are not equivalent when
using finite precision arithmetic. Therefore, their numerical behaviour
must be analyzed. In order to check the accuracy of the computed
singular values and vectors, the associated relative errors are computed
explicitly after the finalization of the Lanczos process. If any of
these values is greater than the required tolerance, then the bound used
in the stopping criterion (2.12) is considered incorrect.
In this section, we perform an empirical test on a set of
real-problem matrices, using the implementation referred to in Section
5, with standard double precision arithmetic. The analysis consists of
measuring the relative error when computing the 10 largest singular
values of non-Hermitian matrices from the Harwell-Boeing [12] and NEP
[2] collections. These 165 matrices come from a variety of real
applications. For this test, the solver is configured with tolerance
equal to 10-7 and a maximum of 30 basis vectors. This relative error is
computed as
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (6.1)
for every converged singular value [[sigma].sub.i] and its
associated vectors [v.sub.i] and [u.sub.i].
The results obtained running Algorithm 6 on these tests are shown
in Fig. 6.1, where each dot corresponds to the maximum relative error
for one matrix within the collections. Only in three cases (arc 130,
west0156, and pores_1), both Algorithm 6 and 5 produce some singular
triplets with a residual larger than the tolerance. In these cases,
there is a difference of six orders of magnitude between the largest and
smallest computed singular values, and this causes instability in the
one-sided variants. However, these large errors can be avoided simply by
explicitly reorthogonalizing the left Lanczos vectors [p.sub.j] at the
end of the computation.
7. Performance analysis. In order to assess the parallel efficiency
of the proposed algorithm (Algorithm 6) and compare it with the original
one (Algorithm 5), several test cases were analyzed on two different
computer platforms. In all these cases the solver was requested to
compute 10 eigenvalues with tolerance set to [10.sup.-7], using a
maximum of 30 basis vectors.
On one hand, two square matrices arising from real applications
were used for measuring the parallel speed-up. These matrices are
AF23560 (order 23,560 with 460,598 non-zero elements) from the NEP [2]
collection, and PRE2 (order 659,033 with 5,834,044 non-zero elements)
from the University of Florida Sparse Matrix Collection [10]. The
speed-up is calculated as the ratio of elapsed time with p processors to
the fastest elapsed time with one processor. On the other hand, a
synthetic test case is used for analyzing the scalability of the
algorithm, measuring the scaled speed-up with variable problem size.
The first machine consists of 55 biprocessor nodes with Pentium
Xeon processors at 2.8 GHz with 2 Gbytes of RAM, interconnected with an
SCI network in a 2-D torus configuration. SLEPc was built with Intel
compilers (version 10.1 with -O3 optimization level) and the Intel MKL
version 8.1 library. Due to memory hardware contention problems, only
one processor per node was used in the tests reported in this section.
As shown in Tables 7.1 and 7.2, both algorithms carry out the same
number of matrix-vector products and restarts for both problems.
Although Algorithm 6 performs more vector dot products than Algorithm 5
due to reorthogonalization, the sequential execution times are similar,
with an advantage of the proposed algorithm for the larger problem. This
is due to the fact that CGS with reorthogonalization exploits the memory
hierarchy better than MGS. These tables also show the execution times
for the different operations. The largest execution time corresponds to
the vector AXPY operations that are used in the orthogonalization phase
and in the computation of the Ritz vectors during restart (Eqs. 4.3 and
4.4). Regarding the benefits of explicitly storing [A.sup.*] versus
using MatMultTranspose, in these cases the gain is hardly perceptible,
only about 0.08% reduction in the overall time.
[FIGURE 7.1 OMITTED]
[FIGURE 7.2 OMITTED]
As expected, Figure 7.1 shows that Algorithm 6 has better speed-up
than Algorithm 5 with the AF23560 matrix. However, both algorithms show
a good speed-up with the larger PRE2 matrix. In this case, the
communication time is shadowed by the high computational cost of this
problem and relative low number of processors.
These two tests were repeated on the MareNostrum computer in order
to extend the analysis to more processors. This computer is configured
as a cluster of 2,560 JS21 blades interconnected with a Myrinet network.
Each blade has two IBM PowerPC 970MP dual-core processors at 2.3 GHz and
2 Gbytes of main memory. The IBM XL C compiler with default optimization
and the ESSL library were used to build SLEPc. The sequential behaviour
of the two algorithms on this machine is similar to the one reported
previously in Tables 7.1 and 7.2. Results with the AF23560 and PRE2
matrices (Fig. 7.2) show the clear advantage of Algorithm 6 over
Algorithm 5 as the number of processors increases. The implementation
described in this work obtained a quasi-linear performance improvement
up to 3 84 processors with the PRE2 test problem.
[FIGURE 7.3 OMITTED]
In these fixed size problems, the work assigned to each processor
gets smaller as the number of processors increases, thus limiting the
performance with a large number of processors. To minimize this effect,
the size of local data must be kept constant, that is, the matrix
dimension must grow proportionally to the number of processors, p. For
this analysis, a square non-symmetric tridiagonal matrix with random
entries has been used, with a dimension of 100,000 x p. The scaled
speed-up shown in the left plot of Fig. 7.3 is almost linear up to 448
processors, with a slight advantage for Algorithm 6. However, the
proposed algorithm has significantly better throughput and gets closer
to the machine's peak performance, as shown in the right plot of
Fig. 7.3.
8. Discussion. In this paper, an optimized thick-restarted Lanczos
bidiagonalization algorithm has been proposed in order to improve
parallel efficiency in the context of singular value solvers. This
algorithm is based on one-sided full reorthogonalization via iterated
Classical Gram-Schmidt and its main goal is to reduce the number of
synchronization points in their parallel implementation. The thick
restart technique has proved to be effective in most cases, guaranteeing
fast convergence with moderate memory requirements.
The performance results presented in Section 7 show that the
proposed algorithm achieves good parallel efficiency in all the test
cases analyzed, and scales well when increasing the number of
processors.
The SLEPc implementation of the algorithm analyzed in this paper
represents an efficient and robust way of computing a subset of the
largest singular values, together with the associated singular vectors,
of very large and sparse matrices in parallel computers. How ever, the
standard Rayleigh-Ritz projection used by the solver is generally
inappropriate for computing small singular values. Therefore, as a
future work, it remains to implement also the possibility of performing
a harmonic Ritz projection, as proposed in [1].
Acknowledgement. The authors thankfully acknowledge the computer
resources, technical expertise and assistance provided by the Barcelona
Supercomputing Center (Centro Nacional de Supercomputacion).
REFERENCES
[1] J. BAGLAMA AND L. REICHEL, Augmented implicitly restarted
Lanczos bidiagonalization methods, SIAM J. Sci. Comput., 27 (2005), pp.
19-42.
[2] Z. BAI, D. DAY, J. DEMMEL, AND J. DONGARRA, A test matrix
collection for non-Hermitian eigenvalue problems (release 1.0),
Technical Report CS-97-355, Department of Computer Science, University
of Tennessee, Knoxville, TN, USA, 1997. Available at
http://math.nist.gov/MatrixMarket.
[3] Z. BAI, J. DEMMEL, J. DONGARRA, A. RUHE, AND H. VAN DER VORST,
eds., Templates for the Solution of Algebraic Eigenvalue Problems: A
Practical Guide, Society for Industrial and Applied Mathematics,
Philadelphia, PA, 2000.
[4] S. BALAY, K. BUSCHELMAN, V. Eijkhout, W. D. Gropp, D. Kaushik,
M. Knepley, L. C. McInnes, B. F. SMITH, AND H. ZHANG, PETSc users
manual, Tech. Rep. ANL-95/11--Revision 2.3.3, Argonne National
Laboratory, 2007. Available at http://www.mcs.an .gov/petsc/petsc-as.
[5] M. W. BERRY, Z. DRMAC, AND E. R. JESSUP, Matrices, vector
spaces, and information retrieval, SIAM Rev., 41 (1999), pp. 335-362.
[6] A. BJORCK, E. GRIMME, AND P. VAN DOOREN, An implicit shift
bidiagonalization algorithm for ill-posed systems, BIT, 34 (1994), pp.
510-534.
[7] A. COOPER, M. SZULARZ, AND J. WESTON, External selective
orthogonalization for the Lanczos algorithm in distributed memory
environments, Parallel Comput., 27 (2001), pp. 913-923.
[8] J. CULLUM, R. A. WILLOUGHBY, AND M. LAKE, A Lanczos algorithm
for computing singular values and vectors of large matrices, SIAM J.
Sci. Statist. Comput., 4 (1983), pp. 197-215.
[9] J. W. DANIEL, W. B. GRAGG, L. KAUFMAN, AND G. W. STEWART,
Reorthogonalization and stable algorithms for updating the Gram--Schmidt
QR factorization, Math. Comp., 30 (1976), pp. 772-795.
[10] T. DAVIS, University of Florida Sparse Matrix Collection. NA
Digest, 1992. Available at
http://www.cise.ufl.edu/research/sparse/matrices.
[11] J. W. DEMMEL AND W. KAHAN, Accurate singular values of
bidiagonal matrices, SIAM J. Sci. Statist. Comput., 11 (1990), pp.
873-912.
[12] I. S. DUFF, R. G. GRIMES, AND J. G. LEWIS, Sparse matrix test
problems, ACM Trans. Math. Software, 15 (1989), pp. 1-14.
[13] L. Elden AND E. Sjostrom, Fast computation ofthe principal
singular vectors of Toeplitz matrices arising in exponential data
modelling, Signal Processing, 50 (1996), pp. 151-164.
[14] J. FRANK AND C. VUIK, Parallel implementation of a multiblock
method with approximate subdomain solution, App. Numer. Math., 30
(1999), pp. 403-423.
[15] G. H. GOLUB AND W. KAHAN, Calculating the singular values and
pseudo-inverse of a matrix, SIAM J. Numer. Anal., Ser. B, 2 (1965), pp.
205-224.
[16] G. H. GOLUB, F. T. LUK, AND M. L. OVERTON, A block Lanczos
method for computing the singular values of corresponding singular
vectors of a matrix, ACM Trans. Math. Software, 7 (1981), pp. 149-169.
[17] M. HANKE, On Lanczos based methods for the regularization of
discrete ill-posed problems, BIT, 41 (2001), pp. 1008-1018.
[18] V. HERNANDEZ, J. E. ROMAN, AND A. TOMAs, Parallel Arnoldi
eigensolvers with enhanced scalability via global communications
rearrangement, Parallel Comput., 33 (2007), pp. 521-540.
[19] V. HERNANDEZ, J. E. ROMAN, A. TOMAs, AND V. VIDAL, SLEPc users
manual, Tech. Rep. DSIC-II/24/02-Revision 2.3.3, D. Sistemas
Informaticos y Computacioon, Universidad Politecnica de Valencia, 2007.
Available at http://www.grycap.upv.es/slepc.
[20] V. HERNANDEZ, J. E. ROMAN, AND V. VIDAL, SLEPc: A scalable and
flexible toolkit for the solution of eigenvalue problems, ACM Trans.
Math. Software, 31 (2005), pp. 351-362.
[21] Z. JIA AND D. NIU, An implicitly restarted refined
bidiagonalization Lanczos method for computing a partial singular value
decomposition, SIAM J. Matrix Anal. Appl., 25 (2003), pp. 246-265.
[22] S. K. KIM AND A. T. CHRONOPOULOS, An efficient parallel
algorithm for extreme eigenvalues of sparse nonsymmetric matrices, Int.
J. Supercomp. Appl., 6 (1992), pp. 98-111.
[23] E. KOKIOPOULOU, C. BEKAS, AND E. GALLOPOULOS, Computing
smallest singular triplets with implicitly restarted Lanczos
bidiagonalization, App. Numer. Math., 49 (2004), pp. 39-61.
[24] L. Komzsik, The Lanczos Method: Evolution and Application,
Society for Industrial and Applied Mathematics, Philadelphia, PA, 2003.
[25] R. M. Larsen, Lanczos bidiagonalization with partial
reorthogonalization, Tech. Rep. PB-537, Department of Computer Science,
University ofAarhus, Aarhus, Denmark, 1998. Available at
http://www.daimi.au.dk/PB/537.
[26] --, Combining implicit restart and partial reorthogonalization
in Lanczos bidiagonalization, Tech. Rep., SCCM, Stanford University,
2001. Available at http://soi.stanford.edu/~rmunk/PROPACK.
[27] B. N. PARLETT, The Symmetric Eigenvalue Problem,
Prentice-Hall, Englewood Cliffs, NJ, 1980. Reissued with revisions by
SIAM, Philadelphia, 1998.
[28] L. REICHEL and W. B. GRAGG, FORTRAN subroutines for updating
the QR decomposition, ACM Trans. Math. Software, 16 (1990), pp. 369-377.
[29] Y. SAAD, Numerical Methods for Large Eigenvalue Problems:
Theory and Algorithms, John Wiley and Sons, New York, 1992.
[30] H. D. SIMON, The Lanczos algorithm with partial
reorthogonalization, Math. Comp., 42 (1984), pp. 115142.
[31] H. D. SIMON AND H. ZHA, Low-rank matrix approximation using
the Lanczos bidiagonalization process with applications, SIAM J. Sci.
Comput., 21 (2000), pp. 2257-2274.
[32] K. WU AND H. SIMON, Thick-restart Lanczos method for large
symmetric eigenvalue problems, SIAM J. Matrix Anal. Appl., 22 (2000),
pp. 602-616.
VICENTE HERNANDEZ ([dagger]), JOSE E. ROMAN ([dagger]) AND ANDRES
TOMAS ([dagger])
* Received November 30, 2007. Accepted July 18, 2008. Published
online on January 19, 2009. Recommended by Daniel Kressner.
([dagger]) Instituto ITACA, Universidad Politecnica de Valencia,
Camino de Vera s/n, 46022 Valencia, Spain,
({vhernand,jroman,antodo}@itaca.upv.es). Work partially supported by the
Valencia Regional Administration, Directorate of Research and Technology
Transfer, under grant number GV06/091, by Universidad Politecnica de
Valencia, under program number PAID-04-07, and by the Centro para el
Desarrollo Tecnologico Industrial (CDTI) of the Spanish Ministry of
Industry, Tourism and Commerce through the CDTEAM Project (Consortium
for the Development of Advanced Medicine Technologies).
ALGORITHM 1 (Golub-Kahan-Lanczos Bidiagonalization).
Choose a unit-norm vector [q.sub.1]
Set [[beta].sub.0] = 0
For j = 1,2, ... ,k
[p.sub.j] = A[q.sub.j] - [[beta].j-1] [p.sub.j-1]
[[alpha].sub.j] = [[parallel][p.sub.j][parallel].sub.2]
[p.sub.j] = [p.sub.j] /[[alpha].sub.j]
[q.sub.j+1] = [A.sup.*] [p.sub.j] - [[alpha].sub.j] [q.sub.j]
[[beta].sub.j] = [[parallel][q.sub.j+1][parallel].sub.2]
[q.sub.j+1] = [q.sub.j+1]/[[beta].sub.j]
endALGORITHM 2 (Lanczos Bidiagonalization with Full Orthogonalization).
Choose a unit-norm vector [q.sub.1]
For j = 1,2, ..., k
[p.sub.j] = A[q.sub.j]
For i = 1,2, ..., j - 1
[gamma] = [p.sup.*.sub.i] [p.sub.j]
[p.sub.j] = [p.sub.j] - [gamma][p.sub.i]
end
[[alpha].sub.j] = [[parallel][p.sub.j][parallel].sub.2]
[p.sub.j] = [p.sub.j]/[[alpha].sub.j]
[q.sub.j+1] = [A.sup.*] [p.sub.j]
For i = 1,2, ..., j
[gamma] = [q.sup.*.sub.i] [q.sub.j+1]
[q.sub.j+1] = [q.sub.j+1] - [gamma][q.sub.i]
end
[[beta].sub.j] = [[parallel][q.sub.j+1][parallel].sub.2]
[q.sub.j+1] = [q.sub.j+1]/[[beta].sub.j]
endALGORITHM 3 (One-Sided Lanczos Bidiagonalization).
Choose a unit-norm vector [q.sub.1]
Set [[beta].sub.0] = 0
For j = 1, 2, ..., k
[p.sub.j] = A[q.sub.j] - [[beta].sub.j-1] [p.sub.j-1]
[[alpha].sub.j] = [parallel] [p.sub.j] [parallel]2
[p.sub.j] = [p.sub.j]/[[alpha].sub.j]
[q.sub.j+1] = [A.sup.*] [p.sub.j]
For i = 1,2, ..., j
[gamma] = [q.sup.*.sub.i] [q.sub.j+1]
[q.sub.j+1] = [q.sub.j+1] - [gamma][q.sub.i]
end
[[beta].sub.j] = [parallel] [q.sub.j+1] [parallel] 2
[q.sub.j+1] = [q.sub.j+1]/[[beta].sub.j]
endAlgorithm 4 (Thick-restart Lanczos Bidiagonalization).
Input: Matrix A, initial unit-norm vector [q.sub.1], and
number of steps k
Output: l [less than or equal to] k Ritz triplets
1. Build an initial Lanczos bidiagonalization of order k
2. Compute Ritz approximations of the singular triplets
3. Truncate to a Lanczos bidiagonalization of order l
4. Extend to a Lanczos bidiagonalization of order k
5. Check convergence and if not satisfied, go to step 2
Algorithm 5 (One-Sided Lanczos Bidiagonalization-restarted).
[p.sub.l+1] = A[q.sub.l+1]
For i = 1, 2, ..., l
pl+1 = pl+1 - [[??].sub.i] [p.sub.i]
end
For j = l +1, l + 2, ...,k
[[alpha].sub.j] = [[parallel][p.sub.j][parallel].sub.2]
[p.sub.j] = [p.sub.j]/[[alpha].sub.j]
[q.sub.j+1] = [A.sup.*] [p.sub.j]
For i = 1,2, ..., j
[gamma] = [q.sup.*.sub.i] [q.sub.j+1]
[q.sub.j+1] = [q.sub.j+1] - [gamma][q.sub.i]
end
[[beta].sub.j] = [[parallel][q.sub.j+1][parallel].sub.2]
[q.sub.j+1] = [q.sub.j+1]/[[beta].sub.j]
If j < k
[p.sub.j+1] = [A.sub.qj+1] - [beta].sub.j] [p.sub.j]
end
endAlgorithm 6 (One-Sided Lanczos Bidiag.--restarted, with enhancements).
1 [p.sub.l+1] = A[q.sub.l+1]
2 For i = 1, 2, ..., l
3 pl+1 = pl+1 - [[??].sub.i][p.sub.i]
4 end
5 For j = l + 1, l + 2, ..., k
6 [q.sub.j+1] = [A.sup.*] [p.sub.j]
7 c = [Q.sup.*.sub.j] [q.sub.j+1]
8 [rho] = [[parallel][q.sub.j+1][parallel].sub.2]
9 [[alpha].sub.j] = [[parallel][p.sub.j][parallel].sub.2]
10 [p.sub.j] = [p.sub.j]/[[alpha].sub.j]
11 [q.sub.j+1] = [q.sub.j+1]/[[alpha].sub.j]
12 c = c/[[alpha].sub.j]
13 [rho] = [rho]/[[alpha].sub.j]
14 [q.sub.j+1] = [q.sub.j+1] - [Q.sub.j]c
15 [[beta].sub.j] = [square root of [[rho].sup.2] -
[[summation].sup.j.sub.i=1] [c.sup.2.sub.i]
16 If [[beta].sub.j] < [eta][rho]
17 c = [Q.sup.*.sub.j] [q.sub.j+1]
18 [rho] = [parallel][q.sub.j+1][parallel]2
19 [q.sub.j+1] = [q.sub.j+1] - [Q.sub.j]c
20 [[beta].sub.j] = [square root of
[[rho].sup.2] - [[summation].sup.j.sub.i=1]
[c.sup.2.sub.i]
21 end
22 [q.sub.j+1] = [q.sub.j+1]/[[beta].sub.j]
23 If j < k
24 [p.sub.j+1] = A[q.sub.j+1] - [[beta].sub.j][p.sub.j]
25 end
26 endTable 7.1
Statistics and sequential execution times on Xeon cluster with the
AF23560 matrix. Times are given in seconds and percentages are
computed with respect to total time.
Algorithm 5 Algorithm 6
Vector dot products 1,122 1,911
Matrix-vector products 116 116
Restarts 3 3
Total execution time 0.84 (100%) 1.22 (100%)
Vector dot products execution time 0.10 (12%) 0.22 (18%)
Vector AXPY execution time 0.36 (43%) 0.60 (49%)
Matrix-vector products execution time 0.34 (40%) 0.34 (28%)
Table 7.2
Statistics and sequential execution times on Xeon cluster with the
PRE2 matrix. Times are given in seconds and percentages are computed
with respect to total time.
Algorithm 5 Algorithm 6
Vector dot products 3,225 5,108
Matrix-vector products 300 300
Restarts 9 9
Total execution time 86.58 (100%) 82.42 (100%)
Vector dot products execution time 12.61 (15%) 14.27 (17%)
Vector AXPY execution time 56.26 (65%) 49.68 (60%)
Matrix-vector products execution time 12.97 (15%) 12.96 (16%)