4
 
New solvers for electrochemical problems
 

The majority of this thesis is concerned with the development of two-dimensional finite difference simulations suitable for both traditional electrodes of millimetre dimensions and microelectrodes. Unlike one-dimensional simulations, where direct methods based on fully-implicit or Crank-Nicolson discretisation are efficient, the choice of finite difference discretisation and solver is far less clear-cut: the Hopscotch algorithm is inaccurate for chronoamperometry and inefficient for steady-state problems, yet has dominated the electrochemical literature1. In the following chapters some contemporary iterative algorithms, the multigrid2 and preconditioned Krylov subspace (PKS) methods3, are applied to electrochemical simulations. In this chapter these methods are introduced and assessed.

Even in 1-D simulations, at high rate constants explicit approximations of kinetic terms becomes unstable. Rudolph's FIFD method, discussed in section 2.6.1.3, is one way around this problem. Here an alternative approach is posed, using preconditioned Krylov subspace (PKS) methods to solve the sparse linear system rather than a block-factorised Thomas Algorithm. This may be applied in a frontal fashion (as the BI method has been - see the tables at the start of Chapters 6 and 8) to channel and wall-jet electrode problems, providing a fully implicit solution which is stable at all rate constants. Unlike Rudolph's method, a PKS-based simulation may readily be extended to any number of dimensions and is used in the next chapter as the basis of a general electrochemical simulator. In order to achieve rapid solution of two-dimensional problems, a multigrid method was developed which uses the PKS methods as smoothers, thus achieving mesh-independent convergence. It is also shown how the transient BI (or frontal PKS) algorithm may be adapted for parallel processing.

4.1 The Multigrid Method (Brandt, 1977)4

The multigrid method is an efficient way of solving a set of finite difference equations, based on interpolating between grids of different coarseness. In fact, multigrid methods are often the 'optimal' algorithms for solving these linear systems2,5. The fundamental idea is that it is very quick to solve the system on a coarse grid, which may be interpolated and used as a rough approximation to the solution on a fine grid.

A linear set of finite difference equations may be written in matrix form:

Mu = b(4.1)

where u is an unknown vector, b is a known vector and M is a known band matrix corresponding to the stencil coefficients at each node.

The matrix equation (4.1) is solved iteratively from a starting approximation, u0. For a particular iteration, i, the residual is given by:

ri = b - Mui(4.2)

this may be used as the basis of a defect correction equation:

ui+1 = ui + ei(4.3)

The correction, or error, ei is related to the residual by:

ei = M-1r(4.4)

This may be expressed as a linear system:

Mei = ri(4.5)

which itself may be solved by an iterative (or direct) method.

In a linear multigrid method, the residuals are restricted onto the coarse grid and then the equation Me=r is solved on this grid for the errors. The errors are then interpolated up to the fine grid where they are used to correct the solution. The principles may be illustrated by considering a 2-level scheme, consisting of a fine grid and a coarse grid:

  1. An initial value of fineu0 is provided.
  2. Initially fineu is 'smoothed' - the smoothing algorithm may consist of a few iterations of a linear solver such as Gauss-Seidel. This improves fineu a bit, removing some of the high-frequency errors.
  3. The residual is calculated.
  4. The residual is then mapped onto the coarse grids using a restriction operator, . This usually involves mapping several points from finer to a point on coarser by using some sort of weighted average. Similarly coarseM must either be generated using finite differences or by restriction of fineM, using a Galerkin operator:
    (4.6)
  5. The system Me = r is then solved for the errors, e, on the coarse grid. This may or may not use the same algorithm as the smoother. If an iterative method is used, the initial approximation for e should be zero.
  6. coarsee is then interpolated onto the fine grid using a prolongation (interpolation) operator, , to generate finee. The prologation operator is usually chosen to be the transpose of the restriction operator. This (together with the use of the Galerkin operator - together known as 'variational properties') allows analysis of the coarse-grid correction scheme6.
  7. The approximate solution ui+1 is then corrected (i.e. fineei is added to ui) and the next iteration begins.
  8. The smoother is applied once more to 'clean-up' ui+1.

Figure 4.1: Multigrid V-cycle.

Rather than just using 2 levels, steps 2-4 and 6-8 may be repeated recursively onto increasingly coarse meshes resulting in what is known as the V-cycle, depicted in Figure 4.1. The system is only fully solved (step 5) on coarsest mesh which lies at the bottom of the 'V'. The greater the number of mesh levels for a given problem size, the greater the efficiency of each V-cycle. V-cycle iterations (consisting of steps 2-8) continue until the required accuracy is met.

4.1.1 MDG1 (NAG Library subroutine D03EDF7)

MGD1 is a multigrid method which can be used to solve an elliptic partial differential equation of the form:

(4.7)

over a rectangular region. The seven-point discretised form, corresponding to the stencil shown in Figure 4.2:

(4.8)

is solved by multigrid iteration starting from an approximate solution. uj,k denotes a node on a two dimensional discrete grid with indices j and k corresponding to the y and x coodinates, respectively. Aij,k are the coefficients for the seven point finite difference stencil at each node.


Figure 4.2: 7-point stencil used in MDG1.

The high frequency components of the error are eliminated by ILU smoothing, using an incomplete Crout reduction to decompose the matrix into the upper and lower triangular matrices (see the next section on Preconditioned Krylov Subspace methods for more information on ILU decomposition). Gauss-Seidel iteration is used for the coarse grid solution. Bilinear interpolation is used for prolongation and its transpose is used for restriction. Restriction of the coefficient matrix is achieved using a Galerkin operator, formulated for the seven-point stencil. V-cycle iterations continue until the termination criterion is met, when the residual 2-norm falls below a threshold value:

(4.9)

Note that MGD1 will not diverge for a strictly diagonally dominant matrix (since an ILU factorisation with a diagonal pivoting strategy will never encounter zero pivots8):

(4.10)

However if this condition is strongly violated, divergence may occur7. MGD1 is available as a subroutine in the NAG FORTRAN library and is therefore immediately available for application by electrochemists.

4.1.2 Efficiency

A finite difference model of a single channel microband electrode was constructed using the steady-state mass transport equation:

= 0(4.11)

and applying the boundary conditions in Table 4.1. The relative efficiencies of SIP and MGD1 were compared in simulating the steady-state current at a channel microband electrode, using the parameters in Table 4.2, representing a typical experimental system.

Table 4.1: Boundary conditions for a microband channel electrode simulation
Region of simulation spaceBoundary conditionFinite Difference implementation
Left wall
k = 0
[x = 0]
Right wall
k = NGX
[x = (lu+ld+1).xe]
Upper wall
j = NGY
[y = 2fh]
Lower wall
j = 0
[y = 0]
- before electrodes
k=1
lu.kE
- at electrode
k=lukE+1 (lu+1)kE
a = 0
- after electrodes
k=(lu+1)kE+1 NGX

Table 4.2: Channel and electrode geometry
Electrode lengthxe30mm
Electrode widthw0.4 cm
Channel widthd0.6 cm
Channel height2h0.04 cm
Flow ratevf1x10-2 cm3s-1
Diffusion coefficientD2x10-5cm2s-1
Bulk concentration[A]bulk1x10-6 mol cm-3
No of electrode lengths upstreamlu1
No of electrode lengths upstreamld1
Fraction of the channel simulatedf20%

The number of iterations necessary to achieve a residual norm of 10-8 and CPU times on a Silicon Graphics Indigo(tm) [133 MHz R3000 processor, 32 Mb RAM] are shown in Table 4.3. It is clear that the CPU time per iteration is about 15% less in MGD1 than SIP for a typical mesh size of several hundred nodes.

For a 641x641 node simulation (a mesh that typically gives convergence in electron-transfer simulations using Cartesian grids with less than 1% error9) the MGD1 method required only 4 iterations to give a fractional error in the current of less than 0.05%. The fractional error is defined in terms of the current after n iterations with respect to the current from MGD1 when the residual 2-norm is less than 1x10-3. SIP, with an optimal APARAM value of 84, required 26 iterations to give a fractional error in the current of 1% and 56 iterations to give an error less than 0.05%.

Table 4.3: Relative efficiencies of MGD1 and SIP as a function of number of nodes
MGD1SIPCPU time/it
NGXno of itsCPU timeCPU time/itno of itsCPU timeCPU time/itRatio
610.010.01060.010.002
1150.020.00480.030.004
21110.170.015100.170.0170.909
41110.640.058160.980.0610.950
81163.270.204225.320.2420.845
1611613.030.8143432.860.9660.843
3211651.963.24848183.683.8270.849

4.1.3 Limitations

As with other stencil-based finite difference solvers such as SIP, fully implicit simulations are limited to homogeneous mechanisms where the species may be solved sequentially such as ECE and EC2E.

Due to the ILU smoother, MGD1 will diverge if the matrix is highly off-diagonally dominant. In electrochemical simulations, sources of off-diagonal dominance include:

  1. Discretisation of 1st derivatives, leading to terms where the contribution from the 1st derivative is of a larger magnitude, and opposite sign to the contribution from the other terms.
  2. Formation of a species through a chemical reaction.
  3. Back-to-back representation of a heterogeneous boundary condition.

Conversely there are factors which increase the diagonal dominance:

  1. (Dirichlet) transport-limited electrolysis and bulk concentration boundary conditions.
  2. Loss of a species through a chemical reaction.

In the absence of these, the magnitude of the diagonal balances the off-diagonals in the general finite difference equation. Thus in the absence of kinetics the internal nodes and no-flux boundaries of the transport matrix show neither diagonal nor off-diagonal dominance.

Where off-diagonal dominance arises from the asymmetry of first derivatives, it may be removed by discretising the first derivative as an upwind or downwind difference (depending on the sign of its coefficient in the mass transport equation i.e. diffusion terms are positive requiring downwind differences, whilst convection terms are negative requiring upwind differences to ensure no off-diagonal dominance.)

As noted in Chapter 3, conformal mappings generally result in a more badly conditioned linear system due to the large range of coefficient magnitudes. Even when the matrix is diagonally dominant, MGD1 may diverge if the system is badly conditioned. Examples where this was found include:

In section 4.4, we investigate what might be the causing multigrid methods to fail to converge in these systems.

4.2 The preconditioned Krylov subspace methods

Unlike many direct elimination methods, these iterative methods only require storage of, and calculation using, the non-zero elements of the coefficient matrix. Despite this, they retain the flexibility of the direct matrix methods - they may be used to solve randomly structured, randomly valued non-symmetric sparse matrix equations. These solvers, known as preconditioned Krylov subspace (PKS) methods, have been applied successfully to finite difference and finite element simulations in quantum dynamics1011-12 and other fields13,14. Conveniently, these methods are available in the NAG FORTRAN library7 (Mk 18 onwards) and in DiffPack15 (free to academic users) so they may be used easily by electrochemists.

4.2.1 Basis

When using preconditioned Krylov methods, solution occurs in two stages - the first involves the generation of a preconditioner matrix P. The second stage is to solve the preconditioned system iteratively:

PNu = q where N = P-1M.(4.12)

The point of the preconditioner, which is a rough approximation to the inverse of the coefficient matrix, is that is makes the system 'easier' to solve (i.e. fewer Krylov subspace iterations are required) and yet can be constructed in less CPU time than it saves. A common approach to preconditioning is to derive P via an incomplete LU (ILU) factorisation16 of the matrix M - a truncated form of Gaussian elimination. A parameter known as the 'fill level' (LFILL) controls how much of the LU factorisation is retained and how much is discarded.


Figure 4.3: Fill elements in ILU factorisation. The zero fill (LFILL=0) and LFILL=1 cases are shown. The white squares correspond to the fill elements introduced into the coefficient matrix.

The concept of fill is illustrated in Figure 4.3, showing the ILU factorisation of a matrix generated from a 5-point finite difference stencil. With a fill level of zero, the only decomposed elements which are retained are at points corresponding to non-zero elements in the original matrix. As the level of fill is increased, more elements of the decomposition are retained providing a better preconditioning matrix but for a higher cost in memory and CPU time.

It is possible to use a range of other methods as preconditioners - the NAG library also contains subroutines such as F11DEF which may be used for Jacobi preconditioning or SSOR preconditioning of the Krylov subspace methods. Recently Meese17 showed how SIP could be extended to act as a similar preconditioner to ILU, giving up to twice the efficiency of ILU for convective-diffusion problems.

Once the preconditioner matrix, P, has been constructed, one of the three iterative Krylov subspace solvers can be applied. These work by assuming that the next iterative solution lies somewhere in a search subspace, K, which is composed of all the previous solution vectors. In order to extract an approximate solution vector from a subspace of order i, i constraints are needed. These are usually chosen as i orthogonality conditions on the residual vector - r is constrained to be orthogonal to i linearly independent vectors forming L (the subspace of constraints). A Krylov subspace arises when:

Ki = span {r0, Mr0, ... , Mi-1r0}(4.13)

where ri is the residual at step i, ri = b-Mui. Note that in this section, i is used to represent the iteration number (and hence subspace order) and j is used as a generic index.

The Krylov methods split the matrix M to generate an iterative scheme for which each successive approximation of the solution, found along a 'search vector', is an element of increasing Krylov subspaces. The process can be likened to a polynomial approximation:

ui = p(M)r0(4.14)

where p is a polynomial function. The simplest splitting M = I - (I-M) gives rise to the Richardson iteration18:

ui = (I-M)ui-1 + b = ui-1 +ri-1(4.15)

which generates solutions

ui = r0 + r1 + ... + ri-1 = (4.16)

which are in increasing Krylov subspaces

ui ( Ki where Ki = span {r0, r1, ... , ri-1}(4.17)

The general splitting:

M = P-Q = P - (P-M)(4.18)

can be rewritten as the standard splitting

N = I - (I-N)(4.19)

for the preconditioned matrix

N = P-1M(4.20)

In this way the preconditioner may be incorporated into the scheme.

The Krylov subspace projection methods fall into three classes19:

  1. Identify ui for which the residual 2-norm, ||b - Mui||2, is minimised over Ki(M;r0). This is, not surprisingly, known as the minimum residual approach.

  2. Orthogonal projection methods (K=L): construct ui for which the residual is orthonormal to the current subspace, namely ri ( Ki(M;r0). This is known as the Ritz-Galerkin (or sometimes just Galerkin) approach.

  3. Oblique projection methods (K(L): find ui so that the residual is orthonormal to some other suitable i-dimensional subspace. This is the Petrov-Galerkin approach.

The three methods provided in the NAG library (CGS, BICGSTAB(l) and RGMRES) used in this thesis fall into the first and third classes. RGMRES is a modification of GMRES (class 1) which acknowledges the increasing numbers of vectors requiring storage as iterations progress. The way this problem is overcome, is to restart the simulation once a maximum Krylov subspace dimension has been reached. CGS and BICGSTAB(l) are derived from the biconjugate gradient method (BiCG) which falls into class 3. The main difference between them is the erratic convergence of CGS verses the smooth convergence of BICGSTAB(l). For a detailed introduction to Krylov subspace methods see refs. 3 and 19. From the electrochemist's point of view, the key point is that there is no 'clear best' Krylov subspace method20 - each is suited to a particular problem class. The following work investigates which of the methods is most suited to solving typical electrochemical problems.

4.2.2 Performance

In this section we compare the efficiencies of the NAG library (section F11) PKS subroutines to 'established' finite difference solvers such as the MGD1 multigrid method, SIP and to direct LU decomposition. For our model problem we simulate the transport-limited current at a 30mm long (xe) channel microband electrode in Cartesian co-ordinates, with the parameters shown in Table 4.4. A five-point Taylor expansion was used to compute the flux, and a three-point Taylor expansion was used to apply second-order Neumann boundary conditions in the no-flux regions. All simlations were run on a SGI Indigo2 TM (150 MHz R4400 processor, 192Mb RAM). The boundary conditions for this problem are shown in Table 4.5.

Table 4.4: Parameters for the model channel microband simulation (see Fig. 2.3)
xe30mm
w0.4cm
2h0.04cm
d0.6cm
vf0.01cm3s-1
[A]bulk1x10-6 mol cm-3
D2x10-5 mol cm-2
lu0.5
ld0.5
f0.2
PSTRATN
f is the fraction of the channel simulated in the y co-ordinate.
PSTRAT is the pivoting strategy used by the ILU preconditioner. The value 'N' corresponds to diagonal pivoting (suitable for a diagonally dominant matrix).
Table 4.5: Boundary conditions for the model channel microband simulation
Region Spatial DefinitionFinite difference nodesCondition
Upstream x = 0k = 0[A] = [A]bulk
Downstream x = (lu+ld+1).xek = NGX
Upper retaining wall y = 2fhj = NGY
Lower wall containing electrode
- before electrode
y = 0
0 < x < lu.xe
j = 0 k=1 lu.kE
- at electrodelu.xe < x < (lu+1).xek=lu.kE+1 (lu+1)kE[A] = 0
- after electrode(lu+1).xe < x < (lu+ld+1).xek=(lu+1)kE+1 NGX

The ILU preconditioner and Krylov subspace solvers have a number of adjustable parameters which may be 'tuned' for each type of problem. Fill may be introduced into the ILU preconditioner to improve the preconditioning matrix and thus reduce the number of iterations performed by the solver at the expense of a greater storage requirement.

Figure 4.4 shows the effect of varying the level of fill in the ILU preconditioner with the CGS solver. The optimum value of LFILL is 5 for this problem. Larger values increase the time taken by the preconditioner and the solver per iteration, smaller values increase the number of iterations performed by the solver. In all cases the extra memory required for the fill was less than the memory used with no fill.


Figure 4.4: Effect of fill in the ILU preconditioner on the CGS solver CPU time

The NAG library subroutine allows the ILU preconditioner to be modified to preserve the row sums in the matrix - the option is known as Modified ILU (MILU). Instead of simply discarding fill elements within a row, they are subtracted from the diagonal element. This has the effect of accelerating the solution at the cost of instability for some systems14, 21. The system of equations arising from discretisation of the channel flow problem is well conditioned, so MILU may be used without any problems. The effect of using MILU is shown in Figure 4.5 - a dramatic improvement in efficiency is observed. Also note that the optimal fill level with MILU is zero.


Figure 4.5: Effect of MILU preconditioning

The Krylov subspace methods BICGSTAB(l) and RGMRES both have a single adjustable parameter which may also be 'tuned'. Figure 4.6(a) and (b) show the effect of varying the polynomial order, l, in the BICGSTAB(l) solver for ||r||2 tolerances of 1x10-3 and 1x10-6, respectively. For the less converged solution (a) the optimum l value is 1. For (b) the optimum value seems to be about 3-4, though larger and smaller values make little difference to the CPU time.

Figure 4.6: Effect of polynomial order, l, in BiCGStab.

Figure 4.7: Effect of restart subspace, m, in RGMRES.

Figure 4.7 shows the effect of varying the size of the restart subspace, m, in the RGMRES solver for a tolerance level of 1x10-3. The optimum value is approximately 10 regardless of the tolerance level. The CPU time increases significantly if m is much smaller. The optimum values and behaviour are very similar to that found by Salvini and Shaw for a range of problems21.

In order to compare CPU times of various iterative linear methods it is necessary to specify equivalent iteration convergence criteria. These are defined differently in the NAG library routines as shown in Table 4.6. In order to compare like with like, the multigrid method was used to solve the test problem for a range of meshes with an ||r||2 threshold of 1x10-3 and 1x10-6. The former is suitable for most electrochemical simulations, but the latter was used to provide a comparison for extremely converged solutions. The thresholds for the other algorithms were computed from the residuals of these solutions, as shown in Table 4.7.

Table 4.6: Convergence criteria for MGD1 multigrid, SIP and the F11 NAG library subroutines
MethodConvergence criterion
Multigrid||r||2
SIP*
PKS>
* This is only for CONRES (normalised residual) - CONCHN (change in normalised residual) was set to a high value so it was always met. See the NAG library documentation for more information.
t is the tolerance value specified by the variable TOL in the NAG library.

Table 4.7: Equivalent convergence thresholds for MGD1 multigrid, SIP and the F11 NAG library subroutines
||r||2 =1x10-3||r||2 =1x10-6
NGX = NGYMultigridF11SIPMultigridF11SIP
333.31x10-41.01x10-83.25x10-84.12x10-71.99x10-114.57x10-11
651.21x10-41.73x10-94.38x10-91.23x10-71.93x10-124.90x10-12
973.69x10-42.42x10-96.04x10-93.39x10-72.37x10-125.77x10-12
1298.68x10-43.79x10-99.25x10-97.37x10-73.36x10-128.00x10-12
1611.76x10-45.82x10-101.37x10-91.34x10-73.99x10-139.61x10-13
1933.78x10-48.86x10-102.06x10-93.06x10-76.95x10-131.66x10-12
2256.14x10-41.06x10-92.45x10-95.41x10-78.99x10-132.13x10-12

An approximately optimal value of 30 was used for the SIP acceleration parameter and five mesh levels were used for the Multigrid solver (i.e. the number of nodes was chosen so that NGX-1 and NGY-1 were divisible by 25). For larger problems than those specified in Table 4.7 it would be advisable to use a higher SIP acceleration parameter22 and number of grids in the multigrid scheme to ensure similar performance. Figure 4.8 shows the decadic log of the CPU time as a function of the decadic log of the number of nodes for (a) ||r||2 tolerance of 1x10-3 and (b) ||r||2 tolerance of 1x10-6. The optimal values of ( for the BICGSTAB(l)b(() ((=1 for (a) and (=4 for (b)) and m=10 for RGMRES were used. Zero fill is contrasted with an optimal fill level of 5 and with the effect of MILU. The data used to generate Figure 4.8 at higher CPU times (104 nodes upwards) was linearly regressed to generate Table 4.8 which shows estimated CPU times for 104, 105 and 106 node problems.

(a)
(b)
Figure 4.8: Log CPU time as a function of the number of finite-difference nodes for (a) ||r||2 tolerance of 1x10-3 and (b) ||r||2 tolerance of 1x10-6.

Comparison of the CPU times required by the various algorithms shows that both SIP and the PKS methods lie between the multigrid method, in which CPU time depends on the total number of nodes (N) as O(N1.15), and the direct LU method, in which CPU time depends on N as O(N1.9). Comparing the three Krylov subspace methods, BiCGStab(4) is seen to be the fastest for the channel problem. In virtually all cases CGS and BiCGStab(4) perform significantly better than RGMRES. RGMRES has been found to be less efficient than CGS or BiCGStab(4) for other problems13,14,21. Fill can reduce the CPU time by a factor of approximately two. MILU has a more dramatic effect on the CPU time - the dependence on the number of nodes is reduced from O(N1.6) to O(N1.5) for this problem.

Table 4.8: Estimated CPU times for various problem sizes
CPU time
Problem size104 nodes*105 nodes*106 nodes
Tolerance, ||r||21x10-31x10-61x10-31x10-61x10-31x10-6
Multigrid0.5 s0.65 s7.5 s9.6 s1.8 min2.3min
SIP2.75 s4.3 s56.0 s1.4 min18.9 min26.2 min
BiCGStab(4) with MILU5.2 s6.9 s2.2 min3.5 min55.7 min1.7 hours
CGS with MILU5.2 s5.8 s2.2 min3.2 min54.3 min1.8 hours
RGMRES(10) with MILU7.3 s8.5 s3.4 min5.4 min1.6 hours3.4 hours
BiCGStab(4): LFILL=56.5 s10.6 s4.1 min5.4 min2.6 hours2.7 hours
CGS: LFILL=57.3 s9.9 s6.1 min6.9 min4.4 hours4.8 hours
RGMRES(10): LFILL=58.6 s12.4 s6.3 min8.2 min5.0 hours5.4 hours
BiCGStab(4) 8.5 s12.9 s7.1 min8.8 min5.9 hours6.0 hours
CGS 18.2 s18.4 s9.1 min10.8 min4.6 hours6.3 hours
RGMRES(10)10.0 s18.2 s15.0 min20.1min22 hours22 hours
DIRECT LU2.6 min3.6 hours12.3 days
* Values were interpolated along the regression line
Value were extrapolated from the regression line

Table 4.9 shows the memory requirement (in Mb) of the various methods as a function of the number of nodes. The Krylov subspace methods require an amount of storage space which is linearly proportional to the number of nodes - a great improvement over the direct LU method which has a quadratic dependence. The largest problem that could be simulated using direct LU factorisation was a 72x72 node mesh using a Silicon Graphics Indigo2 (tm) with 192Mb of RAM. Simulating a mesh of several hundred nodes in either direction presented no problem with the Krylov subspace methods even when some fill was used. MGD1 uses approximately half, and SIP one-third, of the memory of the PKS methods. The very large (up to 106 node) problems which have been simulated by SIP23 and MGD124 would require approximately 3 or 30 times the CPU time, respectively, using the PKS methods.

Table 4.9: Memory requirement for a medium (104 node) and very large (106 node) problem
MethodMemory requirement
104 nodes106 nodes
MGD11.11 Mb109 Mb
SIP0.76 Mb76.3 Mb
BiCGStab(4)2.89 Mb289 Mb
CGS2.44 Mb244 Mb
RGMRES(10)2.90 Mb290 Mb
DIRECT LU1526 Mb1.5x107 Mb

4.2.3 Conclusions

The preconditioned Krylov subspace methods available in the NAG library have been shown to be suitable for the solution of the large sparse non-symmetric matrices arising from application of the finite difference method to general electrochemical problems. Of the three solvers available in the NAG library, either BiCGStab or CGS is recommended, depending on the desired tolerance, since RGMRES seems to be less efficient. However in terms of robustness, RGMRES>BiCGStab>CGS3,19 so the former may prove useful for difficult problems (a larger value of m provides greater robustness). CGS has the advantage of having no adjustable parameters to 'tune' although BiCGStab is more efficient when optimally tuned and converges more smoothly. The recommended values are (=4 for BICGSTAB(l) and m=10 for RGMRES. A small amount of fill (LFILL=2-5) will generally accelerate solution (except with MILU). It should be noted that non-optimal values of adjustable parameters are far less detrimental than with large SIP simulations or with Successive Over Relaxation, where the value of the acceleration parameter is critical.

The memory and CPU dependencies on the number of nodes(N) for the direct LU decomposition, the preconditioned Krylov subspace solvers, SOR, SIP and MGD1 methods are summarised in Table 4.10. If the problem allows MILU to be used (as in the case of the steady-state Cartesian simulation of the channel microband electrode) then the PKS methods compare well with finite difference solvers such as SOR and SIP, though the high efficiency of the multigrid method should be noted. For ill-conditioned problems where MILU cannot be used, it is likely that finite difference solvers such as SIP and MGD1 would diverge.

The stencil-based finite difference solvers are unsuitable for simulating more complex mechanisms where the material balance equations are coupled - here matrix methods are necessary. The F11 solvers compare very favourably to the direct LU factorisation of the matrix proposed by Britz or even state-of-the-art direct solvers as shown by Salvini and Shaw21.

Table 4.10: CPU time and Memory requirements as a function of the number of nodes for various methods.
CPU timeMemory
Solver MethodPoisson's Equation*Channel flowrequirement
Direct LUO(N2)O(N1.9)O(N2)
SOR (optimal w)O(N3/2)not investigatedO(N)
Preconditioned Krylov Subspace:
CGS or BiCGStab(4)
O(N3/2)O(N1.6)O(N)
Preconditioned Krylov Subspace with MILUO(N5/4)O(N1.5)O(N)
SIPnot investigatedO(N1.3)O(N)
MultigridO(N)O(N1.15)O(N)
* The data for Poisson's equation was kindly provided by Gareth Shaw at NAG.

4.3 The best of both worlds?

From the previous two sections it should be clear that one is faced with a dichotomy in terms of the 'best' solver. The MGD1 multigrid method gives superior performance in terms of CPU time, but is sensitive to ill-conditioned problems and restricted in its stencil. Conversely preconditioned Krylov subspace methods are flexible and robust, but less computationally efficient. An ideal solver would combine the favourable properties from both methods.

It is possible to use a multigrid preconditioner25,26 with the Krylov subspace methods which would accelerate their convergence. The difficulty is that this must be robust and capable of accommodating arbitrary sparsity patterns. It is possible, however, to tackle the problem from the other end and construct a multigrid method using a preconditioned Krylov subspace solver as the smoother27. This is particularly attractive as the flexibility and robustness of the PKS methods are retained, yet the mesh-independent convergence of a multigrid method is achieved. Such a multigrid algorithm was therefore developed and programmed using the following components:


Figure 4.9: Cycling strategies for visiting coarse grids. Each dot represents smoothing (or solution in the case of the coarsest grid).


Figure 4.10: CPU time vs mesh density for a WJE and ChE simulation using a PKS method [BICGSTAB(4)] or a PKS-smoothed multigrid method. Parameters for the ChE are: Xe=0.3mm, 2h=40mm, d=6mm, w=4mm, Vf=0.01cm3s-1, f=0.1, lu=ld=1. Parameters for the WJE are: re=2mm, rjet=0.345mm, kc=0.86, n=0.0089cm2s-1, hmax=1, convection upwind in R and central in Z. For both geometries, D=2x10-5cm2s-1, [A]bulk=1x10-6molcm-3.

The preconditioner is generated on each of the meshes before beginning multigrid cycles, using NAG FORTRAN library routine, F11DAF. The smoother and coarse-grid solver calls the NAG FORTRAN library routine F11DCF. Trial simulations were conducted on both wall-jet and channel microband electrodes (for a steady-state transport-limited electrolysis). In both cases, a V-cycle with one PKS iteration for pre- and post-smoothing was found to be optimal, compared with other smoothing/cycling strategies.

Figure 4.10 shows a comparison of the CPU times for the simulation of an E process at a wall-jet (WJE) and channel electrode (ChE). For a 1025x1025 node simulation, the CPU time is cut by a factor of 10 by using the multigrid method.


Figure 4.11: Number of iterations as a function of mesh density.

The reason for this is evident when one considers the number of iterations required as a function of the number of nodes (shown in Figure 4.11 for the WJE - the ChE is not shown but is essentially the same). For the PKS methods the number of iterations increase linearly with the number of nodes in one co-ordinate, O(NGX). For the multigrid method, the number of V-cycles increase as O(log NGX) 28. However this is not a 'fair' comparison since in each V-cycle, both pre- and post- smoothing occurs on each mesh, so the amount of smoothing work done in each v-cycle is given by the geometric series, depending on the number of mesh levels:

2(1+1/4+1/16+1/64+...) < 8/3(4.21)

The amount of smoothing work is also plotted on Figure 4.11, showing that at low NGX there is little improvement from using the multigrid method. As NGX increases so does the benefit of using the multigrid method - the computational work of the PKS method is O(N3/2) compared with O(N log(N1/2)) for the multigrid method.

4.4 Conformal mappings and multigrid methods


Figure 4.12: 2-grid V-cycle used to investigate the convergence failure of the Amatore-microdisc problem.

A number of simulations in the following chapters using expanding grids or conformal mappings were attempted using the NAG library MGD1 multigrid method which failed to converge. This was initially attributed to the off-diagonal dominance that may occur in the linear system as a result of such transformations, for example Verbrugge's microdisc transformation (see Chapter 9). However, if diagonal dominance is restored using an upwinding scheme or other transformations (such as the microdisc transformation Amatore and Fosset29) are used which do not result in an off-diagonally dominant coefficient matrix, MGD1 still fails to converge unless the known analytical solution is given as a starting approximation (again see Chapter 9). Furthermore, the more 'robust' multigrid method based on PKS smoothers and coarse-grid solvers failed to converge with Amatore's transformation, even though the PKS methods alone can solve the system without any trouble. To investigate the cause of this instability, a PKS smoothed 2-grid simulation, using Amatore's conformal mapping, was conducted of a simple transport-limited electrolysis at a 1mm disc in stationary solution. All the details of the transformation and finite difference discretisation are exactly the same as for a straightforward PKS simulation, described in detail in Chapter 9. 103x103 nodes were chosen as the fine-mesh density resulting in only 2 MG levels and allowing clear visualisation of the solution, errors and residuals (via callable IDL(r)30) as the iterations progressed.

This particular problem was chosen because, in addition to its failure to converge, the solution is known analytically - the concentration profile is a simple ramp function:

a(q,G) = G(4.22)

Figure 4.13: Errors predicted by coarse-grid correction (blue) plotted with the true errors (red).

This means that in each iteration, the error can be calculated and this can be compared with the error 'predicted' by the coarse-grid correction. Examination of the predicted vs. true errors, shown in Figure 4.13, in the first iteration reveals that it is indeed the coarse-grid correction which is at fault. It over-predicts the error and thus the solution is over-corrected resulting in an oscillation in the errors between iterations.

Why is this? The coarse-grid solve, error prolongation and calculation of residuals were each tested and found to be working satisfactorily. This only leaves the restriction of the residuals, shown in Figure 4.15. The residual near the origin is very large and oscillatory, since:

r = Me(4.23)

the very small distances between nodes in the conformal mapping near the origin result in very large finite difference coefficients. Therefore the accuracy of the restriction near the origin is very poor leading to the inaccurate prediction of the errors.


Figure 4.14: A comparison of coarse grid residuals by restriction of the fine-grid residuals (blue) with coarse grid residuals synthesised by restriction of the error (red).


Figure 4.15: Residuals on the fine mesh (red), together with those predicted by linear restriction (blue).

To confirm this is the case, coarse grid residuals were synthesised by restriction of the errors, and from these the residuals were calculated via r = Me. These are compared with those obtained by restriction in Figure 4.14. It should be noted that the synthesised residuals are not 'perfect' since the restriction of the errors is only first-order (this explains the discrepancy along the electrode where the error gradient is changing most rapidly). However the values near the origin are wildly different. If the synthesised residual is used in each iteration, the multigrid method converges in 8 iterations.

It appears that multigrid methods, which rely on the physical nature of the finite difference representation, may be unsuitable for simulations in some conformal spaces where the residual is no longer smooth. It may be possible to use a locally improved restriction operator for such systems, to overcome this problem31.

4.5 A frontal solver based on the BI method

The space-marching or 'frontal' application of the BI method (section 2.3) is very efficient for convective-diffusion simulations where axial or radial diffusion is negligible. The reason for this efficiency is twofold:

The simulation merely consists of applying Thomas Algorithm NGX times and copying the result into the RHS each time, the overall algorithm is therefore O(NGY.NGX.NS). Unfortunately there is a major drawback when simulating kinetics, since there is nowhere to 'put' coupling kinetic terms. Often it is necessary to resort to a 'lagging' explicit approximation of the kinetic term, which breaks down at high rate constants.

One can avoid making such an approximation by replacing the Thomas Algorithm with a solver which can cope with coupled chemical reactions. This is essentially what Rudolph has done in FIFD (see section 2.6.1.3) - the Thomas Algorithm is replaced by a block version with an inner Gaussian elimination. The Gaussian elimination step is8 O(N2[1+(N-1)/3]), where N in this case is the number of chemical species, hence the method is O(NGX.NGY . NS2[1+(NS-1)/3]) overall. To the author's knowledge, there have been no applications, to date, of space-marching FIFD simulations.

Here an alternative approach is presented, using a PKS method to solve the coupled system of j-vectors. Although PKS methods are not O(N), the small size of the linear system means that it may be solved using a couple of PKS iterations. Thus the overall method is O(NGX.(NGY.NS)a) where a is close to 1 unless a very large number of nodes and species are used. A comparison with BI is given in Figure 4.16, for the simulation of a DISP1 reaction at a wall-jet electrode.


Figure 4.16: A comparison of CPU times (on a SGI Indigo2 with 192Mb RAM and 150MHz R4400 processor) for simulating a DISP1 reaction at a Wall-Jet electrode. The mesh size was 500x500 nodes.

The BI simulation contained a check to identify when the simulation produced negative concentrations due to breakdown of the 'lagging' kinetic term. If this was the case, NGX was incremented by 10% and the simulation was restarted. Notice that for low rate constants the BI method is around 60 times more efficient than the PKS simulation (using this implementation). One would expect this since more 'work' must be done using the PKS method - this was found to be distributed approximately into: 25% sorting the coefficient matrix; 50% generating the preconditioner; 25% solving the system. The average number of PKS iterations increases from about 2, where the kinetic terms are small, up to about 4, when the kinetic terms dominate. At very high rate constants the average number of iterations per j-vector starts to fall again, as the kinetically unstable species only exists in appreciable concentration for the first few j-vectors. Savings could be made by generating the coefficient matrix ready-sorted and possibly using a simpler preconditioner.

4.6 Parallel transient BI

To date, little work has been done adapting electrochemical simulations for parallel computing. This may partly be due to the selection of problems requiring minimal computing time (e.g. mass-transport in one spatial dimension) and the fact that the Thomas Algorithm, which forms the inner loop of many electrochemical simulations, is not amenable to fine-grain parallelisation (or vectorisation) because it requires each element of a vector to be found from the last in a sequential loop. One possibility would be to replace the Thomas Algorithm with cyclic reduction3233-34 or another parallelisable algorithm for the solution of a tridiagonal linear system35. This is especially recommended on vector computers - cyclic reduction is reported as being up to 10 times faster than Gaussian Elimination on the Cray C9036.

"Embarrassingly" (high-level) parallel electrochemical problems include the generation of working curves and especially working surfaces. Here the same simulation code must be re-run with many sets of parameters (a SIMD problem). In the case of working curves, it is often beneficial to feed back the solution at the previous rate constant to use as the starting approximation for the next. If the loop is to be parallelised then this can no longer be done.

What about the simulation algorithms themselves? The ADI method is a good candidate for medium-grain parallelisation37, since each 'line' may be solved independently within each half-step (all need to be completed before the next half-step, so this forces a synchronisation point). A lot of work has been done on parallelisation of Krylov subspace methods and preconditioners3. This includes exploiting (fine-grain) parallelism in the underlying operations e.g. matrix-vector products as well as designing new algorithms based on reordering or partitioning the equation system.

We have seen that the BI frontal method (or its PKS/FIFD analogue) is an efficient algorithm for the simulation of hydrodynamic electrodes at high Peclet numbers. In the outer loop of the BI method each k vector is calculated from the last, hence it is not suited to medium-grain parallelisation since the outer loops must also be completed in order. However, the transient BI method, described in section 2.3, does lend itself to medium-grain parallelisation since each k vector depends only on the one before it in space and the one before it in time. It is therefore possible to work on different time steps on different processors: once the concentration at (k=1, t=1) has been calculated, both (k=2, t=1) and (k=1, t=2) may be calculated. Thus an increasing number of (k,t) combinations may be solved simultaneously as k and t increase. In reality, the number of available processors is usually much smaller than the maximum number of (k,t) combinations that may be solved simultaneously. Figure 4.17 gives a scaled-down example. For the first step, only 1 processor can be used; 2 for the second etc. Once step n has been reached, all the processors may be used successfully until just before the end where the process happens in reverse.


Figure 4.17: A scaled-down problem showing application of the PTBI algorithm. Each dotted box represents an NJ vector which must be solved using the Thomas Algorithm. The dark boxes represent the 3 processors, showing how they move across the problem in L,t space.

The simulation runs for a total of NL cycles, within which a parallel loop runs from 1 to n (except at the ends where this is truncated). After each parallel loop the processors must be synchronised so that all the concentration values required for the next L-cycle have been calculated. Since each processor has to do exactly the same number of operations in the parallel loop, the synchronisation overhead is minimal on an unloaded machine. If, however, one processor is effectively slower because it is being shared between processes, the others must all wait while it catches up.

The total number of cycles, NL, may be calculated from:

(4.24)

(where ceil(x) rounds x up to the nearest integer). The block numbers may be calculated from:

(4.25)

(where floor(x) rounds x down to the nearest integer). For a given L and t, the corresponding k value is given by:

(4.26)

The algorithm may be written in SGI multiprocessor C as:

for(L=1; L<=NL; L++)
{
  /* Compute block number */
  BL = L/NK; /* Integer division = effective floor */

  /* slide down at end of block */
  if((L>=(BL*NK)+1) && (L<=(BL*NK)+n))
  {
     min++; max++;
  }

  /* Truncate ends */
  if(min<1) tmin=1; else tmin=min;
  if(max>NT) tmax=NT; else tmax=max;

  /* Time loop (parallel across n processors) */
  #pragma parallel 
  #pragma pfor
  for(t=tmin; t<=tmax; t++)
  {
    BT=t/n; /* Integer division = effective floor */
    k=1+L-t+BT(n-NK);
    solve(); /* Solve J vector */
  }

  /* Sync is implicit in ending the parallel region */
}

4.6.1 Theoretical efficiency of processor usage

In Figure 4.17 the three processor 'window' is shown moving over the simulation mapped out in L and t. This may be redrawn (as in Figure 4.18) to show the activity of each processor in each L cycle.


Figure 4.18: Processor usage in the problem defined in Figure 4.17.

There are three regions where processor inactivity occurs:

  1. Starting: CIS = (n-1)! (or 0 if n=1)
  2. Finishing: CIF = (n-(1+m))! (or 0 if n-m=1)
  3. NT not exactly divisible by n: CIND = mNK - m!

where:

m = (n-(NT mod n)) mod n(4.27)

The fraction of time the processors are, on average, active is given by:

(4.28)

and the theoretical efficiency, assuming that all the processors run at the same speed, is given by:

(4.29)

4.6.2 Storage

For processors 1 to n-1, only a vector of NJ elements needs to be stored. In the next L cycle, these are the values at the previous time for processors 2 to n, and the values at the previous space for processors 1 to n-1. So what about processor n? Its values are written into a matrix of (NJ,NK) with the NJ vector being written at the current k forming the previous space values in the next cycle, as before. An NJ vector is also read out (at a lower k value) by processor 1 as the values at the previous time. The storage scheme is illustrated in Figure 4.19.


Figure 4.19: Storage scheme used in PTBI. The array of n-1 vectors which lie in the centre are stored into and retrieved from in every step. Storage into the matrix is represented along the top and retrieval below. The element retrieved is NK-n+1 behind the one that is stored.

4.6.3 Results


Figure 4.20: Measured speedup as a function of the number of CPUs.

CPU times were measured on a SGI Origin 2000(tm) machine (with 2.5 Gb RAM and ten 195MHz R10000 processors) with a negligible load. Figure 4.20 shows the speedup as a function of the number of processors. Although the graph is linear, the gradient is ~0.36 rather than close to 1 as we might expect from the theoretical speedup, shown in Table 4.11. There must be some overhead on the parallel code that is impeding its performance - the most likely is memory contention.

Table 4.11: Theoretical efficiency as a function of the number of CPUs.
nn.NLEfficiency
11 000 000100.0%
21 000 002100.0%
31 002 00099.8%
41 000 012100.0%
51 000 020100.0%
61 002 01899.8%

An explanation may be sought by considering the machine architecture. The SGI Origin 2000(tm) is a distributed memory machine. Each R10000 processor has a local 'bank' of memory and a 4Mb write-back cache. Other processors take longer to get to the memory bank of a particular processor, but in this case the cache is likely to be the important factor. For the example simulation, it was found that the parallel code runs faster on one processor than the original BI simulation it was adapted from. This is surprising since more work has to be done in the parallel version. The reason may be that the small array, holding the vectors computed on virtual processors 1 n-1, can fit into the cache whereas the full NJ x NK array cannot.

The code for the parallel loop is shown below, the line highlighted is the likely source of contention since this overlaps the PU[p] vector between two processors:

  #pragma parallel
  #pragma shared(u,PU,alpha,beta,LAMj,LAMy,l,accum) 
  #pragma local(k,t,be,p,Uoldt,Uoldk,Udest,j,d,f)
  #pragma pfor iterate(t=tmin;tmax;1)
  for (t=tmin; t<=tmax; t++)
  {
   be=floor((t-1)/(float)n);
   k = (l+1) - (t + be*(NK-n));
   p = t%n;  /* logical processor # */ 
   /* Storage: attach pointers */
   if(p == 1)           {Uoldt=u[k-1];  Uoldk=PU[p-1]; Udest[p]=PU[p-1];}
   if(p > 1)            {Uoldt=PU[p-2]; Uoldk=PU[p-1]; Udest[p]=PU[p-1];}
   if(p == 0)           {Uoldt=PU[n-2]; Uoldk=(k==1 ? 0 : u[k-2]); Udest[p]=u[k-1];}
   /* Thomas Algorithm */
   if(k==1) for(j=0;jNJ-1;j++) d[j]=Uoldt[j]+LAMj[j]*Uoldk[j];
   f[0] = d[0]*alpha[0];
   for(j=1; j=0; j--)  Udest[p][j]=f[j] - beta[j]*Udest[p][j+1];   
   accum[t-1]+=Udest[p][0];        /* Accumulator for current */
  } /* End of parallel loop*/

In a write-back cache model, each processor gets its own copy of the NJ x (n-1) array in a 'frame'. As soon as one processor begins to change the data in that frame of memory, it is locked so others cannot get hold of it. When it has finished, the next processor is able to 'grab' the frame from memory and change it, so the processing of this small memory block is essentially serial, with each processor waiting their turn for the memory. The SGI implementation is more sophisticated than this, as one would expect from the observed speedup in parallel, since memory is mapped into the cache in 512 byte 'lines' which correspond to 64 (double precision) array values. However it is likely that memory-cache synchronisation is still responsible for the overhead, on the smaller scale of cache-lines.

The overall result is that the algorithm works, but the efficiency is typically only 40-50% on the SGI Origin(tm). It should be possible, either by further optimising the code for the SGI cache architecture, or using a different machine architecture (suited to finer-grain parallelisation), to attain nearer the theoretical efficiency on unloaded processors.

References

1 B. Speiser, Electroanalytical Chemistry, ed A.J. Bard, Vol 19, Marcel-Dekker, NewYork (1997).
2 W.H.Press, W.T. Vetterling, S.A. Teukolsky, B.P. Flannery, Numerical Recipes in C, Cambridge (1992).
3 Y. Saad, Iterative methods for sparse linear systems, PW Publishing co., Boston, MA. (1996).
4 A. Brandt, Mathematics of Computation, 31, (1977), 333.
5 B. Steffen, L. Rouser, Electrochimica Acta., 40, (1995), 379.
6 W.L. Briggs, A Multigrid Tutorial, SIAM, Lancaster Press, (1987), p69-70.
7 NAG FORTRAN Library, Numerical Algorithms Group, Oxford, http://www.nag.co.uk/numeric/FLOLCH.html. NAG have an information desk which may be contacted by email: infodesk@nag.co.uk.
8 H.R. Schwartz, Numerical Analysis: A Comprehensive Introduction, Wiley, (1989), p10.
9 J.A. Alden, R.G. Compton, J. Electroanal. Chem., 415, (1996), 1.
10 H.G. Yu, S.C. Smith, J. Chem. Soc. Faraday Trans., 93, (1997), 861.
11 U. Peskin, W.H. Miller, A. Edlund, J. Chem. Phys., 103, (1995), 10030.
12 G.H. Tao, R.E. Wyatt, Chem. Phys. Lett., 239, (1995), 207.
13 C.C. Tsai, T.J. Liu, J. Non-Newtonian Fluid Mech., 60, (1995), 155.
14 D.Y. Yang, G.S. Chen, H.P. Chou, Annals of Nuclear Energy, 20, (1993), 9.
15 http://www.nobjects.com/Diffpack/ (see also http://www.netlib.org/diffpack/index.html).
16 J. Meijerink, H. Van der Vorst, J. Comput. Phys. 44, (1991), 134.
17 E. Meese, Copper Mountain Conference of Iterative Methods, Proceedings Vol 1, (1996), University of Colorado. Also see http://www.termo.unit.no/mtf/people/eammtf/cm_abstr_eng.html.
18 D. Kincaid, W. Cheney, Numerical Analysis, Brooks/Cole, California, (1990), p184.
19 H. Van der Vorst, Lecture Notes on Iterative Methods, http://www.math.ruu.nl/people/vorst/cgnotes/cgnotes.html
20 R. Barrett, M. Berry, T.F. Chan, J. Demmel, J.M. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, H. Van der Vorst, Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, SIAM, Philadelphia, PA, (1994). http://www.netlib.org/templates/Templates.html.
21 S. Salvini, G. Shaw, NAG Technical Report TR2/96.
22 J. Alden, Part II Thesis, Oxford University, (1995).
23 R.G. Compton, R.A.W. Dryfe, R.G. Wellington, J. Hirst, J.Electroanal.Chem., 383, (1995), 13.
24 J.A. Alden, R.G. Compton, J. Phys. Chem. B, 101, (1997), 9741.
25 O. Axelsson, P.S. Vassilevski, Numer. Math. 56, (1989), 157.
26 O. Axelsson, P.S. Vassilevski, SIAM J. Numer. Anal. 57, (1990), 1569.
27 This was suggested by Gareth Shaw at NAG in a personal communication. It is intended that the multigrid/PKS algorithm developed in this thesis may lead to a general NAG library 'black-box' solver, which would then be readily available to the electrochemical community.
28 W.L. Briggs, A Multigrid Tutorial, SIAM, Lancaster Press, (1987), p56.
29 C.A. Amatore, B. Fosset, J. Electroanal. Chem. 1992, 328, 21.
30 Interactive Data Language, Floating point systems; Berks, UK. http://www.floating.co.uk/idl/index.html. Callable IDL allows the IDL libraries to be linked to a user-written C or FORTRAN program, arrays to be passed to IDL and commands 'sent' to IDL to operate on and display them. This is a low-effort means of providing real time visualisation as simulations run. Since the simulation code was written in C++, an ANSI C module of wrapper functions was used to prevent name mangling of the functions in the IDL libraries.
31 G. Shaw, Numerical Algorithms Group, personal communication. e-mail gareths@nag.co.uk.
32 R. Hockney, J. ACM, 12, (1965), 95; Meth. Comput. Phys., 9, (1970), 135.
33 R. Hockney, C. Jesshope, Parallel Computers: Architecture, Programming and Algorithms, Adam Hilger Ltd., Bristol, (1981).
34 L. Johnsson, SIAM J. Numer. Anal., 20, (1987), 354.
35 J.M. Ortega, Introduction to parallel and vector solution of linear systems, Plenum, New York, (1988), p130 (References #3).
36 http://laguerre.psc.edu/general/software/packages/cycle/comparison.html.
37 J.M. Ortega, Introduction to parallel and vector solution of linear systems, Plenum, New York, (1988), p149.