1 Introduction
Multigrid/multilevel methods including geometric multigrid (GMG) and algebraic multigrid (AMG) is one of the most popular approaches for solving a large sparse linear system of equations, , arising from the discretization of partial different equations (Smith et al. (2004); Stuben (1999)). The methods involves generating a sequence of linear systems of decreasing size during the setup phase, and iteratively improving the solution to the original system using the sequence of coarse systems in the solve phase. One critical component of the multigrid/multilevel methods is to form a coarse operator through a sparse matrix triple product of restriction , fine operator
, and interpolation
. Here is the transpose of when using the Galerkin method, and can be generated either geometrically (Kong and Cai (2016, 2017, 2018); Kong et al. (2019a)) or algebraically (Kong et al. (2018b, 2019b)). It is challenging to design an efficient parallel algorithm and develop its corresponding software for the sparse matrix triple product since we have to consider both the memory efficiency and the compute time efficiency. Beside the multigrid/multilevel methods, the sparse matrix triple product is also used in other areas such as the Schur complement algorithms so that developing an efficient triple product algorithm becomes an active research topic.Let us briefly review a few of these developments, and interested readers are referred to (Ballard et al. (2016b); Buluç and Gilbert (2012); McCourt et al. (2013)) for more literature reviews. Many previous works have considered parallel algorithms for sparse matrixmatrix multiplications for the generalpurpose use (Akbudak and Aykanat (2014)) and the particular application (Challacombe (2000)). In (Ballard et al. (2016b)), the authors propose and analyze a sparse SUMMA algorithm that was originally used for dense matrices. (Borštnik et al. (2014)) uses a similar idea to convert a dense matrix algorithm (Cannon) to its parallel sparse version for quantumchemical applications with special optimizations and tuning. In the context of the multigrid/multilevel methods, (Tuminaro and Tong (2000)) concerns on the smoothed aggregation algebraic multigrid on distributedmemory supercomputers, where a rowwise algorithm is used for the sparse matrixmatrix multiplications of the twostep matrix triple product. (McCourt et al. (2013)) proposes a matrix coloring technique to cast a sparse matrixmatrix multiplication to a sparsematrix densematrix operation, and the method is used in the multigrid/multilevel methods. (Ballard et al. (2016a)) studies a hyper graph to represent the communication costs of the parallel algorithms for general sparse matrixmatrix multiplications, where a Gakerkin triple product is employed as a case study, and the authors conclude that the rowwise algorithm is communication efficient for the first matrixmatrix multiplication, but inefficient for the second one.
In most of the literatures, typically, the sparse matrix triple product, , is formed using two separate sparse matrixmatrix multiplications, and (Ballard et al. (2016a, b); Balay et al. (2019)). This twostep method works well for many applications, but it is difficult to use the twostep approach for certain memoryintensive applications such as neutron transport problems (Kong et al. (2018b, 2019b)) since the twostep method needs to create some auxiliary matrices causing a high memory overhead in order to efficiently cary out the calculations. To overcome this difficulty, we develop and study two allatonce algorithms (including a merged version) that form the coarse operator with one pass through , and without creating any auxiliary matrices. Here a rowwise algorithm is employed for the first matrixmatrix multiplication, and an outer product is adopted for the second matrixmatrix multiplication. Note that in the new allatonce approaches, the two matrixmatrix multiplication are carried out simultaneously, which will be discussed for more details in Section 3. Compared with the traditional twostep method, the new allatonce algorithms and their implementations result in a reduction of the memory usage by a factor of for a model problem and a factor of for a realistic neutron transport application, meanwhile the new allatonce approaches are able to maintain a good computational efficiency. We numerically show that the proposed algorithms work efficiently for both the model problem and the realistic problem with up to 27 billions of unknowns on supercomputers using up to processor cores.
The rest of this paper is organized as follows. In Section 2, a rowwise matrixmatrix multiplication that serves as part of the calculations in a triple product is described in detail. Two new allatonce algorithms based on hash tables for the sparse matrix triple products in multigrid/multilevel methods are described in Section 3, and numerical results for a model problem and a realistic neutron transport problem with up to processor cores are given and discussed in Section 4. The results are summarized and conclusions are drawn in Section 5.
2 Sparse matrixmatrix multiplications
There are different types of algorithms for sparse matrixmatrix multiplications. The algorithms are generally classified into 1D, 2D and 3D according to the partitioning schemes used to assign the computations of matrix to each processor core. Based on the current infrastructures in PETSc, we consider 1D algorithms only in this work. In the 1D approaches, each processor core owns a chunk of the rows of matrix using a row partition or a chunk of the columns of matrix using a column partition. The 1D matrixmatrix multiplication algorithms are further divided into rowwise, columnwise, and outer product. A matrix in PETSc is distributed in row wise so that a rowwise algorithm is chosen. Next we describe the rowwise algorithm that is employed in the first matrixmatrix multiplication of our new allatonce approaches.
Let , and , where is associated to the number of unknowns on the fine mesh, and is determined by the coarse mesh. A matrixmatrix multiplication operation is defined such that the th column of the th row of output matrix , denoted as , is calculated as follows
The atomic task in the rowwise algorithm is the calculation of a row of , that is,
Here the th row of , , is a linear combination of the rows of , some of which are not local, corresponding to the nonzero columns of the th row of . In the rowwise algorithm, the matrix is equally distributed to processors in block rows, where each processor owns consecutive rows shown in Eq. (1). Here is the number of processor cores. For simplicity of discussion, we assume is divisible by . In reality, may be not divisible by , and if it is the case, some processor cores will own a few more rows than others. Let denote the rows of owned by the th processor core, that is,
For simplicity of description, we also define the th block columns of as as shown in Eq. (1).
(1) 
Note that is owned by the th processor core, and represents the relationship with the th processor core. If , there is no communication from processor to . A good partition of should make most of submatrices zero to minimize the communication. We do not partition directly, instead, the partition of is derived from the finite element mesh partition that is implemented using the approaches in (Karypis and Schloegel (2013); Kong et al. (1116 November, 2018)) since arises from the discretization of partial differential equations on a finite element mesh. is also partitioned in the same way, that is, , and , where . For the th processor core, the computation of the local matrix (the output matrix has the same row partition as , and the same column partition as ) is written as
(2) 
As stated earlier, some of submatrices do not have nonzero elements since is a sparse matrix, and therefore the calculation of does not involve all processor cores, instead, only communicates with the processor cores corresponding to nonzero submatrices. Define the number of the nonzero submatrices of as , and the index function as that returns the index of the th nonzero submatrix of . Eq. (2) is reformulated as
(3) 
To explain this clearly, we present a simple example shown in Eq. (4).
(4) 
Here the horizontal lines represent the row partition of matrix, and the vertical dashed line is the virtual column partition that matches the row partition of the right matrix if it exists. For instance, in the matrix (the first matrix on the right side of Eq. (4)), the first processor core takes the first and the second rows, the second processor core owns the third and the forth rows, and the third processor core has the fifth and the sixth rows. In this toy example, the first row of is the linear combination of the first, the second, and the fifth rows of since the first, the second and the fifth columns of are nonzero. To calculate the first row of , a remote row of , the fifth row, is needed, which is implemented by communication. The calculation of different rows of local , different remote rows of are involved. If we did a message exchange for each row calculation individually, it would be inefficient. Thus, we extract all the required remote rows (forming a matrix ) that corresponds to nonzero columns of () up front. For example, in Eq. (4), the third, the fifth remote rows of are extracted for the first processor core because the nonzero offprocessor columns of are the third and the fifth columns. In PETSc (Balay et al. (2019)), the local part of matrix, e.g., , and , is physically implemented as two blocks, diagonal block (e.g., ) and offdiagonal block (e.g., ), which are stored in two sequential matrices in a compressed sparse row (CSR) format. We here drop the subindex for and without confusion. The formulation of the calculations of is rewritten as follows
The calculation of the matrixmatrix multiplication is generally split into two parts, symbolic calculation and numeric calculation. In the symbolic calculation, the task is to accurately preallocate the memory for the output matrix with going through the matrices , and without involving any floatingpoint operations. In the numeric process, the actual numeric calculation is carried out and the results are added to the allocated space of . The symbolic calculation is summarized in Alg. 1 and 2.
In Alg. 1, and mathematically are integer sets, and they can be implemented based on different data structures such as linked list, binary tree, hash table, etc. In this work, we use the hash table to implement and for our new allatonce algorithms since the hash table has an average lookup time and also simplifies the implementation. Other implementations such as the linked list is also available in PETSc. For simplifying the description, we assume that the hash table is used during the following discussion.
In Alg. 2, and are integer arrays to store the numbers of nonzero columns for the diagonal and the offdiagonal parts of , respectively. represents the number of elements in a set. The memory of and could be reused for each row of , and “clear” simply resets a flag in the data structure so that the memory is ready for next row. Note that line 2 of Alg. 2 involves one message exchange, and all other parts are proceeded independently for each processor core. The numeric calculation of the matrixmatrix multiplication using the rowwise algorithm is shown in Alg. 3 and 4.
In Alg. 3, is a hash table that associates a key with its numeric value. “” represents that if already exists in then the value will be added to the current value otherwise a pair consisting of and its value is inserted into the hash table.
In Alg. 4, we extract keys and their corresponding values from , and call the matrix function, MatSetValues, in PETSc directly to add the values to the allocated space in . “Clear” at line 8 of Alg. 4 does not deallocate memory, instead, it resets a flag so that the memory can be reused for next row.
3 Allatonce algorithms for sparse matrix triple products
In multigrid/multilevel methods, to construct a scalable parallel solver, coarse spaces are included, which can be built either geometrically (Kong and Cai (2016, 2017)) or algebraically (Kong et al. (2018b, 2019b)). One of critical tasks is to form a coarse operator, , based on the interpolation, , and the fine level operator, , as follows
(5) 
where the notation is reused to denote the output matrix of the sparse matrix triple product without confusion. It is straightforward to apply Alg. 2 and 4 twice for computing the sparse matrix triple product, that is,
(6) 
where is an auxiliary matrix. More precisely, is explicitly implemented using Alg. 2 for the symbolic calculation and Alg. 4 for the numeric computation, and then the second product is formed using the same algorithms again, . The procedure (referred to as “twostep” method) is summarized in Alg. 5 for the symbolic caculation and Alg. 6 for the numeric calculation.
The lines 4 and 6 of Alg. 5 are computed using a similar algorithm as Alg. 2 but without extracting any remote rows of . The communication at the lines 5 and 7 is implemented using nonblocking MPI techniques so that the computation and the communication are overlapped. “+=” at the line 8 of Alg. 5 represents adding data from to .
Similarly, the lines 4 and 6 of Alg. 6 are implemented using a similar algorithm as Alg. 4 without extracting any remote rows of . “+=” operator at the line 8 is implemented using MatSetValues in PETSc that efficiently adds values to the diagonal and offdiagonal matrices. The advantage of the twostep algorithm is that it can be efficiently implemented using the rowwise algorithm. However, this algorithm needs to build auxiliary matrices such as (explicit transpose of ) and , which leads to a high memory overhead. The twostep algorithm works well for some applications that do not require much memory, but it is not suitable for memoryintensive problems such as sevendimensional neutron transport simulations. To overcome the difficulty, in this work, we introduce new allatonce algorithms that do not involve the auxiliary matrices for saving memory. The basic idea of the allatonce algorithms is to form with one pass through , and all at once without any auxiliary steps. We also want to mention that the similar idea has been also used in (Adams et al. (2004), Van Emden Henson (2002)). However, we propose a new version of the allatonce algorithms based on hash tables that is implemented in PETSc as part of this work, where the second matrixmatrix multiplication is based on an outer product algorithm instead of the rowwise approach. More precisely, the th element of the th row of is obtained using the following formula
(7) 
where we do not need to explicitly form . The atomic task of the allatonce algorithms is to compute the th row of as follows
(8) 
Eq. (8) potentially is memory efficient, but there is no a good way to access column by column since is distributed in a rowwise manner. To overcome this difficulty, we use an outer product for the second matrixmatrix multiplication, , to access row by row instead of column by column. is formed all at once with a summation of outer products,
(9) 
In Eq. (9), is the summation of the outer product of the th row of and the th row of . Note that (Ballard et al. (2016b)) shows that, in their twostep approach, the rowwise algorithm is suitable for , and the outer product is the best for in terms of the communication cost. We adopt the outer product in the second matrixmatrix multiplicaiton not only for reducing communication cost but also for saving memory. The detailed algorithm is shown in Alg. 7 and 8.
At the line 3 of Alg. 7, is implemented as a set of hash sets that stores the indices of the nonzero columns for each row. The outer product at the line 11 is carried out by inserting the th row of , , into the rows of corresponding to the nonzero columns of . The first loop consisting of lines 513 computes the remote rows that are sent to its remote owners. The second loop consisting of lines 1625 calculates the local part of . The calculation is split into two parts for overlapping the computation and the communication.
In Alg. 8, adding the outer product of at the line 10 is implemented by inserting the numerical values directly into the preallocated space, and the outer product at the line 21 is added to the preallocated matrix using PETSc routine MatSetValues. Again, the communications at the lines 14 and 26 of Alg. 7 are based on nonblocking MPI techniques. It is obviously observed that Alg. 7 and 8 do not involve auxiliary matrices and . Generally, Alg. 7 and 8 work great when the amount of the calculations in the first loop is small (it is true for most of the sparse matrices arising from the discretization of partial differential equations). If the computation of the first loop is heavy, the algorithm can be modified to save the time on the calculations. In this case, we propose an alternative choice that merges the calculations in the first loop and that in the second loop together. This modified algorithm is referred to as “merged allatonce”. The merged allatonce algorithm is described in detail in Alg. 9. and 10.
The performance differences between the allatonce algorithm and its merged version are totally problem dependent. For some applications, the performances may be close to each other. If the commutation in the first loop is expensive, we may prefer the allatonce to the merged allatonce.
4 Numerical results
In this section, we report the performance of the proposed algorithms in terms of the memory usage and the compute time. For understanding the algorithms, we study two tests cases; a model problem based on structured meshes for mimicking geometric muligrid/multilevel methods, and a realistic neutron transport simulation in which AMG is employed. Let us define some notations that are used in the rest of discussions for convenience. “twostep” represents the approach described in Alg. 5 and 6. “allatonce” is the agorithm described in Alg. 7 and 8. “merged” denotes the method in Alg. 9 and 10. “
” is the number of processor cores, “Mem” is the estimated memory usage per processor core, in Megabyte, for the matrix triple products, “Mem
” is the estimated total memory per processor core, in Megabyte, used in the entire simulation, “Time” is the compute time, in second, of the matrix triple products, “Time” is the total compute time, in second, in the simulation, “Time” is the time spent on the symbolic calculations of the triple products, “Time” is the time on the numeric computations of the triple products, and “EFF” is the parallel efficiency of the overall simulation. The proposed algorithms, allatonce and its merged version, for the matrix triple products are implemented in PETSc (Balay et al. (2019)) as part of this work.4.1 Model problem
This test is conducted for processor counts between 8,192 and 32,768 on the Theta supercomputer at Argonne National Laboratory. Theta is a massively parallel, manycore system with secondgeneration Intel Xeon Phi processors. Each compute node has a 64core processor with 16 gigabytes (GB) of highbandwidth inpackage memory (MCDRAM), 192 GB of DDR4 RAM, and a 128 GB SSD. The test is designed to mimic a geometric twolevel method based on a fine mesh and a coarse mesh. A 3D structured grid is employed as the course mesh, and the fine mesh is an uniform refinement of the coarse mesh. Each grid point is assigned with one unknown. An interpolation is created from the coarse mesh to the fine mesh using a linear function. The dimensions of on the fine mesh are , and these of from the coarse mesh to the fine mesh are . One symbolic and eleven numeric triple products are implemented to mimic the use case in which the numeric calculations need to be repeated multiple times. “Time” is the time on all eleven numeric triple products. We compare the performance of our new algorithms in terms of the compute time and the memory usage with that obtained using the traditional twostep method. Numerical results are summarized in Table 1.
Algorithm  Mem  Time  Time  Time  EFF  

8,192  allatonce  68  6.4  63  69  100% 
8,192  merged  68  6.3  63  69  100% 
8,192  twostep  554  8.3  46  54  100% 
16,384  allatonce  35  3.4  33  37  93% 
16,384  merged  35  3.2  32  35  99% 
16,384  twostep  280  4.2  23  27  100% 
24,576  allatonce  24  2.3  22  24  96% 
24,576  merged  24  2.2  21  23  100% 
24,576  twostep  188  2.9  15  18  100% 
32,768  allatonce  18  1.8  17  19  91% 
32,768  merged  18  1.7  18  20  86% 
32,768  twostep  132  2.2  12  14  96% 
We observed that the memory usage of the new algorithms is roughly of that consumed by the twostep method from Table 1, which indicates that the new algorithms produce a very low memory overhead. For example, the allatonce algorithm takes only M memory at 8,192 processor cores, while the twostep method uses M. For all cases, the allatonce and the merged allatonce approaches use exactly the same amount of memory. All the algorithms are scalable in terms of the memory usage in the sense that the memory usages are proportionally decreased when the number of processor cores is increased. The memory usages are halved to M for the allatonce algorithm and M for the twostep method, respectively, when the number of processor cores is doubled from to . They continue being reduced to M and M, respectively, when we increase the number of processor cores to . The memory (denoted as “Mem” in Table 1) spent on the triple products includes the storage for the output matrix . In order to understand how much memory overhead we have on the triple product algorithms, the memories used to store matrices , and using , , and processor cores are shown in Table 2. It is easily found that takes M at processor cores, and then the overhead of the allatonce algorithm is M since the corresponding total memory on the triple product for the allatonce approach is M. On the other hand, the twostep method has a much higher memory overhead, M ( M  M). The memory usages for storing all the matrices are scalable, that is, they are ideally halved when we double the number of processor cores. For example, the memory required to store is M at processor cores, and it is reduced to M when we increase the number of processor cores from to . It continues being reduced to M and M, when we increase the number of processor cores to and . We observed that the same trend in the memory usages for storing and , that is, they are perfectly halved when the number of processor cores is doubled. Similarly, the compute times on the triple products are also perfectly scalable for all the algorithms. The compute times using are s for the allatonce method and s for the twostep method, and they are reduced to s and s when processor cores are used. This leads to parallel efficiencies of for the allatonce method and for the twostep method. When we continue increasing the number of processor cores to and , the compute times are reduced to s and s for the allatonce algorithm, and s and s for the twostep method. A perfect scalability with parallel efficiencies of above for all the algorithms is achieved even when the number of processor cores is up to . The compute time of the allatonce approach is very close to that using the merged allatonce algorithm, and there are no meaningful differences. The twostep method is slightly faster than the allatonce and the merged allatonce algorithms. These speedups and parallel efficiencies are also shown in Fig. 1.
Matrices  8,192  16,384  24,576  32,768 

A  182  92  62  47 
P  125  63  44  26 
C  65  34  23  17 
One more interesting thing is that the time spent on the symbolic calculation for the allatonce method is less than that spent by the twostep method. For example, for the core case, s is used for the symbolic calculations in the allatonce approach while the twostep method takes s. The times spent on both the symbolic and the numeric calculations are ideally scalable. In all, the new algorithms are perfectly scalable in terms of the memory usage and the compute time, and they also use much less memory than the traditional twostep method. The memory comparison between all the algorithms are also summarized in Fig. 2, where it is easily found that the twomethod approach consumes much more memory than the allatonce and the merged allatonce methods for all processor counts.
To further confirm the performance of the new algorithms, larger meshes are used in the following test. A 3D structured grid is used as the coarse mesh, and the fine mesh is an uniform refinement of the coarse mesh. The dimensions of on the fine mesh are , and the dimensions of from the coarse mesh to the fine mesh are . The same configuration as before is adopted. The numerical results are shown in Table 3. The memories used to store matrices , and are recored in Table 4.
Algorithm  Mem  Time  Time  Time  EFF  
8,192  allatonce  223  21  218  239  100% 
8,192  merged  223  21  211  232  100% 
8,192  twostep  –  –  –  –  –% 
16,384  allatonce  114  11  106  117  102% 
16,384  merged  114  10  105  115  101% 
16,384  twostep  935  21  77  98  100% 
24,576  allatonce  76  7.2  73  81  98% 
24,576  merged  76  7  70  77  100% 
24,576  twostep  621  12  58  70  93% 
32,768  allatonce  59  5.4  55  61  98% 
32,768  merged  59  5.3  53  58  100% 
32,768  twostep  469  8.8  38  47  104% 
Matrices  8,192  16,384  24,576  32,768 

A  612  307  204  154 
P  426  275  143  107 
C  212  106  74  54 
The twostep method was not able to run using processor cores since it was attempting to allocate too much memory beyond the physics memory. The parallel efficiencies for the twostep method are computed based on the compute time using processor cores. Again, the memory usage of the allatonce algorithm is the same as that of the merged allatonce algorithm. It is M at processors cores, and reduced to M, M and M when the number of processor cores is increased to , and . In fact, it is ideally halved when we double the number of processor cores. The twostep method consumes nine times as much memory as the allatonce algorithm. The detailed comparison of the memory usages for all the algorithms is drawn in Fig. 4. It is obviously observed that the memory used in the twostep method is significantly more than that used in the allatonce and the merged allatonce algorithms. The times spent on the symbolic and the numeric calculations are scalable for all the algorithm with up to processor cores. For example, the allatonce algorithm uses s for the symbolic calculations at processor cores, and the time is reduced to s, s, and s when we increase the number of processor cores to , and . The time on the numeric calculations for the allatonce algorithm at processor core is s, and it perfectly is halved to s when we double the number of processor cores to . When we continue increasing the number of processor cores to and , and the time of the numeric calculations is further reduced to s and s. For the twostep method, the time on the symbolic computations at processor core is s, and continues being reduced to s and s when we use and processor cores. For the numeric calculations of the twostep method, the time is s at , and it is reduced to s and s when the number of processor cores is increased to and . Again, the merged allatonce algorithm has a similar performance in terms of the memory usage and the compute time as the allatonce algorithm. The allatonce and the merged allatonce algorithms are faster in the symbolic calculations than the twostep method, and are a little slower than the twostep method for the numeric calculations. The overall calculation time is scalable because both the symbolic and the numeric calculations are perfectly scalable. The perfect scalabilities in terms of the compute time for all the algorithms are obtained with all processor counts. The corresponding speedups and parallel efficiencies are also drawn in Fig. 3.
4.2 Neutron transport problem
Next, we present the performance of the proposed algorithms in the realistic neutron transport simulations. The numerical experiments in this section are carried out on a supercomputer at INL (Idaho National Laboratory), where each compute node has two 20core processors with 2.4 GHz and the compute nodes are connected by an OmniPath network.
4.2.1 Problem description
We consider the multigroup neutron transport equations here, and the neutron transport equations is a memoryintensive application because there are many variables associated with each mesh vertex arising from the discretization of the neutron flying direction. The multigroup neutron transport equations defined in ( is the 3D spatial domain, e.g, shown in Fig. 5, and is the 2D unit sphere representing the neutron flying directions) reads as
(10a)  
(10b) 
where , and is the number of energy groups. The fundamental quantity of interest is neutron angular flux . In Eq. (10a), denotes the independent angular variable, is the independent spatial variable , is the macroscopic total cross section , is the macroscopic scattering cross section from group to group , is the macroscopic fission cross section , is the scalar flux defined as , is the scattering phase function, is the averaged neutron emitted per fission, is the prompt fission spectrum, and
is the eigenvalue (sometimes referred to as a multiplication factor). In Eq. (
10b),is the outward unit normal vector on the boundary,
is the boundary of , , is the specular reflectivity on , and is the diffusive reflectivity on . The first term of (10a) is the streaming term, and the second is the collision term. The first term of the right hand side of Eq. (10a) is the scattering term, which couples the angular fluxes of all directions and energy groups together. The second term of the right hand side of (10a) is the fission term, which also couples the angular fluxes of all directions and energy groups together. Eq. (10b) is the boundary conditions. The multigroup neutron transport equations are discretized using the firstorder Lagrange finite element in space, and the discrete ordinates method (that can be thought of as a collocation method) in angle (Wang et al. (2018); Lewis and Miller (1984); Kong et al. (2019b)). The discretizations are implemented based on RattleSnake (Wang et al. (2018)), MOOSE (Multiphysics Object Oriented Simulation Environment) (Peterson et al. (2018)) and libMesh (Kirk et al. (2006)). After the angular and spatial discretizations, we obtain a largescale sparse eigenvalue system that is solved using an inexact NewtonKrylov method (Cai et al. (1998); Knoll and Keyes (2004); Kong (2016); Kong et al. (2018a, 2019b)). During each Newton iteration, the Jacobian system is computed using GMRES (Saad and Schultz (1986)), and to speedup the convergence of GMRES, the multilevel Schwarz preconditioner (Kong et al. (2019b)) is employed. The triple products are performed in the setup phase of the multilevel Schwarz preconditioner. We next report the performance of the new allatonce algorithms for the triple products within the multilevel method.4.2.2 Memory usage and scalability study
As below, we study the memory usages of the allatonce and the merged allatonce algorithms compared with the twostep approach. In this test, a mesh with 25,856,505 nodes and 26,298,300 elements is employed, and 96 variables are associated with each mesh vertex. The largescale sparse eigenvalue system has 2,482,224,480 unknowns, which are computed with 4,000, 6,000, 8,000, and 10,000 processor cores, respectively. A flux moment of the physics solution is shown in Fig. 6. The preconditioning matrix is coarsened algebraically, and a twelvelevel method is generated using 11 sparse matrix triple products, where interpolations and operator matrices are formed. Interested readers are referred to (Kong et al. (2019b)) for the coarsening algorithms and the preconditioner setup. The details of 12 operator matrices and 11 interpolation matrices for the twelvelevel method are shown in Table 5 and 6. Here “level” is numbered from the fine level to the coarse levels, that is, “level 0” is the finest level and “level 11” is the coarsest level. “rows” is the number of the rows of matrix, “cols” is the number of the columns of matrix, “nonzeros” denotes the number of nonzero elements of matrix, “cols” is the minimum number of nonzero columns among all rows, “cols” is the maximum number of nonzero columns among all rows, and “cols” is the averaged number of nonzero columns for all rows.
level  rows  nonzeros  cols  cols  cols 

0  2,482,224,480  66,169,152,672  8  45  26.7 
1  725,805,984  20,914,460,832  3  101  28.8 
2  256,966,368  7,995,535,008  1  172  31.1 
3  75,645,600  2,153,489,952  4  278  28.5 
4  21,120,960  617,242,944  2  251  29.2 
5  5,187,840  169,280,256  3  288  32.6 
6  1,188,576  46,487,520  4  266  39.1 
7  279,360  11,366,592  5  171  40.7 
8  63,648  2,076,192  4  86  32.6 
9  13,632  319,680  7  51  23.5 
10  2,496  37,824  8  25  15.2 
11  384  1,536  4  4  4 
level  rows  cols  cols  cols 

0  2,482,224,480  725,805,984  0  12 
1  725,805,984  256,966,368  0  12 
2  256,966,368  75,645,600  0  11 
3  75,645,600  21,120,960  0  9 
4  21,120,960  5,187,840  0  9 
5  5,187,840  1,188,576  0  8 
6  1,188,576  279,360  0  8 
7  279,360  63,648  0  6 
8  63,648  13,632  0  5 
9  13,632  2,496  0  4 
10  2,496  384  0  2 
The numerical results are summarized in Table 7 and 8. It is easily observed, in Table 7, that the traditional twostep algorithm uses twice as much memory as the allatonce and the merged allatonce algorithms do for all processor counts. For instance, the twostep algorithm uses M memory at 4,000, while the allatonce and the merged allatonce algorithms use M memory that is only of that used in the twostep method. This behavior is similar for all processor counts. The memory usages for all the algorithms are scalable when the number of processor cores is increased. They are M, M and M at 4,000 processor cores for the twostep, the allatonce and the merged allatonce algorithms, respectively, and proportionally reduced to M, M, M when we increase the number of processor cores to 6,000. The memory usages continue being decreased to M, M, and M, when the number of processor cores is increased to . The memory usages at processor cores are slightly increased, but this does not affect the overall memory scalability. Meanwhile, the compute times of the allatonce and the merged allatonce approaches are almost the same as that spent using the twostep method. More important, the compute times for all three algorithms are aggressively optimized so that they account for a negligible portion of the total compute time. The triple products do not affect the overall scalability and performance of the simulations. The compute times spent on the sparse matrix triple products sometimes are in the machine noise range so that we observed the total compute time is slightly larger even when the compute time of the triple products is smaller. For example, the total compute time using the merged allatonce algorithm is s at 6,000 processor cores while that of the twostep method is s even though the compute time of the triple product using the twostep method is smaller than that obtained using the merged allatonce algorithm. We therefore conclude that the proposed new algorithms including the allatonce and the merged allatonce approaches significantly save memory without any compromise on the compute time. The total compute time at 4,000 for the allatonce algorithm is s, and it is reduced to s, s, s for 6,000, 8,000, 10,000 processor cores, respectively. The compute times for the merged allatonce and the twostep method algorithms are close to that of the allatonce approach for all processor counts, that is, they are scalable as well. Good scalabilities and parallel efficiencies in terms of the compute time for all the algorithms are maintained as shown in Fig. 7. A parallel efficiency of at least is achieved at processor cores.
Algorithm  Mem  Mem  Time  Time  EFF  

4,000  twostep  587  1755  39  2825  100% 
4,000  allatonce  264  1423  59  2871  100% 
4,000  merged  264  1423  59  2858  100% 
6,000  twostep  310  1116  36  1990  95% 
6,000  allatonce  158  968  50  1998  96% 
6,000  merged  158  968  53  1977  96% 
8,000  twostep  283  889  36  1582  89% 
8,000  allatonce  108  714  44  1581  91% 
8,000  merged  108  714  44  1587  90% 
10,000  twostep  251  757  40  1385  82% 
10,000  allatonce  113  615  44  1383  83% 
10,000  merged  113  615  43  1392  82% 
The memory usage comparison for different triple product algorithms is also shown in Fig. 8, where it is obvious that the memory used in the twostep method is twice as much as used in the new algorithms for all processor counts.
algorithm  Mem  Mem  Time  Time  EFF  

4,000  twostep  813  1981  48  2864  100% 
4,000  allatonce  402  1571  61  2873  100% 
4,000  merged  402  1571  59  2881  100% 
6,000  twostep  478  1284  42  1969  97% 
6,000  allatonce  256  1066  53  1997  96% 
6,000  merged  256  1066  54  1999  96% 
8,000  twostep  361  967  41  1586  90% 
8,000  allatonce  176  782  45  1584  91% 
8,000  merged  176  782  44  1611  89% 
10,000  twostep  336  842  38  1373  83% 
10,000  allatonce  180  682  47  1402  82% 
10,000  merged  180  682  45  1393  83% 
In the previous test, the intermediate data is free after the preconditioner setup. But in some applications, the preconditioner setup is repeated for each nonlinear iteration and the corresponding triple products are also repeated. In this case, we could choose to cache necessary intermediate data, and do not need to do the symbolic calculations from the scratch for saving compute time. In Table 8, we concern on the amount of the memory required for all the algorithms when we cache the intermediate data. It is found that the memory is almost doubled, compared with that used without storing temporary data. For example, at 4,000 processor cores, M memory is allocated for the twostep method when caching the intermediate data, while it is M when we do not store temporary data. For the new algorithms, it is also similar and the memory usage is increased by . However, the memory usages of the new algorithms are still much lower than that used in the twostep method. More precisely, the memory usages of the new algorithms are half of that in the twostep method for all processor counts. While the compute time spent on the triple products is a tiny portion of the total compute time, the corresponding memory usage takes a large chunk of the total memory. This is shown in Fig. 10. For example, at 10,000 processor cores, for the twostep method, the memory usage on the triple products takes of the total memory, and for the new algorithms, the tripleproducts account for of the total memory. It is exactly the motivation to optimize the memory usage in the triple products.
Again, for all cases, the simulation is scalable in terms of the compute time and the memory usage. The corresponding parallel efficiencies and speedups are drawn in Fig. 9. We finally conclude that the new proposed algorithms including the allatonce and the merged allatonce use similar compute times as the twostep method does while they consume much less memory.
5 Final remarks
Two memoryefficient allatonce algorithms are introduced, implemented and discussed for the sparse matrix triple products in multigrid/multilevel methods. The allatonce triple product methods based on hash tables form the output matrix with one pass through the input matrices, where the first matrixmatrix multiplication is carried out using the rowwise approach and the second multiplication is accomplished with an outer product. For saving memory and reducing communications, the allatonce approach and its merged version strategy do not either explicitly transpose the interpolation matrix or create any auxiliary matrices. Compared with the traditional twostep method, the proposed new algorithms use much less memory while it is able to maintain the computational efficiency; e.g., it consumes only 10% of the memory used in the twostep method for the model problem and 30% for the realistic neutron transport simulation. We have shown that the allatonce algorithms are scalable in terms of the compute time and the memory usage for the model problem with 27 billions of unknowns on supercomputer with up to processor cores and the realistic neutron transport problem with 2 billions of unknowns using processor cores.
Acknowledgments
This manuscript has been authored by Battelle Energy Alliance, LLC under Contract No. DEAC0705ID14517 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a nonexclusive, paidup, irrevocable, and worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes.
This research made use of the resources of the HighPerformance Computing Center at Idaho National Laboratory, which is supported by the Office of Nuclear Energy of the U.S. Department of Energy and the Nuclear Science User Facilities under Contract No. DEAC0705ID14517. This research also made use of the resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DEAC0206CH11357.
References

Adams et al. (2004)
Adams MF, Bayraktar HH, Keaveny TM and Papadopoulos P (2004) Ultrascalable implicit finite element analyses in solid mechanics with over a half a billion degrees of freedom.
In: Proceedings of the 2004 ACM/IEEE conference on Supercomputing. IEEE Computer Society, p. 34.  Akbudak and Aykanat (2014) Akbudak K and Aykanat C (2014) Simultaneous input and output matrix partitioning for outerproduct–parallel sparse matrixmatrix multiplication. SIAM Journal on Scientific Computing 36(5): C568–C590.
 Balay et al. (2019) Balay S, Abhyankar S, Adams MF, Brown J, Brune P, Buschelman K, Dalcin L, Dener A, Eijkhout V, Gropp WD, Karpeyev D, Kaushik D, Knepley MG, May DA, McInnes LC, Mills RT, Munson T, Rupp K, Sanan P, Smith BF, Zampini S, Zhang H and Zhang H (2019) PETSc Users Manual. Technical Report ANL95/11  Revision 3.11, Argonne National Laboratory. URL http://www.mcs.anl.gov/petsc.
 Ballard et al. (2016a) Ballard G, Druinsky A, Knight N and Schwartz O (2016a) Hypergraph partitioning for sparse matrixmatrix multiplication. ACM Transactions on Parallel Computing (TOPC) 3(3): 18.
 Ballard et al. (2016b) Ballard G, Siefert C and Hu J (2016b) Reducing communication costs for sparse matrix multiplication within algebraic multigrid. SIAM Journal on Scientific Computing 38(3): C203–C231.
 Borštnik et al. (2014) Borštnik U, VandeVondele J, Weber V and Hutter J (2014) Sparse matrix multiplication: The distributed blockcompressed sparse row library. Parallel Computing 40(56): 47–58.
 Buluç and Gilbert (2012) Buluç A and Gilbert JR (2012) Parallel sparse matrixmatrix multiplication and indexing: Implementation and experiments. SIAM Journal on Scientific Computing 34(4): C170–C191.
 Cai et al. (1998) Cai XC, Gropp WD, Keyes DE, Melvin RG and Young DP (1998) Parallel NewtonKrylovSchwarz algorithms for the transonic full potential equation. SIAM Journal on Scientific Computing 19(1): 246–265.
 Challacombe (2000) Challacombe M (2000) A general parallel sparseblocked matrix multiply for linear scaling SCF theory. Computer physics communications 128(12): 93–107.
 Karypis and Schloegel (2013) Karypis G and Schloegel K (2013) PARMETIS: Parallel Graph Partitioning and Sparse Matrix Ordering Library. Technical Report Version 4.0, University of Minnesota. URL http://glaros.dtc.umn.edu/gkhome/metis/parmetis/overview.
 Kirk et al. (2006) Kirk BS, Peterson JW, Stogner RH and Carey GF (2006) libMesh: A C++ Library for Parallel Adaptive Mesh Refinement/Coarsening Simulations. Engineering with Computers 22(3–4): 237–254. https://doi.org/10.1007/s0036600600493.
 Knoll and Keyes (2004) Knoll DA and Keyes DE (2004) Jacobianfree NewtonKrylov methods: a survey of approaches and applications. Journal of Computational Physics 193(2): 357–397.
 Kong (2016) Kong F (2016) A Parallel Implicit Fluidstructure Interaction Solver with Isogeometric Coarse Spaces for 3D Unstructured Mesh Problems with Complex Geometry. PhD Thesis, University of Colorado Boulder.
 Kong and Cai (2016) Kong F and Cai XC (2016) A highly scalable multilevel Schwarz method with boundary geometry preserving coarse spaces for 3D elasticity problems on domains with complex geometry. SIAM Journal on Scientific Computing 38(2): C73–C95.
 Kong and Cai (2017) Kong F and Cai XC (2017) A scalable nonlinear fluidstructure interaction solver based on a Schwarz preconditioner with isogeometric unstructured coarse spaces in 3D. Journal of Computational Physics 340: 498–518.
 Kong and Cai (2018) Kong F and Cai XC (2018) Scalability study of an implicit solver for coupled fluidstructure interaction problems on unstructured meshes in 3D. The International Journal of High Performance Computing Applications 32(2): 207–219.
 Kong et al. (2018a) Kong F, Kheyfets V, Finol E and Cai XC (2018a) An efficient parallel simulation of unsteady blood flows in patientspecific pulmonary artery. International Journal for Numerical Methods in Biomedical Engineering 34(4): 29–52.
 Kong et al. (2019a) Kong F, Kheyfets V, Finol E and Cai XC (2019a) Simulation of unsteady blood flows in a patientspecific compliant pulmonary artery with a highly parallel monolithically coupled fluidstructure interaction algorithm. International Journal for Numerical Methods in Biomedical Engineering URL https://doi.org/10.1002/cnm.3208.
 Kong et al. (1116 November, 2018) Kong F, Stogner RH, Gaston DR, Peterson JW, Permann CJ, Slaughter AE and Martineau RC (1116 November, 2018) A generalpurpose hierarchical mesh partitioning method with node balancing strategies for largescale numerical simulations. In: 2018 IEEE/ACM 9th Workshop on Latest Advances in Scalable Algorithms for LargeScale Systems (ScalA). Dallas Texas.
 Kong et al. (2019b) Kong F, Wang Y, Gaston DR, Permann CJ, Slaughter AE, Lindsay AD and Martineau RC (2019b) A highly parallel multilevel NewtonKrylovSchwarz method with subspacebased coarsening and partitionbased balancing for the multigroup neutron transport equations on 3D unstructured meshes. arXiv preprint arXiv:1903.03659 .
 Kong et al. (2018b) Kong F, Wang Y, Schunert S, Peterson JW, Permann CJ, Andrš D and Martineau RC (2018b) A fully coupled twolevel Schwarz preconditioner based on smoothed aggregation for the transient multigroup neutron diffusion equations. Numerical Linear Algebra with Applications 25(3): 21–62.
 Lewis and Miller (1984) Lewis EE and Miller WF (1984) Computational Methods of Neutron Transport. New York, NY: John Wiley and Sons Inc.
 McCourt et al. (2013) McCourt M, Smith B and Zhang H (2013) Efficient sparse matrixmatrix products using colorings. SIAM Journal on Matrix Analysis and Applications .
 Peterson et al. (2018) Peterson JW, Lindsay AD and Kong F (2018) Overview of the incompressible Navier–Stokes simulation capabilities in the MOOSE framework. Advances in Engineering Software 119: 68–92.
 Saad and Schultz (1986) Saad Y and Schultz MH (1986) GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM Journal on Scientific and Statistical Computing 7(3): 856–869.
 Smith et al. (2004) Smith B, Bjorstad P, Gropp WD and Gropp W (2004) Domain Decomposition: Parallel Multilevel Methods for Elliptic Partial Differential Equations. Cambridge university press.
 Stuben (1999) Stuben K (1999) Algebraic multigrid (AMG): An Introduction with Applications. GMDForschungszentrum Informationstechnik.
 Tuminaro and Tong (2000) Tuminaro RS and Tong C (2000) Parallel smoothed aggregation multigrid: Aggregation strategies on massively parallel machines. In: SC’00: Proceedings of the 2000 ACM/IEEE Conference on Supercomputing. IEEE, pp. 5–5.
 Van Emden Henson (2002) Van Emden Henson UMY (2002) BoomerAMG: a parallel algebraic multigrid solver and preconditioner. Applied Numerical Mathematics 41(1): 155–177.
 Wang et al. (2018) Wang Y, Schunert S and Laboure V (2018) Rattlesnake Theory Manual. Technical report, Idaho National Laboratory(INL), Idaho Falls, ID (United States).
Comments
There are no comments yet.